Distributed Systems: Difference between revisions

From NovaOrdis Knowledge Base
Jump to navigation Jump to search
Line 29: Line 29:
Read replication is a strategy to scale a [[#Single_Master_Database|single master database]].
Read replication is a strategy to scale a [[#Single_Master_Database|single master database]].


The pattern involves a '''leader''' database and several '''follower''' databases. The leader database handles updates and is responsible to propagate the updates to the follower databases. The follower databases will be at times out of date with the leader, because it takes a non-zero time for the change to replicate. Also, since the updates for each follower happen independently, followers will be out of sync, for a short interval of time, with each other. This behavior breaks the strong consistency guarantee of a non-distributed database. The strong consistency guarantee is replaced with '''eventual consistency'''. This is something an architect that designs a system using read replication needs to know and plan for.
The pattern involves a '''leader''' database and several '''follower''' databases (read followers). The leader database handles updates and is responsible to propagate the updates to the follower databases. The follower databases will be at times out of date with the leader, because it takes a non-zero time for the change to replicate. Also, since the updates for each follower happen independently, followers will be out of sync, for a short interval of time, with each other. This behavior breaks the strong consistency guarantee of a non-distributed database. The strong consistency guarantee is replaced with '''eventual consistency'''. This is something an architect that designs a system using read replication needs to know and plan for.


The same strategy also applies to non-relational databases. [[MongoDB]] does the read replication the same way.
The same strategy also applies to non-relational databases. [[MongoDB]] does the read replication the same way.

Revision as of 01:54, 4 June 2019

External

Internal

Distributed System Definition

According to Andrew Tannenbaum, a distributed system is a collection of independent computers that appear to their users as one computer. Specifically, there are three specific characteristics any distributed system must have:

  • The computers operate concurrently
  • The computers fail independently. They will fail, sooner or later.
  • The computers do not share a global clock. All activities these computers perform are asynchronous with respect to the other components. This is a very important characteristics, as it imposes some essential limitations on what the distributed system can do.

As we scale and distribute the system, it seems as bad things start happening to us: distributed system are hard, and it is possible to get away without one, go for it. However, we distribute for well defined reasons, and distributing is in some cases the only way we can address the problems we have.

CAP Theorem

Distributed Storage

The non-distributed version of a relational database is called a "single master system". A single master node can scale up to a certain level, after which we need to use more physical nodes - turn it into a distributed database. There are several strategies to scale the database:

Read Replication

Read replication is a strategy to scale a single master database.

The pattern involves a leader database and several follower databases (read followers). The leader database handles updates and is responsible to propagate the updates to the follower databases. The follower databases will be at times out of date with the leader, because it takes a non-zero time for the change to replicate. Also, since the updates for each follower happen independently, followers will be out of sync, for a short interval of time, with each other. This behavior breaks the strong consistency guarantee of a non-distributed database. The strong consistency guarantee is replaced with eventual consistency. This is something an architect that designs a system using read replication needs to know and plan for.

The same strategy also applies to non-relational databases. MongoDB does the read replication the same way.

Read replication works well only for a specific state access pattern: a read-heavy application. Usually, this is the case for web applications. However, in case the application implies a large number of concurrent writes, and the leader cannot keep up and falls over, another distribution strategy called sharding can be used.

Sharding

Sharding implies splitting up the workload, by picking a key and breaking up the workload based on the key. Each shard can have its own replica set, so read replication can also be used.

Sharding3.png

The splitting of the workload is typically done by a layer in the application.

Relational databases can be sharded this way. MongoDB also does sharding the same way.


Cassandra

Distributed file systems.

Distributed Computation

Hadoop

Spark

Storm

Distributed Synchronization

Network Time Protocol

Vector clocks

Consensus

The consensus refers to the problem of agreeing on what is true in a distributed environment.

Zookeeper

Paxos

Distributed Messaging

Kafka

To Schedule