Brian Liao

Consistency and Consensus

This post was originally posted from the 10% Smarter newsletter. Consider subscribing if you like it!

Distributed systems are complex and include many difficult problem to tackle. Fortunately, some techniques have been developed and are widely used in production to deal with these issues in distributed systems. Today, we’ll discuss the relationship between consistency and consensus, and how they are used online services all over the world.

The Linearizability Property

Linearizability is the property that a data object should appear as if there were only one copy of the object, and all operations on object are atomic. Linearizability also guarantees the data will be the most up-to-date on reads and writes of a single object. This however, does not protect against write skew.

Linearizability is used by Apache ZooKeeper, etcd, and Google Chubby. These are lock granting services and are often used to implement distributed locks, leader election and service discovery. They use consensus algorithms to be linearizable.

For example, one of the ways to implement leader election is by using a lock. All the eligible nodes start up and try to acquire a lock and the successful one becomes the leader. The lock must be linearizable.

Single-leader, Multi-leader, and Leaderless Replication are unfortunately not linearizable.

Single-leader Replication is potentially linearizable if reads are from synchronously updated followers. However non-linearizable issues occur from stale-replicas or split-brains.

Multi-leader Replication is not linearizable because writes can be concurrent and asynchronous. This results in clients seeing different values for a single object.

Leaderless Replication is non-linearizable because clock timestamps are not guaranteed to be consistent with actual ordering of events due to clock skew.

What should we use instead?

The solution is Consensus Algorithms, which are linearizable because they are similar to single-leader replication but have additional measures to prevent stale replicas and split-brain.

The CAP Theorem And The Cost Of Linearizability

The famous CAP Theorem is presented as Consistency, Availability, Partition Tolerance: pick 2 out of 3. However this is a misnomer. Network Partitions are inevitable in large scale systems, so a more accurate statement is choose Consistency or Availability when experiencing a Network Partition. The Consistency is this case actually means Linearizability!

Different databases and their chosen tradeoffs from the CAP theorem.

The main reason for dropping linearizability is performance, not fault tolerance. If an application requires linearizability and some replicas are disconnected from other replicas due to a network problem, then these replicas become unavailable and must wait until the network problem is fixed to ensure a single value for each object.

Causality

In understanding consensus algorithms, it is also important to understand causality. With causality, an ordering of event is guaranteed, such that cause always comes before effect. A system that obeys the ordering from causality is causally consistent.

Causal Order and Total Order

If elements are in a total order, it means that they can always be compared.

With a partial order, we can sometimes compare the elements and say which is bigger or smaller, but cannot in order cases. For example: mathematical sets are partial ordered. You can’t compare {a, b} with {b, c}.

Linearizablity and Casual Consistency are slightly different. Linearizability has a total order of operations: if the system behaves as if there is only a single copy of the data. Causality or casual consistency has a partial order, not a total one. Concurrent operations are incomparable, we can’t say that one happened before the other, therefore Linearizability is stronger than causal consistency.

Sequence Number Ordering

A good way of keeping track of causal dependencies in a database is by using sequence numbers or timestamps to order the events. The timestamp can be a logical clock that generates monotonically increasing numbers for each operation.

This becomes a total order where if operation A causally happened before B, then the sequence number for A must be lower than that of B. This allow for the ordering of concurrent operations arbitrarily.

In single-leader databases, the replication log defines a total order of write operations using the monotonically increasing sequence number. All followers apply the writes in that order and will always be in a causally consistent state. Doing this achieves linearizability.

Lamport Timestamps

Linearizability can also be achieved by Multi-Leader and Leaderless databases. If every node keeps track of their own sequence numbers and the maximum counter value it has seen so far, it is possible to have a total ordering. When a node receives a request or response with a maximum counter value greater than its own counter value, it immediately increases its own counter to that maximum.

To ensure a total ordering, give each node a unique identifier and use Lamport timestamps, which are pairs of (counter, nodeId). Multiple nodes can have the same counter value, but including the node ID in the timestamp makes it unique.

For example, if you had two nodes 1 and 2, you could get an ordering like: (1,1) -> (2, 2) -> (3, 1) -> (3,2).

Total Order Broadcast (Atomic Broadcast)

Total Order Broadcast, or atomic broadcast, is a broadcast protocol for ordering requiring two properties:

Reliable delivery: no messages can be lost.

Totally ordered delivery: messages must be delivered to every node in the same order.

Another way of looking at total order broadcast is that it is a way of creating a log. Delivering a message is like appending to the log.

Linearizable Writes Using Total Order Broadcast

Because log entries are delivered to all nodes in the same order, if there are several concurrent writes, all nodes will agree on which one came first. Choosing the first of the conflicting writes as the winner and aborting later ones ensures that all nodes agree on whether a write was committed or aborted.

Linearizable Reads Using Total Order Broadcast

Three ways to make reads linearizable are:

You can sequence reads through the log by appending a message, reading the log, and performing the actual read when the message is delivered back to you (etcd works something like this).

Fetch the position of the latest log message in a linearizable way, you can query that position to be delivered to you, and then perform the read (the idea behind ZooKeeper’s sync()).

You can make your read from a replica that is synchronously updated on writes.

Consensus

Finally, we can understand and look at consensus. Consensus means getting several nodes to agree on something. It’s not an easy problem to solve but fundamental for leader election and performing an atomic commit.

Two-Phase Commit

Two-phase commit is an algorithm to implement atomic commit, where all nodes either successfully commit, or abort.

One node is designated as the coordinator. When the application is ready to commit a transaction, the two phases are as follows:

The coordinator sends a prepare request to all the nodes participating in the transaction, for which the nodes have to respond with essentially a ‘YES’ or ‘NO’ message.

If all the participants reply ‘YES’, then the coordinator will send a commit request in the second phase for them to actually perform the commit. However, if any of the nodes reply ‘NO’, the coordinator sends an abort request to all the participants.

When a participant votes ‘YES’, it promises that it will be able to commit. Once the coordinator decides, that decision is irrevocable.

If one of the participants or the network fails during 2PC, the coordinator aborts the transaction. If any of the commit or abort request fail, the coordinator retries them indefinitely.

If the coordinator fails before sending the prepare requests, a participant can safely abort the transaction.

The only way 2PC can complete is by waiting for the coordinator to recover in case of failure. This is why the coordinator must write its commit or abort decision to a transaction log on disk before sending commit or abort requests to participants.

2PC is also called a blocking atomic commit protocol, as 2PC can become stuck waiting for the coordinator to recover. Three-phase commit (3PC) is an alternative that requires a perfect failure detector.

Unfortunately, Two-Phase Commit is not used in practice, because distributed transactions are blocking causing performance and operational problems.

Raft and Paxos

On the other hand, Paxos and Raft are popular fault-tolerant consensus algorithms that are used widely in practice. They solve the problem of consensus, which can be used to implement atomic commit, and do not block as long as a majority of nodes are agree.

The properties required of a consensus algorithm are:

Uniform agreement: No two nodes decide differently.

Integrity: No node decides twice.

Validity: If a node decides a value v, then v must have been proposed by some node.

Termination: Every node that does not crash eventually decides some value.

How are Paxos and Raft implemented? Consensus algorithms are actually total order broadcast algorithms, using single leader replication.

Both protocols define a monotonically increasing epoch number (ballot number in Paxos and term number in Raft) and make a guarantee that within each epoch, the leader is unique.

Whenever the current leader is thought to be dead, the nodes start a vote to elect a new leader. In each election round, the epoch number is incremented. If we have two leaders belonging to different epochs, the one with the higher epoch number will prevail.

A node cannot trust its own judgement. It must collect votes from a quorum of nodes. For every decision that a leader wants to make, it must send the proposed value to the other nodes and wait for a quorum of nodes to respond in favor of the proposal.

There are two rounds of voting, once to choose a leader, and second time to vote on a leader’s proposal. The quorums for those two votes must overlap.

The biggest difference with 2PC, is that 2PC requires a ‘YES’ vote for every participant.


Conclusion

We’ve seen how Paxos and Raft can implement consensus, a powerful tool that allows several nodes in a distributed system to agree some state. We’ve look at how Paxos and Raft are actually total order broadcast algorithms, using single leader replication and how this can be used to achieve linearizability. Feel free to subscribe as we look at batch processing next week and share this article if you liked it!

Subscribe for more: