Skip to main content
PRL Project

EML code view code  

5 The two-thirds consensus protocol


Consider the following problem: A system has been replicated for fault tolerance. It responds to commands issued to any of the replicas, which must come to consensus on the order in which those commands are to be performed, so that all replicas process commands in the same order. Replicas may fail. We assume that all failures are crash failures: that is, a failed replica ceases all communication with its surroundings. The two-thirds consensus protocol is a simple protocol for coming to consensus, in a manner that tolerates n failures, by using (precisely) 3n + 1 replicas.

Input events communicate proposals, which consist of integer/command pairs: n,cproposes that command c be the nth one performed. The protocol is intended to obtain agreement, for each n, on which command will be the nth to be performed, and to broadcast a notify message with those decisions (which are also integer/command pairs) to a list of clients.

Each copy of the replicated system will contain a module that carries out the consensus negotiations. In this specification we describe only those modules (which we continue to call Replicas). To specify the full system we would have to include a description of how those decisions are used. That is done in the description of the Paxos protocol (section 6).

Browsing notes:Readers may find it helpful to display the full code alongside this page. view code

For convenient display, we split the full specification into smaller chunks: figure 4 contains the prefatory information (parameters, imports, type definitions, message declarations, variables, auxiliaries) and figures  5 through 8 define the classes. Section 5.1 walks through code, redisplaying fragments of the text as they are discussed. A reader may find it helpful first to concentrate on the informal description of each class provided and then, before studying details, turn to section 5.2 to see some scenarios showing the protocol in action. Section 5.3 explains why the protocol satisfies the basic safety property of consistency—it will not send contradictory notifications. That section also defines the precise sense in which the protocol can “tolerate” up to flrs “failures,” but does not provide a proof of that.

5.1 The specification of 2/3-consensus

5.1.1 Preliminaries

This section comments on the preliminary definitions given figure 4, and also introduces the library combinator until.

(* ------ Parameters ------ *) (* consensus on commands of arbitrary type Cmd with equality decider (cmdeq) *) parameter Cmd, cmdeq : Type * Cmd Deq parameter coeff : Int parameter flrs : Int (* max number of failures *) parameter locs : Loc Bag (* locations of (3 * flrs + 1) replicas *) parameter clients : Loc Bag (* locations of the clients to be notified *)

Parameters

The parameters of the protocol are

  • Cmd: the type of commands
  • flrs: the max number of failures to be tolerated
  • locs: the locations of the 3 * flrs + 1 replicated processes that decide on consensus
  • clients: the locations of the clients to be notified of decisions

We make no assumptions about who submits inputs or constraints on how they are submitted.

The declaration of the Cmd parameter also introduces a parameter for an equality operator:

    parameter Cmd, cmdeq : Type * Cmd Deq

When we instantiate the type Cmd, we must also instantiate cmdeq with an operation that decides equality for members of that type. The keyword Deq denotes a type constructor: (Cmd Deq) is the type of all equality deciders for Cmd. We need cmdeq because we want to apply deq-member to compute membership in a list of commands; as noted in section 3.2, we must therefore supply an equality decider.

Variables

(* ------ Variables ------ *) variable sender : Loc variable loc : Loc variable ni : RoundNum variable n : CmdNum

One reason for the variable declarations, such as

   variables sender : Loc 
   variable ni      : RoundNum

is to introduce notational conventions that make the specification easier to read. Type checking will object if the notations are misused.

A second reason is to help the type inference algorithm, which sometimes requires hints about the types of the arguments to functions being defined. An expression or pattern may be labeled with a type, which will be checked statically, and may also constrain polymorphism that might otherwise arise. E.g., after

    let foo x = x ;; 
    let bar (x:Int) = x ;; 
    let baz (x,y:Bool) = (x,y) ;;

foo is the polymorphic identity function on every type; bar is the identity function on integers; and baz is the identity function on pairs whose second coordinate is boolean. Typically, we want library functions to be highly polymorphic and widely applicable, but the functions defined in EventML programs to be much more constrained. By and large, the polymorphism of an EventML program is expressed in its parameters.

Without variable declarations for ni and sender the definition of the newvote operation would have to be expressed as

   let newvote (ni:RoundNum) ((ni’,c),sender:Loc) (_,locs) = ... ;;

but with those declarations, we may simply write

   let newvote ni ((ni’,c),sender) (_,locs) = ... ;;

As a practical matter, there’s not much point in trying to anticipate where type inference needs hints. Most commonly, help may be needed when the right hand side of the definition calls on a polymorphic function such as deq-member, which operates on lists of any type that has a decidable equality operator.

The balance between introducing variable declarations and adding type labels to patterns and expressions is a matter of taste.

Auxiliaries

We introduce a convenient notation for specifying the “init” parameter of SM*-class or Memory* (section 4):

(* ------ Auxiliaries ------ *) let init x loc = {x} ;;

Used in that context, (init x) is the function that assigns the initial state x to every location.

Class combinators

The specification uses one new library combinator:

  • X until Y: v (X until Y ) (e) iff v X (e) and no Y -event has previously occurred at loc (e). That is, at any location l, the class (X until Y) acts exactly like X until a Y-event occurs at l, after which it falls silent.


(* ---------- ReplicaState: a state machine ---------- *) 
 
(* -- inputs -- *) 
let vote2prop loc (((n,i),c),loc’) = {(n,c)} ;; 
class Proposal = propose’base || (vote2prop o vote’base);; 
 
let update_replica (n,c) (max,missing) = 
  if n > max 
  then (n, missing ++ (from-upto (max + 1) n)) 
  else if deq-member (op =) n missing 
       then (max, list-diff (op =) missing [n]) 
       else (max,missing) ;; 
 
class ReplicaState = Memory1 (init (0,nil)) update_replica Proposal ;; 
 
(* ---------- NewVoters ---------- *) 
 
let when_new_proposal loc (n,c) (max,missing) = 
  if n > max or deq-member (op =) n missing then {(n,c)} else {} ;; 
 
class NewVoters = when_new_proposal o (Proposal, ReplicaState) ;;

Figure 5: 2/3 consensus: NewVoters and ReplicaState


 
(* ---------- QuorumState ---------- *) 
 
let newvote ni ((ni’,c),sender) (cmds,locs) = 
  ni = ni’ & !(deq-member (op =) sender locs);; 
 
let add_to_quorum ni ((ni’,c),sender) (cmds,locs) = 
  if newvote ni ((ni’,c),sender) (cmds,locs) 
  then (c.cmds, sender.locs) 
  else (cmds,locs);; 
 
class QuorumState ni = Memory1 (init (nil,nil)) (add_to_quorum ni) vote’base ;; 
 
(* ---------- Quorum ---------- *) 
 
let roundout loc (((n,i),c),sender) (cmds,_) = 
  if length cmds = 2 * flrs 
  then let (k,x) = poss-maj cmdeq (c.cmds) c in 
         if k = 2 * flrs + 1 
         then bag-append (decided’broadcast locs n) 
                         (notify’broadcast clients (n,x)) 
         else { retry’send loc ((n,i+1), x) } 
  else {} ;; 
 
let when_quorum ni loc vote state = 
  if newvote ni vote state then roundout loc vt state else {} ;; 
 
class Quorum ni = (when_quorum ni) o (vote’base, QuorumState ni) ;; 
 
(* ---------- Round ---------- *) 
 
class Round (ni,c) = Output(\loc.vote’broadcast locs ((ni,c),loc)) 
                  || Once(Quorum ni) ;;

Figure 6: 2/3 consensus: Rounds and Quorums


(* ---------- NewRoundsState ---------- *) 
 
let vote2retry loc ((ni,c),sender) = {(ni,c)};; 
let RoundInfo = retry’base || (vote2retry o vote’base);; 
 
let update_round n ((m,i),c) round = if n = m & round < i then i else round ;; 
 
class NewRoundsState n = Memory1 (init 0) (update_round n) RoundInfo ;; 
 
(* ---------- NewRounds ---------- *) 
 
let when_new_round n loc ((m,i),c) round = 
  if n = m & round < i then {((m,i),c)} else {} ;; 
 
class NewRounds n = (when_new_round n) o (RoundInfo, NewRoundsState n) ;; 
 
(* ---------- Voter ---------- *) 
 
class Halt n = (\_.\m. if m = n then {()} else {}) o decided’base;; 
 
class Voter (n,c) = Round ((n,0),c) 
                 || ((NewRounds n >>= Round) until (Halt n));;

Figure 7: 2/3 consensus: NewRounds and Voters


(* ---------- Replica ---------- *) 
 
class Replica = NewVoters >>= Voter;; 
 
(* ---------- Main program ---------- *) 
 
main Replica @ locs ;;

Figure 8: 2/3 consensus: The top level


5.1.2 The top level: Replica

Replica is the event class characterizing the actions of a decider. As noted in figure 8, the main program

main Replica @ locs

installs a decider at each location in locs.

For each n, a Replica will spawn (at most) one instance Voter to communicate with other instances of Voter and come to consensus on a single proposal of the form (n,_).

class Replica = NewVoters >>= Voter ;;

For each n, NewVoters spawns a Voter in response to the first proposal or vote it receives concerning command n.

We define consensus on proposal n,cto mean that 2/3 (plus one) of the replicas vote for it. On any particular poll of the voters that degree of consensus cannot be guaranteed—so we allow do-over polls, for which we adopt the following terminology. Successive polls for each command number are assigned consecutive integers called innings; the pair command_number,inningis called the polling or voting round.

Votes are of type Vote. Each contains:

  • the round in which the vote is cast
  • a command being voted for in that round
  • the voter’s location (used to ensure that repeat votes from the same source are ignored)

5.1.3 ReplicaState and NewVoters

This section refers to figure 5.

A Replica acts when NewVoters does, in response to propose and vote inputs. These are recognized by the class Proposal:

  let vote2prop loc (((n,i),c),sender) = {(n,c)} ;; 
  class Proposal = propose’base || (vote2prop o vote’base);;

Proposal observes the value of type Proposal input in its input.

ReplicaState maintains the state of a Replica, enough information to recognize the first time it sees a Proposal-event about command n (meaning a value of the form n,cfor some command c). Its state has type Int * (Int List). The Int component is the greatest n for which it has seen such an event; and the (Int List) component is the list of all natural numbers less than that maximum for which it has not yet seen a proposal event.

   let update_replica (n,c) (max,missing) = 
     if n > max 
     then (n, missing ++ (from-upto (max + 1) n)) 
     else if deq-member (op =) n missing 
          then (max, list-diff (op =) missing [n]) 
          else (max,missing) ;; 
 
   class ReplicaState = Memory1 (init (0,nil)) update_replica 
                                 Proposal ;;

The initial state of a ReplicaState is 0,nil. The infix operator ++ is the append operator and the imported Nuprl operations from-upto and list-diff have the following meanings:

                from- upto i j =  [i;i+ 1;i+ 2;...;j - 1]

list- diff (op =)[a;b;...][m;n;...]  =  the result of deleting all occurrences of m, n,... from [a;b;...]

Every event is a ReplicaState-event, and observes the state of the state machine when the event occurs (before any processing).

NewVoters-events are Proposal-events. NewVoters compares the data observed by Proposal with the state of the replica when the message arrives, in order to decide whether it is the first proposal about some n.

   let when_new_proposal loc (n,c) (max,missing) = 
     if n > max or deq-member (op =) n missing 
     then {(n,c)} 
     else {} ;; 
 
   class NewVoters = when_new_proposal o (Proposal, ReplicaState) ;;

5.1.4 The next level: Voter

A Voter is a parallel composition of two classes:

class Voter (n,c) = Round ((n,0),c) 
                 || ((NewRounds n >>= Round) until Halt n);;

where:

  • Round((n,i),c) will, at any location, conduct the voting for round (n,i), and will cast its vote in that round for command c.
  • NewRounds n >>= Round will determine when it is time to begin a new round of voting for the nth command and spawn a class to conduct the voting in that new round.
  • The clause “until Halt n will cause termination when it detects a Halt n event, which signals that some Voter has found a consensus for command n.

5.1.5 Round and Quorum

This section refers to figure 6.

Round ((n,i),c)

Round ((n,i),c), running at location loc, broadcasts a vote from loc for command c in round (n,i) and runs an instance of Quorum. Quorum (n,i) keeps a tally of votes received at l in round (n,i) and uses that tally to determine either that consensus has been reached (in which case it notifies the clients and sends every Replica, including itself (i.e., the replica that spawned it), a decided message) or that consensus might not be possible in inning i (in which case it sends to itself a suitable retry message).

class Round (ni,c) = Output(\loc.vote’broadcast locs ((ni,c),loc)) 
                     || Once(Quorum ni) ;;

(Quorum (n,i)) is a state machine that responds to vote messages. Intuitively, its state consists of a pair cmds,locs. Each time it receives a new vote for proposal n,cin inning i, it prepends c to the list cmds. locs is the list of the locations that sent those commands. (We keep the list of senders so that, if a vote from any sender is delivered multiple times, it will only be counted once.) The initial state is a pair of empty lists. (QuorumState (n,i)) is the “pre” Moore machine that observes this state when a vote arrives.

let newvote ni ((ni’,c),sender) (cmds,locs) = 
  ni = ni’ & !(deq-member (op =) sender locs);; 
 
let add_to_quorum ni ((ni’,c),sender) (cmds,locs) = 
  if newvote ni ((ni’,c),sender) (cmds,locs) 
  then (c.cmds, sender.locs) 
  else (cmds,locs);; 
 
class QuorumState ni = Memory1 (init (nil,nil)) 
                                    (add_to_quorum ni) vote’base ;;

The transition function for (QuorumState (n,i)) is (add_to_quorum (n,i)). A vote message is a no-op unless it’s a vote in round (n,i) that comes from a new location. If it’s both, then the vote is tallied by prepending to it state components the command it votes for and the location of its sender.

Quorum (n,i) is a Mealy machine defined from QuorumState. It produces an output once it has received votes from 2 flrs + 1 distinct locations. If all of them are votes for the same command d, it broadcasts notify and decided messages. If not, then it is possible that on this round no proposal will ever receive 2 flrs + 1 votes; so it sends itself a retry message to trigger initiation of inning i + 1. (Once it has sent the retry message it will ignore any votes it subsequently receives in round n,i, even if they would result in some proposal’s receiving 2 flrs + 1.)

 
let roundout loc (((n,i),c),sender) (cmds,_) = 
  if length cmds = 2 * flrs 
  then let (k,x) = poss-maj cmdeq (c.cmds) c in 
         if k = 2 * flrs + 1 
         then bag-append (decided’broadcast locs n) 
                         (notify’broadcast clients (n,x)) 
         else { retry’send loc ((n,i+1), x) } 
  else {} ;; 
 
let when_quorum ni loc vote state = 
  if newvote ni vote state then roundout loc vt state else {} ;; 
 
class Quorum ni = (when_quorum ni) o (vote’base, QuorumState ni) ;;

Consider first the outer conditional. The (cmds,_) argument matches the value observed by QuorumState, so (length cmds) is the number of votes tallied before the input arrives. If this test fails then, even with the new input, the state machine will not yet have received 2 flrs + 1 votes, so the input is ignored.

Consider the inner conditional. The imported operation poss-maj implements the Boyer-Moore majority algorithm. Thus, the locally defined constants k and x have the following meaning: If some element of the list c.comds appears in a majority of its entries, x is that element and k is the number of times it occurs. Thus, the inner conditional tests for unanimity.

The data of a retry message consists of the new round to be initiated and, in addition, the name of a command to propose in this new round. The definition of roundout attempts to choose that command in a reasonable way: So, if the votes are not unanimous, but some command receives a majority, that majority-receiving command will be proposed in the retry message.8This is crucial to the correctness of the protocol.

5.1.6 NewRounds and Voters

This section refers to figure 7.

Halt n

Halt n recognizes the arrival of decided message with body n. We make it a class of type Unit, since the only information conveyed is the fact that the message has arrived.

class Halt n = 
    Once((\_.\i.if i = n then {()} else {}) o decided’base) ;;

NewRounds n

Recall that (NewRounds n) decides when to initiate a new round of voting about the nth command and, when necessary, spawns an instance of Round, supplying it with a new round number of the form n,_and a command to vote for in that round.

(NewRoundsState n) is a “pre” Moore machine. It’s state is an integer, initially 0. At any location it keeps track of the greatest inning i for which it has “participated” in a round of the form n,i. A location has “participated” in such a round if it has received a retry message with data ⟨⟨n,i,_, or a vote message with data ⟨⟨⟨n,i,_,_. So its input events are recognized by RoundInfo, which observes the round/command pair embedded in its input.

let vote2retry loc ((ni,c),sender) = {(ni,c)};; 
class RoundInfo = retry’base || (vote2retry o vote’base);;

The transition function, update_round, updates the state whenever its input constitutes participation in an inning greater than the current state value:

let update_round n ((m,i),c) round = if n = m & round < i 
                                     then i else round ;; 
class NewRoundsState n = 
    Memory1 (init 0) (update_round n) RoundInfo ;;

There’s some redundancy in defining the Mealy machine NewRounds from NewRoundsState.9The next version of the library will contain a different set of combinators that avoids that. The condition in when_new_round is the same as that in the transition function updated_round: when the transition is a no-op, NewRounds ignores the input; when it’s not, NewRounds passes along the input that caused the update.

let when_new_round n loc ((m,i),c) round = 
  if n = m & round < i then {((m,i),c)} else {} ;; 
class NewRounds n = 
    (when_new_round n) o (RoundInfo, NewRoundsState n) ;;

5.2 Illustrative runs of the protocol

This section contains message sequence charts that describe some possible runs of the 2/3-consensus protocol. To make the charts easier to read, all message arrows are drawn horizontally (except for self-messages).10A horizontal arrow does not imply instantaneous communication. That requires a small, but semantically inessential, deviation from the official semantics of EventML. Actions that are atomic in EventML may be shown as nonatomic. Consider figure 9. The top diagram shows A broadcasting message x to B, C, and D as a single event. At C, the act of receiving message x and replying with y is atomic. The second diagram teases everything apart.

We can represent delay in message transit, in part, as a delay in sending the message. Since only message arrivals are observable, no distinction between the picture and the official semantics will be observable.



Figure 9: Simplifying Pictures [+]


A detailed look at retry Figure 10 shows (part of) one possible run of the consensus protocol, in which a round ends not in consensus but in a retry that starts a new round. We assume that flrs = 1, so there are four instances of Replica and a proposal will be accepted if it gets three votes. The diagram does not depict all the classes—in particular, we show only three of the replicas—and does not display all the messages sent. It contains abbreviations, which are defined in the following table:

vote1x=[vote : ((2, 0), x, l1 ) ]
vote2 =[ vote : ((2, 0), x, l2 ) ]
vote4y =[ vote : ((2, 0), y, l4 ) ]
retryx =[ retry : ((2, 1), x) ]
vote2x=[ vote : ((2, 1), x, l2 ) ]
α : start Round ((2,0),x); Quorum state = ([x], [l1 ])
β : Quorum state = ([x; x], [l1 ; l2 ])
γ : Quorum state = ([x; x; y], [l1 ; l2 ; l4 ])
δ : start Round ((2,1),x); Quorum state = ([x], [l2 ])

Note that votes not marked with a “” are cast in inning 0 (i.e., in this case, round (2,0)) and votes marked with “” are cast in inning 1.

This run begins when the Replica at location l1 receives a proposal (2,x) from the environment. We assume that location l1 has not previously received a vote or proposal for command 2; accordingly, it responds by spawning an instance of Voter (2,x) at l1. Only one component of this Voter will play a role: Round ((2,0),x). This class broadcasts vote1x, a vote for the proposal it received—though the diagram shows only two of those messages. Its Quorum component plays no role in this part of the run.

A Replica can respond to either a vote or a proposal. When the Replica at location l2 receives vote1x (also assumed to be new), it spawns an instance of Voter (2,x) at l2. This initiates an instance of Round ((2,0),x) at location l2, which will broadcast vote2 and spawn an instance of Quorum (2,0) at l2. Of this broadcast we show only the message it sends to itself.11The Replica at location l2 sees this vote but, as it has already seen a vote for command 2, the self message does not cause it to spawn a new Voter. Comment α says that the vote that spawned the Round updates the internal state of Quorum to ([x],[l1]), recording the fact that a vote for command x came from l1. As β indicates, the self message updates the state of this Quorum to ([x;x],[l1;l2]).

Meanwhile, the Replica at location l4 has received a competing proposal: that command 2 be y, not x. It spawns Voter (2,y), which broadcasts vote4y; we show only the message received by the Voter at l2. This updates the state of Quorum at l2 to ([x;x;y],[l1;l2;l4]). Once it has received votes from three distinct locations Quorum makes a decision: in this case, because the votes are not unanimous, it must start a new round by sending itself a retry message.12It is possible that the fourth Replica would cast a vote for proposal (2,x), providing the three votes, but that would come too late. As δ indicates, this retry starts Round ((2,1),x). So the Voter at l2 begins by broadcasing vote2x.


retry
Figure 10 Detailed example of a retry [+]


Notification and retry in the same round Figure 11 shows part of a run in which the Voter at l1 broadcasts a notification that the second command will be x, but the Voter at l2 sends a retry that launches a new round. As before, the diagram does not depict all the classes or all the messages sent. Instead of walking through the successive states of the Quorum classes, we only note their states when they reach a decision. The abbreviations are as follows:

vote1x =[ vote : ((2, 0), x, l1 ) ]
vote2x =[ vote : ((2, 0), x, l2 ) ]
vote3x =[ vote : ((2, 0), x, l3 ) ]
vote4y =[ vote : ((2, 0), y, l4 ) ]
decidedx=[ decided : (2, x)) ]
notifyx =[ notify : (2, x) ] is broadcast to all clients
retryx =[ retry : ((2, 1), x) ]
α : Quorum state = ([x; x; x], [l1 ; l2 ; l3 ])
β : Quorum state = ([x; x; y], [l1 ; l2 ; l3 ])
γ : start Round ((2,1),x); Quorum state = ([x], [l2 ])

The first three votes seen by the Voter at location l1 are votes for x, so it notifies all clients that agreement has been reached—command 2 is x—and sends a decided message to stop all the Voters working on command 2. The Voter at location l2 sees two votes for x and one for y and it launches a new round before it receives the decided message that stops it. The crucial point is that, on launching this round it casts its vote for x. If the retry proposed y, it might be possible that the remaining voters in some later round would come to consensus on command y; clients would then receive a contradictory notification saying that command 2 is y. Section 5.3 explains why this calamity cannot occur.


Notify and Retry
Figure 11: Notify and Retry on the same round [+]


Failure to achieve consensus Figure 12 illustrates a run in which this protocol fails to achieve consensus, a possibility that, according to the FLP theorem FLP85] is inevitable. The abbreviations are as follows:

vote1x=[ vote : ((2, 0), x, l1 ])
vote2x =[ vote : ((2, 0), x, l2 ])
vote3y =[ vote : ((2, 0), y, l3 ])
vote4y =[ vote : ((2, 0), y, l4 ])
retryx =[ retry : ((2, 1), x ])
retryy =[ retry : ((2, 1), y ])
α1 : Quorum state = ([x; x], [l1 ; l2 ])
α2 : Quorum state = ([x; x; y], [l1 ; l2 ; l4 ])
α3 : start Round ((2,1),x); Quorum state = ([x], [l1 ])
β1 : Quorum state = ([y; y], [l4 ; l3 ])
β2 : Quorum state = ([y; y; x], [l4 ; l3 ; l1 ])
β3 : start Round ((2,1),y); Quorum state = ([y], [l4 ])

We omit the Voter classes spawned at locations l2 and l3, depicting their messages as coming directly from the Replica classes themselves. In round (2,0), the Replica at l1 votes for x and the Replica voting for y. This exchange of messages results in abandoning round (2,0). But round (2,1) begins in exactly the same way: with Replica at l1 voting for x and at l2 voting for y. The pattern can in principle repeat endlessly.


Notify and Retry
Figure 12: Failure to achieve consensus [+]


5.3 Properties of the 2/3-consensus protocol

Consistency The 2/3-consensus protocol must satisfy the basic safety property of consistency—if the messages ([ notify : n, x]) and ([ notify : n, y]) are sent, then x = y. The example in figure 11 makes it clear that the following property is necessary (though not quite sufficient) to guarantee that.

If, in any round, some Voter finds a quorum for command x, then, in that round, x is the only command that can be proposed by a retry message.

PROOF: Suppose that one Voter sees 2f + 1 votes for command x in a given round. Since each Voter votes for only one command in any round, that round can contain no more than f votes for any command other than x. Now consider the situation of any other Voter making a decision in that round: It will have received 2f + 1 votes, and at most f of them can be for a command other than x. Therefore, at least f + 1 of the votes it sees must be for x; so if it sends a retry message, that retry proposes x.

The argument is not quite done. Suppose one Voter finds a quorum for x in round n,ibut other Voters do not, and will therefore participate in subsequent rounds. Is it possible that one of those later rounds contains a vote for some other command y (possibly as the result of a new proposal received from some external source), and that, as a result, some later round n,jfinds a quorum for y? No, because a stronger property holds.

If some Voter finds a quorum for command x in round n,ithen in any round n,jwith j > i all votes cast are votes for x.

PROOF: Every vote can ultimately be traced either to a retry message or to a proposal message received by some Replica from an external source. However, a Replica will ignore a proposal with body n,cunless it has never before received either a proposal or a vote for something of the form n,_. Thus, votes that arise from external proposals can be cast only in rounds of the form (_,0). That is to say that all votes in round n,iwith i > 0 arise from retry messages sent in round n,i - 1. So, by induction, once we encounter any round in which all retry messages are for command x, all subsequent rounds can only contain votes for x.

Fault tolerance When a process suffers a crash failure it stops sending messages. (It does not perform erratically by, e.g., violating the requirements of the protocol.) The 2/3-consensus protocol will tolerate up to flr crash failures, in the following sense:

All executions of the protocol that suffer only crash failures, and no more than flr of those, are non-blocking—that is, execution never reaches a state from which consensus is impossible.

By the FLP theorem, this is the strongest fault tolerance guarantee that a consistent consensus protocol can provide.