Ethereum Client Diversity Part 3: Consensus-layer Diversity

June 10, 2024

A Wild Nasty Consensus Bug Appears

‍

We take as an example a consensus bug that happened on Prysm at the dawn of the Goerli testnet, where all Prysm validators wrongly computed which epoch was to be finalized. Depending on the market share of the consensus client, this class of bugs has different consequences.

‍

As shown in the first article of the series, validators cast attestation votes with a source (what the validator considers finalized) and a target epoch (what the validator tries to justify). This vote is guarded by slashing rules which shape the future votes the validator does. Notably, a validator that commits to a source vote for an epoch, can’t vote for another source at the lower height with the current target epoch without being slashed: its next source vote has to be done at a higher height. The implication is that if a validator gets its source vote wrong, its next actions depend on whether or not the rest of the network is able to make progress and finalize, so that a new possible vote at a higher epoch is possible.

‍

Let’s take an example, a validator running with a buggy consensus client computes the consensus state of a block and gets the accounting of validators wrong to the point it thinks its view of the network gathered 2/3 of votes, this results in a situation where the network is split in two parts:

validators that do not have the faulty client are not able to validate the faulty block against their block hash, and as a result consider the block missed (A-side), they will continue to vote on top of the missed block in their own branch;
validators that have the faulty client are able to match the block and cast a vote for it as well (B-side), on top of this they commit to finalize the branch for this block.

  graph LR
	subgraph slot_64["slot 64"]
   	block_A["Block A"]
	end

  subgraph B_Side["Side B (bug)"]
		subgraph slot_96["slot 96"]
	   	block_B'["Block B'"]
		end
		subgraph slot_128["slot 128"]
	   	block_C'["Block C'"]
		end
	end

	subgraph A_Side["Side A (no bug)"]
		subgraph slot_96'["slot 96"]
	   	block_B["Block B"]
		end
	end

  slot_64 -.- slot_96
  slot_64 -.- slot_96'
  slot_96 -.-> slot_128

  linkStyle 2 stroke-width:2px,fill:none,stroke:red;
  %% linkStyle 4 stroke-width:2px,fill:none,stroke:green;
  %% linkStyle 4,5 stroke-width:2px,fill:none,stroke:green;

‍*Note: this is a representation of checkpoints slots/blocks (which are on boundaries of epochs), the actual missed blocks leading to the fork can be between slot 65 and slot 96.

‍

Let’s explore the different scenarios of such a bug in the context of diversity.

‍

Case 1: the faulty client is a minority client

‍

If the bug is on a minority consensus client, this means there is super-majority on the correct side as more than 2/3 of the validators are able to vote on the A-side. As a result, the network continue to finalize on the correct branch: blocks and rewards are missed from the validator running the faulty execution client.

‍

Resolution and impact‍

‍

The operators of the faulty consensus client have multiple choices in this case because the rest of the network continued to finalized in the A branch:

upgrade the consensus client to the fixed version once the consensus client team has investigated it and came up with a solution,
switch to another consensus client type if they have one already synced on the side.

‍

This would lead to the following situation: a validator who incorrectly voted for finalization on the non-finalized branch (S=96, root=B’), is then able to rejoin the correct branch that was able to finalize as it gathered more than 2/3 of votes:

graph LR
	subgraph slot_64["slot 64"]
   	block_A["Block A"]
	end


	subgraph A_Side["Side A (no bug)"]
		subgraph slot_96'["slot 96"]
	   	block_B["Block B"]
		end
		subgraph slot_128'["slot 128"]
	   	block_C["Block C"]
		end
		subgraph slot_160'["slot 160"]
	   	block_D["Block D"]
		end
	end

  subgraph B_Side["Side B (bug)"]
		subgraph slot_96["slot 96"]
	   	block_B'["Block B'"]
		end
		subgraph slot_128["slot 128"]
	   	block_C'["Block C'"]
		end
	end

  slot_64 -.- l0[<1/3] -.- slot_96
  slot_96 -.-> slot_128

  slot_64 --- l1[>2/3] --- slot_96'
  slot_96' --- slot_128'
  slot_128' --> slot_160'

  linkStyle 2 stroke-width:2px,fill:none,stroke:red;
  linkStyle 6 stroke-width:2px,fill:none,stroke:green;
  %% linkStyle 4,5 stroke-width:2px,fill:none,stroke:green;

‍‍‍‍The overall impact here is minimal for the validators running with the faulty consensus client and similar to a small downtime:

** Either one of them: there can’t be both a majority client (>33%) and a super-majority client (>66%) on the network.*

‍

Assuming a 12 hours downtime with 1 million validators on the network, the expected outage cost per validator is around 0.002 ETH (6$ with ETH price at ~3000$).

‍

Case 2: the faulty client is a majority client

‍

If the bug is on a majority consensus client, this means there is no super-majority on the correct side as less than 2/3 of the validators are able to vote on the A-side. As a result, the A-side of the network can’t finalize and it enters inactivity leak: validators who do not vote on side A are increasingly penalized.

‍

Resolution and impact

‍

‍The validators from the B side are stuck in this situation, they can’t cast a vote on side A:

There is no future height that finalized on the A-side so voting on a higher source has no effect: finality is achieved on a link that gathered at least 2/3 of votes,
If they try to finalize the A side by voting on a lower source they get slashed because it would violate the surround rule of slashing we discussed in the first article in the series.‍

graph LR
	subgraph slot_64["slot 64"]
   	block_A["Block A"]
	end


	subgraph A_Side["Side A (no bug)"]
		subgraph slot_96'["slot 96"]
	   	block_B["Block B"]
		end
		subgraph slot_128'["slot 128"]
	   	block_C["Block C"]
		end
		subgraph slot_160'["slot 160"]
	   	block_D["Block D"]
		end
		ls0
	end

  subgraph B_Side["Side B (bug)"]
		subgraph slot_96["slot 96"]
	   	block_B'["Block B'"]
		end
		subgraph slot_128["slot 128"]
	   	block_C'["Block C'"]
		end
	end

  slot_64 -.- l0[>1/3 && <2/3] -.- slot_96
  slot_96 -.-> slot_128

  slot_64 -.- l1[>1/3 && <2/3] -.- slot_96'
  slot_96' -.- slot_128'
  slot_128' -.- ls0[pointless, 128 is not finalized] -.- slot_160'
  
  slot_64 -.- ls1[slashable, violates rule 2] -.- slot_160'

  linkStyle 2,6,7,8,9 stroke-width:2px,fill:none,stroke:red;
  %% linkStyle 6 stroke-width:2px,fill:none,stroke:green;
  %% linkStyle 4,5 stroke-width:2px,fill:none,stroke:green;

‍

‍The only solution for validators from side B is to wait for finalization to happen, that is to say, to accumulate inactivity leak penalties until the overall effective balances of side B is less than 1/3 of the network, at which point the A side is able to finalize again so they can vote on top again:

graph LR
	subgraph slot_64["slot 64"]
   	block_A["Block A"]
	end


	subgraph A_Side["Side A (no bug)"]
		subgraph slot_96'["slot 96"]
	   	block_B["Block B"]
		end
		subgraph slot_128'["slot 128"]
	   	block_C["Block C"]
		end
		subgraph slot_160'["slot N"]
	   	block_D["Block D"]
		end
			subgraph slot_192'["slot M"]
	   	block_E["Block E"]
		end
	end

  subgraph B_Side["Side B (bug)"]
		subgraph slot_96["slot 96"]
	   	block_B'["Block B'"]
		end
		subgraph slot_128["slot 128"]
	   	block_C'["Block C'"]
		end
	end

  slot_64 -.- ls1[>1/3] -.-  slot_96
  slot_96 -.-> slot_128

  slot_96' -.- slot_128'
  slot_64 -.- ls2[<2/3] -.- slot_96'
  slot_128' -.- slot_160'
  slot_64 --- ls3[>2/3 after inactivty leak] --- slot_160'
  slot_160' --> slot_192'
 

  linkStyle 2 stroke-width:2px,fill:none,stroke:red;
  linkStyle 9 stroke-width:2px,fill:none,stroke:green;
  %% linkStyle 4,5 stroke-width:2px,fill:none,stroke:green;

‍The impact of such an event is massive for anyone using the faulty majority validators: let's imagine a situation where a faulty consensus client has 40% share of the validation power, the effective balance of 40% of the network would have to go down so that the 60% can finalize: this means that each faulty validators with 32 ETH has to lose about ~7 ETH.

‍

There is however the possibility for faulty validators to trigger an exit to avoid losing this full amount, but only the first ones to trigger the exit will be able to do it in time. As seen in the first article of the series, burning ~7 ETH in inactivity leak takes about ~12 days, assuming 1 million active validators (May 2024), it would take about 194 days for 40% of the network to fully exit via the exit queue. Whenever a validator manages to exit from side B, it has a positive effect on the overall situation as its entire remaining effective balance is removed from the B side. Conversely if new validators join the network via the entry queue on the A side, their effective balance adds up to the A side.

‍

This results in the following average penalty modelisation for validators who voted on side B depending on the market share of the faulty validator, assuming 1 million active validators on the network:

The overall impact here is big as majority validators can lose up to 23 ETH depending on the market share of the used consensus client:

Assuming such a bug on a consensus client with 40% market share with a full exit flow of validators, the expected outage cost per faulty validator is around ~5 ETH (15 000$ with ETH price at ~3000$).

‍

Case 3: the faulty client is a super-majority client

‍

Similar to the super-majority execution bugs we covered in the previous article, if the bug is on a super-majority consensus client, finalization will happen because there are enough votes to reach 2/3, and either the branch with the bug is accepted by the community or validators that voted on it lose a large portion of their stake.

‍

Case 3A: bug is accepted by the community

‍

In case the faulty side B is accepted by the community and the chain continues with the bug in it, validators from the side A can join the faulty side because there is less than 1/3 of the network on their branch so they did not justify it. They can:

upgrade the consensus client to the faulty version if the consensus client team accepts to introduce the bug
switch to another faulty consensus client type if they have one already synced on the side

‍

This would lead to the following situation: a validator who voted on the initially correct target (T=96, root=B), is then able to join the faulty branch that was able to finalize as it gathered more than 2/3rds of votes:

graph LR
	subgraph slot_64["slot 64"]
   	block_A["Block A"]
	end


	subgraph A_Side["Side A (no bug)"]
		subgraph slot_96'["slot 96"]
	   	block_B["Block B"]
		end
	end

  subgraph B_Side["Side B (bug)"]
		subgraph slot_96["slot 96"]
	   	block_B'["Block B'"]
		end
		subgraph slot_128["slot 128"]
	   	block_C'["Block C'"]
		end
	end

  slot_64 --- l0[>2/3] --- slot_96
  slot_96 --> slot_128

  slot_64 -.- l1[<1/3] -.-> slot_96'
  
  linkStyle 2 stroke-width:2px,fill:none,stroke:green;
  linkStyle 3,4 stroke-width:2px,fill:none,stroke:red;
  %% linkStyle 4,5 stroke-width:2px,fill:none,stroke:green;

‍The overall impact here is minimal for the validators running with the minority consensus client and similar to a small downtime:

Assuming a 12 hours downtime with 1 million validators on the network, the expected outage cost per validator is around 0.002 ETH (6$ with ETH price at ~3000$).

‍

Case 3B: bug is not accepted by the community

‍

If the faulty side B is dismissed by the community, the valid side A enters inactivity leak and burns balances from the B side until it can finalize. Validators from the B side can’t enter side A because they already committed to justify/finalize side A: they would be slashed if they were moving as they would be violating one of the two slashing rules we discussed in the first article in the series.‍‍

graph LR
	subgraph slot_64["slot 64"]
   	block_A["Block A"]
	end


	subgraph A_Side["Side A (no bug)"]
		subgraph slot_96'["slot 96"]
	   	block_B["Block B"]
		end
		subgraph slot_128'["slot 128"]
	   	block_C["Block C"]
		end
		subgraph slot_160'["slot 160"]
	   	block_D["Block D"]
		end
			subgraph slot_192'["slot 192"]
	   	block_E["Block E"]
		end
	end

  subgraph B_Side["Side B (bug)"]
		subgraph slot_96["slot 96"]
	   	block_B'["Block B'"]
		end
		subgraph slot_128["slot 128"]
	   	block_C'["Block C'"]
		end
	end

  slot_64 --- ls1[>2/3] ---  slot_96
  slot_96 --> slot_128

  slot_96' -.- slot_128'
  slot_64 -.- ls2[<1/3] -.- slot_96'
  slot_128' -.- slot_160'
  slot_64 --- ls3[>2/3 after inactivty leak] --- slot_160'
  slot_160' --> slot_192'
 

  linkStyle 2 stroke-width:2px,fill:none,stroke:red;
  linkStyle 9 stroke-width:2px,fill:none,stroke:green;
  %% linkStyle 4,5 stroke-width:2px,fill:none,stroke:green;

‍

Similarly to a super-majority execution bug, validators on the B side lose at least 21 ETH:

In most cases, the outcome is for validators from the B side to watch their balance be burned on the A side up until it justifies, then they can vote again on top of the justified block:‍‍

graph LR
	subgraph slot_64
   	block_A["Block A"]
	end
	
  subgraph B_Side["Side B (bug)"]
		subgraph slot_96'["slot_96"]
	   	block_B'["Block B'"]
		end
	end

	subgraph A_Side["Side A (no bug)"]
		subgraph slot_96
	   	block_B["Block B"]
		end
		subgraph slot_128["slot_N"]
	   	block_C["Block C"]
		end
		subgraph slot_160["slot_M"]
	   	block_D["Block D"]
		end
	end


  slot_64 --- l1[> 2/3] --> slot_96'
  slot_64 -.- l2[< 1/3] -.- slot_96
  slot_96 -.- slot_128
  slot_64 --- l3[Burn until > 2/3] --> slot_128
  slot_128 --> slot_160

  linkStyle 0,1 stroke-width:2px,fill:none,stroke:red;
  linkStyle 7 stroke-width:2px,fill:none,stroke:green;

Similarly to super-majority execution bugs, the overall impact here is that super-majority validators lose almost everything:

It is important to note that even in the case where you are running a minority consensus client during such an event, because you did not commit to justification/finalization, you can still decide to join the other branch of the fork if the social consensus goes that way.

Assuming the minority side branch without the bug is chosen by the community, the outage cost is at least 21 ETH for most of super-majority validators (~63 000$ with ETH price at ~3000$).

‍

Kiln’s approach

‍

To mitigate the risks associated with running popular consensus clients, Kiln employs the following approach:

Use Minority Consensus clients for validation: Currently, our entire validation infrastructure on the mainnet relies on Teku.
Parallel Stacks for Contingency: We maintain parallel validation stacks using alternative consensus clients. These can be activated during periods of inactivity leaks after a thorough assessment of the situation.
Testnet Validation at Scale: We extensively test other execution clients on testnets. This ensures we are prepared to transition smoothly to alternative clients on the mainnet if necessary.

‍

Conclusions

‍

As of today, the Ethereum network is at risk as it is running with majority clients (Prysm is currently at around 40%) with the risk of losing ~5 ETH per validator for stakers relying on them.

‍

Running a majority or super-majority consensus client is risky both for the operator and the network (as it can have dramatic impact on the entire supply). The consequences of such an event are existential to any operator at scale, the impact is massive as all stakes are impacted, and orders of magnitude higher than a massive slash of the entire operator infrastructure at once.

‍

It is important large node operators who can make a difference in this picture understand the risks associated with their setups to move the needle in the right direction, both for their interest and the network.

‍

Thanks to Sébastien Rannou (mxs) for writing this post, as well as the Ethereum Foundation, Thorsten Behrens and Emmanuel Nalepa for their support.