Ethereum Client Diversity Part 2: Execution-layer Diversity

May 29, 2024

A wild execution client bug appears

‍

Let’s consider the presence of a bug in a specific execution client we refer to as “faulty client”. Faulty client is used by an operator to propose an invalid block, for example by wrongly computing the block when it contains both a create2 and a selfdestruct instruction like in the Besu bug from January 2024. Because it computes differently a block than other execution clients, the resulting block hash is different:

validators that do not have the faulty client are not able to validate it against their block hash, and as a result consider the block missed (A-side),
validators that have the faulty client are able to match the block and cast a vote for it (B-side).

graph LR
	subgraph slot_64
   	block_A["Block A"]
	end

	subgraph B_Side["Side B (bug)"]
		subgraph slot_96'["slot_96"]
	   	block_B'["Block B'"]
		end
		subgraph slot_128'["slot_128"]
	   	block_C'["Block C'"]
		end
	end

  subgraph A_Side["Side A (no bug)"]
		subgraph slot_96
	   	block_B["Block B"]
		end
		subgraph slot_128
	   	block_C["Block C"]
		end
	end



  slot_64 -.- slot_96
  slot_64 -.- slot_96'
  slot_96 -.- slot_128
  slot_96' -.- slot_128'
  

  %% linkStyle 4,5 stroke-width:2px,fill:none,stroke:green;

Note: this is a representation of checkpoints slots/blocks (which are on boundaries of epochs), the actual missed blocks leading to the fork can be between slot 65 and slot 96.

‍

This situation leads to a fork, without any intervention, the two sides continue to provide a coherent view of what the network should be if they were right. The A-side sees missed blocks for all faulty validators as those propose blocks based on the parent block B’ they cannot interpret (because it is using a block hash that is not part of their canonical view). Conversely in the B-side, blocks from non-faulty validators from side A are missed while the faulty ones are present.

‍

Case 1: the faulty client is a super-minority client

‍

If the bug is on a super-minority execution client, this means there is super-majority on the correct side as more than 2/3 of the validators are able to vote on the A-side. As a result, the network continues to finalize on the correct branch: blocks and rewards are missed from the validator running the faulty execution client.

‍

Resolution and impact

‍

The operators of the faulty execution client have multiple choices in this case. None of their validators have committed to justification/finalization on either of the two branches. Not on the faulty branch as it is has <1/3rd of attestations from the network, and not on the non-faulty branch as it is incompatible with the faulty block. Their options are then:

upgrade the execution client to the fixed version once the execution client team has investigated it and came up with a solution
switch to another execution client type if they have one already synced on the side

This would lead to the following situation: a validator who voted on the faulty target (T=96, root=B’), is then able to rejoin the correct branch that was able to finalize as it gathered more than 2/3 of votes:

graph LR
	subgraph slot_64
   	block_A["Block A"]
	end

	subgraph B_Side["Side B (bug)"]
		subgraph slot_96'["slot_96"]
	   	block_B'["Block B'"]
		end
	end

  subgraph A_Side["Side A (no bug)"]
		subgraph slot_96
	   	block_B["Block B"]
		end
		subgraph slot_128
	   	block_C["Block C"]
		end
	end



  slot_64 --- l1[> 2/3] --- slot_96
  slot_64 -.- l2[< 1/3] -.-> slot_96'
  slot_96 --> slot_128

  linkStyle 2,3 stroke-width:2px,fill:none,stroke:red;
  linkStyle 4 stroke-width:2px,fill:none,stroke:green;

‍

The overall impact here is minimal for the validators running with the faulty execution client and similar to a small downtime, as the network does not enter inactivity leak:

Assuming a 12 hours downtime with 1 million validators on the network, the expected outage cost per validator is around 0.002 ETH (i.e $6 with ETH price at $3000).

‍

Case 2: the faulty client is a majority client

‍

If the bug is on a majority execution client, finalization can’t happen on the consensus side because there are not enough votes to reach 2/3 on any branch (both branches have more than 1/3 of votes and less than 2/3 of votes), as a result the network enters inactivity leak and the pressure to get out of the situation increases:

Blocks and rewards are missed for validators running the faulty execution client
Inactivity leak kicks in after 4 epochs
Validators running a valid execution client don’t receive attestation rewards anymore
Pressure to resolve the situation increases as penalties for faulty validators increase

‍

Resolution and Impact

‍

Similar to the previous scenario, the operators of the faulty execution client have multiple choices in this case: none of their validators have committed to justification on one of the two forks and thus they can:

upgrade the execution client to the fixed version once the execution client team has investigated it and came up with a solution
switch to another execution client type if they have one already synced on the side

This would lead to the following situation: a validator who voted on the faulty target (T=96, root=B’), is able after upgrading to cast a vote with the checkpoint prior to the fork as source (S=64, root=A) and the latest valid checkpoint as target (T=128, root=C). This second vote is not slashable as there is no strict surrounding with the first vote. The link then reaches 2/3 of the network as validators stuck on side A would also cast it, leading to finalization:‍

graph LR
	subgraph slot_64
   	block_A["Block A"]
	end

	subgraph B_Side["Side B (bug)"]
		subgraph slot_96'["slot_96"]
	   	block_B'["Block B'"]
		end
	end

  subgraph A_Side["Side A (no bug)"]
		subgraph slot_96
	   	block_B["Block B"]
		end
		subgraph slot_128
	   	block_C["Block C"]
		end
	end

  slot_64 -.- l1[> 1/3 && < 2/3] -.- slot_96
  slot_64 -.- l2[> 1/3 && < 2/3] -.-> slot_96'
  slot_96 -.- slot_128

  linkStyle 2,3 stroke-width:2px,fill:none,stroke:red;

graph LR
	subgraph slot_64
   	block_A["Block A"]
	end

	subgraph B_Side["Side B (bug)"]
		subgraph slot_96'["slot_96"]
	   	block_B'["Block B'"]
		end
	end

  subgraph A_Side["Side A (no bug)"]
		subgraph slot_96
	   	block_B["Block B"]
		end
		subgraph slot_128
	   	block_C["Block C"]
		end
	end

  slot_64 -- > 2/3 --> slot_128
  slot_64 -.- l1[> 1/3 && < 2/3] -.- slot_96
  slot_64 -.- l2[> 1/3 && < 2/3] -.-> slot_96'
  slot_96 -.- slot_128

  linkStyle 3,4 stroke-width:2px,fill:none,stroke:red;
  linkStyle 0 stroke-width:2px,fill:none,stroke:green;

‍

In this scenario when operating a large number of keys on a majority client, it is important to have side execution clients of other types ready to take the lead, to limit the impact of the inactivity leak and be as fast as possible on the good fork.

Assuming a downtime of 12 hours with 1 million validators on the network during an inactivity leak, the expected outage cost per validator is around 0.01253 ETH (~38$ with ETH price at ~3000$).

‍

Case 3: the faulty client is a super-majority client

‍

If the bug is on a super-majority execution client, finalization happens on the consensus side because there are enough votes to reach 2/3. This is bad because there is no way to roll it back and either one of the following is done: the branch with the bug is accepted by the community or validators that voted on it lose a large portion of their stake.

‍

Case 3A: bug is accepted by the community

‍

Let’s focus on the case where the faulty side B is accepted by the community and the chain continues with the bug in it. Validators from the side A can join the faulty side because there is less than 1/3 of the network on their branch so they did not justify it. They can:

upgrade the execution client to the faulty version if the execution client team accepts to introduce the bug
switch to another faulty execution client type if they have one already synced on the side

‍

This would lead to the following situation: a validator who voted on the initially correct target (T=96, root=B), is then able to join the faulty branch that was able to finalize as it gathered more than 2/3rds of votes:‍

graph LR
	subgraph slot_64
   	block_A["Block A"]
	end

  subgraph B_Side["Side B (bug)"]
		subgraph slot_96'["slot_96"]
	   	block_B'["Block B'"]
		end
		subgraph slot_128'["slot_128"]
	   	block_C'["Block C'"]
		end
	end

	subgraph A_Side["Side A (no bug)"]
		subgraph slot_96
	   	block_B["Block B"]
		end
	end


  slot_64 --- l1[> 2/3] --- slot_96'
  slot_64 -.- l2[< 1/3] -.-> slot_96
  slot_96' --> slot_128'

  linkStyle 2,3 stroke-width:2px,fill:none,stroke:red;
  linkStyle 4 stroke-width:2px,fill:none,stroke:green;

‍

‍The overall impact here is minimal for the validators running with the minority execution client and similar to a small downtime:

Assuming a 12 hours downtime with 1 million validators on the network, the expected outage cost per validator is around 0.002 ETH (6$ with ETH price at ~3000$).

‍

Case 3B: bug is not accepted by the community

‍

If the faulty side B is dismissed by the community (a bug could have more impact than the consequences of burning a large portion of the network, for example, if it results in arbitrary issuance of ETH), the valid side A enters inactivity leak and starts burning stakes from the B side. Validators from the B side can’t enter side A because they already committed to justify/finalize side A: they would be slashed if they were moving as they would be violating one of the two slashing rules we discussed in the last article in the series.

‍

graph LR
	subgraph slot_32
   	block_Z["Block Z"]
	end
	
	subgraph slot_64
   	block_A["Block A"]
	end
	
  subgraph B_Side["Side B (bug)"]
		subgraph slot_96'["slot_96"]
	   	block_B'["Block B'"]
		end
	end

	subgraph A_Side["Side A (no bug)"]
		subgraph slot_96
	   	block_B["Block B"]
		end
		subgraph slot_128["slot_N"]
	   	block_C["Block C"]
		end
	end



  slot_64 --- l1[> 2/3] --> slot_96'
  slot_32 -.- ls2[Violates rule 2] -.-> slot_128
  slot_64 -.- ls1[Violates rule 1] -.-> slot_96
  slot_96 -.- slot_128
  slot_32 --- slot_64

  linkStyle 0,1,2,3,4,5 stroke-width:2px,fill:none,stroke:red;

‍

To get an idea of the volumes involved, if 70% of the network wrongly finalizes and the accepted branch is on the 30% side, for 30% to finalize, each validator from the 70% side has to be burnt by 25 ETH: their balance would go down from 32 ETH to 7 ETH. There are two variables that can slightly affect this number:

if new validators enter the network on the A side via the entry queue, this slightly help it reaching 2/3 of the network faster,
conversely, if existing validators exit the network from the B side, which happens if they forcibly trigger an exit, or once their effective balance goes below 16 ETH.

‍

From an impact perspective, exiting a validator is a good strategy but only effective on the very first few to do it: assuming 700 000 validators are trying to exit, the exit queue time would be about 388 days. However as we discussed in the previous article from the series, it takes it takes about 50 days during an inactivity leak to burn the entire balance of a validator, so most of the validators here would lose the maximal penalty. This leads to the following average loss by validator depending on the market share of the super-majority execution client, assuming 1 million validators on the network:

‍

In most cases, the outcome is for validators from the B side to watch their balance be burned on the A side up until it justifies:

graph LR
	subgraph slot_64
   	block_A["Block A"]
	end
	
  subgraph B_Side["Side B (bug)"]
		subgraph slot_96'["slot_96"]
	   	block_B'["Block B'"]
		end
	end

	subgraph A_Side["Side A (no bug)"]
		subgraph slot_96
	   	block_B["Block B"]
		end
		subgraph slot_128["slot_N"]
	   	block_C["Block C"]
		end
	end


  slot_64 --- l1[> 2/3] --> slot_96'
  slot_64 -.- l2[< 1/3] -.- slot_96
  slot_96 -.- slot_128
  slot_64 --- l3[Burn until > 2/3] --> slot_128

  linkStyle 0,1 stroke-width:2px,fill:none,stroke:red;

‍‍

The overall impact here is maximal as most super-majority validators lose everything:

It is important to note that even in the case where you are running a minority client during such an event, because you did not commit to justification/finalization, you can still decide to join the other branch of the fork if the social consensus goes that way. You have the choice to move to which side of the fork you want.

‍

Assuming the minority side branch without the bug is chosen by the community, the outage cost is at least 22 ETH for most of super-majority validators (~66 000$ with ETH price at ~3000$).

‍

Kiln’s approach

‍

To mitigate the risks associated with running popular execution clients, Kiln employs the following approach:

Use Non-Super Majority Execution clients for validation: Currently, our entire validation infrastructure on the mainnet relies on Nethermind.
Parallel Stacks for Contingency: We maintain parallel validation stacks using alternative execution clients. These can be activated during periods of inactivity leaks after a thorough assessment of the situation.
Testnet Validation at Scale: We extensively test other execution clients on testnets. This ensures we are prepared to transition smoothly to alternative clients on the mainnet if necessary.

‍

Conclusions

‍

Up until recently, the Ethereum network was running with a super-majority execution client (Geth) with the systemic risks that go with it, and after a push from the community and major operators its usage was decreased below 66%. At the time writing, the Ethereum validators network runs with majority and minority execution clients only.

‍

Running a super-majority execution client is risky both for the operator (as its strategy solely depends on a hypothetical social consensus) and the network (as it can have dramatic impact on the entire supply). The consequences of such an event are existential to any operator at scale, the impact is massive as all stakes are impacted, and orders of magnitude higher than a massive slash of the entire operator infrastructure at once.

‍

Running an execution majority client still makes it possible to switch in case there is an incident. It is slightly riskier than running an execution minority client due to potential penalties during an inactivity leak, but it is actionable and manageable.

‍

In the next article, we’ll dive into consensus-layer client diversity risks and considerations. Stay tuned!

‍

Thanks to Sébastien Rannou (mxs) for writing this post, as well as the Ethereum Foundation, Thorsten Behrens and Emmanuel Nalepa for their support.