Getting information about a running system is key when operating at scale. It enables a variety of activities that improve the quality of operations, such as:
When operating Ethereum infrastructure, reliability and performance issues can directly lead to a measurable loss of ETH and trust from stakers in the operator's abilities to properly manage validation keys.
On the other hand, overly focusing on a reliability or performance metric can lead one to take misinformed choices to improve them, without balancing it against the risk of being slashed. One classic example is running the same validation key in multiple places to fail-over faster. It is therefore important to have access to all signals that impact revenue, but also to understand them and gauge their relevance.
At scale, the improbable becomes probable: issues that have a 0.01% chance of triggering upon submitting an attestation (i.e: issues that manifest 1 in 1000 attestations) will take on average ~44 days to manifest on a single validation key. But they take only ~384 seconds to manifest themselves when operating 10,000 validation keys!
Whenever a probabilistic issue impacting operations appears on specific conditions or set of conditions, it is important to:
What follows are some of the key events in the life of a validator and their expected frequencies.
Every 12 seconds, a block is proposed by a selected validator on the network. The probability of proposing a block is thus:
(TOTAL_VALIDATORS * 12 seconds) / (NUMBER_OF_OPERATED_VALIDATORS)
With ~400 000 validators on the network, this translates to:
Since the Altair fork, every 256 epochs, 512 validators are chosen to participate in the sync committee, which gives the following formula:
(TOTAL_VALIDATORS * 32 * 12 * 256) / (NUMBER_OF_OPERATED_VALIDATORS * 512)
With ~400 000 validators on the network, this translates to:
Every epoch, all validators are expected to send an attestation on the network, this translates to:
Being up-to-date at all layers of an infrastructure is critical to keep up with the protocol security, and upcoming changes. Doing so regularly dilutes the risk: N monthly incremental upgrades is less risky than 1 major upgrade every N months.
Looking at the release cycle of a software client such as Prysm, there is about 1 release every month that can be deployed to production, a similar cycle can be found on other parts involved. This leads to the following formula:
(NUMBER_OF_LAYERS_TO_UPGRADE * TIME_TO_UPGRADE) / AVG_UPGRADE_FREQUENCY
Assuming there are five layers involved, (machine, operating-system/container system, beacon node, execution node, validator), each taking about 10 minutes to upgrade, this translates to:
This is only the tip of the iceberg as those events are expected ones, in reality there are more layers involved and probabilistic events happening (network glitch, a machine reboot, …). These conditions can be combined together to create more intricate failure conditions and trigger more unlikely bugs.
For example of such a bug combining multiple events, some early implementations of the Doppelganger, which verifies at startup of a validator that no loaded key is active on the network, would prevent validators from starting if one of the loaded key was scheduled in a committee sync, while validator keys not in committee sync would load fine. This bug was fixed on the validator side once identified by large operators, the point here is that as an operator running at scale will be in the position to notice improbable issues that were not spotted before.
This highlights that it is important, to keep logs from all layers to be able to reconstruct timelines of unlikely combinations of events.
Be sure that you have:
Monitoring comes in two flavours that are complementary: monitoring from the outside of an infrastructure and monitoring from the inside. Both have pros and cons which can be coupled together to cover a large spectrum of needs.
Operating Ethereum at scale implies to have diversity in the setup (different client types, beacon types, execution client types) to reduce the blast radius of an issue that affects a particular software.
This has consequences on monitoring as it becomes more complex to gather signals that make sense for all flavours in place. This is where monitoring from the outside (or “external monitoring”) shines as it enables to watch the behaviour of the set of operated keys in an agnostic way by looking at the blockchain, without having to consider what is under the hood. That way, it is possible to compare versions of validators and spot potential regressions.
Another benefit of the monitoring from the outside is specific to the blockchain world: it will always be possible to reconstruct the full monitoring timeseries from the very first attestation made on the network. This helps seeing the long-term trends of the Ethereum network or of your infrastructure (i.e: if you are doing better with time), and also to compare how operated keys perform against the rest of the network.
This approach has limits though, when something unusual happens on the blockchain it is usually already too late and the keys are impacted, there is thus a need for other types of monitoring to prevent outages in the first place:
Monitoring from the inside (or “internal monitoring”) offers access to a wide range of generic metrics about systems such as disk usage or CPU activity, as well as very fine-grained and specific metrics such as inclusion distance of a validation key.
The current standard for exposing those metrics is Prometheus, which is available on most Ethereum software around. Software using Prometheus typically exports an HTTP endpoint containing key/value pairs of metrics names associated with their current value. A Prometheus server can be configured to scrape those data points every minute and insert them in a timeseries database, which can be queried to either build dashboards or trigger alerts.
Scraping all data points by default is a good approach as it helps a lot during investigations or reconstructions of what happened. Yet it can be daunting to grasp whether a signal is important or not at first sight, and alerting blindly on what looks bad on the internal side can result in a too noisy environment that has the opposite effect on operations.
One approach here is to setup a strong set of alerts on external monitoring (as this denotes an outage with real consequences), and upon each firing alert, look at all internal signals that could have captured this event in advance. Then set an alerts that encompass those events, so that it triggers before next time. For new operators, using this approach on testnet for some time first can give a good initial set of conditions to start with on mainnet.
For example, if a validator stops to attest because a beacon crashed due to an out-of-memory issue, external monitoring will fire. Upon investigation, chances are that this memory issue was a slow-ramping one, that started 2-3 days before the crash as captured by internal monitoring. Introducing an alert based on internal-signals of the beacon's memory usage reaching 85% or so might give enough time to mitigate the issue the next time it appears.
As a production becomes more mature, so will the number of alerts grows, and so will the confidence in the system.
There are two main dimensions to take into account when operating Ethereum validation keys: the uptime which can be expressed as the attestation and block proposal rate, and the performance of a validation key which can be measured by the inclusion distance.
On the Beacon chain, active validator keys are expected to submit attestations every epoch (every ~6'24'’). Doing so in a timely manner will result in a small reward for the validation key, missing one on the other hand results in a small penalty.
At scale in classic architectures, missing 1 attestation on a pool of 10,000 keys is something that can happen for a wide variety of reasons, and alerting on every single missed attestation tends to be noisy and non-actionable for the operator. For example, a validator client can be a bit behind 0.0001 % of the time if unlikely conditions are met (garbage collector happening at the same time of a I/O sync while writing an attestation in the anti-slashing database for instance).
It is however unlikely to miss two attestations in a row on the same key at scale, because the events that trigger such misses are not tied to a specific key. A good approach to alerting here is to fire whenever at least two attestations are missed in a row on a key. One way to implement such an alert is to monitor the balance of a validation key (as exposed in Prometheus by all validator clients), and alert if the balance did not grow in the past 15 minutes (2 and a half attestations).
Unlike attestation rate, block proposal is a rare event in the lifetime of a validation key as we saw above (every ~57 days on average). It also results in a larger reward in terms of execution-layer fees, roughly equivalent to a month of attestation rewards.
Monitoring the block proposal rate on top of the attestation rate is important because there can be issues on the block production side that only manifest whenever the block has to be proposed. This is especially true with the arrival of the execution layer with the Merge, where the mechanisms to produce a block will depend on having a correctly configured execution node, driven exclusively by a single beacon node.
Post-merge, the generated rewards will also depend on the setup of the operator (MEV setup will generate more execution rewards at the risk of centralizing block production). So not only the rate of block proposal needs to be monitored, but the generated yield as well. It is unclear at this stage how this data will look like and how monitoring & alerting need to be addressed there, but having the data in the first place will be key to drive those topics.
An attestation on the Ethereum network made by a validator is a signature of the current slot and hash as seen by the validator. In other words, what the validator thinks the current head of the network is. Attestations are aggregated by Aggregators before being included into blocks by proposers. This process is multi-stage and multiple factors can delay it:
Some of these factors are in the control of the operator, some aren't. The reward a validator gets on each attestation is proportional to how timely it is once included in a block. The metric to look for here is the inclusion distance, which is defined as the difference between the attested slot by the validator and the first block it is included in.
As this metric is not fully in the control of the operator, a possible approach to know if it is actionable or not is to compare the inclusion distance of operated keys with the global average of the network. If it is lower, it means the operated keys are doing a better job on average compared to the network. If it is below, it means there is likely something on the operator's side that can be changed to improve the overall performance and rewards.
The Ethereum protocol was designed in such a way that missing attestations is not a big issue per-se, so that operators and solo stakers can keep operating in a simple way and not be afraid of having a few hours downtime on a validation key.
It is tempting to aim for 100% uptime but there is no free lunch: there are trade-off between reliability and slashing risks. At scale, en-masse slashing is an existential risk to all operators around, having monitoring around the metrics described in this article is essential, but they should not be seen as ultimate goals as there are only one face of our jobs.
Better be down than slashed.
Kiln's motto on Ethereum
The following list of tools can help you implement those signals and alerts at scale:
Kiln is the leading enterprise-grade staking platform, enabling institutional customers to stake assets, and to whitelabel staking functionality into their offering. Our platform is API-first and enables fully automated validator, rewards, and commission management.