Monitoring Ethereum staking infrastructure at scale
Philosophy of monitoring
Getting information about a running system is key when operating at scale. It enables a variety of activities that improve the quality of operations, such as:
- alerting in real-time on leading indicators to an outage, or outages themselves
- analyzing long-term trends like memory consumption of a specific software
- running experiments on a subset of production to compare its effectiveness
- investigating issues by reconstructing a timeline of events that leads to a particular condition
When operating Ethereum infrastructure, reliability and performance issues can directly lead to a measurable loss of ETH and trust from stakers in the operator's abilities to properly manage validation keys.
On the other hand, overly focusing on a reliability or performance metric can lead one to take misinformed choices to improve them, without balancing it against the risk of being slashed. One classic example is running the same validation key in multiple places to fail-over faster. It is therefore important to have access to all signals that impact revenue, but also to understand them and gauge their relevance.
The improbable at scale
At scale, the improbable becomes probable: issues that have a 0.01% chance of triggering upon submitting an attestation (i.e: issues that manifest 1 in 1000 attestations) will take on average ~44 days to manifest on a single validation key. But they take only ~384 seconds to manifest themselves when operating 10,000 validation keys!
Whenever a probabilistic issue impacting operations appears on specific conditions or set of conditions, it is important to:
- First, be aware of those conditions by logging extensively in all components involved, and
- Second, be able to compare them with prior similar-looking events, to understand what is different this time
What follows are some of the key events in the life of a validator and their expected frequencies.
Probability of proposing a block
Every 12 seconds, a block is proposed by a selected validator on the network. The probability of proposing a block is thus:
(TOTAL_VALIDATORS * 12 seconds) / (NUMBER_OF_OPERATED_VALIDATORS)
With ~400 000 validators on the network, this translates to:
- for a single key, once every ~57 days,
- for 10 keys, once every ~6 days,
- for 100 keys, once every 13 hours,
- for 1,000 keys, once every 80 minutes,
- for 10,000 validation keys, once every 8 minutes.
Probability of being in a sync committee
Since the Altair fork, every 256 epochs, 512 validators are chosen to participate in the sync committee, which gives the following formula:
(TOTAL_VALIDATORS * 32 * 12 * 256) / (NUMBER_OF_OPERATED_VALIDATORS * 512)
With ~400 000 validators on the network, this translates to:
- for a single key, every 30 months a key goes in sync committee for 27.3 hours,
- for 10 keys, every 3 months a key goes in sync committee for 27.3 hours,
- for 100 keys, every 9 days a key goes in sync committee for 27.3 hours,
- for 1,000 keys, every day a key goes in sync committee for 27.3 hours,
- for 10,000 validation keys, 27.3 hours every 2 hours.
Probability of submitting an attestation
Every epoch, all validators are expected to send an attestation on the network, this translates to:
- or a single key, once every 6 minutes and a half,
- for 10 keys, once every ~38 seconds,
- for 100 keys, once every ~4 seconds,
- for 1,000 keys, ~3 times per second,
- for 10,000 validation keys, ~33 times per second.
Probability of operational upgrades
Being up-to-date at all layers of an infrastructure is critical to keep up with the protocol security, and upcoming changes. Doing so regularly dilutes the risk: N monthly incremental upgrades is less risky than 1 major upgrade every N months.
Looking at the release cycle of a software client such as Prysm, there is about 1 release every month that can be deployed to production, a similar cycle can be found on other parts involved. This leads to the following formula:
(NUMBER_OF_LAYERS_TO_UPGRADE * TIME_TO_UPGRADE) / AVG_UPGRADE_FREQUENCY
Assuming there are five layers involved, (machine, operating-system/container system, beacon node, execution node, validator), each taking about 10 minutes to upgrade, this translates to:
- 50 minutes every month for all managed keys.
Gotta catch 'em all
This is only the tip of the iceberg as those events are expected ones, in reality there are more layers involved and probabilistic events happening (network glitch, a machine reboot, …). These conditions can be combined together to create more intricate failure conditions and trigger more unlikely bugs.
For example of such a bug combining multiple events, some early implementations of the Doppelganger, which verifies at startup of a validator that no loaded key is active on the network, would prevent validators from starting if one of the loaded key was scheduled in a committee sync, while validator keys not in committee sync would load fine. This bug was fixed on the validator side once identified by large operators, the point here is that as an operator running at scale will be in the position to notice improbable issues that were not spotted before.
This highlights that it is important, to keep logs from all layers to be able to reconstruct timelines of unlikely combinations of events.
Be sure that you have:
- Info logs of your validators, beacons & execution nodes,
- Lower-level logs of your infrastructure (kernel, Kubernetes, …)
- The same timezone used as reference throughout your infrastructure
- At least 90 days of history on logging/metrics data
Internal & external monitoring
Monitoring comes in two flavours that are complementary: monitoring from the outside of an infrastructure and monitoring from the inside. Both have pros and cons which can be coupled together to cover a large spectrum of needs.
Monitoring from the outside
Operating Ethereum at scale implies to have diversity in the setup (different client types, beacon types, execution client types) to reduce the blast radius of an issue that affects a particular software.
This has consequences on monitoring as it becomes more complex to gather signals that make sense for all flavours in place. This is where monitoring from the outside (or “external monitoring”) shines as it enables to watch the behaviour of the set of operated keys in an agnostic way by looking at the blockchain, without having to consider what is under the hood. That way, it is possible to compare versions of validators and spot potential regressions.
Another benefit of the monitoring from the outside is specific to the blockchain world: it will always be possible to reconstruct the full monitoring timeseries from the very first attestation made on the network. This helps seeing the long-term trends of the Ethereum network or of your infrastructure (i.e: if you are doing better with time), and also to compare how operated keys perform against the rest of the network.
This approach has limits though, when something unusual happens on the blockchain it is usually already too late and the keys are impacted, there is thus a need for other types of monitoring to prevent outages in the first place:
- Set-up external monitoring of keys you operate using public dashboards and APIs,
- Set-up alerting on external signals such as attestation rate or missed blocks proposals,
- Alternatively build your own beaconchain aggregator,
- Don't host your external monitoring on the same infrastructure as your validators,
- Periodically watch long-term trends.
Monitoring from the inside
Monitoring from the inside (or “internal monitoring”) offers access to a wide range of generic metrics about systems such as disk usage or CPU activity, as well as very fine-grained and specific metrics such as inclusion distance of a validation key.
The current standard for exposing those metrics is Prometheus, which is available on most Ethereum software around. Software using Prometheus typically exports an HTTP endpoint containing key/value pairs of metrics names associated with their current value. A Prometheus server can be configured to scrape those data points every minute and insert them in a timeseries database, which can be queried to either build dashboards or trigger alerts.
Scraping all data points by default is a good approach as it helps a lot during investigations or reconstructions of what happened. Yet it can be daunting to grasp whether a signal is important or not at first sight, and alerting blindly on what looks bad on the internal side can result in a too noisy environment that has the opposite effect on operations.
One approach here is to setup a strong set of alerts on external monitoring (as this denotes an outage with real consequences), and upon each firing alert, look at all internal signals that could have captured this event in advance. Then set an alerts that encompass those events, so that it triggers before next time. For new operators, using this approach on testnet for some time first can give a good initial set of conditions to start with on mainnet.
For example, if a validator stops to attest because a beacon crashed due to an out-of-memory issue, external monitoring will fire. Upon investigation, chances are that this memory issue was a slow-ramping one, that started 2-3 days before the crash as captured by internal monitoring. Introducing an alert based on internal-signals of the beacon's memory usage reaching 85% or so might give enough time to mitigate the issue the next time it appears.
As a production becomes more mature, so will the number of alerts grows, and so will the confidence in the system.
- Capture all internal signals exposed by Prometheus by default
- Capture generic signals about resource usage (CPU, disk, memory, network),
- Start with basic alerting on high resource usage (CPU, disk, memory, network),
- Investigate internal signals whenever there is an outage to introduce a new set of alerts.
Signals of Ethereum
There are two main dimensions to take into account when operating Ethereum validation keys: the uptime which can be expressed as the attestation and block proposal rate, and the performance of a validation key which can be measured by the inclusion distance.
On the Beacon chain, active validator keys are expected to submit attestations every epoch (every ~6'24'’). Doing so in a timely manner will result in a small reward for the validation key, missing one on the other hand results in a small penalty.
At scale in classic architectures, missing 1 attestation on a pool of 10,000 keys is something that can happen for a wide variety of reasons, and alerting on every single missed attestation tends to be noisy and non-actionable for the operator. For example, a validator client can be a bit behind 0.0001 % of the time if unlikely conditions are met (garbage collector happening at the same time of a I/O sync while writing an attestation in the anti-slashing database for instance).
It is however unlikely to miss two attestations in a row on the same key at scale, because the events that trigger such misses are not tied to a specific key. A good approach to alerting here is to fire whenever at least two attestations are missed in a row on a key. One way to implement such an alert is to monitor the balance of a validation key (as exposed in Prometheus by all validator clients), and alert if the balance did not grow in the past 15 minutes (2 and a half attestations).
- Check that you have monitoring for the global attestation rate of your entire setup,
- Check that you have an alert on the global attestation rate of your entire setup,
- Check that you have an alert on consecutive attestations missed on the same key,
- Periodically review your global attestation rate to prevent regressions that are above the alert thresholds.
Unlike attestation rate, block proposal is a rare event in the lifetime of a validation key as we saw above (every ~57 days on average). It also results in a larger reward in terms of execution-layer fees, roughly equivalent to a month of attestation rewards.
Monitoring the block proposal rate on top of the attestation rate is important because there can be issues on the block production side that only manifest whenever the block has to be proposed. This is especially true with the arrival of the execution layer with the Merge, where the mechanisms to produce a block will depend on having a correctly configured execution node, driven exclusively by a single beacon node.
Post-merge, the generated rewards will also depend on the setup of the operator (MEV setup will generate more execution rewards at the risk of centralizing block production). So not only the rate of block proposal needs to be monitored, but the generated yield as well. It is unclear at this stage how this data will look like and how monitoring & alerting need to be addressed there, but having the data in the first place will be key to drive those topics.
- Check that you have monitoring for the global block proposal rate of your entire setup,
- Consider alerting on the global block proposal rate,
- Explore how to have data about execution rewards,
- Periodically review your global proposal rate and execution rewards.
An attestation on the Ethereum network made by a validator is a signature of the current slot and hash as seen by the validator. In other words, what the validator thinks the current head of the network is. Attestations are aggregated by Aggregators before being included into blocks by proposers. This process is multi-stage and multiple factors can delay it:
- The validator being late when generating the attestation due to resource issues,
- The attestation taking time to reach out to an Aggregator due to network issues,
- The aggregator itself being slow to process the attestation,
- The block proposer not catching up the propagated aggregation.
Some of these factors are in the control of the operator, some aren't. The reward a validator gets on each attestation is proportional to how timely it is once included in a block. The metric to look for here is the inclusion distance, which is defined as the difference between the attested slot by the validator and the first block it is included in.
As this metric is not fully in the control of the operator, a possible approach to know if it is actionable or not is to compare the inclusion distance of operated keys with the global average of the network. If it is lower, it means the operated keys are doing a better job on average compared to the network. If it is below, it means there is likely something on the operator's side that can be changed to improve the overall performance and rewards.
- Check that you have signals in place to monitor inclusion distance,
- Check what your inclusion distance looks like compared to the average of the network,
- Periodically review your inclusion distance to catch regressions potentially on your side.
How to work around those signals
The Ethereum protocol was designed in such a way that missing attestations is not a big issue per-se, so that operators and solo stakers can keep operating in a simple way and not be afraid of having a few hours downtime on a validation key.
It is tempting to aim for 100% uptime but there is no free lunch: there are trade-off between reliability and slashing risks. At scale, en-masse slashing is an existential risk to all operators around, having monitoring around the metrics described in this article is essential, but they should not be seen as ultimate goals as there are only one face of our jobs.
Better be down than slashed.
Kiln's motto on Ethereum
Tools from the ecosystem
The following list of tools can help you implement those signals and alerts at scale:
- Rated Network, whose API can be used to implement external monitoring per validation key. It also offers the ability to group keys per operator, so as to provide statistics on a cohort of keys.
- Beaconcha.in, whose API can be used to implement external monitoring per single validation, there is no out-of-the-box facility to use it at scale though.
- Chaind, an open-source aggregator which can be used as a brick to implement external monitoring, it directly fills a PostgreSQL database with the entire history of what happened on the beaconchain.
- Prometheus, the standard monitoring system & time series database.
- Loki/Promtail, a lightweight log collector and forwarder that can be used all the logs per key as well as consensus and execution node and correlate them if needed.
- Grafana, a frontend for Prometheus backends, can be used to build dashboards on top of Prometheus and Loki.
- Metrika, a monitoring and analytics platform showing the consensus performance of the network and validator drill-downs.
Kiln is the leading enterprise-grade staking platform, enabling institutional customers to stake assets, and to whitelabel staking functionality into their offering. Our platform is API-first and enables fully automated validator, rewards, and commission management.