When operating production at scale, monitoring can be approached in two complementary ways:
1️⃣ open-box monitoring, using internal signals from the stack,
2️⃣ closed-box monitoring, using external signals from the outside.
The former is usually fine-grain but depends on the monitored stack itself, while the latter tends to be more generic but robust. Those two approaches are meant to be combined: for instance, if one of your internal stack stops to report metrics, chances are open-box monitoring will not catch it, while closed-box monitoring will. It also helps to have two pairs of independent eyes to watch a critical system and build up confidence in it.
Landscape of Monitoring
Most of the Ethereum validators today support Prometheus as a metric exporter, which can be used to pull-in metrics to build open-box dashboards and alerts, this approach is widely used and the most straightforward way to operate. For instance, the Prysm validators come with a ready-to-use Grafana dashboard build on top of their metrics.
In large setups with client diversity, open-box monitoring tends to be complicated, because the signals on each validator type may be different, and finding a common set of signals that is meaningful and comparable between each type is challenging.
Closed-box monitoring on the other hand is agnostic to the client type as it purely relies on metrics from the beaconchain: it implies to have a process that watches attestations on the beacon chain from the outside of your infrastructure. Tooling in this area is sparse and usually requires to build your own aggregator or instrument an existing one like Chaind to expose per-validator metrics to Prometheus.
Another approach to closed-box monitoring is to use Rated Network, which offers statistics about validation keys via a public API.
With the help of the Rated Network team, we have developed a Prometheus exporter for Rated Network that can be configured to watch a set of validation keys to export them into a Prometheus stack.
The watcher is configured via a YAML config file containing validation keys to watch, it polls metrics about each validation key by using the Rated Network API every 24 hours and refreshes the exported Prometheus values. It also comes with a Grafana dashboard that can then be installed to watch key metrics:
For now, the implementation supports the subset of signals from the Rated Network API: