Learnings from running Web3Signer at scale on Holesky

November 30, 2023

Overview

‍

This document describes the infrastructure setup for running validation at scale using Web3Signer, which we put in place at Kiln to run our ~100k Holesky validators.

One of the goals was to assess the feasibility of running validators from multiple geographical locations while keeping guarantees from Web3Signer. Initially, we adopted a straightforward approach, and this post covers some pitfalls that came up with it and the eventual enhancements we implemented.

We extend our heartfelt gratitude to the ConsenSys team for their unwavering support, providing custom flags to Web3Signer that enabled us to tweak its threading model.

‍

Architecture

‍

The overall architecture looks as follows:

Each validator client loads a subset of validation keys, establishes connections to a beacon/exec pair and gets signatures from a fleet of Web3Signer instances linked to an anti-slashing database. This configuration is quite common when using Web3Signer.

‍

However, what sets our setup apart from classic setups is the scale at which it runs, and the fact that validators/beacons/exec nodes are run from different geographical places.

‍

Impact of geographical spread

‍

TL;DR: You likely want to favor having a Web3Signer instance as close to the database as possible.

‍

In the above architecture, the Web3Signer instances can be positioned:

close to the validator clients and far away from the database
close to the database and far away from the validator clients

We initially assumed it would end up in similar signing latencies as the overall distance end-to-end was the same, but that’s not the case.

Whenever a validator client reaches out to Web3Signer, it sends a single HTTP request, so there is only one back-and-forth. On the other hand, upon receiving the signature request, Web3Signer sends a transaction to the database, which first locks the validator to check, then from within this transaction, sends multiple sub-queries (this happens in the transaction handler). This means increasing the latency between Web3Signer and the database will have a ~5x impact on the overall signing latency as there is more back-and-forth happening:

The impact of increased latency between Web3Signer instances and the database leads to extended queue delays and, eventually, timeouts if the threading model of Web3Signer is not tuned (see next section).

‍

Threading model of Web3Signer and latency

‍

TL;DR: You may want to tweak the threading model of Web3Signer.

‍

Our initial setup placed Web3Signer instances scheduled near validator clients, which increased the latency to the database. At the beginning of each epoch, when the signing load is at its peak, we observed:

a significant number of missed attestations
validators seeing timeouts when signing requests
optimal resource usage by beacons, execution clients and validators.
little to no load on our database, as well as our Web3Signer instances
no saturation on the network side

‍

These observations suggested that there was some contention in the way Web3Signer processed incoming requests. After a thorough investigation, we concluded that we needed to tweak the worker size pool of Vertx, the Java framework used by Web3Signer to dispatch request processing. This framework excels at handling async operations concurrently that can spread across multiple Unix threads.

‍

However, doing some categories of blocking operations from a handler can block the event loop. We suspect this is what happens around SQL transactions used for anti-slashing. As we couldn’t tweak the Vertx configuration, we coordinated with the ConsenSys team to build a version of Web3Signer, allowing the tuning of -Xworker-thread-pool, which increases the number of Unix threads. Tuning this parameter impacts performance, especially when latency to the database is high.

‍

There are Prometheus metrics that are worth checking to get an insight into this:

http_vertx_worker_queue_delay: time spent by requests in queues before being processed
http_vertx_worker_pool_completed_total : number of queries processed by Web3Signer

‍

With high latency to the database

‍

The greater the distance between the database and the Web3Signer instances, the more significant this issue becomes because the blocking SQL transaction consumes more time and blocks other requests. We got to the point where we were observing improvements while running it with large values of 200 Unix threads on a single CPU, without a noticeable increase in the CPU load. This suggests that there is room for improvement at the Web3Signer level:

Note: The delays we are seeing in this case are not to the database, but rather to the incoming requests waiting in the processing queue of Web3Signer, with no contention at the underlying database level.

‍

In this extreme example, with a latency of around ~O(50)ms between Web3Signer and the database, requests quickly accumulate in queues, awaiting processing by Web3Signer, as it can only handle a few at a time due to the block on the SQL transaction.

‍

With low latency to the database

‍

When there is low latency between the database and the Web3Signer instances, the average performance is generally satisfactory and tweaking its value marginally improves performances:

Note: With latencies < 0.02ms, this represents multiple orders of magnitude lower than the previous graph, and depicts a well-functioning Web3Signer.

‍

However, zooming in on the 99 percentile, we still observe a net improvement in the waiting queue times when using a large value for worker-thread-size:

We could not match this (yet) with a clear improvement in our attestation rate because there is too much noise currently on Holesky. However, we believe it has the potential to win ~50ms at the 99 percentile on more stable networks or to mitigate the impact of an increase in DB latency.

‍

Takeaways

‍

The current state of Web3Signer’s request processing reveals opportunities for improvement, as it is currently delaying/queuing incoming requests in situations where there is no resource contention in the pipeline. Maybe yielding back to the Vertx scheduler around blocking code inside the signing transaction would allow other concurrent requests to progress.

‍

In the meantime, we think adjusting the -Xworker-thread-pool value higher than 20 could prove beneficial in case we experience an incident that increases our latency to the database, ensuring that Web3Signer copes better with it.

‍

Ingress load balancing

‍

TL;DR: At scale, you likely want an ingress load-balancer.

‍

We initially didn’t use an ingress, relying instead on the default load-balancing mechanism in Kubernetes. This approach led to a random connection of a validator client to a Web3Signer instance, which then keep its HTTP connection for all of its signatures. This is problematic because there is no guarantee that the random selection at socket opening time will result in a balanced situation, we observe a high level of disparity between the load processed by each Web3Signer instance:

As we saw before, the waiting queue can quickly be limiting due to the threading model hence, some validators all connected to the same instance would experience 5 times higher latencies and potentially timeout, while others would be fine. Using an ingress without any extra configuration results in a balanced load on each request:

On the plus side, we get additional metrics from the ingress that can quickly point out issues, such as the overall QPS (which spikes at every epoch):

As well as latency distribution graphs which can pinpoint underlying issues on the Web3Signer queues or at the database level:

‍

Conclusion

‍

Our journey with Web3Signer and Holesky at scale highlighted potential areas of optimization in its request processing. Fine-tuning parameters like -Xworker-thread-pool can provide better performance, especially when faced with unexpected latency issues.

‍

Additionally, implementing an ingress load-balancer at scale ensures a more balanced and efficient distribution of requests. These insights reflect the importance of continuous assessment and adjustment when operating in a dynamic environment like blockchain technology.

‍

Thanks to Sébastien Rannou for writing this article, as well as the Ethereum Foundation for their support.

‍

Reach out to start staking with Kiln

‍

About Kiln

‍

Kiln is the leading enterprise-grade staking platform, enabling institutional customers to stake assets, and to whitelabel staking functionality into their offering. Kiln runs validators on all major PoS blockchains, with over $2.2 billion crypto assets being programmatically staked, and running over 3% of the Ethereum network on a multi-cloud, multi-region infrastructure. Kiln also provides a validator-agnostic suite of products for fully automated deployment of validators, and reporting and commission management, enabling custodians, wallets, and exchanges to streamline staking operations across providers. Kiln is also SOC2 Type 2 certified.