We had been struggling with an issue that impacts the majority of Radix validator node runners: occasional missed proposals for which there's no apparent cause evident in the log files or monitoring dashboards. In this blog entry we explain what the issue is and how we managed to keep this issue under control, minimizing the impact to the Radical Staking node and our delegators.
This is a situation where we would miss proposals due to a large drop in peers (for example, several peer validator nodes becoming unreachable or maybe a few of the top 10 validator nodes becoming unreachable). A random missed proposal would occur about every 7 to 14 days; the impact was minimal and did not affect our uptime. The behavior we observed was that when a large drop in peers occurred, there was a spike in both network traffic and system load that led to the missed proposal. This underscores the difficulty in dealing with this issue because it is external (i.e., other validator nodes experiencing cloud service issues), so essentially the missed proposals were due to circumstances beyond our control. The Radix team has confirmed that a code optimization in the validation logic could be implemented to prevent missed proposals due to other nodes becoming unavailable, but in the meantime the issue became progressively worse to the point where on October 4th we had 5 missed proposals and our uptime on the Radix Explorer went as low as 99.97%. This impact should be negligible in terms of delegation rewards, but we prioritized troubleshooting and resolution of this problem to keep it from getting worse to the point where it would impact our delegators' rewards.
After testing different solutions proposed by the validator community, the change that finally resolved the issue for us was upgrading the server memory on the validator node from 16GB to 32GB, and reconfiguring the Docker container for the validator node to use the expanded memory. The issue we observed was that the default memory configuration for the container results in memory usage above 90% within the container, and when a memory spike occurs due to a drop in peers (i.e., memory usage within the container hitting 100%) the missed proposals occur. After discussing the possible solutions with our fellow node validators, the memory expansion and reconfiguration was the option that seemed to work the best to minimize this issue. We implemented this solution on October 6, and since then we have been at a solid 100% uptime with no missed proposals.
The memory monitoring chart below shows the memory usage inside the Docker container for the validator node before and after the memory upgrade. This chart shows that with the default configuration the memory usage is near 100%, but after the memory upgrade the container has the "cushion" needed to be able to handle unexpected memory spikes. We upgraded both our primary and backup nodes from 16GB to 32GB of memory. This server upgrade incurred an increased monthly cost for cloud services, but we are not passing down this additional cost to our delegators and our fee will remain at 1%.
Hopefully the Radix team will release a patch soon that will enable the node to handle this scenario without the need for a memory upgrade. In the meantime, we'll continue to monitor this issue closely, and will provide further updates to our delegator community if anything changes. We also would like to give special thanks to our friends at Radix DLT Staking for collaborating with us in the analysis and troubleshooting of this issue; you can find out more about them at radixdltstaking.com.
Comments