The Radical Staking team participated in the Stokenet Babylon migration today and successfully migrated our Olympia validator node to Babylon (in case you haven’t heard of “Stokenet”, it is the Radix testing network). The Stokenet migration is intended to be identical to the Mainnet migration, therefore we prioritized this task because it was crucial to help us prepare for the Radix Mainnet Babylon migration that is planned to occur on or around September 27.
We actively participated in live discussions with the RDX Works team and other validator node runners in Discord in the #node-runner and #babylon-test-validators channels as the migration happened. We came out of this experience with a full understanding of how the process will work, what to expect, the potential risks, and how to prepare for them. This was a time-consuming task, and as a result several validator node runners were not able to participate. One of our core values is to foster Radix ecosystem growth and success through open knowledge sharing, therefore we decided to create this blog to share our migration experience, including lessons learned, recommended best practices, and troubleshooting steps for common migration issues.
Our migration approach consisted of using two separate nodes running on separate virtual machines, one for the Olympia Stokenet and another for the Babylon Stokenet. It is also possible to do an “in-place” migration, where you run both nodes on the same VM, but based on our previous experience executing similar migrations we believe this is a risky approach because of the possible conflicts between the two configurations (i.e., port conflicts) and resource contention between the nodes. One of the general observations during the migration was that the Babylon node consumes a high amount of memory during the migration process, and as a result the RDX Works team is planning to update the documentation to make 32 GB of RAM the minimum requirement for Babylon (currently 16 GB is the minimum requirement, and 32 GB is the preferred/recommended configuration). We avoided this issue by allocating 32 GB of memory in each of our nodes, but configurations with less memory (or less than 64 GB if doing an in-place migration in a single VM) will be prone to failure based on our observations.
The process started with upgrading the Stokenet node to version 1.5.0-stokenet (the equivalent Mainnet release is 1.5.0, which we already deployed in our Olympia Mainnet node). This version of the node includes an olympia-end-state API endpoint that is used to provide the genesis transaction that initializes the Babylon node. In other words, the Babylon node will connect with the Olympia node on this API endpoint to obtain the genesis transaction. This API runs on different ports that are currently not used by Olympia validator nodes, and the port is different based on the validator configuration (Docker or systemd). This caused some confusion, and in fact we know that some node runners were not able to participate in the Stokenet migration because they couldn’t get past this first step (in other words, the Babylon node was not able to connect to the Stokenet node). First, you need to determine which port you are using for the olympia-end-state endpoint: the port is 443 if you’re deploying the node on Docker, and it is 3400 if using systemd. However, some node runners have mimicked the Docker architecture and deployed NGINX as a reverse-proxy in front of the core node process because it provides a more robust architecture. For these node runners that followed the same architecture even though they’re using systemd, the olympia-end-state endpoint may run on 443 instead of 3400, or on both 443 and 3400, even though they’re using systemd, depending on their customizations to the NGINX configuration.
To verify whether the node is listening on port 3400 or 443, you can run the following commands:
netstat -ani | grep LISTEN | grep 3400
netstat -ani | grep LISTEN | grep 443
In our case we used port 443, and received this output confirming that the node was listening on port 443:
tcp 0 0 0.0.0.0:443 0.0.0.0:* LISTEN off (0.00/0/0)
tcp6 0 0 :::443 :::* LISTEN off (0.00/0/0)
Once you’ve determined which port the olympia-end-state endpoint is listening on in your node, make sure that your firewall rules are configured to allow inbound traffic into that port. For the Mainnet migration, we recommend opening your firewalls in your Olympia node (on port 3400 or 443) only to your Babylon node, as opposed to opening it to any address (i.e., 0.0.0.0/0 CIDR range) in order to maintain a secure configuration. Once the firewall is configured, you can verify the port connectivity from the Babylon node by running "telnet x.x.x.x 443" or "telnet x.x.x.x 3400", where x.x.x.x is the IP address of your Olympia node. You should get a message saying "Connected to x.x.x.x"; if you get a time out, check the firewall configuration again to make sure it's correct.
After verifying the listening port and completing the firewall configuration, we further verified connectivity from the Babylon node to the Olympia node on the olympia-end-state endpoint by running the following command in the Babylon node, where x.x.x.x is the IP address of the Olympia node:
curl -u admin:admin_password -k -X POST https://x.x.x.x/olympia-end-state \
--data '{"network_identifier": {"network": "stokenet"}, "include_test_payload": false}'
If you are using systemd and listening on port 3400, the URL should be modified to https://x.x.x.x:3400/olympia-end-state. For Docker configurations, there’s no need to specify port 443 since it is the default port for https.
The successful execution of this curl command should return the following output:
{
“status”: “not_ready”
}
If you received this message this means you are ready to start working on your Babylon node (despite the “not_ready” message 😊).
We deployed node version rcnet-v3.1-r1 on our Babylon Stokenet node. We followed the usual process for validator node configuration except for three major exceptions:
1. We set network_id=2, which is the same setting as in the Olympia node (likewise, for the Mainnet migration the setting will be network_id=1 for both nodes)
2. We used the same validator node keystore (node-keystore.ks) from the Olympia node in the Babylon node, and updated the RADIX_NODE_KEYSTORE_PASSWORD variable accordingly
3. We updated the RADIXDLT_NETWORK_SEEDS_REMOTE setting based on the systemd settings published here. According to the instructions, creating the Babylon node with babylonnode docker config -m CORE MIGRATION will automatically step you through the manual steps for systemd described in the page, however, this did not include the RADIXDLT_NETWORK_SEEDS_REMOTE configuration and we had to configure it manually using the network.p2p.seed_nodes setting for the systemd default.config file since the corresponding steps were not published for Docker.
Our key takeaway from these steps was that if you’re running on Docker you still have to read the instructions for the systemd configuration just in case there is a configuration step that was omitted in the Docker instructions. We realized this after the fact and it caused a problem later on; below we describe what the issue was and how we resolved it.
With this configuration, our Babylon node came up with no errors, but instead of the typical “node started” messages we only saw the following:
Successfully connected to the Olympia stokenet node, but the end state hasn't yet been generated (will keep polling)
This is expected, and this message will repeat every second until the network shutdown is signaled in the Olympia Stokenet. This shutdown process starts as soon as the majority of node runners vote for the fork. In this case, the RDX Works team holds the majority of staked XRD in Stokenet, and were able to force the shutdown in a controlled manner. As soon as all of their validator nodes signaled readiness for the migration, these messages appeared in our Olympia node log file:
2023-09-07T11:53:47,299 [INFO/CandidateForkVotesPostProcessor/SyncRunner tv...ptysfgc0w] (CandidateForkVotesPostProcessor.java:129) - Forks votes results: [ForkVotingResult[epoch=16205, candidateForkId=73746f6b656e65742d73687464776e325e5d2025d0e4f983, stakePercentageVoted=8392]]
2023-09-07T11:53:47,313 [INFO/RadixEngineStateComputer/SyncRunner tv...ptysfgc0w] (RadixEngineStateComputer.java:466) - Forking RadixEngine to stokenet-shtdwn2
2023-09-07T11:53:47,313 [WARN/RadixEngineStateComputer/SyncRunner tv...ptysfgc0w] (RadixEngineStateComputer.java:475) - The time of the Olympia network has come to an end. It will no longer process any transactions.
Run the Babylon node to continue your Radix journey!
After these messages the Olympia Stokenet node continued validating for about 5 minutes, then the logs showed this message:
2023-09-07T11:58:42,306 [INFO/OlympiaEndStateHandler/XNIO-2 task-1] (OlympiaEndStateHandler.java:159) - Olympia end state prepared in 294 s
At this point the Olympia node became read-only and stopped updating the ledger.
Almost immediately, the logs in the Babylon node changed from the “will keep polling” messages to the following (IP address removed):
2023-09-07T11:59:00,092 [INFO/undertow/OlympiaGenesisService] (Undertow.java:259) - stopping server: Undertow - 2.2.9.Final
2023-09-07T11:59:00,095 [INFO/RadixNodeBootstrapper/OlympiaGenesisService] (RadixNodeBootstrapper.java:338) - Genesis data has been successfully received from the Olympia node (2138 data chunks). Initializing the Babylon node...
2023-09-07T11:59:01,583 [INFO/RunningRadixNode/OlympiaGenesisService] (RunningRadixNode.java:98) - Starting Radix node subsystems...
2023-09-07T11:59:01,583 [INFO/RunningRadixNode/OlympiaGenesisService] (RunningRadixNode.java:99) - Using a genesis of hash fe034c412d934813f46ec89a65e48d8773a0d35416646874a7950a2783a0f780
2023-09-07T11:59:02,084 [INFO/StandardHostIp/OlympiaGenesisService] - Using a configured host IP address: x.x.x.x
At this point the radixdlt-core.log in the Babylon node stopped logging messages, but we were able to continue following the progress in the console by executing the following steps:
1. Run the command "docker ps" and copy the value for CONTAINER ID for the radixdlt/babylon-node container
2. Run the command "docker logs --follow cointaner_id_number", using the Container ID identified in the previous command
This shows the console output from the container, and we saw the following console output messages:
Committing data ingestion chunk (resource_balances) 6 of 2138
Committing data ingestion chunk (resource_balances) 7 of 2138
Committing data ingestion chunk (resource_balances) 8 of 2138
These “Committing data ingestion chunk” messages continued for about an hour, until it finally showed “Committing data ingestion chunk (resource_balances) 2138 of 2138”. Note that during this time data is not being transmitted from the Olympia node: that only occurred for the genesis data transmission with the “Genesis data has been successfully received from the Olympia node (2138 data chunks). Initializing the Babylon node...” message; after that it’s just the Babylon node “churning” the data. Another item to keep in mind is that from the moment we started up the Babylon node up until this point the node status was “BOOTING_AT_GENESIS”, which is the expected value.
Once this process completes, the node started up and we saw the “Radix node started successfully” message and other messages that we normally see in the validator node log file. However, once the node started processing proposals it got stuck with bft_timeout messages similar to this:
bft_timeout{epoch=16205 round=1 leader=validator_tdx_xxx next_leader=validator_tdx_xxx count=36}
We saw numerous messages like this with epoch=16205 and round=1 where these numbers didn't change, but with the count going up. After discussing this issue with the RDX Works team and other validators, we discovered that it was due to using the wrong value for the RADIXDLT_NETWORK_SEEDS_REMOTE variable. We updated this variable definition with the correct values, restarted the node, and the issue was resolved: the node status finally changed from “BOOTING_AT_GENESIS” to “UP”, the number of peers increased to the total number of validator nodes running (26), and consensus status changed to "VALIDATING_IN_CURRENT_EPOCH". One item of concern was that the node status kept changing from “UP” to “OUT_OF_SYNC”, even though the output from the command "babylonnode api system network-status" showed the same number for current_state_version and target_state_version, with this version numbers also increasing. Fortunately, David from RDX Works clarified that this metric is “rubbish” and should be ignored 🤣
Overall the migration went very smooth for all node runners who participated and the only issues that come to mind were the connectivity issues related to the olympia-end-state endpoint and the confusion on the value for RADIXDLT_NETWORK_SEEDS_REMOTE value because of how the documentation was written. We came out of this experience very confident that the Babylon Mainnet migration will be a success, and we hope the details we shared in this blog are helpful for other validator node runners who did not have a chance to participate in the Stokenet migration today.
Comments