Network Restart Proposal: Feedback Requested

I have begun to sketch an initial proposal to improve our ability to restart the network when it enters a paused state. The proposal I present here is not final and nor do I believe that it is the "best" method. However, I do believe it would be an improvement and, more to the point, it is relatively simple and achievable.

As I understand it, the current system has the following problems:

1. It is always active, so a central authority could restart the entire network at any time without permission from node operators.
2. If the central authority were to go offline, it would be a significant logistical challenge to switch to a new central authority, particularly if the network is in a stalled state.
3. It requires operators to run their nodes in Docker, which may not be the preference of the operator and creates a potential attack vector.
4. Setup is complicated.
5. An operator must coordinate with the central authority to move their node or change their IP.

A better restart system might have the following properties:

1. Operators can choose when to allow a central authority to restart their node (i.e. the system is not always on).
2. A new central authority can be selected with minimal effort.
3. The system is agnostic to how factomd is being operated, meaning that an operator can run factomd however they like (e.g. docker, systemd, etc).
4. Operators would be able to move their node or change their IP without the cooperation of a central authority.
5. Requires minimal configuration.

Proposal:

A restart agent inspired by Chockablock that is installed locally on an authority sever and can manage the restart process as directed by a centralised coordinator.

The restart agent would be a small, basic agent that would be started in the event of a network pause. It would connect directly over websockets to a centralised coordinator as defined in a local config file. The coordinator would be able to issue start and stop commands to the restart agent. Upon receipt of one of these commands, the restart agent would run a corresponding bash script defined by the operator: start.sh or stop.sh. These operator-defined scripts would contain the commands to stop or start the factomd process. The state of the factomd process would be fed back up the websocket connection to the central coordinator.

The coordinator itself would consist of a websocket server and a UI to allow core devs and operators to orchestrate network restarts. The UI would contain a public interface to display network state and a private interface available only to people orchestrating network restarts. The public interface would show the state of the network, including the height and minute each factomd instance, and the time since the last block. The private interface would include those details, in addition to controls to restart the entire network or to restart only selected nodes.

restart.png

Pros:
  • The system is not always on. Operators can decide when to allow authenticated users to restart their authority nodes by controlling whether the restart agent is running.
  • The user-defined start.sh and stop.sh scripts will allow operators to run factomd however they prefer.
  • Setup is likely to be much simpler than the current process.
  • Authority nodes will become more flexible and mobile, as they won't be bound to a specific IP address.
  • It would require minimal reconfiguration for node operators to select a new central authority. They would only need to change the endpoint in a config file.

Cons:
  • All (or virtually all) operators would be required to respond to a network pause, as they will be needed to start their agent.
  • It is not as decentralised as it could be. We would still need a central authority to hit a restart button.

I am keen to hear any feedback. Thanks!
 
I like this idea and this kind of setup is the next logical step for improving the current workflow. The idea is simple and can be easily created and integrated. The biggest challenge will likely be writing the bash stop/start scripts and monitoring the processes but assuming most people will continue running factomd in docker, that should be doable. The biggest concern would be the security model of the coordinator.

We should do it and I'd be happy to assist in any way I can

Cons:
  • It is not as decentralised as it could be. We would still need a central authority to hit a restart button.
I've mentioned this before somewhere and the whole idea of a "network restart" is in itself a centralized operation. I would rather make it a goal that centralized restarts are not necessary through improving the protocol than invest a ton of resources into trying to decentralize something that can't be decentralized. This proposal improves the existing access to restarts and IMO this is not a con because it's not a goal in the first place.



Question: Does anyone who currently work with the existing factom docker swarm know of a feature the docker swarm provides that would be missing under this system?
 
Thanks for the feedback, Who. Your input is much appreciated.

I have given some thought to the security model of the coordinator. I wanted to avoid maintaining a list of API keys. Instead, I wondered about the possibility of using asymmetric cryptography to sign messages using the private key associated with the node identity. The upside would be that the coordinator would only need to be aware of which identities currently belong to the authority set, meaning that it could access all necessary public data on-chain via OpenNode. The downside would be that the server would need to parse and track the admin chain, which would be slow for fresh installs and would add more complexity to the coordinator. I am no stranger to parsing the chain, however, so it is not a big drawback.

However, my knowledge of factomd cryptography is hazy at best, so I am not sure whether or not such an approach would be feasible. What do you think?

With regards to monitoring the progress of a restarted node, I was planning to rely on the exit code of the bash script and the status of the factomd API. That would massively simplify the contents of the bash script to, potentially, a single line. However, I am keen to understand better why you think this element might be tricky? Am I overlooking something?
 
Last edited:
Instead, I wondered about the possibility of using asymmetric cryptography to sign messages using the private key associated with the node identity. The upside would be that the coordinator would only need to be aware of which identities currently belong to the authority set, meaning that it could access all necessary public data on-chain via OpenAPI. The downside would be that the server would need to parse and track the admin chain, which would be slow for fresh installs and would add more complexity to the coordinator. I am no stranger to parsing the chain, however, so it is not a big drawback.

However, my knowledge of factomd cryptography is hazy at best, so I am not sure whether or not such an approach would be feasible. What do you think?
That's pretty much what I was going to suggest 😸 It's much simpler than you think: there's an undocumented "authorities" API call you can use. (Sample output). Nodes logging in can be accomplished via Kambani, since the node keys are just ed25519 keys like FCT addresses OR they can sign a message locally if loading the key into Kambani is deemed too insecure. The only hurdle with Kambani would be converting the hex private key to an Fs... address so it can be imported as an FCT address, or extending Kambani to support a third address type ("raw ed25519 key").

With regards to monitoring the progress of a restarted node, I was planning to rely on the exit code of the bash script and the status of the factomd API. That would massively simplify the contents of the bash script to, potentially, a single line. However, I am keen to understand better why you think this element might be tricky? Am I overlooking something?
It's not really tricky per se, just the most challenging aspect.

If it's not running inside docker, the script has to make up for a lot. The factomd api doesn't immediately start when the node is started, it starts after the initial setup and fastboot is done. If fastboot is somehow corrupt or otherwise not working, it might take like half an hour between running the executable and the API being available, which would be indistinguishable from a node not working due to other reasons. Ideally you'd have to monitor the process itself to see if it's running, which could be a challenge to get working on all platforms currently used by ANOs.

My gut feeling at first was to just provide restart scripts for the docker container, and let ANOs write their own scripts if they want a custom setup. (Unless that was your plan all along and I misinterpreted it). The docker restart script would indeed be simple.
 
The technical approach here seems like it could work in an ideal system - I think some of the challenges we have with the current codebase may prove problematic (though there is plenty of merit in moving to a more decentralized tooling)

Question: Does anyone who currently work with the existing factom docker swarm know of a feature the docker swarm provides that would be missing under this system?
From my observation of the current process - the main thing that would be missed is the ability to get a global view of the overall network. There is an 'Observer' tool that queries the nodes attached to the swarm network, and a user with the proper authorization can create a tunnel to observe any Admin panel. (and see logs also - though most nodes don't do verbose logging)

The largest challenges with the current system happen when a given Authority node doesn't restart when we need it to - (this requires a human in the loop to intervene)

Deeper Dive Into the Problem

The design of the protocol currently is sensitive to Time, some adjustments have been made to address this.
In the past the workaround has been to connect the authorities using a backhaul follower network running https://github.com/FactomProject/factomd/tree/talktalktalk

On a longer timescale - would be interesting to have the blockchain protocol support different modes of operation.
For example - in the case of a stall the network could move to a 'startup' mode where different message filtering rules take affect(like in talktalktalk branch), and a reboot could be coordinated using the p2p network instead of a side channel.
 
There is an 'Observer' tool that queries the nodes attached to the swarm network, and a user with the proper authorization can create a tunnel to observe any Admin panel
Accessing the admin panel is something I had not yet considered. It may be possible to get the admin panel over the same websocket connection as everything else. The admin panel continuously polls factomd, so the challenge there is probably around how to correctly route those requests. I don't have a readymade solution in mind for how that would be done, but I am relatively confident that it is doable. Logs would be more difficult as the restart agent would have no knowledge of where those logs are.
 
All (or virtually all) operators would be required to respond to a network pause, as they will be needed to start their agent.
Maybe I'm being dense, but doesn't this make the idea of a network-wide restart moot? I'd have to imagine that the point of this is to not cause a system-wide outage just because a few people are asleep during a stall 🤔

If people have to be awake to start the restart agent, then why wouldn't they just be able to restart their own node? I'll confess the more technical aspects of the proposal go over my head, so there could be multiple benefits I'm not seeing here.

Perhaps a middle-ground would be to only grant access to a certain number of people, f.ex. 10 in total, with diverse "active hours", limiting the risk of abuse. I would imagine there would be a log of restarts initiated, by whom, along with a "reason for restart" input area, so if anyone restarted the network without a valid reason they could have their access removed.
 
Thank you for bringing this up Alex.

This is a first step towards something more resilient. We could probably try it on the testnet first (since we also had some issues with restarts recently...)

Maybe I'm being dense, but doesn't this make the idea of a network-wide restart moot? I'd have to imagine that the point of this is to not cause a system-wide outage just because a few people are asleep during a stall 🤔
Even if the script is "always on" and listening for a potential restart, it's still a better solution since currently you have to go through a centralized process to get your node up in the swarm (if it's a new one or if you have changed IP).

Would be pleased to help and set that up
 
Top