The curious case of the doppelganger DBsig. (planned fix for v5.4.3)

Brian Deery

Factom Inc.
Core Committee
Core Developer
Good news, Clay found the bug that caused last Tuesday's stall. Let me see if I can explain it well enough. I am posting it here in the hope that it can inform some learning about factom in the future.

As part of getting servers un-gummed-up a change was made last release to take the latest valid dbsig that was received from the network in the last release. This gives the code the ability to recover if there was an older DBsig that was floating around from before a stall, for example. This contributed to difficult recovery in earlier stalls.

The change happened to have a negative effect on the elections.
When an election starts off, it begins with an audit server offering up an alternative to the thing that the other fed neglected to produce to the other remaining federated servers. In this case, an audit server would create a dbsig, then offer it up to the remaining feds. They would vote amongst each other and would agree that the audit server that was volunteering gave an appropriate DBsig. After 3 rounds of the majority agreeing, they would all take the new dbsig and add it as the first item in the process list where it was missing. (dbsigs are always the first item in the PL.)

This is where the earlier bugfix came in. The audit server that volunteered would put the DBsig it offered the feds into it's process list. There was a bug where it would create and add another doppelganger DBsig, which was only mostly the same. Almost isn't close enough though. It would sign the same block as before, and ed25519 sigs are deterministic, so the *block* signature would match the same one as before. Messages that go in the process list are signed by the federated server, and under that message signature is the dbsig, a timestamp, a hash of the previous PL message, and some other stuff. The audit server that is promoted to the fed position erroneously generated that second dbsig, and puts it into its local process list. This second dbsig is not broadcast to the rest of the network. Since it is made again at a different time, it has a different timestamp, and thus a different hash. The next message that the promoted audit server sends out will have a pointer to the doppelganger dbsig. All the other feds will reject those subsequent process list messages, since the pointer to the previous message does not match the original non-doppelganger which they are using. At the end of the first minute, the other feds get tired of waiting and a new, different audit server offers up a satisfactory EOM0 to close out the first minute period. (This is typically 2 minutes after the other servers have been waiting.)

Here is the fix.

With the new condition `pl.VMs[dbs.VMIndex].List[0] != m` the code no longer overwrites that existing dbsig.

The code you see above ( is the test to replicate a server dropping out at this critical moment so that if this bug arises again, it will be caught when it gets pushed to github.

Paul Bernier

Core Committee
Core Developer
Just a quick question, what do you mean by an audit volunteer? Isn't the audit server chosen in a deterministic way? Do all audit volunteer? Basically what's the logic to chose a new audit? Thanks!