Troubleshooting
Restate Clusters
Node id misconfiguration puts log server in data-loss
state
If a misconfigured Restate node with the log server role attempts to join a cluster where the node id is already in use, you will observe that the newly started node aborts with an error:
ERROR restate_core::task_center: Shutting down: task 4 failed with: Node cannot start a log-server on N3, it has detected that it has lost its data. storage-state is `data-loss`
Restarting the existing node that previously owned this id will also cause it to stop with the same message. Follow these steps to return the initial log server into service without losing its stored log segments.
First, prevent the misconfigured node from starting again until the configuration has been corrected. If this was a brand new node, there should be no data stored on it, and you may delete it altogether.
The reused node id has been marked as having data-loss
. This precaution that tells the Restate control plane to avoid selecting this node as member of new log nodesets. You can view the current status using the restatectl replicated-loglet
tool:
restatectl replicated-loglet servers
Node configuration v21Log chain v6NODE GEN STORAGE-STATE HISTORICAL LOGLETS ACTIVE LOGLETSN1 N1:5 read-write 8 2N2 N2:4 read-write 8 2N3 N3:6 data-loss 6 0
You should also observe that the control plane is now avoiding using this node for log storage. This will result in reduced fault tolerance or even unavailability, depending on the configured minimum log replication:
restatectl logs list
Logs v3└ Logs Provider: replicated├ Log replication: {node: 2}└ Nodeset size: 0L-ID FROM-LSN KIND LOGLET-ID REPLICATION SEQUENCER NODESET0 2 Replicated 0_1 {node: 2} N2:1 [N1, N2]1 2 Replicated 1_1 {node: 2} N2:1 [N1, N2]
To restore the original node's ability to accept writes, we can update its metadata using set-storage-state
subcommand.
Only proceed if you are confident that you understand the reason why the node is in this state, and are certain that its locally stored data is still intact. Since Restate cannot automatically validate that it safe to put this node back into service, we must use the --force
flag to override the default state transition rules.
restatectl replicated-loglet set-storage-state --node-id 3 --storage-state 'read-write' --force
Node N3 storage-state updated from data-loss to read-write
You can validate that the server is once again being used for log storage using logs list
and replicated-loglet servers
subcommands.