Error Handling
Infrastructure errors (transient) vs. application errors (terminal)
Handling transient errors via retries
Levels at which retries happen (durable code block, service)
Configuring retries with policies (interval, upper bounds)
Configuring retries with policies
You can influence the retry behavior of your service invocations by configuring the retry policy.
This can be set at different levels: at the Restate-level (global), at the service-level, and at the run-block-level.
At the Restate-level (Global)
This is the default retry policy that will be used for all invocations, unless overridden at the service-, or run-block-level.
You can set the global retry policy in the Restate Server configuration.
By default, Restate will use an exponential backoff retry policy:
[worker.invoker.retry-policy]type = "exponential" # retry strategy; requiredinitial-interval = "50ms" # time between the first and second retry; requiredfactor = 2.0 # factor used to calculate the next retry interval; requiredmax-interval = "10s" # max time between retries; default: unset (=interval keeps increasing)max-attempts = "10" # max number of attempts before terminal error; default: unset (=infinite)
You can tune this policy to your needs.
Note that all durations should follow the humantime format.
You can also use a fixed-delay retry policy:
[worker.invoker.retry-policy]type = "fixed-delay" # retry strategy; requiredinterval = "50ms" # time between retries; requiredmax-attempts = "10" # max number of attempts before terminal error; default: unset (=infinite)
If you set a maximum number of attempts, then the handler will throw a TerminalException once the retries are exhausted.
Then run the Restate Server with:
restate-server --config-file restate.toml
Or set it via environment variables, for example:
RESTATE_WORKER__INVOKER__RETRY_POLICY__TYPE=fixed-delay \RESTATE_WORKER__INVOKER__RETRY_POLICY__INTERVAL=100ms \restate-server
At the Service-Level
Coming soon!
At the Handler-level
Handler-level retry policy configuration does not exist and is not planned.
At the Run-block-level
Handlers use run blocks to execute actions involving other systems and services (API call, DB write, ...). These run blocks are especially prone to transient failures, and you might want to configure a specific retry policy for them.
Most Restate SDKs allow you to configure the retry policy for a run-block.
Note that these retries are coordinated and initiated by the Restate Server. So the handler goes through the regular retry cycle outlined above.
If you set a maximum number of attempts, then the ctx.run block will fail with a TerminalException once the retries are exhausted.
When you throw a terminal error, you need to undo the actions you did earlier in your handler to make sure that your system remains in a consistent state. Have a look at our Sagas guide to learn more.
Application errors (terminal)
throwing from handlers or code blocks catching, handling, re-throwing --> basically show it works exactly like in a normal program Terminal errors propagate across RPCs (compare to error bubble up / stack unwind)
By default, Restate infinitely retries all errors. In some cases, you might not want to retry an error (e.g. because of business logic, because the issue is not transient, ...).
The SDK lets you signal this by throwing/returning a terminal error. A terminal error is a Restate-specific error, that is not retried, and is considered to be a permanent failure of the invocation (check syntax at ). Terminal errors are also proxied back to the client.
Timeouts to bound response times
Inactivity timeout
Default 1 minute
This timer guards against stalled service/handler invocations. Once it expires, Restate triggers a graceful termination by asking the service invocation to suspend (which preserves intermediate progress).
The 'abort timeout' is used to abort the invocation, in case it doesn't react to the request to suspend.
Can be configured using the humantime format.
Abort timeout
Default 1 minute
This timer guards against stalled service/handler invocations that are supposed to terminate. The abort timeout is started after the 'inactivity timeout' has expired and the service/handler invocation has been asked to gracefully terminate. Once the timer expires, it will abort the service/handler invocation.
This timer potentially interrupts user code. If the user code needs longer to gracefully terminate, then this value needs to be set accordingly.
Can be configured using the humantime format.
Cancellations are Terminal Errors
If you handle Terminal Errors, you automatically handle cancellation signals
Common patterns
- catch and apply compensation
- dead-letter-queue (catch-all wrapper)
- rpc-or-timeout