Error Handling
Restate handles retries for failed invocations. By default, Restate infinitely retries all errors with an exponential backoff strategy.
This guide helps you fine-tune the retry behavior for your use cases.
Infrastructure errors (transient) vs. application errors (terminal)
In Restate, we distinguish between two types of errors: transient errors and terminal errors.
- Transient errors are temporary and can be retried. They are typically caused by infrastructure issues (network problems, service overload, API unavailability,...).
- Terminal errors are permanent and should not be retried. They are typically caused by application logic (invalid input, business rule violation, ...).
Handling transient errors via retries
Restate assumes by default that all errors are transient errors and therefore retryable. If you do not want an error to be retried, you need to specifically label it as a terminal error (see below).
Restate lets you configure the retry strategy at different levels: at the Restate-level (global) and at the run-block-level.
At the Restate-Level (Global)
This defines the default retry policy that will be used for all invocations, unless overridden at the service-, or run-block-level.
You can set the global retry policy in the Restate Server configuration. By default, Restate will use an exponential backoff retry policy:
[worker.invoker.retry-policy]type = "exponential" # retry strategy; requiredinitial-interval = "50ms" # time between the first and second retry; requiredfactor = 2.0 # factor used to calculate the next retry interval; requiredmax-interval = "10s" # max time between retries; default: unset (=interval keeps increasing)
You can tune this policy to your needs. Note that all durations should follow the humantime format.
You can also use a fixed-delay retry policy:
[worker.invoker.retry-policy]type = "fixed-delay" # retry strategy; requiredinterval = "50ms" # time between retries; requiredmax-attempts = "10" # max number of attempts before terminal error; default: unset (=infinite)
If you set a maximum number of attempts, then the handler will throw a terminal error once the retries are exhausted.
Then run the Restate Server with:
restate-server --config-file restate.toml
Or set it via environment variables, for example:
RESTATE_WORKER__INVOKER__RETRY_POLICY__TYPE=fixed-delay \RESTATE_WORKER__INVOKER__RETRY_POLICY__INTERVAL=100ms \restate-server
At the Run-Block-Level
Handlers use run blocks to execute non-deterministic actions, often involving other systems and services (API call, DB write, ...). These run blocks are especially prone to transient failures, and you might want to configure a specific retry policy for them. Most Restate SDKs allow this:
- TypeScript
- Python
- Java
- Kotlin
- Rust
const myRunRetryPolicy = {initialRetryIntervalMillis: 500,retryIntervalFactor: 2,maxRetryIntervalMillis: 1000,maxRetryAttempts: 5,maxRetryDurationMillis: 1000,}await ctx.run("write",() => writeToOtherSystem(),myRunRetryPolicy);
await ctx.run("write", lambda: write_to_other_system(),# Max number of retry attempts to complete the action.max_attempts=3,# Max duration for retrying, across all retries.max_retry_duration=timedelta(seconds=10))
RetryPolicy myRunRetryPolicy =RetryPolicy.exponential(Duration.ofMillis(500), 2).setMaxDelay(Duration.ofSeconds(10)).setMaxAttempts(10).setMaxDuration(Duration.ofMinutes(5));ctx.run(myRunRetryPolicy, () -> writeToOtherSystem());
val myRunRetryPolicy = retryPolicy {initialDelay = 5.secondsexponentiationFactor = 2.0fmaxDelay = 60.secondsmaxAttempts = 10maxDuration = 5.minutes}ctx.runBlock("write", myRunRetryPolicy) { writeToOtherSystem() }
let my_run_retry_policy = RunRetryPolicy::default().initial_delay(Duration::from_millis(100)).exponentiation_factor(2.0).max_delay(Duration::from_millis(1000)).max_attempts(10).max_duration(Duration::from_secs(10));ctx.run(|| write_to_other_system()).retry_policy(my_run_retry_policy).await?;
Note that these retries are coordinated and initiated by the Restate Server. So the handler goes through the regular retry cycle of suspension and re-invocation.
If you set a maximum number of attempts, then the run block will fail with a TerminalException once the retries are exhausted.
Service-level retry policies are planned and will come soon.
Application errors (terminal)
By default, Restate infinitely retries all errors. In some cases, you might not want to retry an error (e.g. because of business logic, because the issue is not transient, ...).
For these cases you can throw a terminal error. Terminal errors are permanent and are not retried by Restate.
You can throw a terminal error as follows:
- TypeScript
- Python
- Java
- Kotlin
- Go
- Rust
throw new TerminalError("Something went wrong.", { errorCode: 500 });
raise TerminalError("Something went wrong.")
throw new TerminalException(500, "Something went wrong");
throw TerminalException(500, "Something went wrong")
return restate.TerminalError(fmt.Errorf("Something went wrong."), 500)
Err(TerminalError::new("This is a terminal error").into());
You can throw terminal errors from any place in your handler, including run blocks.
Unless catched, terminal errors stop the execution and are propagated back to the caller. If the caller is another Restate service, the terminal error will propagate across RPCs, and will get thrown at the line where the RPC was made. If this is not caught, it will propagate further up the call stack until it reaches the original caller.
You can catch terminal errors just like any other error, and build control flow around this. For example, the catch block can run undo actions for the actions you did earlier in your handler, to bring it to a consistent state before rethrowing the terminal error.
For example, to catch a terminal error of a run block:
- TypeScript
- Python
- Java
- Kotlin
- Go
- Rust
try {// Fails with a terminal error after 3 attempts or if the function throws oneawait ctx.run("write",() => writeToOtherSystem(),{ maxRetryAttempts: 3 });} catch (e) {if (e instanceof restate.TerminalError) {// Handle the terminal error: undo previous actions and// propagate the error back to the caller}throw e;}
try:# Fails with a terminal error after 3 attempts or if the function throws oneawait ctx.run("write", lambda: write_to_other_system(), max_attempts=3)except TerminalError as err:# Handle the terminal error: undo previous actions and# propagate the error back to the callerraise err
try {// Fails with a terminal error after 3 attempts or if the function throws onectx.run(RetryPolicy.defaultPolicy().setMaxAttempts(3), () -> writeToOtherSystem());} catch (TerminalException e) {// Handle the terminal error: undo previous actions and// propagate the error back to the caller}
try {// Fails with a terminal error after 3 attempts or if the function throws onectx.runBlock("write",RetryPolicy(initialDelay = 500.milliseconds, maxAttempts = 3, exponentiationFactor = 2.0f)) {writeToOtherSystem()}} catch (e: TerminalException) {// Handle the terminal error: undo previous actions and// propagate the error back to the callerthrow e}
result, err := restate.Run(ctx, func(ctx restate.RunContext) (string, error) {return writeToOtherSystem()})if err != nil {if restate.IsTerminalError(err) {// Handle the terminal error: undo previous actions and// propagate the error back to the caller}return err}
// Fails with a terminal error after 3 attempts or if the function throws oneif let Err(e) = ctx.run(|| write_to_other_system()).retry_policy(RunRetryPolicy::default().max_attempts(3)).await {// Handle the terminal error: undo previous actions and// propagate the error back to the callerreturn Err(e)}
When you throw a terminal error, you might need to undo the actions you did earlier in your handler to make sure that your system remains in a consistent state. Have a look at our sagas guide to learn more.
Cancellations are Terminal Errors
You can cancel invocations via the CLI, UI and programmatically.
When you cancel an invocation, it throws a terminal error in the handler processing the invocation the next time it awaits a Promise or Future of a Restate Context action (e.g. run block, RPC, sleep,...; CombineablePromise
in TypeScript, Awaitable
in Java).
Unless caught, This terminal error will propagate up the call stack until it reaches the original caller.
Here again, the handler needs to have compensation logic in place to make sure the system remains in a consistent state, when you cancel an invocation.
Timeouts between Restate and the service
There are two types of timeouts describing the behavior between Restate and the service.
Inactivity timeout
When the Restate Server does not receive a next journal entry from a running handler within the inactivity timeout, it will ask the handler to suspend. This timer guards against stalled service/handler invocations. Once it expires, Restate triggers a graceful termination by asking the service invocation to suspend (which preserves intermediate progress).
By default, the inactivity timeout is set to one minute.
You can increase the inactivity timeout if you have long-running ctx.run
blocks, that lead to long pauses between journal entries. Otherwise, this timeout might kill the ongoing execution.
Abort timeout
This timer guards against stalled service/handler invocations that are supposed to terminate. The abort timeout is started after the 'inactivity timeout' has expired and the service/handler invocation has been asked to gracefully terminate. Once the timer expires, it will abort the service/handler invocation.
By default, the abort timeout is set to one minute. This timer potentially interrupts user code. If the user code needs longer to gracefully terminate, then this value needs to be set accordingly.
If you have long-running ctx.run
blocks, you need to increase both timeouts to prevent the handler from terminating prematurely.
Configuring the timeouts
You can set the inactivity timeout via the UI, the CLI or the Restate Server configuration.
Via the CLI:
restate services config edit <SERVICE>
Then you can adapt the configuration file and save it for the new settings to take effect.
Via the Restate Server Configuration:
[worker.invoker]inactivity-timeout = "1m"abort-timeout = "1m"
restate-server --config-file restate.toml
Both timeouts follow the humantime format.
Or set it via environment variables, for example:
RESTATE_WORKER__INVOKER__INACTIVITY_TIMEOUT=5m \RESTATE_WORKER__INVOKER__ABORT_TIMEOUT=5m \restate-server
Common patterns
These are some common patterns for handling errors in Restate:
Sagas
Have a look at the sagas guide to learn how to revert your system back to a consistent state after a terminal error. Keep track of compensating actions throughout your business logic and apply them in the catch block after a terminal error.
Dead-letter queue
A dead-letter queue (DLQ) is a queue where you can send messages that could not be processed due to errors.
You can implement this in Restate by wrapping your handler in a try-catch block. In the catch block you can forward the failed invocation to a DLQ Kafka topic or a catch-all handler which for example reports them or backs them up.
Catching failed invocations before handler execution starts
Some errors might happen before the handler code gets invoked/starts running (e.g. service does not exist, request decoding errors in SDK HTTP server, ...).
By default, Restate fails these requests with 400
.
Handle these as follows:
-
In case the caller waited for the response of the failed call, the caller can handle the propagation to the DLQ.
-
If the caller did not wait for the response (one-way send), you would lose these messages.
-
Decoding errors can be caught by doing the decoding inside the handler. The called handler then takes raw input and does the decoding and validation itself. In this case, it would be included in the try-catch block which would do the dispatching:
- TypeScript
- Python
- Java
- Kotlin
- Go
- Rust
myHandler: async (ctx: restate.Context) => {try {const rawRequest = ctx.request().body;const decodedRequest = decodeRequest(rawRequest);// ... rest of your business logic ...} catch (e) {if (e instanceof restate.TerminalError) {// Propagate to DLQ/catch-all handler}throw e;}},@my_service.handler()async def my_handler(ctx: Context):try:raw_request = ctx.request().bodydecoded_request = decode_request(raw_request)# ... rest of your business logic ...except TerminalError as err:# Propagate to DLQ/catch-all handlerraise err@Handlerpublic void myHandler(Context ctx,@Accept("*/*") @Raw byte[] request) {try {var decodedRequest = decodeRequest(request);// ... rest of your business logic ...} catch (TerminalException e) {// Propagate to DLQ/catch-all handler}}@Handlersuspend fun myHandler(ctx: Context,@Accept("*/*") @Raw request: ByteArray) {try {val decodedRequest = decodeRequest(request)// ... rest of your business logic ...} catch (e: TerminalException) {// Propagate to DLQ/catch-all handlerthrow e}}func (MyService) myHandler(ctx restate.Context) (string, error) {rawRequest := ctx.Request().BodydecodedRequest, err := decodeRequest(rawRequest)if err != nil {if restate.IsTerminalError(err) {// Propagate to DLQ/catch-all handler}return "", err}// ... rest of your business logic ...return decodedRequest, nil}// Use Vec<u8> to represent a binary requestasync fn my_handler(&self, ctx: Context<'_>, request: Vec<u8>) -> Result<(), HandlerError> {let decoded_request = decode_request(&request).map_err(|e| {// Propagate to DLQ/catch-all handlere})?;// ... rest of you business logic ...Ok(())}The other errors mainly occur due to misconfiguration of your setup (e.g. wrong service name, wrong handler name, forgot service registration...). You cannot handle those.
Timeouts for context actions
You can set timeouts for context actions like calls, awakeables, etc. to bound the time they take:
- TypeScript
- Java
- Kotlin
- Go
try {// If the timeout hits first, it throws a `TimeoutError`.// If you do not catch it, it will lead to a retry.await ctx.serviceClient(MyService).myHandler("hello").orTimeout(5000);const {id, promise} = ctx.awakeable()// do something that will trigger the awakeableawait promise.orTimeout(5000);} catch (e){if (e instanceof restate.TimeoutError) {// Handle the timeout error}throw e;}
try {// If the timeout hits first, it throws a `TimeoutError`.// If you do not catch it, it will lead to a retry.MyServiceClient.fromContext(ctx).myHandler("Hello").await(Duration.ofSeconds(5));var awakeable = ctx.awakeable(JsonSerdes.BOOLEAN);// ...Do something that will trigger the awakeableawakeable.await(Duration.ofSeconds(5));} catch (TimeoutException e) {// Handle the timeout error}
val awakeable = ctx.awakeable<String>()// do something that will trigger the awakeableval timeout = ctx.timer(5.seconds)try {val result = select {awakeable.onAwait { it }timeout.onAwait { throw TimeoutException() }}} catch (e: TimeoutException) {// Handle the timeout}val callAwaitable = MyServiceClient.fromContext(ctx).myHandler("Hello")val callTimeout = ctx.timer(5.seconds)try {val result = select {callAwaitable.onAwait { it }callTimeout.onAwait { throw TimeoutException() }}} catch (e: TimeoutException) {// Handle the timeout}// ... rest of your business logic ...
awakeable := restate.Awakeable[string](ctx)timeout := restate.After(ctx, 5*time.Second)selector := restate.Select(ctx, awakeable, timeout)switch selector.Select() {case awakeable:result, err := awakeable.Result()if err != nil {return err}slog.Info("Awakeable resolved first with: " + result)case timeout:if err := timeout.Done(); err != nil {return err}slog.Info("Timeout hit first")}