Error Handling

Restate handles retries for failed invocations. By default, Restate infinitely retries all errors with an exponential backoff strategy.

This guide helps you fine-tune the retry behavior for your use cases.

Infrastructure errors (transient) vs. application errors (terminal)

In Restate, we distinguish between two types of errors: transient errors and terminal errors.

Transient errors are temporary and can be retried. They are typically caused by infrastructure issues (network problems, service overload, API unavailability,...).
Terminal errors are permanent and should not be retried. They are typically caused by application logic (invalid input, business rule violation, ...).

Handling transient errors via retries

Restate assumes by default that all errors are transient errors and therefore retryable. If you do not want an error to be retried, you need to specifically label it as a terminal error (see below).

Restate lets you configure the retry strategy at different levels: at the Restate-level (global) and at the run-block-level.

At the Restate-Level (Global)

This defines the default retry policy that will be used for all invocations, unless overridden at the service-, or run-block-level.

You can set the global retry policy in the Restate Server configuration. By default, Restate will use an exponential backoff retry policy:

restate.toml

[worker.invoker.retry-policy]
type = "exponential" # retry strategy; required
initial-interval = "50ms" # time between the first and second retry; required
factor = 2.0 # factor used to calculate the next retry interval; required
max-interval = "10s" # max time between retries; default: unset (=interval keeps increasing)

You can tune this policy to your needs. Note that all durations should follow the humantime format.

You can also use a fixed-delay retry policy:

restate.toml

[worker.invoker.retry-policy]
type = "fixed-delay" # retry strategy; required
interval = "50ms" # time between retries; required
max-attempts = "10" # max number of attempts before terminal error; default: unset (=infinite)

If you set a maximum number of attempts, then the handler will throw a terminal error once the retries are exhausted.

Then run the Restate Server with:

restate-server --config-file restate.toml

Or set it via environment variables, for example:

RESTATE_WORKER__INVOKER__RETRY_POLICY__TYPE=fixed-delay \
RESTATE_WORKER__INVOKER__RETRY_POLICY__INTERVAL=100ms \
restate-server

At the Run-Block-Level

Handlers use run blocks to execute non-deterministic actions, often involving other systems and services (API call, DB write, ...). These run blocks are especially prone to transient failures, and you might want to configure a specific retry policy for them. Most Restate SDKs allow this:

const myRunRetryPolicy = {
  initialRetryInterval: { milliseconds: 500 },
  retryIntervalFactor: 2,
  maxRetryInterval: { seconds: 1 },
  maxRetryAttempts: 5,
  maxRetryDuration: { seconds: 1 },
};
await ctx.run("write", () => writeToOtherSystem(), myRunRetryPolicy);

await ctx.run(
    "write",
    write_to_other_system,
    # Max number of retry attempts to complete the action.
    max_attempts=3,
    # Max duration for retrying, across all retries.
    max_retry_duration=timedelta(seconds=10),
)

RetryPolicy myRunRetryPolicy =
    RetryPolicy.exponential(Duration.ofMillis(500), 2)
        .setMaxDelay(Duration.ofSeconds(10))
        .setMaxAttempts(10)
        .setMaxDuration(Duration.ofMinutes(5));
ctx.run("my-run", myRunRetryPolicy, () -> writeToOtherSystem());

val myRunRetryPolicy = retryPolicy {
  initialDelay = 5.seconds
  exponentiationFactor = 2.0f
  maxDelay = 60.seconds
  maxAttempts = 10
  maxDuration = 5.minutes
}
ctx.runBlock("write", myRunRetryPolicy) { writeToOtherSystem() }

result, err := restate.Run(ctx,
  func(ctx restate.RunContext) (string, error) {
    return writeToOtherSystem()
  },
  // After 10 seconds, give up retrying
  restate.WithMaxRetryDuration(time.Second*10),
  // On the first retry, wait 100 milliseconds before next attempt
  restate.WithInitialRetryInterval(time.Millisecond*100),
  // Grow retry interval with factor 2
  restate.WithRetryIntervalFactor(2.0),
)
if err != nil {
  return err
}

let my_run_retry_policy = RunRetryPolicy::default()
    .initial_delay(Duration::from_millis(100))
    .exponentiation_factor(2.0)
    .max_delay(Duration::from_millis(1000))
    .max_attempts(10)
    .max_duration(Duration::from_secs(10));
ctx.run(|| write_to_other_system())
    .retry_policy(my_run_retry_policy)
    .await?;

Note that these retries are coordinated and initiated by the Restate Server. So the handler goes through the regular retry cycle of suspension and re-invocation.

If you set a maximum number of attempts, then the run block will fail with a TerminalException once the retries are exhausted.

Other levels of retry policies

Service-level retry policies are planned and will come soon.

Application errors (terminal)

By default, Restate infinitely retries all errors. In some cases, you might not want to retry an error (e.g. because of business logic, because the issue is not transient, ...).

For these cases you can throw a terminal error. Terminal errors are permanent and are not retried by Restate.

You can throw a terminal error as follows:

throw new TerminalError("Something went wrong.", { errorCode: 500 });

raise TerminalError("Something went wrong.")

throw new TerminalException(500, "Something went wrong");

throw TerminalException(500, "Something went wrong")

return restate.TerminalError(fmt.Errorf("Something went wrong."), 500)

Err(TerminalError::new("This is a terminal error"))

You can throw terminal errors from any place in your handler, including run blocks.

Unless catched, terminal errors stop the execution and are propagated back to the caller. If the caller is another Restate service, the terminal error will propagate across RPCs, and will get thrown at the line where the RPC was made. If this is not caught, it will propagate further up the call stack until it reaches the original caller.

You can catch terminal errors just like any other error, and build control flow around this. For example, the catch block can run undo actions for the actions you did earlier in your handler, to bring it to a consistent state before rethrowing the terminal error.

For example, to catch a terminal error of a run block:

try {
  // Fails with a terminal error after 3 attempts or if the function throws one
  await ctx.run("write", () => writeToOtherSystem(), {
    maxRetryAttempts: 3,
  });
} catch (e) {
  if (e instanceof restate.TerminalError) {
    // Handle the terminal error: undo previous actions and
    // propagate the error back to the caller
  }
  throw e;
}

try:
    # Fails with a terminal error after 3 attempts or if the function throws one
    await ctx.run("write", lambda: write_to_other_system(), max_attempts=3)
except TerminalError as err:
    # Handle the terminal error: undo previous actions and
    # propagate the error back to the caller
    raise err

try {
  // Fails with a terminal error after 3 attempts or if the function throws one
  ctx.run("my-run", RetryPolicy.defaultPolicy().setMaxAttempts(3), () -> writeToOtherSystem());
} catch (TerminalException e) {
  // Handle the terminal error: undo previous actions and
  // propagate the error back to the caller
  throw e;
}

try {
  // Fails with a terminal error after 3 attempts or if the function throws one
  ctx.runBlock(
      "write",
      RetryPolicy(
          initialDelay = 500.milliseconds, maxAttempts = 3, exponentiationFactor = 2.0f)) {
        writeToOtherSystem()
      }
} catch (e: TerminalException) {
  // Handle the terminal error: undo previous actions and
  // propagate the error back to the caller
  throw e
}

result, err := restate.Run(ctx, func(ctx restate.RunContext) (string, error) {
  return writeToOtherSystem()
})
if err != nil {
  if restate.IsTerminalError(err) {
    // Handle the terminal error: undo previous actions and
    // propagate the error back to the caller
  }
  return err
}

// Fails with a terminal error after 3 attempts or if the function throws one
if let Err(e) = ctx
    .run(|| write_to_other_system())
    .retry_policy(RunRetryPolicy::default().max_attempts(3))
    .await
{
    // Handle the terminal error: undo previous actions and
    // propagate the error back to the caller
    return Err(e);
}

Sagas with Restate

When you throw a terminal error, you might need to undo the actions you did earlier in your handler to make sure that your system remains in a consistent state. Have a look at our sagas guide to learn more.

Cancellations are Terminal Errors

You can cancel invocations via the CLI, UI and programmatically. When you cancel an invocation, it throws a terminal error in the handler processing the invocation the next time it awaits a Promise or Future of a Restate Context action (e.g. run block, RPC, sleep,...; RestatePromise in TypeScript, DurableFuture in Java). Unless caught, This terminal error will propagate up the call stack until it reaches the original caller.

Here again, the handler needs to have compensation logic in place to make sure the system remains in a consistent state, when you cancel an invocation.

Timeouts between Restate and the service

There are two types of timeouts describing the behavior between Restate and the service.

Inactivity timeout

When the Restate Server does not receive a next journal entry from a running handler within the inactivity timeout, it will ask the handler to suspend. This timer guards against stalled service/handler invocations. Once it expires, Restate triggers a graceful termination by asking the service invocation to suspend (which preserves intermediate progress).

By default, the inactivity timeout is set to one minute.

You can increase the inactivity timeout if you have long-running ctx.run blocks, that lead to long pauses between journal entries. Otherwise, this timeout might kill the ongoing execution.

Abort timeout

This timer guards against stalled service/handler invocations that are supposed to terminate. The abort timeout is started after the 'inactivity timeout' has expired and the service/handler invocation has been asked to gracefully terminate. Once the timer expires, it will abort the service/handler invocation.

By default, the abort timeout is set to one minute. This timer potentially interrupts user code. If the user code needs longer to gracefully terminate, then this value needs to be set accordingly.

Long-running run blocks

If you have long-running ctx.run blocks, you need to increase both timeouts to prevent the handler from terminating prematurely.

Configuring the timeouts

You can set the inactivity timeout via the UI, the CLI or the Restate Server configuration.

Via the CLI:

restate services config edit <SERVICE>

Then you can adapt the configuration file and save it for the new settings to take effect.

Via the Restate Server Configuration:

restate.toml

[worker.invoker]
inactivity-timeout = "1m"
abort-timeout = "1m"

restate-server --config-file restate.toml

Both timeouts follow the humantime format.

Or set it via environment variables, for example:

RESTATE_WORKER__INVOKER__INACTIVITY_TIMEOUT=5m \
  RESTATE_WORKER__INVOKER__ABORT_TIMEOUT=5m \
  restate-server

Common patterns

These are some common patterns for handling errors in Restate:

Sagas

Have a look at the sagas guide to learn how to revert your system back to a consistent state after a terminal error. Keep track of compensating actions throughout your business logic and apply them in the catch block after a terminal error.

Dead-letter queue

A dead-letter queue (DLQ) is a queue where you can send messages that could not be processed due to errors.

You can implement this in Restate by wrapping your handler in a try-catch block. In the catch block you can forward the failed invocation to a DLQ Kafka topic or a catch-all handler which for example reports them or backs them up.

Catching failed invocations before handler execution starts

Some errors might happen before the handler code gets invoked/starts running (e.g. service does not exist, request decoding errors in SDK HTTP server, ...). By default, Restate fails these requests with 400.

Handle these as follows:

In case the caller waited for the response of the failed call, the caller can handle the propagation to the DLQ.
If the caller did not wait for the response (one-way send), you would lose these messages.

Decoding errors can be caught by doing the decoding inside the handler. The called handler then takes raw input and does the decoding and validation itself. In this case, it would be included in the try-catch block which would do the dispatching:

myHandler: async (ctx: restate.Context) => {
  try {
    const rawRequest = ctx.request().body;
    const decodedRequest = decodeRequest(rawRequest);

    // ... rest of your business logic ...
  } catch (e) {
    if (e instanceof restate.TerminalError) {
      // Propagate to DLQ/catch-all handler
    }
    throw e;
  }
},

@my_service.handler()
async def my_handler(ctx: Context):
    try:
        raw_request = ctx.request().body
        decoded_request = decode_request(raw_request)

        # ... rest of your business logic ...

    except TerminalError as err:
        # Propagate to DLQ/catch-all handler
        raise err


@Handler
public void myHandler(
    Context ctx,
    @Accept("*/*") @Raw byte[] request) {
  try {
    var decodedRequest = decodeRequest(request);

    // ... rest of your business logic ...

  } catch (TerminalException e) {
    // Propagate to DLQ/catch-all handler
  }
}

@Handler
suspend fun myHandler(
    ctx: Context,
    @Accept("*/*") @Raw request: ByteArray
) {
  try {
    val decodedRequest = decodeRequest(request)

    // ... rest of your business logic ...

  } catch (e: TerminalException) {
    // Propagate to DLQ/catch-all handler
    throw e
  }
}

func (MyService) myHandler(ctx restate.Context) (string, error) {
  rawRequest := ctx.Request().Body
  decodedRequest, err := decodeRequest(rawRequest)
  if err != nil {
    if restate.IsTerminalError(err) {
      // Propagate to DLQ/catch-all handler
    }
    return "", err
  }

  // ... rest of your business logic ...
  return decodedRequest, nil
}

// Use Vec<u8> to represent a binary request
async fn my_handler(&self, ctx: Context<'_>, request: Vec<u8>) -> Result<(), HandlerError> {
    let decoded_request = decode_request(&request)?;

    // ... rest of you business logic ...

    Ok(())
}

The other errors mainly occur due to misconfiguration of your setup (e.g. wrong service name, wrong handler name, forgot service registration...). You cannot handle those.

Timeouts for context actions

You can set timeouts for context actions like calls, awakeables, etc. to bound the time they take:

TypeScript
Python
Java
Kotlin
Go

try {
  // If the timeout hits first, it throws a `TimeoutError`.
  // If you do not catch it, it will lead to a retry.
  await ctx
    .serviceClient(MyService)
    .myHandler("hello")
    .orTimeout(5000);

  const { id, promise } = ctx.awakeable();
  // do something that will trigger the awakeable
  await promise.orTimeout(5000);
} catch (e) {
  if (e instanceof restate.TimeoutError) {
    // Handle the timeout error
  }
  throw e;
}

match await restate.select(
    greeting=ctx.service_call(my_service_handler, "value"),
    timeout=ctx.sleep(timedelta(seconds=5)),
):
    case ["greeting", greeting]:
        print("Greeting:", greeting)
    case ["timeout", _]:
        print("Timeout occurred")

try {
  // If the timeout hits first, it throws a `TimeoutError`.
  // If you do not catch it, it will lead to a retry.
  MyServiceClient.fromContext(ctx)
      .myHandler("Hello")
      .await(Duration.ofSeconds(5));

  var awakeable = ctx.awakeable(Boolean.class);
  // ...Do something that will trigger the awakeable
  awakeable.await(Duration.ofSeconds(5));

} catch (TimeoutException e) {
  // Handle the timeout error
}

try {
  ctx.awakeable<String>()
      .withTimeout(5.seconds)
      .await()
} catch (e: TimeoutException) {
  // Handle the timeout
}

awakeable := restate.Awakeable[string](ctx)
timeout := restate.After(ctx, 5*time.Second)
selector := restate.Select(ctx, awakeable, timeout)
switch selector.Select() {
case awakeable:
  result, err := awakeable.Result()
  if err != nil {
    return err
  }
  slog.Info("Awakeable resolved first with: " + result)
case timeout:
  if err := timeout.Done(); err != nil {
    return err
  }
  slog.Info("Timeout hit first")
}

Infrastructure errors (transient) vs. application errors (terminal)​

Handling transient errors via retries​

At the Restate-Level (Global)​

At the Run-Block-Level​

Application errors (terminal)​

Cancellations are Terminal Errors​

Timeouts between Restate and the service​

Inactivity timeout​

Abort timeout​

Configuring the timeouts​

Common patterns​

Sagas​

Dead-letter queue​

Timeouts for context actions​

Infrastructure errors (transient) vs. application errors (terminal)

Handling transient errors via retries

At the Restate-Level (Global)

At the Run-Block-Level

Application errors (terminal)

Cancellations are Terminal Errors

Timeouts between Restate and the service

Inactivity timeout

Abort timeout

Configuring the timeouts

Common patterns

Sagas

Dead-letter queue

Timeouts for context actions