Skip to main content

How replication works

Keeping services available is crucial in distributed applications, especially during failures or incidents. Temporal’s replication and automated failover features ensure high availability. By enabling High Availability replication, you allow Temporal to copy Namespace metadata and Workflow Executions to a replica. This redundancy, combined with failover capability, enhances availability during outages.

Replication

Replication is the process of copying and synchronizing data or services across Temporal Server deployments. This ensures availability and consistency in the event of a failure. Temporal uses replication to support high availability, ensuring that Workflows and data remain available even if parts of the system fail or become unreachable.

In Temporal, replication operates at the Namespace level. Each Namespace is replicated across isolation domains or separate regions. If one Namespace becomes unavailable, a replica can take over, ensuring that Workflows continue without interruption.

Temporal Cloud replicates both Workflow Execution details and metadata, including configurations such as retention periods, Search Attributes, and other settings. All parts of the system will eventually synchronize to a consistent view of the Namespace metadata, even if the primary and its replica temporarily lose communication.

Dive deeper — Workflow replication restrictions[+]

Temporal Cloud restricts certain Workflow operations to the primary:

  • You may only update Workflows in the primary.
  • You may only dispatch Workflow Tasks and Activity Tasks from the primary. Because of this, forward progress in a Workflow Execution can only be made in the primary.

These limits mean that certain requests, such as Start Workflow and Signal Workflow, are processed by and limited to the primary. Replicas may receive API requests from Clients and Workers. They automatically forward these requests to the primary for execution.

Namespaces with High Availability features provide an “all-active” experience for Temporal users. This helps limit or eliminate downtime during Namespace failover. There's a short time window from when a replica becomes active to when Clients and Workers receive a DNS update. During this time requests forward from the now passive (formerly active) primary to the newly active (formerly passive) replica.

As Workflow Executions progress and are operated on, replication Tasks created in the primary are dispatched to the replica. Processing these replication Tasks ensures that the replica undergoes the same state transitions as the active primary. This enables replicated tasks to synchronize and achieve the same state as the original tasks.

Replicas do not distribute Workflow or Activity Tasks. Instead, they perform verification tasks to confirm that intended operations are executed so Workflows reach the desired state. This mechanism ensures consistency and reliability in the replication process.


Failovers

Occasionally, a Namespace may become temporarily unavailable due to an unexpected incident. Temporal Cloud detects these issues using regular health checks.

Health checks

Temporal Cloud monitors error rates, latencies, and infrastructure problems, such as request timeouts. If it finds unhealthy conditions where indicators exceed the allowed thresholds, Temporal automatically switches the primary to the replica. In most cases, the replica is unaffected by the issue. This process is known as failover.

Automatic failovers

Failovers prevent data loss and application interruptions. Existing Workflows continue, and new Workflows start as the incident is addressed. Once the incident is resolved, Temporal Cloud performs a "failback," shifting Workflow Execution processing back to the original Namespace.

Temporal Cloud handles failovers automatically, ensuring continuity without manual intervention.

On failover, the replica becomes active and the Namespace endpoint directs access to it.

On failover, the replica becomes active and the Namespace endpoint directs access to it.

For more control over the failover process, you can disable automated failovers.

tip

You can test the failover of Namespace with High Availability features by manually triggering a failover using the UI or the 'tcld' CLI utility. In most scenarios, we recommend you let Temporal handle failovers for you.

After failover, be aware of the following points:

  • When working with Multi-region Namespaces, your CNAME may change. For example, it may switch from aws-us-west-1.region.tmprl.com to aws-us-east-1.region.tmprl.com. This change doesn't affect same-region Namespaces.

  • Your Namespace endpoint will not change. If it is my_namespace.my_account.tmprl.cloud:7233 before failover, it will be my_namespace.my_account.tmprl.cloud:7233 after failover.

The failover process

Temporal's automated failover process works as follows:

  • During normal operation, the primary asynchronously copies operations and metadata to its replica, keeping them in sync.
  • If the primary becomes unavailable, Temporal detects the issue through health checks. It automatically switches to the replica, using one of its available failover scenarios.
  • The replica takes over the active role and becomes the primary. Operations continue with minimal disruption.
  • When the original primary recovers, the roles can either switch back (failback, by default) or remain as they are, based on your Namespace settings. Automatic role switching with failover and failback minimizes downtime for consistent availability.
info

A Namespace failover, which updates the "active region" field in the Namespace record, is a metadata update. This update is replicated through the Namespace metadata mechanism.

Failover scenarios

The Temporal Cloud failover mechanism supports several modes for executing Namespace failovers. These modes include graceful failover ("handover"), forced failover, and a hybrid mode. The hybrid mode is Temporal Cloud’s default Namespace behavior.

Graceful failover (handover)

In this mode, Temporal Cloud fully processes and drains replication Tasks. Temporal Cloud pauses traffic to the Namespace before the failover. Graceful failover prevents the loss of progress and avoids data conflicts.

The Namespace experiences a short period of unavailability, defaulting to 10 seconds. During this period:

  • Existing Workflows stop progress.
  • Temporal Cloud returns a "Service unavailable error". This error is retryable by the Temporal SDKs.
  • State transitions will not happen and tasks are not dispatched.
  • User requests like start/signal Workflow are rejected.
  • Operations are paused during handover.

This mode favors consistency over availability.

Forced failover

In this mode, a Namespace immediately activates in the replica. Events not replicated due to replication lag undergo conflict resolution upon reaching the new active Namespace.

This mode prioritizes availability over consistency.

Hybrid failover mode

While graceful failovers are preferred for consistency, they aren’t always practical. Temporal Cloud’s hybrid failover mode (the default mode) limits the initial graceful failover attempt to 10 seconds or less.

During this period:

  • Existing Workflows stop progress.
  • Temporal Cloud returns a "Service unavailable error", which is retried by SDKs.

If the graceful approach doesn’t resolve the issue, Temporal Cloud automatically switches to a forced failover.

This strategy balances consistency and availability requirements.

Scenario summary

Failover ScenarioCharacteristics
Graceful failover (handover)Favors consistency over availability.
Forced failoverPrioritizes availability over consistency.
Hybrid failover modeBalances consistency and availability requirements.

Network partitions

At any time only the primary or the replica is active. The only exception occurs in the event of a network partition, when a Network splits into separate subnetworks. Should this occur, you can promote a replica to active status. Caution: This temporarily makes both regions active. After the network partition is resolved and communication between the isolation domains/regions is restored, a conflict resolution algorithm determines whether the primary or replica remains active.

tip

In traditional active/active replication, multiple nodes serve requests and accept writes simultaneously, ensuring strong synchronous data consistency. In contrast, with a Temporal Cloud Namespace with High Availability Features, only the primary accepts requests and writes at any given time. Workflow History Events are written to the primary first and then asynchronously replicated to the replica, ensuring that the replica remains in sync.

Conflict resolution

Namespaces with replicas rely on asynchronous event replication. Updates made to the primary may not immediately be reflected in the replica due to replication lag, particularly during failovers. In the event of a non-graceful failover, replication lag may cause a temporary setback in Workflow progress.

Namespaces that aren't replicated can be configured to provide at-most-once semantics for Activities execution when a retry policy's maximum attempts is set to 0. High Availability Namespaces provide at-least-once semantics for execution of Activities. Completed Activities may be re-dispatched in a newly active Namespace, leading to repeated executions.

When a Workflow Execution is updated in a newly active replica following a failover, events from the previously active Namespace that arrive after the failover can't be directly applied. At this point, Temporal Cloud has forked the Workflow History.

After failover, Temporal Cloud creates a new branch history for execution, and begins its conflict resolution process. The Temporal Service ensures that Workflow Histories remain valid and are replayable by SDKs post-failover or after conflict resolution. This capability is crucial for ensuring Workflow Executions continue forward without losing progress, and for maintaining consistency across replication, even during incidents that cause disruptions in replication.