Data Resilience

On this page

CockroachDB provides built-in high availability (HA) features and disaster recovery (DR) tooling to achieve operational resiliency in various deployment topologies and use cases.

HA features ensure continuous access to data without interruption even in the presence of failures or disruptions to maximize uptime.
DR tools allow for recovery from major incidents to minimize downtime and data loss.

Diagram showing how HA features and DR tools create a resilient CockroachDB deployment.

You can balance required SLAs and recovery objectives with the cost and management of each of these features to build a resilient deployment.

Recovery Point Objective (RPO): The maximum amount of data loss (measured by time) that an organization can tolerate.
Recovery Time Objective (RTO): The maximum length of time it should take to restore normal operations following an outage.

Tip:

For a practical guide on how CockroachDB uses Raft to replicate, distribute, and rebalance data, refer to the CockroachDB Resilience demo.

High availability

Multi-active availability: CockroachDB's built-in Raft replication stores data safely and consistently on multiple nodes to ensure no downtime even during a temporary node outage. Replication controls allow you to configure the number and location of replicas to suit a deployment.
- For more detail on planning for single-region or multi-region recovery, refer to Single-region survivability planning or Multi-region survivability planning.
Advanced fault tolerance: Capabilities built in to CockroachDB to perform routine maintenance operations with minimal impact to foreground performance. For example, online schema changes, write-ahead log failover.
Logical data replication (LDR) (Preview): A cross-cluster replication tool between active CockroachDB clusters, which supports a range of topologies. LDR provides eventually consistent, table-level replication between the clusters. Individually, each active cluster uses CockroachDB multi-active availability to achieve low, single-region write latency with transactionally consistent writes using Raft replication.

	Single-region replication (synchronous)	Multi-region replication (synchronous)	Logical data replication (asynchronous)
RPO	0 seconds	0 seconds	Immediate mode 0.5 seconds
RTO	Zero RTO Potential increased latency for 1-9 seconds	Zero RTO Potential increased latency for 1-9 seconds	Zero RTO Application traffic failover time
Write latency	Region-local write latency p50 latency < 5ms (for multiple availability zones in us-east1)	Cross-region write latency p50 latency > 50ms (for a multi-region cluster in us-east1, us-east-2, us-west-1)	Region-local latency depending on design p50 latency < 5ms (for multiple availability zones in us-east1)
Recovery	Automatic	Automatic	Semi-automatic
Fault tolerance	Zero RPO node, availability zone failures within a cluster	Zero RPO node, availability zone failures, region failures within a cluster	Zero RPO node, availability zone within a cluster, region failures with loss up to RPO in a two-region (or two datacenter) setup
Minimum regions to achieve fault tolerance	1	3	2

For details on designing your cluster topology for HA with replication, refer to the Disaster Recovery Planning page.

Backup and point-in-time restore: Point-in-time backup and restore allows you to roll back to a specific point in time. Multiple supported cloud storage providers means that you can store backups in your chosen provider. Incremental backups allow you to configure backup frequency for lower RPO.
Physical cluster replication (PCR): A cross-cluster replication tool between an active primary CockroachDB cluster and a passive standby CockroachDB cluster. PCR provides transactionally consistent full-cluster replication.

CockroachDB is designed to recover automatically; however, building backups or PCR into your DR plan protects against unforeseen incidents.

	Point-in-time backup & restore	Physical cluster replication (asynchronous)
RPO	>=5 minutes	10s of seconds
RTO	Minutes to hours, depending on data size and number of nodes	Seconds to minutes, depending on cluster size, and time of failover
Write latency	No impact	No impact
Recovery	Manual restore	Manual failover
Fault tolerance	Not applicable	Zero RPO node, availability zone within a cluster, region failures with loss up to RPO in a two-region (or two-datacenter) setup
Minimum regions to achieve fault tolerance	1	2