Physical Cluster Replication

On this page Carat arrow pointing down
Note:

Physical cluster replication is only supported in CockroachDB self-hosted clusters.

CockroachDB physical cluster replication (PCR) continuously sends all data at the cluster level from a primary cluster to an independent standby cluster. Existing data and ongoing changes on the active primary cluster, which is serving application data, replicate asynchronously to the passive standby cluster.

You can cut over from the primary cluster to the standby cluster. This will stop the replication stream, reset the standby cluster to a point in time where all ingested data is consistent, and mark the standby as ready to accept application traffic.

For a list of requirements for PCR, refer to the Before you begin section of the setup tutorial.

Use cases

You can use PCR to:

  • Meet your RTO (Recovery Time Objective) and RPO (Recovery Point Objective) requirements. PCR provides lower RTO and RPO than backup and restore.
  • Automatically replicate everything in your primary cluster to recover quickly from a control plane or full cluster failure.
  • Protect against region failure when you cannot use individual multi-region clusters—for example, if you have a two-datacenter architecture and do not have access to three regions; or, you need low-write latency in a single region. PCR allows for an active-passive (primary-standby) structure across two clusters with the passive cluster in a different region.
  • Quickly recover from user error (for example, dropping a database) by failing over to a time in the near past.
  • Create a blue-green deployment model by using the standby cluster for testing upgrades and hardware changes.

Features

  • Asynchronous cluster-level replication: When you initiate a replication stream, it will replicate byte-for-byte all of the primary cluster's existing user data and associated metadata to the standby cluster asynchronously. From then on, it will continuously replicate the primary cluster's data and metadata to the standby cluster. PCR will automatically replicate changes related to operations such as schema changes, user and privilege modifications, and zone configuration updates without any manual work.
  • Transactional consistency: Avoid conflicts in data after recovery; the replication completes to a transactionally consistent state as of a certain point in time.
  • Improved RPO and RTO: Depending on workload and deployment configuration, replication lag between the primary and standby is generally in the tens-of-seconds range. The cutover process from the primary cluster to the standby should typically happen within five minutes when completing a cutover to the latest replicated time using LATEST.
  • Cutover to a timestamp in the past or the future: In the case of logical disasters or mistakes, you can cut over from the primary to the standby cluster to a timestamp in the past. This means that you can return the standby to a timestamp before the mistake was replicated to the standby. You can also configure the WITH RETENTION option to control how far in the past you can cut over to. Furthermore, you can plan a cutover by specifying a timestamp in the future.
  • Monitoring: To monitor the replication's initial progress, current status, and performance, you can use metrics available in the DB Console and Prometheus. For more detail, refer to Physical Cluster Replication Monitoring.

Known limitations

  • Physical cluster replication is supported in CockroachDB self-hosted clusters on v23.2 or later. The primary cluster can be a new or existing cluster. The standby cluster must be a new cluster started with the --virtualized-empty flag.
  • Read queries are not supported on the standby cluster before cutover.
  • The primary and standby clusters must have the same zone configurations.
  • Before cutover to the standby, the standby cluster does not support running backups or changefeeds.

  • When you cut back to a cluster that was previously the primary cluster, you should cut over to the LATEST timestamp. Using a historical timestamp may lead to the cutback failing. #117984

  • After the cutover process for physical cluster replication, scheduled changefeeds will continue on the promoted cluster. You will need to manage pausing or canceling the schedule on the promoted standby cluster to avoid two clusters running the same changefeed to one sink. #123776

  • After a cutover, there is no mechanism to stop applications from connecting to the original primary cluster. It is necessary to redirect application traffic manually, such as by using a network load balancer or adjusting DNS records.

Note:

Frequent large schema changes or imports may cause a significant spike in replication lag.

Get started

This section is a quick overview of the initial requirements to start a replication stream.

For more comprehensive guides, refer to:

Manage replication in the SQL shell

To start, manage, and observe PCR, you can use the following SQL statements:

Statement Action
CREATE VIRTUAL CLUSTER ... FROM REPLICATION OF ... Start a replication stream.
ALTER VIRTUAL CLUSTER ... PAUSE REPLICATION Pause a running replication stream.
ALTER VIRTUAL CLUSTER ... RESUME REPLICATION Resume a paused replication stream.
ALTER VIRTUAL CLUSTER ... START SERVICE SHARED Initiate a cutover.
SHOW VIRTUAL CLUSTER Show all virtual clusters.
DROP VIRTUAL CLUSTER Remove a virtual cluster.

Cluster versions and upgrades

Warning:

The standby cluster must be at the same version as, or one version ahead of, the primary's virtual cluster.

When PCR is enabled, upgrade with the following procedure. This upgrades the standby cluster before the primary cluster. Within the primary and standby CockroachDB clusters, the system virtual cluster must be at a cluster version greater than or equal to the virtual cluster:

  1. Upgrade the binaries on the primary and standby clusters. Replace the binary on each node of the cluster and restart the node.
  2. Finalize the upgrade on the standby's system virtual cluster.
  3. Finalize the upgrade on the primary's system virtual cluster.
  4. Finalize the upgrade on the standby's virtual cluster.
  5. Finalize the upgrade on the primary's virtual cluster.

Demo video

Learn how to use PCR to meet your RTO and RPO requirements with the following demo:


Yes No
On this page

Yes No