Replication Dashboard

On this page

The Replication dashboard in the DB Console lets you monitor the replication metrics for your cluster, such as range status, replicas per store, and replica quiescence.

To view this dashboard, access the DB Console, click Metrics in the left-hand navigation, and select Dashboard > Replication.

Note:

The Replication dashboard is distinct from the Physical Cluster Replication dashboard, which tracks metrics related to physical cluster replication jobs.

Review of CockroachDB terminology

Range: CockroachDB stores all user data and almost all system data in a giant sorted map of key-value pairs. This keyspace is divided into "ranges", contiguous chunks of the keyspace, so that every key can always be found in a single range.
Range Replica: CockroachDB replicates each range (3 times by default) and stores each replica on a different node.
Range Lease: For each range, one of the replicas holds the "range lease". This replica, referred to as the "leaseholder", is the one that receives and coordinates all read and write requests for the range.
Under-replicated Ranges: When a cluster is first initialized, the few default starting ranges have a single replica. As more nodes become available, the cluster replicates these ranges to other nodes until the number of replicas for each range reaches the desired replication factor (3 by default). If a range has fewer replicas than the replication factor, the range is said to be "under-replicated". Non-voting replicas, if configured, are not counted when calculating replication status.
Unavailable Ranges: If a majority of a range's replicas are on nodes that are unavailable, then the entire range is unavailable and will be unable to process queries.

For more details, see Scalable SQL Made Easy: How CockroachDB Automates Operations.

Dashboard navigation

Use the Graph menu to display metrics for your entire cluster or for a specific node.

To the right of the Graph and Dashboard menus, a time interval selector allows you to filter the view for a predefined or custom time interval. Use the navigation buttons to move to the previous, next, or current time interval. When you select a time interval, the same interval is selected in the SQL Activity pages. However, if you select 10 or 30 minutes, the interval defaults to 1 hour in SQL Activity pages.

Hovering your mouse pointer over the graph title will display a tooltip with a description and the metrics used to create the graph.

When hovering on graphs, crosshair lines will appear at your mouse pointer. The series' values corresponding to the given time in the cross hairs are displayed in the legend under the graph. Hovering the mouse pointer on a given series displays the corresponding value near the mouse pointer and highlights the series line (graying out other series lines). Click anywhere within the graph to freeze the values in place. Click anywhere within the graph again to cause the values to change with your mouse movements once more.

In the legend, click on an individual series to isolate it on the graph. The other series will be hidden, while the hover will still work. Click the individual series again to make the other series visible. If there are many series, a scrollbar may appear on the right of the legend. This is to limit the size of the legend so that it does not get endlessly large, particularly on clusters with many nodes.

The Replication dashboard displays the following time series graphs:

Ranges

DB Console Ranges

The Ranges graph shows you various details about the status of ranges.

In the node view, the graph shows details about ranges on the node.
In the cluster view, the graph shows details about ranges across all nodes in the cluster.

On hovering over the graph, the values for the following metrics are displayed:

Metric	Description
Ranges	The number of ranges.
Leaders	The number of ranges with leaders. If the number does not match the number of ranges for a long time, troubleshoot your cluster.
Lease Holders	The number of ranges that have leases.
Leaders w/o Leases	The number of Raft leaders without leases. If the number if non-zero for a long time, troubleshoot your cluster.
Unavailable	The number of unavailable ranges. If the number if non-zero for a long time, troubleshoot your cluster.
Under-replicated	The number of under-replicated ranges. Non-voting replicas are not included in this value.

Logical Bytes per Store

DB Console Logical Bytes per Store

Metric	Description
Logical Bytes per Store	Number of logical bytes stored in key-value pairs on each node. This includes historical and deleted data.

Note:

Logical bytes reflect the approximate number of bytes stored in the database. This value may deviate from the number of physical bytes on disk, due to factors such as compression and write amplification.

Replicas Per Store

DB Console Replicas per Store

In the node view, the graph shows the number of range replicas on the store.
In the cluster view, the graph shows the number of range replicas on each store.

You can Replication Controls to set the number and location of replicas. You can monitor the configuration changes using the DB Console, as described in CockroachDB Resilience.

Replica Quiescence

DB Console Replica Quiescence

In the node view, the graph shows the number of replicas on the node.
In the cluster view, the graph shows the number of replicas across all nodes.

On hovering over the graph, the values for the following metrics are displayed:

Metric	Description
Replicas	The number of replicas.
Quiescent	The number of replicas that haven't been accessed for a while.

Range Operations

In the node view, the graph shows the number of range operation events on the node.
In the cluster view, the graph shows the number of range operation events across all nodes.

Metric	CockroachDB Metric Name	Description
Splits	`range.splits`	Number of range splits
Merges	`range.merges`	Number of range merges
Adds	`range.adds`	Number of range additions
Removes	`range.removes`	Number of range removals
Lease Transfers	`leases.transfers.success`	Number of successful lease transfers
Load-based Lease Transfers	`rebalancing.lease.transfers`	Number of lease transfers motivated by store-level load imbalances
Load-based Range Rebalances	`rebalancing.range.rebalances`	Number of range rebalance operations motivated by store-level load imbalances

Snapshots

DB Console Replica Snapshots

Usually the nodes in a Raft group stay synchronized by following along with the log message by message. However, if a node is far enough behind the log (e.g., if it was offline or is a new node getting up to speed), rather than send all the individual messages that changed the range, the cluster can send it a snapshot of the range and it can start following along from there. Commonly this is done preemptively, when the cluster can predict that a node will need to catch up, but occasionally the Raft protocol itself will request the snapshot.

Metric	Description
Generated	The number of snapshots created per second.
Applied (Raft-initiated)	The number of snapshots applied to nodes per second that were initiated within Raft.
Applied (Learner)	The number of snapshots applied to nodes per second that were anticipated ahead of time (e.g., because a node was about to be added to a Raft group). This metric replaces the `Applied (Preemptive)` metric in 19.2 and onwards.
Applied (Preemptive)	The number of snapshots applied to nodes per second that were anticipated ahead of time (e.g., because a node was about to be added to a Raft group). This metric was used in pre-v19.2 releases and will be removed in future releases.
Reserved	The number of slots reserved per second for incoming snapshots that will be sent to a node.

Snapshot Data Received

DB Console Replica Snapshot Data Received

The Snapshot Data Received graph shows the rate per second of data received in bytes by each node via Raft snapshot transfers. Data is split into recovery and rebalancing snapshot data received: recovery includes all upreplication due to decommissioning or node failure, and rebalancing includes all other snapshot data received.

On hovering over the graph, the value for the following metrics are displayed:

Metric	Description
`{node}-recovery`	The rate per second of recovery snapshot data received (e.g., from node decommissioning or node failure) in bytes per node, as tracked by the `range.snapshots.recovery.rcvd-bytes` metric.
`{node}-rebalancing`	The rate per second of rebalancing snapshot data received (e.g., from all other snapshot data received) in bytes per node, as tracked by the `range.snapshots.rebalancing.rcvd-bytes` metric.

Receiver Snapshots Queued

DB Console Replica Receiver Snapshots Queued

The Receiver Snapshots Queued graph shows the number of Raft snapshot transfers queued to be applied on a receiving node, which can only accept one snapshot at a time per store.

On hovering over the graph, the value for the following metric is displayed:

Metric	Description
`{node}`	The number of queued snapshot transfers waiting to be applied per node, as tracked by the `range.snapshots.recv-queue` metric.

Circuit Breaker Tripped Replicas

DB Console Circuit Breaker Tripped Replicas

When individual ranges become temporarily unavailable, requests to those ranges are refused by a per-replica circuit breaker instead of hanging indefinitely.

In the node view, the graph shows the number of replicas for which the per-replica circuit breaker is currently tripped, for the selected node.
In the cluster view, the graph shows the number of replicas for which the per-replica circuit breaker is currently tripped, for each node in the cluster.

On hovering over the graph, the value for the following metric is displayed:

Metric	Description
`{node}`	The number of replicas on that node for which the per-replica circuit breaker is currently tripped.

Circuit Breaker Tripped Events

DB Console Circuit Breaker Tripped Events

When individual ranges become temporarily unavailable, requests to those ranges are refused by a per-replica circuit breaker instead of hanging indefinitely. While a range's per-replica circuit breaker remains tripped, each incoming request to that range triggers a ReplicaUnavailableError event until the range becomes available again.

In the node view, the graph shows the rate of ReplicaUnavailableError events logged per aggregated interval of time since the cockroach process started, for the selected node.
In the cluster view, the graph shows the rate of ReplicaUnavailableError events logged per aggregated interval of time since the cockroach process started, for each node in the cluster.

Metric	Description
`{node}`	The rate of `ReplicaUnavailableError` events that have occurred per aggregated interval of time on that node since the `cockroach` process started.

Paused Follower

DB Console Paused Follower

The Paused Follower graphs displays the number of nonessential replicas in the cluster that have replication paused. A value of 0 represents that a node is replicating as normal, while a value of 1 represents that replication has been paused for the listed node.

In the node view, the graph shows whether replication has been paused, for the selected node.
In the cluster view, the graph shows each node in the cluster and indicates whether replication has been paused for each node.

On hovering over the graph, the value for the following metric is displayed:

Metric	Description
`{node}`	Whether replication is paused on that node. A value of `0` represents that the node is replicating as normal, while a value of `1` represents that replication has been paused for the listed node.

Replicate Queue Actions: Successes

DB Console Replicate Queue Actions: Successes

The Replicate Queue Actions: Successes graph shows the rate of various successful replication queue actions per second.

In the node view, the graph shows the rate of successful replication queue actions for the selected node.
In the cluster view, the graph shows the average rate of successful replication queue actions across the cluster.

On hovering over the graph, the value for the following metric is displayed:

Metric	Description
Replicas Added / Sec	The number of successful replica additions processed by the replicate queue, as tracked by the `queue.replicate.addreplica.success` metric.
Replicas Removed / Sec	The number of successful replica removals processed by the replicate queue, as tracked by the `queue.replicate.removereplica.success` metric.
Dead Replicas Replaced / Sec	The number of successful dead replica replacements processed by the replicate queue, as tracked by the `queue.replicate.replacedeadreplica.success` metric.
Dead Replicas Removed / Sec	The number of successful dead replica removals processed by the replicate queue, as tracked by the `queue.replicate.removedeadreplica.success` metric.
Decommissioning Replicas Replaced / Sec	The number of successful decommissioning replica replacements processed by the replicate queue, as tracked by the `queue.replicate.replacedecommissioningreplica.success` metric.
Decommissioning Replicas Removed / Sec	The number of successful decommissioning replica removals processed by the replicate queue, as tracked by the `queue.replicate.removedecommissioningreplica.success` metric.

Replicate Queue Actions: Failures

DB Console Replicate Queue Actions: Failures

The Replicate Queue Actions: Failures graph shows the rate of various failed replication queue actions per second.

In the node view, the graph shows the rate of failed replication queue actions for the selected node.
In the cluster view, the graph shows the average rate of failed replication queue actions across the cluster.

On hovering over the graph, the value for the following metric is displayed:

Metric	Description
Replicas Added Errors / Sec	The number of failed replica additions processed by the replicate queue, as tracked by the `queue.replicate.addreplica.error` metric.
Replicas Removed Errors / Sec	The number of failed replica removals processed by the replicate queue, as tracked by the `queue.replicate.removereplica.error` metric.
Dead Replicas Replaced Errors / Sec	The number of failed dead replica replacements processed by the replicate queue, as tracked by the `queue.replicate.replacedeadreplica.error` metric.
Dead Replicas Removed Errors / Sec	The number of failed dead replica removals processed by the replicate queue, as tracked by the `queue.replicate.removedeadreplica.error` metric.
Decommissioning Replicas Replaced Errors / Sec	The number of failed decommissioning replica replacements processed by the replicate queue, as tracked by the `queue.replicate.replacedecommissioningreplica.error` metric.
Decommissioning Replicas Removed Errors / Sec	The number of failed decommissioning replica removals processed by the replicate queue, as tracked by the `queue.replicate.removedecommissioningreplica.error` metric.

Decommissioning Errors

DB Console Replica Decommissioning Errors

The Decommissioning Errors graph shows the rate per second of decommissioning replica replacement failures experienced by the replication queue, by node.

On hovering over the graph, the value for the following metric is displayed:

Metric	Description
`{node} - Replaced Errors / Sec`	The rate per second of failed decommissioning replica replacements processed by the replicate queue by node, as tracked by the `queue.replicate.replacedecommissioningreplica.error` metric.

Other graphs

The Replication dashboard shows other time series graphs that are important for CockroachDB developers:

Leaseholders per Store
Average Queries per Store

For monitoring CockroachDB, it is sufficient to use the Ranges, Replicas per Store, and Replica Quiescence graphs.

Summary and events

Summary panel

A Summary panel of key metrics is displayed to the right of the timeseries graphs.

Metric	Description
Total Nodes	The total number of nodes in the cluster. Decommissioned nodes are not included in this count.
Capacity Used	The storage capacity used as a percentage of usable capacity allocated across all nodes.
Unavailable Ranges	The number of unavailable ranges in the cluster. A non-zero number indicates an unstable cluster.
Queries per second	The total number of `SELECT`, `UPDATE`, `INSERT`, and `DELETE` queries executed per second across the cluster.
P99 Latency	The 99th percentile of service latency.

Note:

If you are testing your deployment locally with multiple CockroachDB nodes running on a single machine (this is not recommended in production), you must explicitly set the store size per node in order to display the correct capacity. Otherwise, the machine's actual disk capacity will be counted as a separate store for each node, thus inflating the computed capacity.

Events panel

Underneath the Summary panel, the Events panel lists the 5 most recent events logged for all nodes across the cluster. To list all events, click View all events.

DB Console Events

The following types of events are listed:

Database created
Database dropped
Table created
Table dropped
Table altered
Index created
Index dropped
View created
View dropped
Schema change reversed
Schema change finished
Node joined
Node decommissioned
Node restarted
Cluster setting changed

Pricing

Contact us

Sign In

Replication Dashboard

Review of CockroachDB terminology

Dashboard navigation

Ranges

Logical Bytes per Store

Replicas Per Store

Replica Quiescence

Range Operations

Snapshots

Snapshot Data Received

Receiver Snapshots Queued

Circuit Breaker Tripped Replicas

Circuit Breaker Tripped Events

Paused Follower

Replicate Queue Actions: Successes

Replicate Queue Actions: Failures

Decommissioning Errors

Other graphs

Summary and events

Summary panel

Events panel

See also