This feature is in preview. This feature is subject to change. To share feedback and/or issues, contact Support.
You can monitor logical data replication (LDR) using:
SHOW LOGICAL REPLICATION JOBS
in the SQL shell to view a list of LDR jobs on the cluster.- The Logical Data Replication dashboard on the DB Console to view metrics at the cluster level.
- Prometheus and Alertmanager to track and alert on LDR metrics.
- Metrics export with Datadog.
- Metrics labels to view metrics at the job level.
There are some tradeoffs between enabling one table per LDR job versus multiple tables in one LDR job. Multiple tables in one LDR job can be easier to operate. For example, if you pause and resume the single job, LDR will stop and resume for all the tables. However, the most granular level observability will be at the job level. One table in one LDR job will allow for table-level observability.
When you start an LDR stream, one job is created on each cluster:
- The history retention job on the source cluster, which runs while the LDR job is active to protect changes in the table from garbage collection until they have been applied to the destination cluster. The history retention job is viewable in the DB Console or with
SHOW JOBS
. Any manual changes to the history retention job could disrupt the LDR job. - The
logical replication
job on the destination cluster. You can view the status of this job in the SQL shell withSHOW LOGICAL REPLICATION JOBS
and the DB Console Jobs page.
SQL Shell
In the destination cluster's SQL shell, you can query SHOW LOGICAL REPLICATION JOBS
to view the LDR jobs running on the cluster:
SHOW LOGICAL REPLICATION JOBS;
job_id | status | targets | replicated_time
----------------------+---------+---------------------------+------------------
1012877040439033857 | running | {database.public.table} | NULL
(1 row)
For additional detail on each LDR job, use the WITH details
option:
SHOW LOGICAL REPLICATION JOBS WITH details;
job_id | status | targets | replicated_time | replication_start_time | conflict_resolution_type | description
----------------------+----------+--------------------------------+-------------------------------+-------------------------------+--------------------------+-----------------------------------------------------------------------------------------
1010959260799270913 | running | {movr.public.promo_codes} | 2024-10-24 17:50:05+00 | 2024-10-10 20:04:42.196982+00 | LWW | LOGICAL REPLICATION STREAM into movr.public.promo_codes from external://cluster_a
1014047902397333505 | canceled | {defaultdb.public.office_dogs} | 2024-10-24 17:30:25+00 | 2024-10-21 17:54:20.797643+00 | LWW | LOGICAL REPLICATION STREAM into defaultdb.public.office_dogs from external://cluster_a
You can also use SHOW JOBS
for general job details.
Responses
Field | Response |
---|---|
job_id |
The job's ID. Use with CANCEL JOB , PAUSE JOB , RESUME JOB , SHOW JOB . |
status |
The job's current state. Possible values: pending , paused , pause-requested , failed , succeeded , canceled , cancel-requested , running , retry-running , retry-reverting , reverting , revert-failed .Refer to Jobs status for a description of each status. |
targets |
The fully qualified name of the table(s) that are part of the LDR job. |
replicated_time |
The latest timestamp at which the destination cluster has consistent data. This time advances automatically as long as the LDR job proceeds without error. replicated_time is updated periodically (every 30s). |
replication_start_time |
The start time of the LDR job. |
conflict_resolution_type |
The type of conflict resolution: LWW last write wins. |
description |
Description of the job including the replicating table(s) and the source cluster connection. |
Recommended LDR metrics to track
- Replication latency: The commit-to-commit replication latency, which is tracked from when a row is committed on the source cluster, to when it is applied on the destination cluster. An LDR commit is when the job either applies a row successfully to the destination cluster or adds a row to the dead letter queue (DLQ).
logical_replication.commit_latency-p50
logical_replication.commit_latency-p99
- Replication lag: How far behind the source cluster is from the destination cluster at a specific point in time. The replication lag is equivalent to RPO during a disaster. Calculate the replication lag with this metric. For example,
time.now() - replicated_time_seconds
.logical_replication.replicated_time_seconds
- Row updates applied: These metrics indicate whether the destination cluster is actively receiving and applying data from the source cluster.
logical_replication.events_ingested
logical_replication.events_dlqed
DB Console
In the DB Console, you can use:
- The Metrics dashboard for LDR to view metrics for the job on the destination cluster.
- The Jobs page to view the history retention job on the source cluster and the LDR job on the destination cluster
The metrics for LDR in the DB Console metrics are at the cluster level. This means that if there are multiple LDR jobs running on a cluster the DB Console will show the average metrics across jobs.
Metrics dashboard
You can use the Logical Data Replication dashboard of the destination cluster to monitor the following metric graphs at the cluster level:
- Replication latency
- Replication lag
- Row updates applied
- Logical bytes reviewed
- Batch application processing time: 50th percentile
- Batch application processing time: 99th percentile
- DLQ causes
- Retry queue size
To track replicated time, ingested events, and events added to the DLQ at the job level, refer to Metrics labels.
Jobs page
On the Jobs page, select:
- The Replication Producer in the source cluster's DB Console to view the history retention job.
- The Logical Replication Ingestion job in the destination cluster's DB Console. When you start LDR, the Logical Replication Ingestion job will show a bar that tracks the initial scan progress of the source table's existing data.
Monitoring and alerting
Prometheus
You can use Prometheus and Alertmanager to track and alert on LDR metrics. Refer to the Monitor CockroachDB with Prometheus tutorial for steps to set up Prometheus.
Metrics labels
To view metrics at the job level, you can use the label
option when you start LDR to add a metrics label to the LDR job. This enables child metric export, which are Prometheus time series with extra labels. You can track the following metrics for an LDR job with labels:
logical_replication.replicated_time_seconds
logical_replication.events_ingested
logical_replication.events_dlqed
logical_replication.scanning_ranges
logical_replication.catchup_ranges
To use metrics labels, ensure you have enabled the child metrics cluster setting:
SET CLUSTER SETTING server.child_metrics.enabled = true;
When you start LDR, include the label
option:
CREATE LOGICAL REPLICATION STREAM FROM TABLE {database.public.table_name}
ON 'external://{source_external_connection}'
INTO TABLE {database.public.table_name} WITH label=ldr_job;
For a full reference on tracking metrics with labels, refer to the Child Metrics page.
Datadog
You can export metrics to Datadog for LDR jobs. For steps to set up metrics export, refer to the Monitor CockroachDB Self-Hosted with Datadog.