This page has instructions showing how to troubleshoot scenarios where you believe replicas are not behaving as specified by your zone configurations.
Cockroach Labs does not recommend modifying zone configurations manually.
Most users should use multi-region SQL statements instead. If additional control is needed, use Zone Config Extensions to augment the multi-region SQL statements.
Prerequisites
This page assumes you have read and understood the following:
- Replication controls > Replication zone levels, which describes how the inheritance hierarchy of replication zones works. This is critical to understand for troubleshooting.
- Monitoring and alerting > Critical nodes endpoint, which is used to monitor if any of your cluster's ranges are under-replicated, or if your data placement constraints are being violated.
SHOW ZONE CONFIGURATIONS
, which is the SQL statement used to view details about the replication zone configuration of various schema objects.
Types of problems
The most common types of problems you may encounter when manually configuring replication zones are:
- The replica location problem, a.k.a., "The replicas are not where they should be". For replica location problems, the following zone config variables may be involved:
- The replica state problem, a.k.a., "The replicas are not how they should be". For replica state problems, the following zone config variables may be involved:
The replica location problem is the most common. It is most often caused by misconfiguration introduced by manually configuring replication zones, which is why manual zone config management is not recommended. Most users should use the multi-region SQL statements. If additional control is needed, Zone config extensions can be used to augment the multi-region SQL statements.
If you just did a cluster restore and are seeing problems with your zone configs, it may be because zone configs are overwritten by a cluster restore.
Troubleshooting steps
Use the following steps to determine which schema objects (if any) have zone configurations that are misconfigured.
The examples assume a local multi-region cockroach demo
cluster started using the following command:
cockroach demo --global --nodes 9 --insecure
Next, execute the following statements to set the database regions for the movr
database:
ALTER DATABASE movr SET PRIMARY REGION "us-east1";
ALTER DATABASE movr ADD REGION "us-west1";
ALTER DATABASE movr ADD REGION "europe-west1";
Step 1. Start with a target schema object
Look at the zone configuration for the schema object that you think may be misconfigured. Depending on the type of problem, you might have come to this conclusion by monitoring the critical nodes endpoint.
Use the SHOW ZONE CONFIGURATION
statement to inspect the target object's zone configurations.
For example, to view the zone configuration for the movr.rides
table:
SHOW ZONE CONFIGURATION FOR TABLE movr.users;
----------------+-------------------------------------------------------------------------------------------
DATABASE movr | ALTER DATABASE movr CONFIGURE ZONE USING
| range_min_bytes = 134217728,
| range_max_bytes = 536870912,
| gc.ttlseconds = 14400,
| num_replicas = 5,
| num_voters = 3,
| constraints = '{+region=europe-west1: 1, +region=us-east1: 1, +region=us-west1: 1}',
| voter_constraints = '[+region=us-east1]',
| lease_preferences = '[[+region=us-east1]]'
(1 row)
The preceding output is expected for a multi-region cluster spread across 9 nodes that was configured using multi-region SQL. Since nothing about the zone configuration has been manually modified, the output shows that movr.users
is using the zone configuration from its parent movr
database. For more information about how this works, see How zone config inheritance works.
However, if the zone configuration had been manually modified, there could be inconsistencies in the output that would show a misconfiguration.
For example:
- If the type of problem were a constraint violation, you'd want to check whether the values in
constraints
,voter_constraints
, andlease_preferences
are logically inconsistent, which would cause constraint violations. - If the type of problem were an under-replicated range, you'd want to check the values of
num_replicas
andnum_voters
. Modifying these values can cause under-replication. For example,num_replicas
could be set too low. It's also easy to make an arithmetic mistake when configuringnum_voters
; ifnum_voters
is less thannum_replicas
, the difference dictates the number of non-voting replicas. This is why most users should control non-voting replica placement with the high-level multi-region SQL features instead. - If range splits are failing, the value of
gc.ttlseconds
may be too high.
If the zone configuration for the target schema object looks good, move to Step 2.
If the zone configuration does not look right, repair it now using ALTER TABLE ... CONFIGURE ZONE
. You can set the problem-causing field to another value, but often the best thing to do is to discard the changed settings using ALTER TABLE ... CONFIGURE ZONE DISCARD
so that it returns to inheriting values from its parent object:
ALTER TABLE movr.users CONFIGURE ZONE DISCARD;
Step 2. Move upward in the inheritance hierarchy as needed
If the target schema object looked good in Step 1, look at its parent schema object at the next level up in the inheritance hierarchy.
This is the new target schema object. Return to the previous step and follow the instructions there.
Continue this process recursively until you either find the misconfigured zone configuration or make it all the way up to the default
replication zone and confirm that all of your schema objects have the expected zone configurations.
Considerations
Monitor for ranges that are under-replicated or violating constraints
Monitor the output of the critical nodes endpoint to see if you have ranges that are under-replicated or violating constraints.
Using the range IDs from that endpoint, you can map from range IDs to schema objects as described in the following example. This will give you a target schema object to start from.
To monitor for under-replicated range IDs, see Critical nodes endpoint > Replication status - under-replicated ranges.
To monitor for range IDs that are violating constraints, see Critical nodes endpoint > Replication status - constraint violation.
Once you have a range ID, you need to map from that ID to the name of a schema object that you can pass to SHOW ZONE CONFIGURATIONS
.
The following example query uses the SHOW RANGES
statement to show, for each range ID, which tables and indexes use that range for their underlying storage. The query assumes the movr
schema that is loaded by cockroach demo
, so you'll need to modify it to work with your schema.
WITH movr_tables AS (SHOW RANGES FROM DATABASE movr WITH TABLES),
movr_indexes AS (SHOW RANGES FROM DATABASE movr WITH INDEXES)
SELECT array_agg(movr_indexes.range_id) AS ranges,
movr_tables.table_name,
movr_indexes.index_name
FROM movr_tables, movr_indexes
WHERE movr_tables.range_id = movr_indexes.range_id
GROUP BY movr_tables.table_name, movr_indexes.index_name
ORDER BY table_name, index_name
In the following output, each range ID in the leftmost column maps to a table name and index name in the subsequent columns. For example, given this output, if the critical nodes endpoint said the constraint violation was for range ID 101
, we would know to look at the zone configuration(s) for the users
table and the users_pkey
primary index.
ranges | table_name | index_name
-------------------------------------+----------------------------+------------------------------------------------
{150} | promo_codes | promo_codes_pkey
{150} | promo_codes | rides_auto_index_fk_city_ref_users
{150} | promo_codes | rides_auto_index_fk_vehicle_city_ref_vehicles
{150} | promo_codes | rides_pkey
{150} | promo_codes | user_promo_codes_pkey
{150} | promo_codes | vehicle_location_histories_pkey
{150} | rides | promo_codes_pkey
{150} | rides | rides_auto_index_fk_city_ref_users
{150} | rides | rides_auto_index_fk_vehicle_city_ref_vehicles
{83,153,152,96,154,95,141,140,150} | rides | rides_pkey
{150} | rides | user_promo_codes_pkey
{150} | rides | vehicle_location_histories_pkey
{83} | rides | vehicles_auto_index_fk_city_ref_users
{83} | rides | vehicles_pkey
{150} | user_promo_codes | promo_codes_pkey
{150} | user_promo_codes | rides_auto_index_fk_city_ref_users
{150} | user_promo_codes | rides_auto_index_fk_vehicle_city_ref_vehicles
{150} | user_promo_codes | rides_pkey
{150} | user_promo_codes | user_promo_codes_pkey
{150} | user_promo_codes | vehicle_location_histories_pkey
{101,73,72,71,100,70,81,80,90} | users | users_pkey
{90} | users | vehicles_pkey
{150} | vehicle_location_histories | promo_codes_pkey
{150} | vehicle_location_histories | rides_auto_index_fk_city_ref_users
{150} | vehicle_location_histories | rides_auto_index_fk_vehicle_city_ref_vehicles
{150} | vehicle_location_histories | rides_pkey
{150} | vehicle_location_histories | user_promo_codes_pkey
{150} | vehicle_location_histories | vehicle_location_histories_pkey
{83} | vehicles | rides_pkey
{90} | vehicles | users_pkey
{83} | vehicles | vehicles_auto_index_fk_city_ref_users
{90,94,93,92,85,91,84,82,83} | vehicles | vehicles_pkey
(32 rows)
Replication system priorities: data placement vs. data durability
As noted in Data Domiciling with CockroachDB:
Zone configs can be used for data placement but these features were historically built for performance, not for domiciling. The replication system's top priority is to prevent the loss of data and it may override the zone configurations if necessary to ensure data durability.
Performance
Changes you make to a schema object's zone configuration may take some time to be reflected in the schema object's actual state, and can result in an increase in CPU usage, IOPS, and network traffic while the cluster rebalances replicas to meet the provided constraints. This is especially true for larger clusters.
For more information about how replica rebalancing works, see Load-based replica rebalancing.
Zone configs are overwritten during cluster restore
During a cluster restore, any zone configurations present on the destination cluster are overwritten with the zone configurations from the backed-up cluster. If no customized zone configurations were on the cluster when the backup was taken, then after the restore the destination cluster will use the zone configuration from the RANGE DEFAULT
configuration.
For more information, see RESTORE
considerations.
See also
- Troubleshooting overview
- Troubleshoot Self-Hosted Setup > Replication issues
- Replication Controls
- Critical nodes status endpoint: check the status of your cluster's data replication, data placement, and zone constraint conformance.