Why is my process hanging when I try to start nodes with the --background
flag?
Cockroach Labs recommends against using the --background
flag when starting a cluster. In production, operators usually use a process manager like systemd
to start and manage the cockroach
process on each node. Refer to Deploy CockroachDB On-Premises. When testing locally, starting nodes in the foreground is recommended so you can monitor the runtime closely.
If you do use --background
, you should also set --pid-file
. To stop or restart a cluster, send SIGTERM
or SIGHUP
signal to the process ID in the PID file.
Check whether you have previously run a multi-node cluster using the same data directory. If you have not, refer to Troubleshoot Cluster Setup.
If you have previously started and stopped a multi-node cluster, and are now trying to bring it back up, note the following:
The --background
flag of cockroach start
causes the start
command to wait until the node has fully initialized and is able to start serving queries. In addition, to keep your data consistent, CockroachDB waits until a majority of nodes are running. This means that if only one node of a three-node cluster is running, that one node will not be operational.
As a result, starting nodes with the --background
flag will cause cockroach start
to hang until a majority of nodes are fully initialized.
To restart your cluster, you should either:
- Use multiple terminal windows to start multiple nodes in the foreground.
- Start each node in the background using your shell's functionality (e.g.,
cockroach start &
) instead of using the--background
flag.
Why is memory usage increasing despite lack of traffic?
Like most databases, CockroachDB caches the most recently accessed data in memory so that it can provide faster reads, and its periodic writes of time-series data cause that cache size to increase until it hits its configured limit. For information about manually controlling the cache size, see Recommended Production Settings.
Why is disk usage increasing despite lack of writes?
By default, DB Console stores time-series cluster metrics within the cluster. By default, data is retained at 10-second granularity for 10 days, and at 30-minute granularity for 90 days. An automatic job periodically runs and prunes historical data. For the first several days of your cluster's life, the cluster's time-series data grows continually.
CockroachDB writes about 15 KiB per second per node to the time-series database. About half of that is optimized away by the storage engine. Therefore an estimated calculation of how much data will be stored in the time-series database is:
8 KiB * 24 hours * 3600 seconds/hour * number of days
For the first 10 days of your cluster's life, you can expect storage per node to increase by about the following amount:
8 * 24 * 3600 * 10 = 6912000
or about 6 GiB. With on-disk compression, the actual disk usage is likely to be about 4 GiB.
However, depending on your usage of time-series charts in the DB Console, you may prefer to reduce the amount of disk used by time-series data. To reduce the amount of time-series data stored, or to disable it altogether, refer to Can I reduce or disable the storage of time-series data?
Why is my disk usage not decreasing after deleting data?
There are several reasons why disk usage may not decrease right after deleting data:
For instructions on how to free up disk space as quickly as possible after dropping a table, see How can I free up disk space that was used by a dropped table?
The data could be preserved for MVCC history
CockroachDB implements Multi-Version Concurrency Control (MVCC), which means that it maintains a history of all mutations to a row. This history is used for a wide range of functionality: transaction isolation, historical AS OF SYSTEM TIME
queries, incremental backups, changefeeds, cluster replication, and so on. The requirement to preserve history means that CockroachDB "soft deletes" data: The data is marked as deleted by a tombstone record so that CockroachDB will no longer surface the deleted rows to queries, but the old data is still present on disk.
The length of history preserved by MVCC is determined by two things: the gc.ttlseconds
of the zone that contains the data, and whether any protected timestamps exist. You can check the range's statistics to observe the key_bytes
, value_bytes
, and live_bytes
. The live_bytes
metric reflects data that's not garbage. The value of (key_bytes + value_bytes) - live_bytes
will tell you how much MVCC garbage is resident within a range.
This information can be accessed in the following ways:
- Using the
SHOW RANGES
SQL statement, which lists the above values under the nameslive_bytes
,key_bytes
, andval_bytes
. - In the DB Console, under Advanced Debug Page > Even more Advanced Debugging, click the Range Status link, which takes you to a page where the values are displayed in a tabular format like the following:
MVCC Live Bytes/Count | 2.5 KiB / 62 count
.
When data has been deleted for at least the duration specified by gc.ttlseconds
, CockroachDB will consider it eligible for 'garbage collection'. Asynchronously, CockroachDB will perform garbage collection of ranges that contain significant quantities of garbage. Note that if there are backups or other processes that haven't completed yet but require the data, these processes may prevent the garbage collection of that data by setting a protected timestamp until these processes have completed.
For more information about how MVCC works, see MVCC.
The data could be in the process of being compacted
When MVCC garbage is deleted by garbage collection, the data is still not yet physically removed from the filesystem by the Storage Layer. Removing data from the filesystem requires rewriting the files containing the data using a process also known as compaction, which can be expensive. The storage engine has heuristics to compact data and remove deleted rows when enough garbage has accumulated to warrant a compaction. It strives to always restrict the overhead of obsolete data (called the space amplification) to at most 10%. If a lot of data was just deleted, it may take the storage engine some time to compact the files and restore this property.
For instructions on how to free up disk space as quickly as possible after dropping a table, see How can I free up disk space that was used by a dropped table?
How can I free up disk space when dropping a table?
If you've noticed that your disk space is not freeing up quickly enough after dropping a table, you can take the following steps to free up disk space more quickly the next time you drop a table. This example assumes a table t
exists.
The procedure shown here only works if you get the range IDs from the table before running DROP TABLE
. If you are in an emergency situation due to running out of disk, see What happens when a node runs out of disk space?
Lower the
gc.ttlseconds
parameter to 10 minutes.ALTER TABLE t CONFIGURE ZONE USING gc.ttlseconds = 600;
Find the IDs of the ranges storing the table data using
SHOW RANGES
:SELECT range_id FROM [SHOW RANGES FROM TABLE t];
range_id ------------ 68 69 70 ...
Drop the table using
DROP TABLE
:
DROP TABLE t;
Visit the Advanced Debug page and click the link Run a range through an internal queue to visit the Manually enqueue range in a replica queue page. On this page, select mvccGC from the Queue dropdown and enter each range ID from the previous step. Check the SkipShouldQueue checkbox to speed up the MVCC garbage collection process.
Monitor GC progress in the DB Console by watching the MVCC GC Queue and the overall disk space used as shown on the Overview Dashboard.
What is the internal-delete-old-sql-stats
process and why is it consuming my resources?
When a query is executed, a process records query execution statistics on system tables. This is done by recording SQL statement fingerprints.
The CockroachDB internal-delete-old-sql-stats
process cleans up query execution statistics collected on system tables, including system.statement_statistics
and system.transaction_statistics
. These system tables have a default row limit of 1 million, set by the sql.stats.persisted_rows.max
cluster setting. When this limit is exceeded, there is an hourly cleanup job that deletes all of the data that surpasses the row limit, starting with the oldest data first. For more information about the cleanup job, use the following query:
> SELECT * FROM crdb_internal.jobs WHERE job_type='AUTO SQL STATS COMPACTION';
In general, the internal-delete-old-sql-stats
process is not expected to impact cluster performance. There are a few cases where there has been a spike in CPU due to an incredibly large amount of data being processed; however, those cases were resolved through workload optimizations and general improvements over time.
Can I reduce or disable the storage of time-series data?
Yes, you can either reduce the interval for time-series storage or disable time-series storage entirely.
After reducing or disabling time-series storage, it can take up to 24 hours for time-series data to be deleted and for the change to be reflected in DB Console metrics.
Reduce the interval for time-series storage
To reduce the interval for storage of time-series data:
- For data stored at 10-second resolution, reduce the
timeseries.storage.resolution_10s.ttl
cluster setting to anINTERVAL
value less than240h0m0s
(10 days).
For example, to change the storage interval for time-series data at 10s resolution to 5 days, run the following SET CLUSTER SETTING
command:
> SET CLUSTER SETTING timeseries.storage.resolution_10s.ttl = '120h0m0s';
> SHOW CLUSTER SETTING timeseries.storage.resolution_10s.ttl;
timeseries.storage.resolution_10s.ttl
+---------------------------------------+
120:00:00
(1 row)
This setting has no effect on time-series data aggregated at 30-minute resolution, which is stored for 90 days by default.
- For data stored at 30-minute resolution, reduce the
timeseries.storage.resolution_30m.ttl
cluster setting to anINTERVAL
value less than2160h0m0s
(90 days).
Cockroach Labs recommends that you avoid increasing the period of time that DB Console retains time-series metrics. If you need to retain this data for a longer period, consider using a third-party tool such as Prometheus to collect the cluster's metrics and disabling the DB Console's collection of time-series metrics. Refer to Monitoring and Alerting.
Disable time-series storage
Disabling time-series storage is recommended only if you exclusively use a third-party tool such as Prometheus for time-series monitoring. Prometheus and other such tools do not rely on CockroachDB-stored time-series data; instead, they ingest metrics exported by CockroachDB from memory and then store the data themselves.
When storage of time-series metrics is disabled, the DB Console Metrics dashboards in the DB Console are still available, but their visualizations are blank. This is because the dashboards rely on data that is no longer available.
To disable the storage of time-series data, run the following command:
> SET CLUSTER SETTING timeseries.storage.enabled = false;
> SHOW CLUSTER SETTING timeseries.storage.enabled;
timeseries.storage.enabled
+----------------------------+
false
(1 row)
This setting only prevents the collection of new time-series data. To also delete all existing time-series data, also change both the timeseries.storage.resolution_10s.ttl
and timeseries.storage.resolution_30m.ttl
cluster settings:
> SET CLUSTER SETTING timeseries.storage.resolution_10s.ttl = '0s';
> SET CLUSTER SETTING timeseries.storage.resolution_30m.ttl = '0s';
Historical data is not deleted immediately, but is eventually removed by a background job within 24 hours.
What happens when a node runs out of disk space?
When a node runs out of disk space, it shuts down and cannot be restarted until space is freed up.
To prepare for this case, CockroachDB automatically creates an emergency ballast file in each node's storage directory that can be deleted to free up enough space to be able to restart the node.
For more information about troubleshooting disk usage issues, see storage issues.
In addition to using ballast files, it is important to actively monitor remaining disk space.
For instructions on how to free up disk space as quickly as possible after dropping a table, see How can I free up disk space that was used by a dropped table?
Why would increasing the number of nodes not result in more operations per second?
If queries operate on different data, then increasing the number of nodes should improve the overall throughput (transactions/second or QPS).
However, if your queries operate on the same data, you may be observing transaction contention. For details, see Transaction Contention.
Why does CockroachDB collect anonymized cluster usage details by default?
Cockroach Labs collects information about CockroachDB's real-world usage to help prioritize the development of product features. We choose our default as "opt-in" to strengthen the information collected, and are careful to send only anonymous, aggregate usage statistics. For details on what information is collected and how to opt out, see Diagnostics Reporting.
What happens when node clocks are not properly synchronized?
CockroachDB requires moderate levels of clock synchronization to preserve data consistency. For this reason, when a node detects that its clock is out of sync with at least half of the other nodes in the cluster by 80% of the maximum offset allowed, it spontaneously shuts down. This offset defaults to 500ms but can be changed via the --max-offset
flag when starting each node.
While serializable consistency is maintained regardless of clock skew, skew outside the configured clock offset bounds can result in violations of single-key linearizability between causally dependent transactions. It's therefore important to prevent clocks from drifting too far by running NTP or other clock synchronization software on each node.
In very rare cases, CockroachDB can momentarily run with a stale clock. This can happen when using vMotion, which can suspend a VM running CockroachDB, migrate it to different hardware, and resume it. This will cause CockroachDB to be out of sync for a short period before it jumps to the correct time. During this window, it would be possible for a client to read stale data and write data derived from stale reads. By enabling the server.clock.forward_jump_check_enabled
cluster setting, you can be alerted when the CockroachDB clock jumps forward, indicating it had been running with a stale clock. To protect against this on vMotion, however, use the --clock-device
flag to specify a PTP hardware clock for CockroachDB to use when querying the current time. When doing so, you should not enable server.clock.forward_jump_check_enabled
because forward jumps will be expected and harmless. For more information on how --clock-device
interacts with vMotion, see this blog post.
In CockroachDB versions prior to v22.2.13, and in v23.1 versions prior to v23.1.9, the --clock-device
flag had a bug that could cause it to generate timestamps in the far future. This could cause nodes to crash due to incorrect timestamps, or in the worst case irreversibly advance the cluster's HLC clock into the far future. This bug is fixed in CockroachDB v23.2.
Considerations
When setting up clock synchronization:
- All nodes in the cluster must be synced to the same time source, or to different sources that implement leap second smearing in the same way. For example, Google and Amazon have time sources that are compatible with each other (they implement leap second smearing in the same way), but are incompatible with the default NTP pool (which does not implement leap second smearing).
- For nodes running in AWS, we recommend Amazon Time Sync Service. For nodes running in GCP, we recommend Google's internal NTP service. For nodes running elsewhere, we recommend Google Public NTP. Note that the Google and Amazon time services can be mixed with each other, but they cannot be mixed with other time services (unless you have verified leap second behavior). Either all of your nodes should use the Google and Amazon services, or none of them should.
- If you do not want to use the Google or Amazon time sources, you can use
chrony
and enable client-side leap smearing, unless the time source you're using already does server-side smearing. In most cases, we recommend the Google Public NTP time source because it handles smearing the leap second. If you use a different NTP time source that doesn't smear the leap second, you must configure client-side smearing manually and do so in the same way on each machine. - Do not run more than one clock sync service on VMs where
cockroach
is running. - For new clusters using the multi-region SQL abstractions, Cockroach Labs recommends lowering the
--max-offset
setting to250ms
. This setting is especially helpful for lowering the write latency of global tables. Nodes can run with different values for--max-offset
, but only for the purpose of updating the setting across the cluster using a rolling upgrade.
Tutorials
For guidance on synchronizing clocks, see the tutorial for your deployment environment:
Environment | Featured Approach |
---|---|
On-Premises | Use NTP with Google's external NTP service. |
AWS | Use the Amazon Time Sync Service. |
Azure | Disable Hyper-V time synchronization and use NTP with Google's external NTP service. |
Digital Ocean | Use NTP with Google's external NTP service. |
GCE | Use NTP with Google's internal NTP service. |
How can I tell how well node clocks are synchronized?
As explained in more detail in our monitoring documentation, each CockroachDB node exports a wide variety of metrics at http://<host>:<http-port>/_status/vars
in the format used by the popular Prometheus timeseries database. Two of these metrics export how close each node's clock is to the clock of all other nodes:
Metric | Definition |
---|---|
clock_offset_meannanos |
The mean difference between the node's clock and other nodes' clocks in nanoseconds |
clock_offset_stddevnanos |
The standard deviation of the difference between the node's clock and other nodes' clocks in nanoseconds |
As described in the above answer, a node will shut down if the mean offset of its clock from the other nodes' clocks exceeds 80% of the maximum offset allowed. It's recommended to monitor the clock_offset_meannanos
metric and alert if it's approaching the 80% threshold of your cluster's configured max offset.
You can also see these metrics in the Clock Offset graph on the DB Console.
How do I prepare for planned node maintenance?
Perform a node shutdown to temporarily stop a node that you plan to restart.