Technical Advisory 131639

On this page Carat arrow pointing down

Publication date: October 8, 2024

Description

In v22.2 through v24.2, in rare circumstances where there is a sustained period of disk slowness in the presence of lease transfers, it is possible for some writes in a transaction that straddle multiple ranges to be lost. In cases where these lost writes are not also present on one or more secondary indexes that could allow for reconstructing or repairing the row, these writes could be irrecoverably lost.

Transactions that write to a single range, or transactions that take advantage of the One-Phase-Commit (1PC) optimization, are not affected.

In certain scenarios, the lost writes may be fully recoverable. For example, if the lost writes were on a secondary index, these could be fully repaired from the primary index, or a primary index could be fully repaired if there exists a set of secondary indexes that fully covers the primary index columns. The lost writes may also be partially recoverable. For example, if only some of the lost writes from a primary index are contained in secondary indexes.

In detail, since v22.2, when a lease for a range is transferred, it is done so as an expiration-based lease. This lease is then promoted to an epoch-based lease. In situations where the block device backing a store is experiencing intermittent slowness over an extended period of time, the expiration time for the lease transferred to the node with the slow stores can move back in time during this promotion. The lease expiration is said to have "regressed". This lease expiration regression can result in a brief period of time where another node can claim the lease, and the range can have two leaseholders.

While transactions will typically be routed from gateway nodes to the true leaseholder (on the node with the healthy store), and write intents will be written on this true leaseholder, on the node with the slow store these write intents are still queued up in the range's Raft log, and have not been applied. If async intent resolution (the means by which intents are committed to durable records, once a transaction commits) is subsequently triggered on this slow node before it has caught up on its Raft log, intent resolution becomes a no-op, instead of marking the intent as committed. A future transaction that interacts with the row may then come across this intent, and if the committed transaction's record has been garbage collected at that point, it may incorrectly determine that the intent belongs to an aborted transaction. This would then cause the intent to be cleaned up, effectively dropping the committed write.

This bug is rare in that it requires intermittent slowness of a block device backing a store, coupled with lease transfers onto the store with the intermittently slow block device, followed by async intent resolution on the slow node before it has caught up on its Raft log. Writes must touch more than one range. Single range writes, including those that benefit from the One-Phase-Commit (1PC) fast path are not affected.

The following CockroachDB versions are impacted by this bug:

  • v22.2
  • v23.1.0 - v23.1.26
  • v23.2.0 - v23.2.10
  • v24.1.0 (subsequent patch versions are unaffected)

Versions prior to 23.1 are no longer eligible for maintenance support, and this issue will not be addressed in those versions.

Statement

In #123442, we resolved an issue with CockroachDB in the expiration-to-epoch lease promotion transition process, where a lease's effective expiration could be allowed to regress, resulting in two nodes believing they are the leaseholder for a range.

The patch has been applied to maintenance releases of CockroachDB:

This public issue is tracked by 131639.

Mitigation

Users are encouraged to upgrade to v23.1.27, v23.2.11, v24.1.1, or a later version that includes the patch.

Detection via logs

Users can run the script a131639_check.sh to determine whether any lease expiration regressions are evident from their logs. The script must be run from within a decompressed debug.zip directory.

As described previously, a lease expiration regression is only one of the two race conditions needed to encounter a lost write, with the other being a stale read during asynchronous intent resolution on the range that had the lease expiration regression. Consequently, this script does not provide a complete method of detection for a lost write. Running this script may result in a limited number of false positives. For complete confirmation, this script should be combined with the method described in Detection via an audit of primary and secondary indexes.

Users should not run the script on a production system. Instead, users should move the debug.zip file elsewhere for analysis and run the script there.

The script requires the following dependencies:

Example usage and output:

➜ pwd
/path/of/unzipped/debug

➜ /path/to/script/a131639_check.sh
Searching for liveness epoch increments ...
Searching for slow lease promotions ...
Querying the logs for lease expiration regressions ...

❌ Lease expiration regressions found. Symptoms detected.

symptoms
node 123 observed possible symptoms at 2024-09-01 12:34:56 on ranges 123456

Detection via an audit of primary and secondary indexes

If the detection script finds symptoms of a potential lease expiration regression and you suspect that a particular row may have been subject to a lost write, you can query the primary and secondary indexes to confirm that the row exists in both places.

For example, if you suspect that a row in my_table with primary key pk=1 and secondary index keys a=2 and b=3 has a lost write, you can run the following queries to confirm:

icon/buttons/copy
BEGIN AS OF SYSTEM TIME '-10s';
SELECT count(*) FROM my_table@my_table_pkey WHERE pk = 1;
SELECT count(*) FROM my_table@my_table_a_idx WHERE pk = 1 AND a=2;
SELECT count(*) FROM my_table@my_table_b_idx WHERE pk = 1 AND b=3;
COMMIT;

If any of these queries return different values for count, that likely indicates a lost write.

The AS OF SYSTEM TIME clause helps to minimize the impact on foreground traffic.

If you are unsure which rows may have been affected but suspect that your cluster experienced lost writes, contact the support team.

Repair rows with lost writes

If you experienced lost writes, it is possible to repair rows by re-inserting data. This will generally be easier if you know all the column values that need to be re-inserted. Contact the support team for assistance with repairing rows with lost writes.

Impact

In circumstances where there is a sustained period of disk slowness in the presence of lease transfers, it is possible for some writes in a transaction that straddle multiple ranges to be lost. In cases where these lost writes are not also present on one or more secondary indexes that could allow for reconstructing or repairing the row, these writes could be irrecoverably lost.

Reach out to the support team if more information or assistance is needed.


Yes No
On this page

Yes No