Publication date: August 15, 2023
Description
A CockroachDB store with encryption-at-rest enabled maintains per-file metadata necessary for file decryption within an append-only log. When this log file grows large, the log file is rotated by writing a snapshot of the metadata to a new, replacement log file. The introduction of this log file in v21.2.0 contains a bug in log rotation that could result in the omission of a single file’s metadata from the persisted log file. When a node encounters this bug, it continues to operate normally until the process is restarted. If the omitted file still exists after a process restart, the cockroach
process cannot decrypt or read the file, and the local store appears to be corrupted. This bug most frequently causes the cockroach
process on a node to exit shortly after it starts. An error such as “pebble/table: invalid table (bad magic number: 0x08cf6caea91de887)
” is logged.
Only CockroachDB deployments using encryption-at-rest are affected. This issue is rare, given that the average lifespan of a file within CockroachDB’s storage engine is relatively short. If a file is omitted from the registry during a rotation, the file is frequently removed before the process is restarted.
The corruption is limited to a single store, allowing recovery through decommissioning and replication. The restart of multiple processes simultaneously or in quick succession increases the risk of quorum loss through the simultaneous loss of nodes.
Statement
This is resolved in CockroachDB by #107249, which fixes the ordering of steps during encryption-at-rest log rotation.
The fix has been applied to maintenance releases of CockroachDB v23.1.8, v22.2.13, v22.1.22.
This public issue is tracked by #106617.
Mitigation
Users of CockroachDB v21.2, ≤22.1.21, ≤22.2.12, ≤23.1.7 using encryption-at-rest are encouraged to upgrade to v22.1.22, v22.2.13, v23.1.8 or a later version. Users are encouraged to deploy the patch release within a best practice “rolling upgrade.” Nodes should be upgraded one at a time, pausing 15 minutes between nodes.
If a node encounters corruption, manifested as repeated node start failures, users should:
- Pause the rolling upgrade.
- Decommission the node with corruption.
- Wait for all ranges to fully replicate.
- Resume the rolling upgrade.
Impact
Restarting a node with a store encrypted using encryption-at-rest may surface existing latent, local corruption of the store’s data, requiring the node to be decommissioned and replaced. Customers are encouraged to upgrade to a patched version through a rolling upgrade.
Questions about any technical alert can be directed to our support team.