This document outlines the foundational skills required to deploy and operate CockroachDB in production environments.
The skills are organized into sections based on the following operational domains:
Each section includes links to relevant documentation for the listed skills.
Cockroach Labs offers Professional Services that can assist you with getting applications into production faster and more efficiently.
Infrastructure configuration
This section covers how to ensure that your hardware and network are properly configured to meet the performance and connectivity requirements of CockroachDB.
- Verify vCPU, RAM, storage, and disk IOPS performance
- Configure time synchronization with NTP server
- Validate network connectivity
Security
This section covers how to secure a CockroachDB deployment, including certificate management, load balancing setup, role-based access control, and data encryption.
- Create and distribute certificates; initialize cluster
- Configure load balancer and direct a workload
- Configure RBAC
- Encryption at rest
Cluster maintenance
This section covers how to manage the lifecycle of CockroachDB nodes, including adding and removing nodes, handling outages, performing upgrades or downgrades, and modifying cluster settings.
- Shut down a node gracefully
- Handle unplanned node outages
- Add nodes
- Remove nodes
- Add a region
- Remove a region
- Rolling upgrades
- Downgrade a cluster from a patch version
- Downgrade a cluster from a major version
- Change a cluster setting
- Repave a cluster: cluster repaving involves the following individual skills, which are also used during rolling upgrades:
- Shut down a node gracefully
- Detach the persistent volume (a.k.a. persistent disk) from the removed node's virtual machine (VM) (this step is optional but recommended)
- Delete the removed node's VM
- Start a new VM
- Reattach the persistent disk to the new VM (necessary if you did step #2)
- Add a node to the cluster from the new VM
Troubleshooting
This section contains a list of common issues related to SQL performance, cluster stability, memory usage, load balancing, and changefeed lag.
- SQL response time for specific queries
- SQL throughput degradation across the board
- Cluster instability: Dead/suspect nodes
- Out of memory (OOM) problems
- Imbalanced cluster load
- End of file (EOF) errors
- Changefeed is falling behind
- Gather diagnostic data from a "debug zip" file
- Collect timeseries diagnostic data from a "tsdump" file
Disaster recovery
This section covers how to set up and manage backup and restore of your cluster to ensure data recovery in case of failures.
- Create AWS IAM access key
- Create S3 bucket for backup data
- Full cluster backup to S3
- Incremental backup to S3
- Cluster restore from AWS S3