Distributed Glossary

ACID or A.C.I.D

What is ACID?

ACID stands for atomicity, consistency, isolation, and durability. These are a set of characteristics that are desirable for database transactions. Together, they guarantee that transactions and all data stored in the database remains valid even in the event of errors or power failures. These characteristics are extremely important for any database.

The Four ACID Properties: Atomicity, Consistency, Isolation, Durability

Atomicity

The “all or nothing” principle. Each transaction must be treated as a single, indivisible unit (even if processing the transaction requires multiple steps). In other words, every step of the transaction must complete successfully or the entire transaction will fail and no change will be written to the database. This is desirable because without atomicity, a transaction that’s interrupted while processing may only write a portion of the changes it’s supposed to make, which could leave the database in an inconsistent state.

Example: A bank transfer. Imagine if it only partially went through, then the bank balances of both the recipient and the sender would be unclear and inconsistent. Multiply this singular incomplete transaction and you have a large-scale problem.

Consistency

No transaction can violate the integrity of the database – all transactions must leave the database in a valid state. Consensus algorithms like Raft and Paxos can help ensure data consistency across distributed systems.

Example: Inventory management. You might have brick-and-mortar stores as well as an online presence. It’s important to ensure that the inventory listed in your database is correct at all times across both physical locations and online stock so that customers don’t pay for an item that is sold out.

Isolation

Any concurrently-processed transactions (i.e. transactions happening at the same time) have the same effect on the database as if they were executed sequentially (i.e. one after another). Perfect isolation between transactions—serializable isolation—creates some restrictions on concurrent data processing. So many databases adopt lower levels of isolation or allow applications to choose from various isolation levels.

Example: Imagine person A is transferring money to person B at the same time that person B is transferring money to person C. It’s important that the bank balance for all three people is accurate and up-to-date based on the amounts transferred in both cases.

See what happens when transactions are not properly isolated:

Durability

Once committed, the transaction remains committed even in the case of hardware failure, power outage, etc.

Example: With organizations experiencing 86 outages each year, handling failures is part of running a business today. In large-scale systems, you should expect that something will fail eventually (network partition, regional outage, disk stall, cloud outage, etc.). In the event that there is a failure, you need the reassurance that your data has not been corrupted or lost.

If you’re interested in learning more about why durability matters today, and why databases should be benchmarked according to how they perform when everything fails, check out the video below explaining CockroachDB’s new database benchmarking methodology “Performance under Adversity”:

Why ACID transactions matter

ACID transactions ultimately give your users the confidence that the data they are seeing, using, and analyzing is accurate, up-to-date, and safe. Particularly for distributed databases and distributed transactions, it is important that all of the data viewed is consistent wherever and whenever a user retrieves that data, even when transactions span multiple geographic locations. For example, a company like Amazon, which has warehouses all over the world, needs to ensure that their inventory is accurate and up-to-date at all times for all of their global consumers.

In addition, while many emerging regulations do not explicitly require ACID-compliant transactions, the properties enforced by the ACID framework can support the outcomes desired by these regulatory bodies. For example, in the case of GDPR which focuses on data privacy, consistency and durability can support accurate and timely corrections and deletion of personal data. In the case of PCI-DSS and payment processing, data integrity is also critical, and all 4 ACID properties help to prevent data corruption and loss.

ACID compliance in databases

Not all databases are built to support ACID transactions. For example, NoSQL databases are often associated with speed, scalability, and schema flexibility — but historically, they did not guarantee full ACID compliance. Instead, following the BASE model, which stands for basically available, soft state, and eventually consistent. This tradeoff was intentional, favoring eventual consistency and high availability over strict transaction guarantees.

Distributed SQL databases, like CockroachDB, offer the scale and speed of NoSQL databases, while providing the reliability, resilience, and ACID transactions of traditional SQL and relational databases.

ACID transactions with CockroachDB

CockroachDB, the first commercially available distributed SQL database, was built to be ACID compliant. Since its founding, the engineering team has continued to develop a set of integrated architectural mechanisms designed to be always-on, resilient, and performant at scale. Below are just a few of the features available today and how they relate to the four ACID properties.

Atomicity in CockroachDB

All SQL statements in CockroachDB are treated as transactions, including single statements ("autocommit mode"). If any part of a transaction fails, the entire transaction is aborted, and the database is left unchanged. Successful transactions commit all changes as a unit.

Consistency in CockroachDB

CockroachDB uses the Raft consensus protocol to ensure data consistency between replicas. Before any write is committed, a quorum of replicas must agree on the change, guaranteeing that only valid, conflict-free changes are applied.

The transaction layer coordinates distributed transactions using an atomic commit protocol called Parallel Commits, allowing transactions (including those that span multiple nodes or tables) to be committed only when all necessary conditions are met.

Isolation in CockroachDB

CockroachDB supports SERIALIZABLE and READ COMMITTED isolation levels, with SERIALIZABLE as the default — the highest standard defined by SQL. This level requires that transactions appear to execute one at a time, preventing phenomena such as dirty reads or lost updates.

Concurrency is controlled using mechanisms like latches, lock tables (for write and read locks), and timestamp caches, which collectively ensure transactions are isolated and that conflicts are correctly resolved.

Multi-version concurrency control (MVCC) enables transactions to operate on consistent snapshots of data, ensuring proper ordering and retrying of conflicting operations.

Durability in CockroachDB

Data is not considered committed until it has been replicated and persisted on disk by a quorum of nodes, leveraging Raft’s consensus and log replication. In case of a node or disk failure, data can be restored from other replicas’ Raft logs or Raft snapshots, ensuring no committed transaction is ever lost.

In the case of transient disk stalls, commonly seen in cloud storage, CockroachDB also supports write-ahead log (WAL) failover. WAL failover uses a secondary disk as a fail-to for the write-ahead log so that when the primary disk stalls, failover to the secondary occurs and solves the problem of write latency in the presence of a disk stall.

Distributed and Self-Healing Guarantees

CockroachDB is designed for self-healing: if a node or region fails, the system automatically rebalances data as needed, so that surviving nodes can continue serving requests while maintaining ACID properties.

High availability and rapid recovery are further reinforced through backup systems, automatic failover, and distributed consensus, which together deliver robust transactional guarantees even in the face of widespread failures.

In summary, CockroachDB’s combination of distributed, consensus-driven replication, sophisticated transaction coordination, concurrency control mechanisms, and careful conflict management allows it to offer strict ACID compliance — even for globally distributed workloads.

ACID Transaction FAQ

What does ACID stand for in databases?

ACID stands for Atomicity, Consistency, Isolation, and Durability. These are the four key properties that ensure reliable, safe, and consistent database transactions.

Why are ACID transactions important?

ACID transactions prevent data loss, corruption, or inconsistency. They ensure that even in the event of system crashes, power failures, or concurrent user activity, the database remains in a valid and predictable state.

What is an example of an ACID transaction?

A classic example is a bank transfer. Money is withdrawn from one account and deposited into another. ACID ensures that either both actions succeed (atomicity), or neither happens. The balance stays consistent, even under concurrent access or failure.

Do all databases support ACID transactions?

No. Relational databases like PostgreSQL, MySQL (with InnoDB), and SQL Server are fully ACID-compliant. However, many NoSQL databases may relax one or more of the ACID properties in favor of performance or scalability, often using the BASE model instead.

Can distributed databases provide ACID compliance?

Yes, modern distributed SQL databases can support ACID transactions using consensus protocols (like Raft or Paxos), distributed transaction logs, and coordination layers that ensure consistency across nodes.

More about ACID or A.C.I.D⟶

Chaos Testing

What is Chaos Testing?

Chaos testing is a deliberate approach to fostering system resilience by introducing controlled disruptions to reveal weaknesses before they cause real-world outages. While chaos engineering focuses holistically on the full discipline — planning, automating, and analyzing experiments — chaos testing refers to the specific act of simulating faults (such as shutting down servers or injecting latency) in a system to evaluate how well it withstands different kinds of failures, including grey failures, and maintains its critical services.

As organizations shift toward cloud-native and distributed architectures, the importance of chaos testing continues to grow. Distributed systems deliver high availability and scale, but only if they’re designed and validated to handle unpredictable network partitions, hardware failures, or software bugs. Without careful resilience testing, a minor disruption can quickly snowball into a widespread outage with costly business impacts.

Why Chaos Testing Matters

In today's always-on digital world, real-world outages are on the rise, and conventional testing has to catch up with today's challenges. For instance, network partitions, server crashes, or cascading configuration errors can devastate systems if not anticipated. By simulating these events, chaos testing enables teams to:

Validate system reliability in the face of adverse conditions
Improve observability by revealing hidden points of failure and bottlenecks
Build organizational confidence that production systems can withstand the unexpected

Cockroach Labs, through its continuous and rigorous resilience innovations, demonstrates that investing in chaos testing can translate directly into enhancements in the ability to survive and recover from failures.

To learn more about chaos testing an enterprise-ready database, check out this video by Technical Evangelist, Rob Reid and see how CockroachDB survives pod failures, network partitions, corrupted data, restricted bandwidth, time faults, and so much more:

Key Concepts in Chaos Testing

Successful chaos testing is built on several foundational concepts:

Fault injection: Introducing failures such as CPU spikes, network blackholes, or node failures to observe real-world system behavior
Steady state: Defining what “normal operation” looks like, so deviations during experiments are readily captured
Blast radius: Controlling the scope of chaos to minimize risk, especially when starting with production experiments
Automation vs. manual experiments: Automated experiments (using tools) scale chaos testing and cover regression scenarios, but manual, targeted attacks can uncover edge cases or validate specific remediations

How Chaos Testing Works

Chaos testing comprises a few standard practices across organizations:

Industry tools like Chaos Monkey, Gremlin, LitmusChaos, and Chaos Mesh enable injection of failure scenarios
Staging vs. Development Environments: Early-stage chaos tests typically begin in development or staging environments but progress towards production-like environments, where the stakes and learnings are higher.
Key Metrics: Teams track vital signs—including latency, throughput, error rates, and system logs—seeking deviations from steady state that indicate real or latent failures.

Chaos Testing vs. Other Testing Methods

Traditional unit and integration testing validate components against known inputs and outputs, ensuring they function correctly under ideal conditions. Chaos testing intentionally breaks assumptions—introducing unpredictable failures to uncover issues that only manifest when things don’t go as planned. Together, these modes provide full-spectrum assurance: correctness in the small, resilience in the large.

Implementing Chaos Testing in a CI/CD Pipeline

To fully leverage chaos testing, organizations integrate it into their CI/CD pipeline:

Automate failure injection as part of regular test runs, catching regressions before they hit customers
Use “shift-left” approaches, ensuring chaos testing starts early in the development process rather than as an afterthought
Monitor for improvements over time

Modern Databases: Resilience by Default

CockroachDB was the first commercially available distributed SQL database and Cockroach Labs have been hard at work making it resilient by default. For example, CockroachDB automatically recovers from lost nodes, redistributing data and minimizing user impact. This is made possible thanks to CockroachDB’s distributed consensus, based on the Raft protocol. In addition, CockroachDB’s geo-distributed architecture allows it to survive regional outages, not just node or datacenter failures.

Chaos testing is not just a validation step, but a way Cockroach Labs hardens CockroachDB before it reaches customers, providing high availability, strong consistency, and unmatched fault tolerance.

What sets CockroachDB apart is its architecture—designed to minimize the number and severity of chaos events. Features like consensus-driven data replication, transparent node failover, and automatic recovery mean customers spend less time firefighting, and more time building.

To learn more about data resilience with CockroachDB, check out our docs.

Use Cases Across Industries

Chaos testing powers resilience in sectors where downtime is unacceptable:

Financial services: Ensuring transactional integrity despite hardware or network failures
eCommerce: Keeping storefronts available when traffic spikes or infrastructure degrades
Gaming: Maintaining state and session continuity for large, distributed player bases

Cockroach Labs’ customer success stories highlight how CockroachDB’s default resilience accelerates adoption in these markets.

Getting Started with Chaos Testing

Chaos testing is essential for building confidence and robustness in distributed systems. The engineering behind the product or service then enables faster recovery and fewer outages. With CockroachDB’s resilience-centric architecture you have a partner in building and maintaining mission-critical services.

Want to experience true resilience? Try CockroachDB for free, contact us, or explore our recent blogs or videos that highlight how our unique architecture makes high availability and survivability the norm, not the exception.

Chaos Testing FAQ

What is chaos testing?

Chaos testing is the practice of deliberately introducing faults or failures into a system to observe how it responds. The goal is to uncover weaknesses and ensure the system can maintain critical functions during disruptions—before real-world outages occur. Chaos testing often involves simulating network degradations and outages, hardware pressures and failures, and process terminations.

Why is chaos testing important for modern distributed systems?

Distributed, cloud-native systems are inherently complex and prone to rare, unpredictable failures. Traditional testing does not account for scenarios like network partitions, cascading errors, or node failures. Chaos testing helps teams identify and address these scenarios, improving reliability, observability, and confidence in production deployments.

Being single points of failure, we’ve historically babied our databases and applied hope as the primary resilience strategy. When they go down, the application goes down, so why subject them to failures? With distributed SQL databases, it’s safe (and encouraged) to test your databases just as you would any other part of your architecture.

What benefits does chaos testing provide?

Chaos testing enables organizations to:

Validate system reliability under adverse conditions
Reveal unexpected failure behavior
Build confidence that systems across the entire architecture will withstand outages
Continually improve system resilience by learning from controlled failures

What are the key concepts in chaos testing?

Several foundational concepts underpin chaos testing:

Fault injection: Intentionally introducing failures (CPU spikes, network drops, node kills)
Steady state hypothesis: Defining normal operation so anomalies are noticeable
Blast radius: Controlling the scope and reach of chaos experiments to limit risk
Automation vs. manual experiments: Automated tools scale testing; manual attacks target specific areas or edge cases

What common tools are used for chaos testing?

Popular chaos testing tools include Chaos Monkey, Gremlin, LitmusChaos, and Chaos Mesh. Organizations often develop their own internal tools to simulate particular failure modes in environments like Kubernetes. Cockroach Labs, for example, develop tool suites to simulate disk issues, pod failures, network disruptions, and more.

Where should chaos testing be done: staging or production?

Chaos testing often starts in staging or development environments to minimize risk. As confidence grows, experiments can move to production-like environments or production itself, using tight controls to ensure that learning does not come at the expense of service reliability.

What metrics should be monitored during chaos tests?

Teams typically monitor latency, throughput, database and application error rates, and system logs. Any unexpected deviation from the steady state may indicate issues needing attention.

How does chaos testing compare to unit or integration testing?

Unit and integration tests validate correctness on known inputs and outputs under expected conditions. Chaos testing, by contrast, exposes the system to unpredictable, real-world faults to assess resilience and fault tolerance. Both testing methods are critical: correctness tests verify functionality, while chaos tests provide assurance under failure.

What best practices and pitfalls should teams consider?

Start with non-critical systems as you build confidence in chaos testing.
Limit blast radius at first—target single nodes or services.
Gain organizational buy-in by sharing results and learnings widely; encouraging the use of chaos testing across teams.
Document findings to drive continuous improvement.

Which industries benefit most from chaos testing?

Chaos testing is critical for any modern business, but especially wherever downtime is not an option. For example, financial services must ensure transaction integrity. E-commerce companies need to keep storefronts online during spikes or failures. Gaming platforms must maintain state and player sessions at scale.

How should organizations get started with chaos testing?

Start small: introduce controlled failures in staging or with non-critical workloads, learn from each iteration, and expand the scope as confidence increases. Consider leveraging CockroachDB for its resilient baseline and explore Cockroach Labs’ resources for guidance and industry examples.

Why choose CockroachDB as a foundation for resilient systems?

CockroachDB’s architecture aligns naturally with chaos testing goals: it’s built to self-heal from failures, minimize downtime, and ensure transactional consistency—even under chaotic conditions. Combined with Cockroach Labs’ ongoing investment in resilience testing, it offers a robust platform for mission-critical applications.

More about Chaos Testing⟶

Data Modernization

What is Data Modernization?

Data modernization is the process of updating an organization’s data infrastructure to improve its efficiency, security, and usability in a rapidly evolving technological landscape. This transformation encompasses a variety of strategies aimed at enhancing the way data is managed, accessed, and utilized across the enterprise.

Key Aspects of Data Modernization

Integration of New Technologies: Organizations increasingly adopt modern database solutions that leverage cloud computing, distributed systems, and AI capabilities to optimize data operations. By embracing these innovations, businesses can streamline processes and improve operational efficiency.
Optimizing Data Architecture: Data modernization focuses on rethinking data storage and processing architectures. This often involves migrating from legacy systems to more flexible solutions that can handle growing data volumes and diverse workloads effectively. For instance, moving from monolithic to microservice architectures enhances agility.
Enhancing Database Functionality: Modernization isn't just about moving data but involves reimagining how data serves business objectives. This includes ensuring that databases support advanced functions like AI and real-time data processing. Implementing solutions such as CockroachDB exemplifies how organizations can achieve high availability, resilience, and scalability in their data infrastructure.

Database Modernization

A critical component of data modernization is database modernization. This process involves updating database technologies to improve performance, reliability, and security. Database modernization can involve a number of different approaches:

Journey to the Cloud: Moving data and applications to the cloud offers organizations greater flexibility, scalability, and in many instances, cost-effectiveness. It provides the ability to quickly adapt to changing business needs and market conditions. Organizations can shift to hybrid, multi-region, or multi-cloud deployments.
Application Modernization: This involves reconfiguring existing applications to operate more effectively in a cloud-based environment. It often includes shifting from traditional deployment models to containerized microservices, which help organizations simplify their applications’ development and deployment.
Generative AI Integration: Modernization is also driven by the increasing need for AI capabilities in data management. Generative AI applications are being integrated into data platforms to unlock new insights and enhance decision-making processes. This requires databases to support large-scale data processing and fast retrieval capabilities.

Discover where you stand, plan your path forward, and take the proven steps to build a future-ready data infrastructure. Download The Architect's Guide to SQL Database Modernization: Your Step-by-Step Roadmap today.

Benefits of Data Modernization

Data modernization offers a wealth of benefits that help organizations to stay competitive and operate efficiently in today’s rapidly changing technological landscape:

Enhanced Operational Efficiency: Modernizing data infrastructure streamlines processes, reduces manual workflows, and enhances the overall productivity of the organization. By migrating to advanced technologies like CockroachDB, companies can achieve seamless data management and faster transaction processing, leading to higher operational efficiency.
Improved Data Security and Compliance: Modern data platforms provide robust security features that ensure sensitive data is protected. These include encryption for data at rest and in transit, advanced access controls, and supporting compliance with regulations such as DORA, GDPR, CCPA, PCI, and HIPAA. Enhancing data security helps to reduce risks and prevent data breaches.
Scalability and Flexibility: Modern data solutions are designed to scale effortlessly as business needs grow. CockroachDB, for instance, supports horizontal scaling across clouds and on-premise environments, allowing businesses to adapt to increased data volumes and geographic distribution without performance degradation.
Cost-Effectiveness: Shifting to distributed databases can reduce operational costs by eliminating the need for complex physical infrastructure management. This allows organizations to allocate their resources more efficiently, investing in innovation rather than maintenance.
Agility and Innovation: Modernized data infrastructures support agile methodologies and continuous integration/continuous delivery (CI/CD), which enhances the speed and reliability of application development and deployment. Additionally, integrating AI and machine learning capabilities into the data platform unlocks new insights and drives innovation.
Improved Customer Experience: By leveraging modern data architectures, organizations can provide real-time data access and personalized services to their customers, enhancing user satisfaction and engagement. This is particularly crucial for industries relying on immediate data processing and high availability, such as eCommerce, healthcare, and finance.

Data modernization thus stands as a pivotal component of organizational growth, enabling enterprises to transform their data management strategies to not only meet current demands but also anticipate future challenges and opportunities. By integrating advanced technologies and fostering a resilient, scalable, and secure data infrastructure, businesses can drive efficient operations, innovate rapidly, and maintain a competitive edge in the digital age.

Challenges in Data Modernization

Data modernization, while essential for maintaining competitive advantage and operational efficiency, comes with several challenges.

Legacy System Complexity: Migrating from traditional monolithic systems to modern architectures can be complex and resource-intensive. Legacy systems are often deeply woven into an organization, making the transition challenging.
Data Integration Issues: Integrating new technologies with existing systems can lead to data inconsistencies, integration errors, and performance discrepancies. Ensuring seamless data flow requires meticulous planning and robust integration tools.
Cost and Resource Constraints: Data modernization involves significant investment in technology, infrastructure, and skilled personnel. Organizations may face budget constraints and resource limitations, impacting the scope and pace of modernization efforts.
Security and Compliance Risks: Modernizing data infrastructure while supporting compliance with regulations like DORA, GDPR, CCPA, PCI, and HIPAA, is challenging. Organizations must implement stringent security measures and regularly audit systems to prevent data breaches and maintain compliance.
Change Management and Adoption: Resistance to change among employees can hinder modernization efforts. Effective change management strategies are essential for smooth transitions, user adoption, and alignment with new processes and technologies.
Scalability and Performance: Achieving seamless scalability and maintaining high performance across distributed architectures are key challenges. As data volumes grow, ensuring consistent performance and resilience requires advanced scaling techniques and performance monitoring tools.

Addressing these challenges involves adopting best practices, leveraging advanced technologies like CockroachDB, and fostering a culture of continuous improvement and strategic planning.

Best Practices for Data Modernization

Implementing data modernization effectively requires a strategic approach, incorporating best practices to ensure a seamless transition and maximum benefit for the organization. Key best practices include:

Conduct Comprehensive Assessment: Begin with a thorough evaluation of the existing data architecture, identifying pain points, and opportunities for improvement. This includes reviewing legacy systems' performance, scalability, security, and compliance status.
Develop a Clear Strategy: Outline a clear modernization strategy aligned with business goals. This should include short-term and long-term objectives, prioritized initiatives, and a roadmap for implementation. Ensure stakeholder buy-in by communicating the benefits and expected outcomes.
Leverage Scalable Solutions: Choose modern, scalable solutions like CockroachDB, which support horizontal scaling, resilience, and high availability. Scalable architectures can handle increasing data volumes and diverse workloads while maintaining performance and reliability.
Adopt Cloud-Based Approaches: Embrace cloud technologies for flexibility and scalability. Migrate applications and data infrastructure to the cloud, utilizing cloud-native features and automated management tools to optimize operations.
Implement Microservices Architecture: Transitioning from monolithic to microservices architecture enhances agility, allows independent scaling, and simplifies development and deployment cycles. Containerization and orchestration tools like Kubernetes can be particularly helpful.
Integrate AI and Machine Learning: Embed AI and machine learning capabilities into the data platform to unlock advanced analytics and improve decision-making processes. Ensure the database supports real-time data processing and retrieval to maximize AI utility.
Ensure Robust Data Security: Incorporate robust security measures, including encryption, access controls, and compliance with regulations such as GDPR and CCPA. Regularly audit and monitor system vulnerabilities to protect against data breaches.
Automate Data Management Processes: Utilize automated tools for data migration, backups, replication, and recovery. Automation reduces manual effort, minimizes errors, and ensures data integrity throughout the modernization journey.
Monitor and Optimize Performance: Continuously monitor system performance and optimize configurations to ensure efficiency. Regularly review and adapt the architecture based on evolving business needs and technological advancements.
Foster a Culture of Continuous Improvement: Encourage a culture of continuous learning and improvement within the organization. Stay updated on industry trends, engage with experts, and invest in training to navigate challenges effectively and innovate rapidly.

In today's data-driven world, effective data modernization equips organizations with the tools they need to leverage their data assets strategically, enhancing their competitive edge. Leveraging robust platforms like CockroachDB allows businesses to navigate their modernization journeys, ensuring they can meet regulatory requirements and maintain operational efficiency while driving innovation. By embracing these changes, companies not only improve their data handling capabilities but also align their infrastructural assets with future business goals.

Data Modernization FAQ

What is data modernization?

Data modernization is the process of upgrading legacy systems, infrastructures, and data management processes to modern, scalable, and cloud-native solutions. It enables faster insights, supports innovation, and empowers organizations to make data-driven decisions efficiently.

Why is data modernization important?

Data modernization helps organizations stay competitive by unlocking the power of real-time analytics, AI, and automation. It reduces costs, improves data security, ensures regulatory compliance, and lays the foundation for future innovation.

What are the key benefits of data modernization?

Faster, more accurate insights
Streamlined operations and automation
Scalability for future growth
Lower IT and maintenance costs
Improved security and compliance
Readiness for AI and GenAI initiatives

What is the role of distributed SQL databases like CockroachDB in data modernization?

Distributed SQL solutions like CockroachDB from Cockroach Labs provide the resilience, scalability, and global consistency required for modern applications. They eliminate traditional database bottlenecks, making them ideal for cloud-native modernization efforts.

Why should I consider Cockroach Labs during a Data Modernization project?

Cockroach Labs’ CockroachDB offers a distributed, fault-tolerant, cloud-native SQL database perfect for enterprises modernizing their infrastructure. It ensures always-on availability, automated scaling, and simplifies complex multi-region deployments — critical factors for a successful modernization journey.

How does distributed SQL differ from traditional relational databases?

Unlike traditional SQL databases that rely on single-node architectures, distributed SQL databases spread data across multiple nodes. This provides better fault tolerance, horizontal scalability, and high availability — essential features for businesses migrating to the cloud and modernizing their systems.

How does genAI (generative AI) fit into data modernization?

Modernized, high-quality, accessible data is the fuel for genAI initiatives. Companies looking to leverage genAI for advanced analytics, content generation, automation, and customer experiences must first modernize their data infrastructure to ensure data is accurate, secure, and readily available.

What does the "journey to the cloud" mean in the context of data modernization?

The "journey to the cloud" refers to the phased migration of data, applications, and services from on-premises environments to cloud platforms. It often involves re-architecting applications, adopting cloud-native technologies like CockroachDB, enhancing data governance, and optimizing operations for agility.

What are common challenges companies face on their journey to the cloud?

Legacy system complexities
Data migration risks
Skill gaps in cloud technologies
Managing multi-cloud and hybrid cloud environments
Ensuring data security and compliance during transitions

What industries benefit most from data modernization and cloud migration? Finance, healthcare, e-commerce, retail, logistics, and public sector organizations gain significant advantages, but any business aiming to become more data-driven benefits greatly from modernization and cloud adoption.

Is moving to a cloud database like CockroachDB enough to complete data modernization? While adopting a distributed cloud database like CockroachDB is a major step, full modernization also involves improving data quality, rethinking data governance, integrating analytics and GenAI capabilities, and ensuring strong security and compliance practices.

How long does a typical Data Modernization journey take? Depending on the organization's size and complexity, the journey could range from a few months to multiple years. Successful organizations typically break down modernization into phases and prioritize based on business impact.

How should my organization start its Data Modernization and cloud journey? Start by assessing your current data landscape, setting clear business goals, identifying legacy bottlenecks, and creating a phased modernization roadmap. Partnering with experts in cloud technologies, distributed databases like CockroachDB, and AI integration can accelerate your journey.

More about Data Modernization⟶

Vector Database

We’re in the middle of a massive shift in how we build and interact with software. AI is no longer on the horizon — it’s here. From search and chatbots to recommendation engines and generative tools, intelligent applications are quickly becoming the new standard. And powering many of these experiences under the hood? Vectors.

But storing and searching high-dimensional vector data at scale — especially in real-time and across global deployments — poses unique challenges. Traditional databases weren’t built for this and that’s where vector databases come in.

What is a vector database?

A vector database is a specialized system designed to store, index, and search through vector embeddings — numerical representations of unstructured data such as text, images, video, and audio.

Unlike traditional databases that retrieve data based on exact matches (e.g., WHERE name = 'Alice'), vector databases excel at similarity search. They find the items in your dataset that are most similar to a query, even if the words or data points aren’t an exact match.

In practical terms, that means a user who searches for “classical piano music” might also get results for “baroque keyboard compositions” or “instrumental piano playlists.” The magic is in the math: behind the scenes, these systems are comparing high-dimensional vectors to surface semantically similar content.

What are vector embeddings?

So where do these vectors come from?

Modern AI models — especially large language models (LLMs) and image classifiers — transform raw data into vector embeddings. These embeddings capture semantic relationships. Two similar pieces of content will have vectors that are “close” in a multi-dimensional space.

For example:

A sentence like "Let's meet for lunch." might live near "Want to grab a bite?"
An image of a cat might be close to other animal photos, but far from pictures of buildings

What makes a vector database different?

You might wonder: why can’t we just use PostgreSQL or Oracle?

While general-purpose databases can technically store vector data, they aren't optimized for similarity search or the scale and speed AI workloads demand.

Here’s how vector databases stand apart:

Approximate Nearest Neighbor (ANN) Search: Vector databases typically rely on ANN algorithms like HNSW or PQ to quickly find similar vectors (see our page on vector search to learn more). These trade perfect accuracy for speed—delivering relevant results in milliseconds.
High-Dimensional Indexing: Vectors often have hundreds or thousands of dimensions. Vector databases use optimized structures like trees or graphs to enable fast search at scale.
Real-Time Updates: Unlike many standalone vector indexes (e.g., FAISS), vector databases support dynamic data. You can insert, delete, or update vectors without rebuilding the entire index.
Metadata Filtering: Vector databases often support hybrid queries — filtering by metadata and vector similarity. This lets you do things like “find articles similar to this one, published after 2023, in the healthcare domain.”
Horizontal Scalability: As data grows, vector databases scale out to handle billions of vectors across distributed architectures.

Common use cases for vector databases

Vector databases are the backbone of many AI-powered systems. Here are some common applications:

Semantic search: Delivering more relevant results by understanding the meaning behind user queries.
Product recommendations: Suggesting items similar to those a customer viewed or purchased.
Chatbots & virtual assistants: Enabling long-term memory and contextual understanding in LLM-driven interfaces.
Fraud detection: Identifying patterns in transactions or behaviors that are semantically similar to known fraud or semantically dissimilar to non-fraudulent behavior.
Image, audio, or video retrieval: Searching media libraries based on content similarity instead of exact metadata.

How vector databases work

Let’s walk through a typical workflow.

Embedding generation: Use a model (like OpenAI’s text-embedding-ada-002) to transform your unstructured data into vectors.
Insert into database: Store the vectors in a vector database or a database with vector capabilities, along with optional metadata (e.g., timestamps, categories).
Query time: When a user submits a query, a vector embedding will be generated that represents the query, and that embedding will be used to search the database for similar entries based on their vectors.
Return results: Retrieve associated documents, items, or data points and present them to the user.

Behind the scenes, the system may:

Partition the index using a specific algorithm
Prune irrelevant vector spaces before scoring
Re-rank results with secondary similarity metrics

The result? Sub-second responses across massive datasets — without compromising relevance.

Scaling and deployment considerations

Scaling vector search isn’t trivial. AI-native apps often have high QPS (queries per second) and require millisecond response times — even across billions of records.

Leading vector databases address this with:

Distributed indexing: Vectors are automatically sharded and replicated across nodes.
Serverless options: Storage and compute are decoupled, enabling cost-efficient elastic scaling.
Multi-tenancy: Namespaces or collections support isolated, concurrent workloads.

These features become essential when deploying in production — especially across clouds, regions, or edge locations.

Serverless vector databases: The next evolution

Traditional vector databases often tie storage and compute together, leading to over-provisioning. Newer systems introduce serverless architectures with three core benefits:

Separation of compute and storage: Compute is spun up on demand, saving cost.
Cost efficiency: Only pay for what you use, ideal for variable workloads.
Strong consistency even when scaling: Updates propagate quickly, so new data is searchable in seconds.

This shift is critical for real-time, AI-native workloads like search augmentation or continuous learning systems. This is what modern businesses need: a modern data infrastructure.

CockroachDB: Relational database with vector capabilities

At Cockroach Labs, we believe developers shouldn’t have to choose between transactional correctness and intelligent search. Starting in 24.2 we introduced pgvector-compatible vector search.

It’s vector search, done the Cockroach way: resilient, scalable, and SQL-native.

With CockroachDB, you can:

Geo-partition your vector data for compliance and low latency
Scale across regions without operational complexity
Run complex hybrid queries like: SELECT * FROM tickets WHERE type = 'password_reset' AND embedding <-> $query_vector < 0.3;

Choosing the right database

There’s no one-size-fits-all answer, but here are some factors to evaluate:

Query performance: Millisecond latency can be critical
Indexing flexibility: Support for IVF, etc.
Freshness: Can it handle real-time data updates
Hybrid filtering: Combine vector search with SQL or metadata
Scalability: Can it scale to billions of vectors?
Integration: Does it easily plug into your existing tech stack?

Final thoughts: Beyond search

Vector databases are more than just a backend for search bars.

They’re powering the next generation of applications that understand. And as AI continues to evolve — from LLM agents to contextual commerce — embedding-rich systems will become foundational infrastructure.

If you’re building for the future, now is the time to get familiar with vector databases. They represent a fundamental shift in how we model, store, and retrieve data in the age of AI.

Vector databases store and search vector embeddings—numerical representations of unstructured data.
They power semantic search, recommendations, chatbots, and other AI-native experiences.
Unlike traditional databases, they support high-dimensional indexing, ANN search, and hybrid filtering at scale.
Newer systems offer serverless architectures for elastic, cost-effective performance.
CockroachDB now supports pgvector-compatible vector search, bringing together distributed SQL and intelligent search.

Get started with CockroachDB Cloud today. We’re offering $400 of free credits to help kickstart your CockroachDB journey, or get in touch today to learn more.

More about Vector Database⟶

Vector Search

In an age where users expect instant, accurate results across massive datasets, search technology needed to evolve to meet consumer needs. One of the most significant advancements driving this evolution has been vector search. But what exactly is vector search, and how does it differ from the traditional keyword-based approach?

What is Vector Search?

Vector search is a search technique that relies on mathematical representations of data, known as embeddings, rather than exact keyword matching. Where keyword search depends on matching literal terms (strings or substrings) within documents, vector search aims to understand the meaning behind the query.

For example, imagine looking up “soft sci-fi books with a literary writing style." A traditional keyword search might miss relevant results if those exact words aren’t present. A vector search, however, can measure and calculate the semantic meaning behind the search, and return results like Exhalation by Ted Chiang, Station Eleven by Emily St. John Mandel, and Never Let Me Go by Kazuo Ishiguro.

The ability to then navigate an entire corpus of texts using vector data and similarity search opens new doors for applications like recommendation systems, AI assistants, and discovery tools that prioritize relevance and context over surface-level text.

What is a vector?

In the simplest terms, a vector is a numerical representation of an object that can have many different dimensions. For example, let’s take the book Exhalation by Ted Chiang. A simplistic vector representation of the book may contain dimensions for page length (368), year of publication (2019), science fiction (1), short story collection (1), horror (0), self-help (0). So exhalation = [368, 2019, 1, 1, 0, 0] in this case. You can then imagine the book being a point in this 6-dimensional space.

Then if you had a whole library of books that you’ve converted into vectors using the same dimensions, you could analyze how close or far apart two books were.

How vector search works

At the core of vector search lies embeddings: numerical vectors that represent the meaning of text, images, or other data types. These embeddings are generated using machine learning models and large language models (LLMs) like Claude, Gemini or OpenAI’s GPTs. Each piece of data is then mapped into a high-dimensional array (or vector) where similar items are placed closer together.

NOTE: While ChatGPT is now part of common vernacular, ChatGPT is just the chat UI where you can utilize OpenAI’s different LLMs.

To find relevant results, vector search engines use different metrics to determine how close or far apart two documents are. For example, let’s say you’re still searching for books in a library, you can use natural language processing (NLP) to convert every book into a vector based on some number of dimensions. We then assume that if two books are very similar (let’s say Harry Potter & the Sorcerer’s Stone and Harry Potter & The Chamber of Secrets), then their vector representations will also be very close together. But books like War & Peace and Dr. Seuss’ Green Eggs & Ham would be rather far apart based on their vector representations.

Common metrics include:

Cosine similarity: Measures the angle between two vectors, emphasizing direction rather than magnitude.
Euclidean distance: Measures the straight-line distance between two points in vector space.

The result? Instead of asking, "Do these texts match?" vector search asks, "Are these texts close in meaning?"

Benefits of vector search

There are a number of benefits to utilizing vector search:

Semantic Understanding: Captures the underlying meaning of queries and content.
Improved Relevance: Delivers better, more nuanced results, especially for complex or ambiguous queries.
Scalability: Ideal for massive datasets where traditional indexing becomes brittle or inefficient.

Whether you're building a book recommender, AI playlist, legal document analyzer, or customer support chatbot, vector search can enable smarter, faster, and more intuitive interactions.

Use cases and applications of vector search

Vector search is powering innovation across industries:

Recommendation Systems: Suggesting similar products, articles, or videos based on user behavior or preferences.
Improving knowledge base tools: With semantic search, the results returned from customer service tools or internal company tooling can become more relevant and useful.
Retrieval-Augmented Generation (RAG): Enhancing large language models by providing additional data for use in response generation.

For example, in our vector search with pgvector and CockroachDB integration, developers can store, index, and query vector data from within CockroachDB. This makes it easier than ever to build AI-powered apps on a distributed SQL backbone.

Technical implementation of vector search

At its core, you're not just storing vectors—you’re building a system that can quickly compare vectors and return relevant results at scale. Here’s how that comes together in practice:

ANN (Approximate Nearest Neighbor) algorithms

Vector search relies heavily on Approximate Nearest Neighbor (ANN) algorithms to identify the most similar vectors without scanning the entire dataset. ANN makes trade-offs between precision and performance, delivering results that are “good enough” with far greater speed and efficiency than brute-force comparisons.

Two leading algorithms power modern vector systems:

Hierarchical Navigable Small World (HNSW): A graph-based approach that organizes vectors into layers, where higher layers offer rough approximations and lower layers refine results. HNSW is widely adopted because it delivers excellent accuracy at low latency.
Product Quantization (PQ): This technique compresses vectors into compact representations that speed up similarity computations. It’s especially valuable when dealing with billions of vectors, where memory usage becomes a bottleneck.

These algorithms form the backbone of scalable vector search — supporting everything from recommendation engines to LLM-enhanced applications.

Vector indexing techniques

Efficient vector search depends on indexing vectors in a way that minimizes search time while maximizing relevance. Unlike traditional B-tree or hash indexes used in relational databases, vector indexes must accommodate high-dimensional data and tolerate approximate results.

HNSW, for instance, builds a multi-layer graph where nodes represent vectors and edges connect them based on similarity. Traversing this structure narrows the search space dramatically. Other indexing methods, like inverted file indexing (IVF) and IVF-PQ, are also used in high-performance scenarios.

In short, the right index can make or break the performance of your vector system.

Database integration

Embeddings aren’t just for specialized ML platforms. Thanks to tools like pgvector, developers can now store and query vector data inside PostgreSQL-compatible databases, including CockroachDB.

This means no separate infrastructure, no complex data pipelines, and no reinventing the wheel. You write SQL, just as you would for structured data — only now, you can utilize vector data at the same time.

With CockroachDB, vector search gains all of the advantages of a leading distributed SQL database:

Scalability across regions: Data is automatically distributed across nodes, giving you horizontal scale and ultra-low-latency access worldwide.
Resilience built-in: CockroachDB is fault-tolerant by design, ensuring vector workloads remain online even in the face of hardware failures.
Familiar SQL interface: Developers use SQL they already know, with vector-specific syntax, so there’s no steep learning curve.

Whether you're building a personalized recommendation engine or a semantic search platform, CockroachDB makes it easier to deploy, scale, and maintain vector search alongside your operational workloads.

Challenges and considerations of vector search

Vector search brings powerful semantic capabilities, but it also introduces complex trade-offs. From compute requirements to data modeling, successful implementation means thinking beyond accuracy alone. Here are some of the biggest hurdles teams encounter:

Handling high-dimensional data

Vector embeddings can have hundreds or thousands of dimensions, each representing a subtle feature learned by a machine learning model. This richness is what enables semantic search — but it also makes indexing and retrieval far more challenging than working with flat keyword fields.

As dimensions increase, so does the "curse of dimensionality." Distances between vectors become less meaningful, search times grow, and index efficiency can degrade. Selecting the right dimensionality and pruning unnecessary features is crucial to performance.

Balancing speed and accuracy

ANN algorithms are fast—but not always precise. You’re trading off some accuracy for massive gains in throughput. That’s acceptable in most use cases (especially consumer-facing search), but not all. Mission-critical systems, like fraud detection or medical diagnostics, may require deterministic results. In other words, speed is great—until it costs you trust.

Infrastructure and resource requirements

Vector search isn’t “set and forget.” It requires modern infrastructure—especially when you’re working with real-time or large-scale applications.

Key considerations include:

Compute resources for training embeddings: Most models (like GPT, Claude, Gemini) require GPUs or high-performance CPUs to generate high-quality vector representations.
Storage and memory: Vector indexes can consume significant memory, especially when hosting millions of vectors. Efficient indexing and quantization are essential to stay within hardware limits.
Low-latency serving: End users expect fast results. That means your database or search layer must support millisecond response times—even at scale.

CockroachDB helps address these concerns by offloading infrastructure complexity. Its cloud-native, distributed architecture enables seamless scale without manual sharding, and its PostgreSQL compatibility means you can plug in vector search using existing tools and workflows.

Evolving models and schema management

As your ML models improve, so will your embeddings. That creates another challenge: versioning and updating vectors without disrupting the entire system. Vector dimensions may change. Similarity scoring may shift. Indexes may need to be rebuilt.

CockroachDB supports online schema changes, helping you evolve your schema—and your vectors—without downtime. This flexibility is critical when iteration is a core part of your product.

Vector search in today’s technology

As AI and machine learning continue to evolve, vector search is becoming a cornerstone of modern data architectures. We’re already seeing:

Deeper integration with LLMs for more context-aware and personalized experiences
Standardization efforts around formats, frameworks, and APIs
Hybrid search systems combining vector and keyword-based approaches for the best of both worlds

In short, vector search is no longer a fringe capability—it’s foundational for enterprises building intelligent, real-time applications.

To explore how CockroachDB enables native vector search capabilities with resilience and scale, check out our blog post or learn more about pgvector support.

Vector Search FAQ

What is vector search in simple terms?

Vector search is a modern search technique that finds results based on meaning rather than exact keywords. It uses mathematical vectors (called embeddings) to understand the context of queries and match them with semantically similar data.

How does vector search differ from traditional keyword search?

Unlike keyword search, which matches exact words, vector search uses machine learning to map words, phrases, or documents into high-dimensional vectors. It then finds results that are “close” in meaning, even if the exact words don’t match.

What are embeddings in vector search?

Embeddings are numerical representations of text, images, or other data types. These vectors capture the semantic meaning of the data and are used to compare and rank results by similarity rather than keyword overlap.

What are the main benefits of vector search?

The benefits of vector search include semantic understanding of queries, improved relevance for complex questions, and scalability across large datasets.

Where is vector search used today?

Vector search powers many applications, including product and content recommendation engines, chatbots and customer support tools, legal and technical document search and retrieval-augmented generation (RAG) systems.

What is an example of vector search in action?

If you search for “soft sci-fi books with a literary style,” a keyword search might miss relevant results. But a vector search could return books like Exhalation, Station Eleven, or Never Let Me Go—even if they don’t contain those exact words.

What algorithms power vector search?

Common vector search algorithms include HNSW (Hierarchical Navigable Small World), a graph-based search method offering high accuracy with low latency, and PQ (Product Quantization), which compresses vectors to reduce memory usage and improve search speed.

How does vector search work with databases like CockroachDB?

Vector data can be stored and queried within SQL databases using tools like pgvector. In v25.1, CockroachDB introduced pgvector-compatible vector search. With CockroachDB, developers can build AI-powered apps with horizontal scalability, fault tolerance, and a familiar SQL interface.

Is vector search better than keyword search?

Vector search is ideal for understanding natural language and context, while keyword search is better for exact matches or structured data. Many modern systems combine both for optimal performance.

What are the challenges of vector search?

Key challenges include managing high-dimensional data, balancing speed vs. accuracy, infrastructure and compute requirements, and keeping up with evolving ML models and embedding updates.

What is the future of vector search?

Vector search is becoming essential in AI-driven applications. It’s evolving through deeper integration with LLMs, hybrid systems that combine vector and keyword search, and standardization of tools and APIs.

More about Vector Search⟶

RTO vs RPO

RTO (Recovery Time Objective) and RPO (Recovery Point Objective) are two critical metrics in disaster recovery and business continuity planning. They help you answer two essential questions when an outage or disaster strikes:

How quickly do we need to recover? (RTO)
How much data can we afford to lose? (RPO)

Understanding these objectives allows businesses to design and implement a robust strategy that balances cost, complexity, and risk tolerance. Below, we’ll explore what RTO and RPO mean, how they differ, their role in disaster recovery, and best practices for meeting these objectives. After that, we’ll address frequently asked questions in the FAQ section.

What is RTO?

Recovery Time Objective (RTO) is the maximum acceptable downtime for your systems after an outage or disaster. If operations do not resume within the RTO, the business may experience unacceptable costs, lost revenue, or reputational damage.

Key Characteristics of RTO

Focus on Downtime: RTO measures how quickly you must restore services after a disruption.
Business-Driven: RTO typically aligns with financial or operational thresholds. For instance, an online retailer might have an RTO of one hour for its ordering system, because every hour of downtime equates to lost sales.
Measured in Time Units: RTO is usually expressed in hours or minutes, though some mission-critical systems push it down to seconds.

For example, if your customer database fails at 2:00 PM and your RTO is 2 hours, your goal is to have the system up and running again by 4:00 PM. Missing this deadline means you are out of compliance with your own recovery objective, which comes with potential financial or reputational consequences. In fact, downtime can be extremely costly, with some businesses losing up to $427 per minute for smaller operations and significantly more for larger enterprises. On an hourly basis, industries like healthcare and retail face costs of $636,000 and $1.1 million, respectively. Moreover, companies can lose substantial revenue, with some experiencing losses between $100,000 and over $1 million per outage. Therefore, meeting RTOs is crucial to minimize these financial impacts.

Balancing Cost and Speed

Shorter RTOs typically require redundant systems, automation, and higher availability architectures. While this can minimize the damage of downtime, it also often increases costs. Finding a balance involves:

Weighing the cost of downtime (lost transactions, customer dissatisfaction).
Evaluating the cost of continuous availability (hot standbys, distributed systems, high-speed network connections).

Many modern systems, such as distributed SQL databases (e.g., CockroachDB), help reduce RTO by automatically re-routing traffic if a node or region goes offline, thereby shortening downtime.

What is RPO?

Recovery Point Objective (RPO) is the maximum acceptable amount of data loss when systems are restored. It defines how far back in time your recovery process can “roll back” data to a consistent state.

Key Characteristics of RPO

Focus on Data Loss: RPO addresses how much recent data might be irretrievable following an incident.
Driving Backup Frequency: RPO influences how often data backups or replications must occur to ensure that any lost data falls within acceptable limits.

For example, if your RPO is 15 minutes, you have determined the business can tolerate losing up to 15 minutes of recent transactions. To achieve this, you might rely on near-real-time replication or frequent incremental backups. If a disaster strikes at 3:00 PM, you can confidently restore from a snapshot taken at 2:45 PM, with only 15 minutes of potential data loss.

Zero RPO

A zero RPO means no data loss is acceptable. Often, this requires synchronous replication to multiple sites, ensuring every transaction is committed across multiple locations before it’s considered final. Distributed databases can get close to zero RPO by writing data to multiple replicas, minimizing the risk of data loss.

Balancing Cost and Downtime

While a short Recovery Time Objective (RTO) helps minimize downtime—and the associated revenue loss or reputational damage—it often demands more robust infrastructure, specialized failover mechanisms, and a skilled IT response team. For example:

Infrastructure Costs: Maintaining hot standby systems, redundant servers, and automated failover solutions can be expensive, but these investments significantly reduce potential downtime.
Complexity vs. Speed: Highly available architectures (e.g., distributed databases or clustered applications) may increase operational complexity, yet they can restore services within minutes—or even seconds—when a primary node fails.
Opportunity Costs: Some organizations find that a marginal improvement (e.g., reducing RTO from 30 minutes to near-zero) requires disproportionately higher spending. Weigh this extra cost against the potential revenue loss or compliance penalties incurred if you exceed your RTO.
Risk Mitigation: If your business model depends on continuous operations (e.g., online retail, financial services), the risk of extended outages may far outweigh the cost of additional infrastructure. Conversely, for non-mission-critical workloads, a slightly longer RTO might be acceptable given a lower budget.

Striking the right balance requires collaborating with both technical stakeholders and business leaders to determine how much downtime is tolerable — and what the organization is willing to spend to prevent it.

Difference between RPO and RTO: Explained

Interplay Between the Two

It’s possible to meet one objective but not the other. For example, you could restore a system quickly (meeting RTO) but revert to a backup from hours earlier (failing your RPO). Conversely, you might restore to a perfectly up-to-date data state (RPO met) but require several hours of downtime (RTO not met).

Organizations typically set RTO and RPO targets simultaneously based on risk tolerance, budget, and operational needs

For many business-critical applications, both low RTO and low RPO (i.e., fast recovery and minimal data loss) are required. This can drive adoption of high-availability architectures and continuous data protection solutions.

RPO and RTO in Disaster Recovery and Business Continuity

Visit the Webinar: “The Always-on Dilemma: Disaster Recovery vs. Inherent Resilience”

Peter Mattis, CTO and co-founder at Cockroach Labs joins Rob Reid, Technical Evangelist, to discuss ways to build inherently resilient systems coupled with defense in depth strategies.

Disaster Recovery (DR) focuses on the technical aspects of restoring IT systems after a disruption, while Business Continuity (BC) considers the wider organizational response to maintain essential functions. RPO and RTO are cornerstones in both realms:

Defining Tolerable Disruption
- RTO sets the maximum downtime for critical operations before severe damage occurs (e.g., lost revenue, regulatory penalties).
- RPO sets the maximum data loss your organization can handle without damaging credibility or violating compliance rules.
Influencing Technology Choices
- If your RPO is near-zero, you might opt for real-time replication and distributed databases that write data to multiple nodes.
- If your RTO is extremely short, you may implement auto-failover, hot standbys, or container-based microservices that can redeploy quickly.
Supporting Compliance
- Regulatory standards in sectors like finance and healthcare often mandate certain recovery capabilities.
- Internal SLAs or external contracts can specify RTO and RPO targets as measurable obligations.
Driving Investment
- The lower (stricter) your RTO and RPO, the more you typically invest in infrastructure, replication, offsite backups, and skilled personnel.
- Some businesses opt for multi-region or multi-cloud setups to ensure resilience. For instance, CockroachDB’s multi-region capabilities allow data to be replicated closer to users and automatically fail over.

Real-World Example: E-commerce Platform

RTO: 30 minutes. Beyond this point, lost sales and frustrated customers become unacceptable.
RPO: 5 minutes. Because orders, payments, and customer data are vital, losing more than five minutes of transactional data could cause confusion over inventory, financial records, and shipping addresses.

To achieve these goals, the company might employ a distributed SQL database for near-real-time data replication and maintain hot standby servers that can take over almost instantly. The investment in servers, networking, and automation is offset by the reduction in lost sales if a primary server goes down.

Architect your zero downtime strategy with CockroachDB: The Definitive Guide.

Best Practices for Optimizing RPO and RTO

Achieving minimal data loss and fast recovery requires a combination of smart planning, technical solutions, and ongoing maintenance. Below are proven best practices to help your business optimize RPO and RTO.

1. Conduct a Business Impact Analysis (BIA)

A BIA identifies critical business processes, quantifies potential losses, and determines acceptable downtime. This helps you:

Prioritize Systems: Not all applications are equally critical.
Set Realistic Targets: Define RTO and RPO values based on the real-world impact of downtime or data loss.

2. Tier Your Applications

Categorize applications into tiers. For example:

Tier 1: Mission-critical systems (e.g., customer-facing payments).
Tier 2: Important systems (e.g., internal reporting, HR).
Tier 3: Non-critical or infrequently used systems.

Assign stricter RTO and RPO to Tier 1 applications while allowing more relaxed targets for lower tiers. This approach ensures you allocate resources appropriately rather than over-investing across the board.

3. Choose the Right Data Protection Strategy

Data backup and replication strategies must align with your RPO:

Frequent Snapshots or Continuous Backups: If you require a very low RPO, schedule backups frequently.
Cloud and Offsite Storage: Keep backups offsite or in the cloud for protection against local disasters.
Encryption and Integrity Checks: Ensure backups are both secure and verified to prevent restoration failures.

4. Leverage High Availability (HA) Architectures

To minimize RTO, invest in resilient architectures that reduce or eliminate downtime:

Distributed Systems: A multi-region, distributed SQL database can automatically redirect queries if one node fails, often delivering near-continuous availability.
Clustering and Failover: Automated failover to a hot or warm standby can restore operations within seconds or minutes.
Containerization and Orchestration: Tools like Kubernetes simplify redeploying services on healthy nodes, cutting recovery times.

5. Automate and Document Recovery Procedures

Manual processes can slow you down during an emergency:

Automated Failover Scripts: Trigger an immediate failover to a standby instance.
Infrastructure-as-Code: Provision new servers quickly using scripted configurations.
Comprehensive Documentation: Keep runbooks current and store them in an accessible, fail-safe location.

6. Test Regularly

Testing is an essential part of an enterprise discovery recovery plan to confirm you can meet your stated objectives:

Disaster Recovery Drills: Simulate real outages or data loss scenarios.
Restore from Backups: Verify backups are valid and that your recovery time meets RTO.
Measure and Refine: Compare actual recovery metrics with your targets, then adjust resources or processes as needed.

7. Continuously Review and Update

Business requirements, infrastructure, and threats can change:

Track System Growth: Larger data volumes may slow backup procedures.
Adopt New Technologies: For instance, if a new distributed storage service improves replication speeds, you might achieve tighter RPO.
Revisit BIA: Ensure your objectives still align with new business lines, regulations, or market demands.

By following these best practices, your organization can systematically reduce downtime (RTO) and limit data loss (RPO), giving you a resilient foundation that supports continuous operations—even under adverse conditions.

RTO vs. RPO FAQ

Below are common questions related to RTO, RPO, and their optimization.

How do we determine the right RTO and RPO for our business?

Start with a Business Impact Analysis (BIA) to understand how downtime or data loss affects each department. Quantify potential revenue impact and compliance risks. Engage both technical teams and business stakeholders to decide on maximum tolerable downtime (RTO) and data loss (RPO). Balance the cost of downtime/data loss against the cost of building more resilient systems.

Can we achieve zero RTO or zero RPO?

Zero RTO means no perceived downtime at all. In practice, you’d need highly available architectures with automatic failover to ensure continuous service.
Zero RPO means no data loss whatsoever. This typically requires synchronous replication to multiple locations so that every write is confirmed in real time. Modern distributed SQL databases (like CockroachDB) approach near-zero RPO and RTO due to our architecture. Most businesses aim for low or near-zero metrics where justified.

How can we reduce our RTO and RPO?

For RTO (faster recovery): Use automated failover, maintain standbys, and script your recovery processes.
For RPO (less data loss): Increase backup frequency or adopt continuous replication to a standby. If your backups currently happen once a day, consider hourly or near-real-time snapshots.

Improving network bandwidth, verifying backup integrity, and ensuring offsite replication are additional ways to tighten both RTO and RPO.

Why is RTO testing and RPO testing important, and how often should we test?

RTO and RPO testing is crucial because theoretical objectives might not hold up under real conditions. By regularly testing:

You verify backups and failover processes work as intended.
You uncover hidden dependencies or bottlenecks.
Your team gains familiarity with emergency procedures, reducing mistakes during an actual outage.

Annual testing is common, though critical systems might warrant semi-annual or quarterly drills. Always reassess your results to see if they meet your RTO and RPO targets.

How do RTO and RPO relate to our SLA or compliance obligations?

Some regulations require specific recovery capabilities (e.g., within X hours for financial records). You may incorporate RTO and RPO into internal SLAs to set performance benchmarks for your IT department. Externally, customers might demand certain uptime or data protection guarantees. Documenting (and meeting) your RTO/RPO objectives helps maintain compliance, avoid fines, and build customer trust.

Should we tier our services for different RTO and RPO objectives?

Yes. It’s common to tier services based on criticality:

Tier 1: Minimal downtime and data loss (e.g., 15-minute RTO, near-zero RPO).
Tier 2: Moderate tolerance (e.g., 2–4 hours RTO, 1-hour RPO).
Tier 3: Extended downtime acceptable (e.g., 24 hours RTO, 12-hour RPO).

This ensures you invest heavily where it matters most. Testing each tier separately also helps refine your overall continuity plan.

Where should backups be stored to meet a good RPO?

Store backups offsite or in geographically separated cloud regions. On-premises backups alone may fail in regional disasters like floods or fires. Using a distributed backup service or multi-cloud approach further protects data. Regularly confirm that backups are valid and restorable, since corrupt or incomplete backups won’t meet any RPO goals.

Can distributed databases help with both RTO and RPO?

Yes. Distributed SQL solutions—such as CockroachDB—replicate data across multiple nodes (and often multiple regions). This approach:

Minimizes downtime (improving RTO) by automatically failing over if one node goes offline.
Reduces data loss (improving RPO) by maintaining near-real-time replicas that keep data current.

Such architectures are particularly valuable for organizations with global user bases or strict uptime/data integrity requirements.

More about RTO vs RPO⟶

Cloud Migration

What is cloud migration?

Cloud migration is the process of moving an organization's digital assets, services, databases, IT resources, and applications into the cloud. This process may involve transferring data and applications from an on-premises data center to a cloud-based infrastructure or shifting resources from one cloud environment to another. With Gartner predicting that “Worldwide end-user spending on public cloud services [will] total $723.4 billion in 2025, up from $595.7 billion in 2024,” it’s important for all companies to keep in mind where they are in their cloud journey, as that is where modern business continues to head. This article will discuss the benefits and challenges of moving from fully on-prem to hybrid or cloud deployments, as well as the benefits of multi-cloud and managing cross-cloud migration.

Particularly as companies aim to stay competitive in today’s economic landscape, many are turning to modernization efforts, which often includes migrating data from purely on-prem data centers to a hybrid, multi-region, or multi-cloud deployment.

One of the more common strategies is called rehosting, also known as “lift and shift.” This involves moving applications, workloads, and data from on-premises or another cloud environment to the cloud with minimal or no changes. This strategy is often used for quick cloud adoption and is usually the fastest and simplest method of cloud migration. Other common strategies include replatforming – lifting and then tinkering to get the shift to work more seamlessly – refactoring (rearchitecting) applications for the cloud, repurchasing (transitioning products entirely), or retiring.

Benefits of cloud migration

Cost Savings: Cloud migration can help reduce IT costs by minimizing the need for physical hardware and maintenance while optimizing resource usage based on demand.
Scalability and Flexibility: It allows organizations to scale resources up or down easily according to their needs, enhancing their ability to respond to changing market conditions and business demands.
Performance and Reliability: Cloud providers offer high availability and disaster recovery options, ensuring better performance and reliability for applications and services.
Access to Advanced Technologies: Migrating to the cloud enables organizations to leverage cutting-edge technologies like artificial intelligence, machine learning, and big data analytics, promoting innovation and competitive advantage.
Global Reach: Cloud services support deployment across multiple geographic regions, enabling organizations to deliver services closer to their customers, thus reducing latency and improving user experience.
Supporting Company Initiatives: Many companies are focused on modernization efforts, and today, that includes moving to the cloud.

Common challenges of cloud migration

While there are certainly many benefits to cloud migration, there also can be some challenges:

Downtime and Service Disruption: Migrating services and applications can lead to downtime, which can cause service disruptions and impact business operations.
Data Security and Compliance: Ensuring the security and compliance of data in transit and at rest during and after the migration is critical. Adhering to regulatory requirements and industry standards needs proper planning and execution.
Cost Management: Unexpected costs can arise during and after migration due to poor planning, oversized or underutilized resources, and lack of cost optimization strategies.
Technical Compatibility: Ensuring that existing applications and systems are compatible with the cloud infrastructure can be challenging. Some legacy applications may require significant modifications to migrate successfully.
Performance Issues: Post-migration performance may suffer due to differences in cloud infrastructure, network latency, or misconfiguration, impacting the user experience.
Data Loss and Integrity: Ensuring data accuracy and preventing data loss during migration can also be a key concern. Proper data backup and validation processes are necessary to mitigate these risks.
Complexity of Cloud Architecture: Understanding and effectively managing the complexity of cloud architecture, such as hybrid or multi-cloud environments, can be challenging.
Governance and Management: Establishing proper governance and management practices to monitor, control, and optimize cloud resources is crucial and can be complex.
Staff Expertise and Training: Existing IT teams may lack the required skills and expertise to handle cloud environments and migration processes. Investing in training and hiring skilled professionals is essential.

Addressing these challenges requires careful planning, a well-thought-out migration strategy, strong project management, and collaboration between business and IT teams. To simplify cloud migrations, companies can turn to different tools provided by database companies, who focus on ensuring a smooth transition.

What is cross-cloud migration?

Cross-cloud migration, also known as multi-cloud migration, refers to the process of moving data, applications, and workloads between different cloud service providers or utilizing multiple cloud environments to run various parts of an organization's IT landscape. This can involve migrating from one public cloud to another (e.g., from AWS to Azure), leveraging both public and private cloud resources, or distributing workloads across several public cloud providers.

Organizations undertake cross-cloud migration for several reasons:

Avoiding Vendor Lock-in: By using multiple cloud providers, organizations can avoid dependency on a single vendor. This flexibility allows them to switch providers more easily if necessary, without being restricted by contractual or technical limitations.
Cost Optimization: Different cloud providers offer varied pricing models, services, and performance characteristics. Organizations can optimize costs by selecting the most cost-effective services for specific workloads or taking advantage of promotional pricing and discounts from multiple providers.
Performance and Availability: Using multiple cloud environments can enhance disaster recovery and business continuity. Distributing workloads across various clouds can ensure higher availability and redundancy, reducing the risk of downtime and data loss.
Geographic Distribution: Organizations with a global presence may use multiple cloud providers to ensure that their services are geographically distributed closer to their users, reducing latency and improving performance.
Compliance and Data Sovereignty: Different cloud providers may meet specific regulatory and compliance requirements in various regions. Cross-cloud strategies allow organizations to store and process data in specific jurisdictions to comply with local laws and data sovereignty rules.
Enhanced Innovation and Functionality: Each cloud provider offers unique tools, services, and innovations. By leveraging multiple providers, organizations can take advantage of a broader range of features and capabilities to better meet their business needs and drive innovation.
Risk Mitigation: Relying on a single cloud provider can pose risks if that provider experiences outages, security breaches, or other issues. Cross-cloud strategies can mitigate risk by ensuring that the failure of one provider does not disrupt the entire operation.

Overall, cross-cloud migration enables organizations to build a more flexible, resilient, and cost-effective IT infrastructure that can adapt to changing business needs and market conditions. However, it also introduces additional complexity in managing and integrating multiple cloud environments, which requires careful planning and robust management practices.

Cross-cloud migration with CockroachDB

CockroachDB offers flexible replication controls that make it easy to run a single CockroachDB cluster across multiple cloud platforms and migrate data from one cloud to another without service interruption. This process involves starting nodes on different clouds, and using load balancers like HAProxy to ensure even distribution of client requests.

The migration process includes steps such as initializing the cluster, setting up load balancing, running sample workloads, and watching data balance across nodes. Eventually, you can migrate all data to a new cloud by adding constraints to ensure all replicas are on the desired cloud nodes, or simply decommissioning nodes in the source cloud. This is called a stretch migration.

Check out the video below, where Technical Evangelist, Rob Reid, performs a stretch migration:

What are cloud migration services?

Moving data, applications, and services to the cloud or between clouds can be a daunting task. As outlined above, there are many benefits, but also substantial challenges to consider when adopting a cloud migration plan. Cloud migration services help organizations move data, applications, and other business elements from an on-premises environment or one cloud provider to another cloud environment. The three major cloud service providers are Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP).

AWS Database Migration Service (DMS) is a popular tool used to migrate data from one database to another. Easy to use, AWS DMS offers minimal downtime, support for various database engines, flexibility, scalability, cost-effectiveness, automatic monitoring, security, integration with the AWS ecosystem, and overall reliability. Its ability to simplify and streamline the database migration process while providing robust performance and security makes it an attractive option for organizations looking to modernize their database infrastructure and take advantage of cloud benefits.

For example, users can use AWS DMS to migrate from PostgreSQL, MySQL, Oracle, or Microsoft SQL Server to a distributed SQL database, like CockroachDB. The process involves setting up a replication instance, configuring source and target endpoints, and creating a database migration task.

Distributed SQL databases are uniquely powerful in that they provide the much-needed consistency and familiar structure of traditional, relational databases along with the scalability, survivability, and performance of NoSQL databases, necessary for modern businesses.

The MOLT (Migrate Off Legacy Technology) Suite

MOLT stands for "Migrate Off Legacy Technology." It is a suite of tools designed by Cockroach Labs to facilitate the database migration process to CockroachDB (self-hosted or in the cloud). There are three tools to support users throughout the migration process.

MOLT SCT (Schema Conversion Tool): Converts database schemas from PostgreSQL, MySQL, Oracle, and SQL Server to a CockroachDB-compatible schema.
MOLT Fetch: Moves data from a source database (PostgreSQL, MySQL, or CockroachDB) to CockroachDB. It supports initial data loads and continuous replication.
MOLT Verify: Checks for data discrepancies between the source and target databases to ensure data integrity during migration.

These tools aim to simplify and streamline the migration process, enabling organizations to modernize their infrastructure and move away from legacy systems.

We hope you've enjoyed this overview of cloud migration! Happy modernizing!

Check out this demo of the MOLT Suite at RoachFest, our customer conference:

Cloud Migration FAQ

What is cloud migration?

Cloud migration is the process of moving an organization's digital assets, services, databases, IT resources, and applications into the cloud. This may involve transferring data and applications from an on-premises data center to a cloud-based infrastructure or shifting resources from one cloud environment to another.

What are the top cloud service providers?

Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP).

What are the benefits of cloud migration?

Cloud migration reduces IT costs and provides scalability, flexibility, and high reliability for applications and services while enabling the use of advanced technologies like AI and big data analytics. It supports global reach, enhancing user experience and modernization efforts.

What are the common challenges of cloud migration?

Without careful planning, cloud migration can cause downtime and service disruptions. Clear understanding of the nuances of cloud migration are needed to ensure data security, technical compatibility, and proper cost management. Navigating the complexity of cloud architecture and ensuring staff expertise also pose significant challenges. There is no one-size-fits-all solution, and so attention must be paid to the specificities of each ecosystem.

What is cross-cloud migration?

What are cloud migration services?

Cloud migration services involve moving data, applications, and other business elements from an on-premises environment or one cloud provider to another cloud environment.

What are common cloud migration services?

AWS DMS (Data Migration Services) are a popular tool for organizations to utilize as they gain access to the AWS ecosystem. Database companies, like CockroachDB, also often offer their own migration services. CockroachDB provides a set of tools to support every step of the data migration process (both for self-hosted and our managed Cloud offerings): MOLT SCT, MOLT Fetch, and MOLT Verify.

How do I get started with cloud migration?

You can check out comprehensive resources online from database providers, such as The Architect's Guide to SQL Database Modernization: Your Step-by-Step Roadmap, which provides guidelines on how to navigate the entire modernization journey, from on-prem to hybrid to multi-region and multi-cloud.

More about Cloud Migration⟶

DSQL

What is DSQL?

Published on March 14, 2025

DSQL can refer to either the proprietary AWS Aurora DSQL database or to distributed SQL databases – a class of databases designed to provide horizontal scalability, high availability, and strong consistency while maintaining the relational database capabilities that developers expect. Unlike traditional SQL databases, which often require complex sharding or replication strategies to scale, distributed SQL databases natively distribute data across multiple nodes and regions, allowing applications to handle increasing workloads without sacrificing performance.

Check out the webinar below for an introduction to distributed SQL by technical evangelist, Rob Reid:

What is Amazon Aurora DSQL?

Amazon Aurora DSQL is a serverless distributed SQL database introduced by AWS in late 2024. It is designed to offer scalability, strong consistency, and high availability while eliminating infrastructure management. Aurora DSQL extends the Amazon Aurora family by providing an active-active multi-region architecture with automatic sharding, replication, and failover capabilities.

Key Features of Aurora DSQL

Serverless & Fully Managed – No need for manual provisioning, scaling, or maintenance.
Multi-Region Active-Active Availability – Supports synchronous replication across multiple AWS regions high availability guarantees
Optimistic Concurrency Control (OCC) – Handles transactions using OCC and snapshot isolation, which may lead to higher transaction abort rates under high contention workloads
Automatic Sharding & Scale-Out Performance – Aurora DSQL automatically distributes data across nodes, scaling for both reads and writes

Aurora DSQL belongs to the distributed SQL class of databases, and it can be compared to offerings like Google Cloud Spanner, Apache Cassandra, and CockroachDB. All of these systems address the challenge of scaling databases horizontally while maintaining consistency, but they differ in design decisions, consistency models, performance characteristics, and pricing.

Aurora DSQL: Key Limitations

AWS announced the public preview of Aurora DSQL at AWS re:Invent 2024, and as of March 2025, Aurora DSQL is still in preview. In the preview, Aurora DSQL is available in limited regions (e.g., three US regions) and is free to try. AWS’ investment in distributed SQL is particularly exciting to companies like Cockroach Labs, which offered the first commercially available distributed SQL database a decade ago.

Because Aurora DSQL is in preview, there are still a number of feature gaps, including a lack of PostgreSQL compatibility, and availability in only a few regions. In addition, the service is not yet battle-tested at the level of the original Aurora or other mature databases. AWS is likely to continue improving Aurora DSQL’s compatibility and expanding its regional availability throughout 2025. If it follows a trajectory similar to past Aurora features, general availability (GA) could arrive by late 2025, possibly with more enterprise features and compliance certifications (at launch it’s only in preview, so not for production use yet).

Overall, Aurora DSQL is an exciting development in cloud databases – bringing the scalability of NoSQL and the power of SQL together. Positioning itself against technologies like Google Cloud Spanner and CockroachDB, DSQL is another option for customers looking for a serverless operation, multi-region design, and aspirational PostgreSQL compatibility. For now (early 2025), it’s a technology to watch and experiment with, especially for those already in the AWS ecosystem or those hitting the limits of single-node databases and considering a distributed SQL alternative.

Aurora DSQL vs. Google Cloud Spanner

Google Cloud Spanner is a distributed SQL database (first described by Google in 2012) that offers global-scale transactions with strong consistency. Both Spanner and Aurora DSQL emphasize horizontal scalability, fault-tolerance, and ACID consistency across regions, but there are notable differences:

Architecture & Consistency: Spanner relies on a proprietary TrueTime API (atomic clocks & GPS) to achieve external consistency (linearizable global order of transactions). Aurora DSQL uses AWS’s Time Sync Service for tightly synchronized clocks to support its snapshot isolation model. In practice, both deliver synchronous replication across regions with no data loss, but Spanner’s approach guarantees a single global serialization order for transactions. Aurora DSQL’s approach aims for minimal latency impact and resilience (it requires 3 regions—two active and one witness—for quorum, similar to Spanner’s Paxos-based replication).
Performance: AWS has claimed significant performance advantages for Aurora DSQL over Spanner. In fact, Amazon’s CEO announced that Aurora DSQL achieved reads and writes four times faster than Spanner in internal benchmarks. It will be interesting to see the performance comparison play out as DSQL gains more users.
Scalability & Use Cases: Both databases can scale horizontally to large sizes. Aurora DSQL can also operate across multiple regions, but gives you more control to link specific regions (it’s currently only in a few AWS regions during preview). Spanner has been used for multi-region consistent deployments, while Aurora DSQL is yet to be truly tested on an enterprise scale.
Platform & Compatibility: Spanner is available only on Google Cloud Platform, whereas Aurora DSQL is an AWS service (and currently in limited preview regions on AWS). Aurora DSQL’s PostgreSQL compatibility means it can work with many existing tools and ORMs, whereas Spanner uses its own SQL dialect (though JDBC drivers exist, and Google provides a PostgreSQL-interface for Spanner in a separate product variant). Neither offer compelling options for those seeking multi-cloud deployments in the future.

Aurora DSQL vs. Apache Cassandra

Apache Cassandra is a popular open-source NoSQL database known for its extreme scalability and always-on design. It takes a very different approach from Aurora DSQL, trading some consistency guarantees for partition tolerance and speed. Comparing Cassandra to Aurora DSQL highlights the differences between NoSQL (AP) systems and distributed SQL (CP/ACID) systems:

Data Model and Features: Cassandra is a NoSQL wide-column store, whereas Aurora DSQL is a relational SQL database. Aurora DSQL supports SQL queries, joins, secondary indexes, and ACID multi-row transactions. As of version 5.0, Cassandra now also supports secondary indexes and ACID transactions.
Consistency Model: Cassandra clients can tune consistency per query (e.g., reads or writes), and as of 5.0 offers ACID transactions. Aurora DSQL is designed for strong consistency – a committed transaction’s changes are visible immediately to all subsequent reads cluster-wide. Essentially, Aurora DSQL claims immediate consistency and full ACID semantics, whereas Cassandra follows an “eventually consistent” model that sacrifices some correctness for availability and performance.
Performance and Scalability: Both systems scale horizontally, but Cassandra’s scaling is “shared-nothing” and proven in some of the largest data workloads in the world. Aurora DSQL’s scaling is impressive in that it handles complex SQL operations while scaling, but being newer, real-world benchmarks will further reveal its performance at extreme scale. Cassandra can ingest massive write loads and scale out to petabytes of data across dozens or hundreds of nodes. Because it forgoes synchronous cross-node coordination on writes (unless configured otherwise), it can offer lower write latency and higher throughput in some scenarios.
Use Cases & Flexibility: Use cases that suit Apache Cassandra include scenarios where you need to handle huge volumes of data with simple queries, and can tolerate eventual consistency – for example, IoT data ingestion, analytics where slight delays are okay, messaging logs, or content feed storage. If you need to perform complex queries (multi-table joins, aggregations, etc.) and rely on transactional consistency, a distributed SQL system like Aurora DSQL or CockroachDB could be more appropriate.

Aurora DSQL vs. CockroachDB

CockroachDB was the first commercially available distributed SQL database, known for its performance, high availability, scalability, and resilience. CockroachDB and Aurora DSQL share goals of combining SQL features with horizontal scale, but they differ in maturity and deployment models:

Deployment and Flexibility: CockroachDB can be deployed anywhere – on-premises, on any cloud, or via Cockroach Labs’ managed service. It supports hybrid and multi-cloud deployments, giving companies cloud flexibility and even the ability to run across cloud providers. Aurora DSQL, on the other hand, is a proprietary AWS service only, offered as a serverless managed database on AWS.
Consistency and Transactions: Both systems ensure strong consistency for transactions. A key difference is the transaction isolation level: CockroachDB runs all transactions with serializability by default, which is the strictest isolation level, ensuring no anomalies or write skew issues occur. Aurora DSQL implements repeatable-read snapshot isolation (optimistic concurrency control), which allows for high concurrency and performance, but under higher contention may result in transaction rollbacks for conflicts. For workloads with lots of contention or hot spots, CockroachDB’s approach avoids anomalies and it naturally supports features like foreign keys and integrity constraints across the cluster. Aurora DSQL in preview does not yet support certain PostgreSQL features like foreign keys, triggers, or views, whereas CockroachDB has long supported foreign keys, JSON data, and many Postgres-compatible features. This means that today CockroachDB will handle complex relational schemas more completely, while Aurora DSQL’s focus has been on core transactional performance and will likely add additional features over time.
Performance and Scalability: Both databases are built to scale out on commodity hardware and handle heavy loads. CockroachDB has proven performance in production deployments, executing massive numbers of transactions with low latency and no manual sharding – it automatically balances data across nodes. It maintains indexes and constraints cluster-wide, so applications always see consistent, up-to-date data. One area of distinction is latency and locality: CockroachDB allows configuring data locality, and it automatically routes queries to nearest data copies. As of its preview, Aurora DSQL does not yet offer granular data locality controls to, say, pin a table to one region – it treats the entire cluster as a single logical database that is replicated. This could change in the future, but currently CockroachDB provides more control and has been successfully adopted in enterprises around the world.

RELATED: Watch our webinar on-demand, “CockroachDB vs. Amazon Aurora: Comparing Distributed Relational Databases.”

If an organization values total cost of ownership and freedom from vendor lock-in, they might lean toward CockroachDB; if they value hands-off management and deep AWS integration, Aurora DSQL is attractive. It’s also worth noting that CockroachDB, with multi-cloud capability, can help avoid cloud concentration risk, whereas Aurora DSQL ties your data layer to AWS.

Battle-Tested vs. Beta: Aurora DSQL vs. CockroachDB

While both Aurora DSQL and CockroachDB are distributed SQL databases, CockroachDB offers superior transactional consistency, PostgreSQL compatibility, operational flexibility, multi-region, and multi-cloud capabilities. Since CockroachDB has been commercially available for nearly a decade, the team has been innovating, pioneering, and grappling with the distributed SQL technology for significantly longer. The maturity of CockroachDB is a testament to the engineers behind the technology. The table below highlights key differences:

When to Choose CockroachDB Over Aurora DSQL

For business-critical workloads: CockroachDB offers stronger consistency (Serializable isolation), ensuring better reliability under concurrent transactions.
For multi-cloud or hybrid deployments: Aurora DSQL is AWS-exclusive, while CockroachDB runs across all cloud providers and on-prem.
For PostgreSQL compatibility: CockroachDB supports features like foreign keys, triggers, stored procedures, and sequences, while Aurora DSQL currently does not.
For predictable pricing and control: CockroachDB offers flexible pricing models and deployment options (serverless, dedicated, self-hosted, and pay-as-you-go), whereas Aurora DSQL, while currently in a free preview, provides no clarity as to how the pricing could change over time.

Aurora DSQL and CockroachDB both provide scalable, distributed SQL solutions, but CockroachDB stands out for enterprise-grade consistency, PostgreSQL compatibility, and deployment flexibility. Additionally, CockroachDB is battle-tested. Organizations looking for a multi-cloud, high-performance transactional database with rich SQL support will find CockroachDB to be a superior choice over Aurora DSQL.

Get started with CockroachDB Cloud today. We’re offering $400 of free credits to help kickstart your CockroachDB journey, or get in touch today to learn more.

DSQL FAQ

What is DSQL (Distributed SQL)?

DSQL can refer to either the proprietary database, AWS Aurora DSQL, or to a class of databases designed to provide horizontal scalability, high availability, and strong consistency while maintaining the relational database capabilities that developers expect.

What is distributed SQL?

Distributed SQL databases are a class of databases designed to provide horizontal scalability, high availability, and strong consistency while maintaining the relational database capabilities that developers expect.

What is Aurora DSQL?

Announced at AWS re:Invent 2024, Aurora DSQL belongs to the class of distributed SQL databases, and it can be compared to offerings like Google Cloud Spanner, Apache Cassandra, and CockroachDB. All these systems address the challenge of scaling databases horizontally while maintaining consistency, but they differ in design decisions, consistency models, performance characteristics, and pricing.

What are the key features of DSQL?

At the time of writing, DSQL is serverless and fully managed, offers multi-region Active-Active availability, optimistic Concurrency Control (OCC), and automatic sharding. Since Aurora DSQL is in preview, we anticipate changes to be forthcoming.

What are the most commonly used distributed SQL databases?

Today, the most commonly used distributed SQL databases include Google Spanner and CockroachDB. CockroachDB was the first commercially available distributed SQL database.

How does CockroachDB compare with Aurora DSQL?

While both Aurora DSQL and CockroachDB are distributed SQL databases, CockroachDB offers superior transactional consistency, PostgreSQL compatibility, operational flexibility, and multi-cloud capabilities. Since CockroachDB has been commercially available for nearly a decade, the team has been innovating, pioneering, and grappling with the distributed SQL technology for significantly longer.

More about DSQL⟶

Disaster Recovery

What is Disaster Recovery?

Disaster recovery (also known as DR) is a critical aspect of database management, ensuring that systems can quickly recover from unexpected events that disrupt normal operations, with minimal to no data loss.

Disaster recovery refers to the strategies and processes that an organization puts in place to restore normal operations after a disruptive event. These events can range from hardware failures and data corruption to security breaches and natural disasters. A strong disaster recovery plan helps to minimize downtime and data loss, which maintains business continuity.

What is a Disaster Recovery Plan?

A disaster recovery plan is a documented, structured approach with instructions for responding to unplanned incidents. It includes:

Definition and significance: A disaster recovery plan outlines the steps to recover from various types of disasters, ensuring minimal downtime and data loss.
Key strategies and goals: These include minimizing downtime, ensuring data integrity, and maintaining business continuity.
Roles and responsibilities: Clearly defined roles for IT staff and service providers are crucial for effective disaster recovery.
Critical systems involved: This includes databases, applications, and network infrastructure.

Steps in Disaster Recovery

Recovering from disasters requires detailed steps, depending on the type of failure. A few examples of possible failures include:

Hardware and network failures

Disk failures: Hard drive crashes, SSD failures, or RAID controller issues
Data center outages: power outages or network partitioning impacting access

Natural Disasters

Earthquakes, Floods, Fires: Physical damage to database servers in on-premises data centers

Power Failures: Unexpected shutdowns leading to incomplete transactions or corruption

Human Errors: Accidental Deletion: Unintentional DROP TABLE, DELETE, or UPDATE commands

Security Breaches: Malware or Ransomware Attacks: Encrypting or corrupting database files, making data inaccessible

After determining the kinds of failure that your system could encounter, it’s important to identify and formalize the basis of your disaster recovery plan. For example, what specific actions can there be for short-term and long-term remediation, what tools and support are at your disposal, what are the potential risks and their impact. Then you can build out your recovery objectives and codify your plan for regular testing.

Below, we have further explained each of the above areas.

Specific actions for short-term and long-term remediation

In disaster recovery, short-term remediation refers to getting all systems operational and then keeping applications running, whereas long-term remediation refers to improving the overall recovery plan.

Immediate short-term actions might include recovering from your latest backup or failing over to a standby, while long-term actions involve analyzing the cause of the disaster and improving the recovery plan.

In the case of disaster recovery for a database, the distributed SQL database CockroachDB, for example, provides tools like automated backups, monitoring dashboards, and support services to assist in disaster recovery.

Identifying potential risks and their impact

The first step in constructing a reliable DR plan is to conduct a comprehensive risk assessment. Preparing this assessment involves the cataloging of all potential threats that could disrupt normal operations — these may include:

Hardware failures
Cyberattacks
Data corruption
Natural disasters (such as floods or earthquakes)

For distributed SQL databases, this also means identifying risks associated with network instability, regional outages, and replication delays.

By understanding each threat’s likelihood and severity, organizations can gauge the potential impact on their data, applications, and overall productivity. The result of this exercise should be a prioritized list of vulnerabilities, enabling teams to focus their attention on mitigating the most critical risks and formulating tailored recovery strategies.

Establishing Recovery Objectives

Recovery objectives serve as the guiding metrics for measuring the effectiveness of a DR plan. Two primary parameters stand out:

Recovery Time Objective (RTO) – the maximum acceptable amount of time to restore normal operations after a disaster
Recovery Point Objective (RPO) – the maximum acceptable amount of data loss measured in time

In other words, the RTO defines how quickly systems must be restored after a disruption, while the RPO specifies how much data loss is tolerable to an organization. For example, an enterprise might have an RTO of two hours and an RPO of 15 minutes, meaning that they can tolerate up to two hours of downtime and 15 minutes of data loss. Put another way, a financial services firm might require near-zero data loss and mere minutes of downtime, whereas a smaller e-commerce platform could be more flexible.

Determining these objectives ensures that teams can choose technologies and architectures — such as CockroachDB optimized for high availability and geo-replication — that align with business needs. CockroachDB's architecture, for example, supports low RTO and RPO through features like automated backups, multi-region replication, and high availability.

Specifying RTO and RPO benchmarks help organizations to set realistic recovery goals, and choose appropriate recovery strategies. These benchmarks also help to set budgetary constraints, guiding investments in infrastructure, automation, and redundancy.

Implementing Backup and Recovery Solutions

Once the risk landscape and objectives are clear, the next step is to deploy suitable backup and restoration mechanisms. In a distributed SQL environment, data is often replicated across multiple nodes or regions, providing built-in resilience. However, true disaster recovery requires a multi-layered approach. This may include:

Snapshot-based backups
Point-in-time recovery (PITR) capabilities
Encrypted offsite or cloud-based backups.

Automation is key to this component: scheduling regular backups and verifying data integrity ensures a smooth and rapid failover process. Some organizations also maintain “warm” or “hot” standby systems, which stand ready to instantly take over operations if the primary environment fails. By combining these tools and technologies, teams can ensure that their data and applications remain accessible, even under extremely challenging circumstances.

Regularly Testing and Updating the Plan

A disaster recovery plan is not a static document — it must evolve with an enterprise’s changing technologies, growth in data volume, and emerging threats. Regularly conducting DR drills and simulations helps validate the effectiveness of each strategy, reveals gaps in processes, and uncovers any hidden dependencies. Teams should test failover procedures, confirm backup integrity, and measure recovery performance against established objectives.

Equally important, IT teams must keep the disaster recovery plan up-to-date to reflect infrastructure changes, software upgrades, and new compliance mandates. Continuous testing and refinement ensure that when a true disaster strikes, the organization is fully prepared to restore operations quickly, minimize loss, and maintain customer trust.

What is Cloud Disaster Recovery?

Cloud disaster recovery involves using cloud-based resources and services to back up data and applications, ensuring that they can be restored in the event that a disaster takes place. Key aspects include:

Definition and significance: Cloud disaster recovery leverages the scalability and flexibility of cloud services to provide a cost-effective and efficient recovery solution.

Cloud Disaster Recovery Plan: The cloud-native distributed SQL database CockroachDB, for example, offers its users comprehensive disaster recovery features that includes automated backups, multi-region deployments, and high availability features.

What is Disaster Recovery as a Service (DRaaS)?

Disaster Recovery as a Service (DRaaS) leverages cloud-based solutions to provide organizations with an outsourced, turnkey approach to recovering their IT environments after a disruptive event. Instead of relying solely on internal infrastructure and expertise, businesses leveraging DRaaS partner with a third-party provider which maintains the necessary hardware, software, and processes to restore operations quickly and efficiently.

DRaaS is particularly appealing for organizations that prefer to avoid the complexity and cost of building and managing their own secondary data centers. The DRaaS provider ensures that critical systems, applications, and data are continuously replicated in a secure, offsite environment. In the event of a disaster — whether it’s a cyberattack, hardware failure, or natural calamity — teams can fail over to the DRaaS provider’s resources and continue operating with minimal downtime.

Another important advantage of DRaaS is its scalability and flexibility. As businesses grow or their data protection needs evolve, they can easily adjust their DRaaS solution without significant capital investments. DRaaS vendors also often offer a range of recovery options, which helps organizations to fine-tune their RTOs and RPOs based on their unique requirements and compliance mandates.

With DRaaS, enterprises have a partner that is continuously monitoring their systems, verifying backups, and testing restoration procedures. It’s a proactive approach to ensure that a robust, well-orchestrated response is a failover command away when disaster strikes, significantly reducing financial losses, operational disruptions, and reputational damage.

What is Business Continuity Planning (BCP)?

Business Continuity Planning (BCP) is another aspect of disaster recovery. It extends beyond the immediate recovery of IT systems, taking a holistic approach to maintaining organizational operations during and after disruptive events.

While disaster recovery focuses on restoring specific infrastructure and data, BCP ensures that the business as a whole can continue functioning — serving customers, meeting regulatory requirements, and protecting brand reputation — regardless of the interruption’s scale or nature.

A comprehensive BCP begins with a detailed analysis of:

Critical business functions
Supply chain dependencies
Communication strategies

This process involves identifying the technology required for seamless operations, as well as the workforce, facilities, and external partners essential for day-to-day activities. For example, BCP might entail contingency plans for shifting workforces to alternate locations, rerouting logistics, or enabling remote operations if key offices become inaccessible.

Regular training and simulation exercises are fundamental to effective BCP. These drills verify that stakeholders understand their roles, communication lines remain open, and everyone can adapt quickly if a crisis occurs. Continuous refinement is equally important: The plan must be revisited and updated as the business evolves, with steps like integrating new software, diversifying suppliers, or introducing hybrid work models.

Ultimately, BCP ensures organizations are ready to respond swiftly and cohesively to any disruption – safeguarding both their IT systems and their overall resilience. By focusing on maintaining core business functions rather than just technology recovery, an organization’s BCP can ensure that operations remain stable, customers are supported, and reputations stay intact, no matter what challenges they face.

What is IT Disaster Recovery?

IT Disaster Recovery (ITDR) is focused on restoring the technical core of an organization’s infrastructure after a catastrophic event. While BCP and DRaaS provide broader frameworks, ITDR is for specific tasks: rebuilding servers, retrieving critical data, reinstating network connectivity, and ensuring that key applications return to their pre-disaster performance levels.

A solid ITDR strategy often starts with a detailed inventory of IT assets — including servers, storage arrays, network devices, and software platforms — coupled with an understanding of their criticality. By prioritizing systems according to their importance, teams can allocate resources wisely and focus their recovery efforts where they matter most. Techniques like virtualization, multi-region replication, and snapshot-based backups are commonly employed to ensure rapid, reliable restoration.

Consistent testing is central to ITDR. Regular drills, failover simulations, and integrity checks help validate the effectiveness of backup processes, highlight gaps in procedures, and confirm that recovery steps can be performed smoothly under pressure. As technology evolves, ITDR plans must be updated to include new tools, platforms, and security protocols, ensuring ongoing alignment with an organization’s environment and risk profile.

ITDR’s primary goal is to minimize downtime and data loss at the technical layer. By maintaining robust backups, establishing clear restoration workflows, and continuously refining ITDR processes, organizations can ensure that their technology stack remains reliable, resilient, and ready to recover when the unexpected strikes. This lays the groundwork for overall organizational stability and long-term success.

The Importance of a Disaster Recovery Plan

Having a robust disaster recovery plan is essential for maintaining business continuity and minimizing the impact of unexpected events. When organizations regularly update and test their plan, this ensures that it remains effective and can be relied upon when needed. Implementing a comprehensive disaster recovery strategy with a distributed SQL database like CockroachDB can help organizations to achieve their recovery objectives and maintain high availability.

Explore CockroachDB for Disaster Recovery

CockroachDB offers robust disaster recovery capabilities essential for modern database management. With features like Raft replication and fault tolerance, CockroachDB ensures high availability and automatic recovery from disruptions, including hardware failures and data corruption. Its built-in resilience minimizes downtime and data loss, maintaining business continuity.

Get started with CockroachDB Cloud for free today!

Architect your zero downtime strategy with CockroachDB: The Definitive Guide.

Disaster Recovery FAQ

What is a disaster recovery plan?

A disaster recovery plan is a documented, structured approach with instructions for responding to unplanned incidents. It outlines the procedures to follow to recover and protect a business IT infrastructure in the event of a disaster. This plan includes strategies for handling various types of disruptions, such as hardware failures, data corruption, and security breaches, ensuring that the organization can quickly resume normal operations.

What are the benefits of a disaster recovery plan?

A disaster recovery plan provides several benefits for organizations. Firstly, it ensures business continuity by minimizing downtime and data loss during catastrophic events such as hardware failures, data corruption, or security breaches. By having a structured approach with predefined roles and responsibilities, organizations can quickly restore operations and maintain data integrity. Regular testing and validation of the disaster recovery plan also help identify potential weaknesses and improve overall resilience, which ensures that the organization can effectively respond to – and recover from – unexpected incidents.

What is the difference between disaster recovery and business continuity?

Disaster recovery focuses specifically on restoring IT systems and data access after a disruption. It involves technical solutions like backups, data replication, and failover procedures. Business continuity, on the other hand, encompasses a broader scope, ensuring that all critical business functions can continue during and after a disaster. This includes not only IT systems but also other aspects like personnel, facilities, and communication strategies.

How does CockroachDB aid in disaster recovery?

CockroachDB aids in disaster recovery by providing built-in high availability and fault tolerance features. It uses Raft replication to ensure data is consistently replicated across multiple nodes, which helps maintain data integrity and availability even during failures. CockroachDB also supports point-in-time backups and PCR. These features allow for quick recovery from data loss or corruption, which minimizes downtime and helps to ensure business continuity. By integrating these DR strategies, CockroachDB helps organizations meet their RTO and RPO requirements effectively.

What is the difference between RTO and RPO?

RTO (Recovery Time Objective) and RPO (Recovery Point Objective) are key metrics in disaster recovery planning. RTO refers to the maximum acceptable amount of time that a system can be offline after a failure, indicating how quickly services must be restored. Meanwhile, RPO defines the maximum acceptable amount of data loss measured in time – it represents the point in time to which data must be recovered. Put another way, RTO focuses on downtime, while RPO focuses on data loss.

What is disaster recovery in a database?

Disaster recovery in a database refers to the strategies and processes that are set up to restore database operations after a catastrophic event. This could include hardware failures, data corruption, or security breaches. The goal is to minimize downtime and data loss. CockroachDB, for example, is designed to be fault-tolerant and recover automatically, but having a disaster recovery plan ensures quick recovery and limits the impact of such events. This plan typically involves regular backups and possibly PCR to maintain data integrity and availability.

What is backup and restore?

Backup and restore in a database context involves creating copies of data (backups) at specific points in time and using these copies to recover data (restore) in case of data loss or corruption. This process is crucial for disaster recovery, ensuring data integrity and availability.

CockroachDB’s Physical Cluster Replication (PCR), on the other hand, continuously replicates data at the byte level from a primary cluster to a standby cluster. PCR provides lower RTO and RPO compared to backup and restore, as it allows for quicker failover and minimal data loss.

More about Disaster Recovery⟶

pgvector

What is pgvector?

pgvector is an open-source extension for PostgreSQL that provides efficient storage, retrieval, and similarity search of vector data. It is particularly useful for applications involving semantic search, recommendation systems, and natural language processing, which are critical for AI-driven applications, including Large Language Models (LLMs).

CockroachDB introduced vector search capabilities in the 24.2 release. This implementation uses the same interface as pgvector and aims to be compatible with pgvector’s API. This compatibility facilitates seamless integration with tools like Langchain and Hugging Face, making it easier to incorporate data into AI models.

The integration of vector search in CockroachDB combines the strengths of a vector database and an operational database into a single, horizontally scalable solution. This approach simplifies the architecture and eliminates the operational and financial costs associated with maintaining a dedicated vector database. Additionally, CockroachDB’s distributed SQL engine allows for complex SQL operations on vector data, enhancing the performance and efficiency of AI and machine learning workloads.

pgvector adds support for vector data types and vector similarity search to the PostgreSQL database. It is particularly useful for applications involving machine learning, natural language processing, and other domains where vector representations of data (such as embeddings) are commonly used.

Key features of pgvector include:

Vector Data Type: It introduces a new data type for storing vectors (arrays of floating-point numbers) directly in PostgreSQL tables.
Similarity Search: It provides functions for performing similarity searches on vector data, such as finding the nearest neighbors based on various distance metrics (e.g., Euclidean distance, cosine similarity).
Indexing: It supports indexing mechanisms to speed up similarity searches, making it efficient to query large datasets.

What is vector search?

Vector search refers to the process of finding vectors in a dataset that are similar to a given query vector. Vector search is a powerful tool for handling high-dimensional data and is widely used in modern AI and data-driven applications to provide relevant and accurate results based on the inherent similarities in the data.

Common applications of vector search

Vector search is essential in applications where data is represented as high-dimensional vectors, such as:

Machine Learning: Finding similar data points in embedding spaces, such as word embeddings, image embeddings, or user embeddings.
Natural Language Processing (NLP): Searching for semantically similar text documents, sentences, or words based on their vector representations.
Recommendation Systems: Identifying items (e.g., products, movies) that are similar to a user's preferences, represented as vectors.
Computer Vision: Finding images that are visually similar to a query image based on their feature vectors.

Check out this video showcasing how to use CockroachDB to build a product recommendation engine for ecommerce with AI:

How does vector search work?

Vector search typically involves the following steps:

Vector Representation: Data is converted into vector representations using techniques like word embeddings (e.g., Word2Vec, GloVe), sentence embeddings (e.g., BERT), or image embeddings (e.g., convolutional neural networks).
Similarity Metric: A similarity metric (e.g., Euclidean distance, cosine similarity) is chosen to measure the closeness between vectors.
Indexing: Efficient indexing structures (e.g., KD-trees, ball trees, HNSW) are used to speed up the search process, especially for large datasets.
Querying: The query vector is compared against the indexed vectors to find the most similar ones based on the chosen similarity metric.

Why is `pgvector` commonly used?

pgvector has become one of the most popular extensions for working with vectors in part because it allows easy integration with PostgreSQL. Users can thus leverage the robustness, scalability, and familiarity of PostgreSQL, making it easier for developers to integrate vector search capabilities into their existing databases without needing a separate specialized system. PostgreSQL has a large and active community, and pgvector benefits from this ecosystem. Users can leverage existing PostgreSQL tools, extensions, and best practices while working with vector data.

By extending PostgreSQL, pgvector allows users to work with vector data using standard SQL queries. This reduces the learning curve and simplifies the development process. pgvector also supports efficient indexing and querying, and can handle large-scale vector data and perform similarity searches quickly, which is crucial for real-time applications. pgvector also allows users to choose the best approach for their specific use case with various distance metrics and indexing methods.

`pgvector`-compatible vector search in CockroachDB

CockroachDB is deeply invested in providing robust database offerings so that organizations can run their business-critical applications at scale. To that end, the Cockroach Labs team continues to focus on expanding generative AI capabilities, leading to the release of vector search features in v24.2.

Our vector search features make it easier for users to deploy AI-driven applications with CockroachDB. CockroachDB's vector search functionality is compatible with the pgvector extension for PostgreSQL, allowing users to store and manipulate vectors within the database.

Although CockroachDB does not currently support vector indexing, it allows for the storage of billions of vectors and supports various vector comparison operators, such as Euclidean distance, inner product, and cosine distance, to facilitate similarity searches. This capability enables developers to build fault-tolerant AI applications that leverage vectors stored in CockroachDB, benefiting from its horizontal data distribution and multi-region abstractions.

pgvector and Vector Search FAQ

What is pgvector?

pgvector is an open-source PostgreSQL extension for efficient storage, retrieval, and similarity search of vector data. It's useful for semantic search, recommendation systems, and natural language processing, critical for AI applications like Large Language Models (LLMs).

What are the key features of pgvector?

The key features of pgvector include storing vectors in PostgreSQL tables, similarity search, and vector indexing.

What is vector search?

Vector search finds vectors in a dataset similar to a query vector. It's used in AI and data-driven applications to provide relevant results based on data similarities.

What are common applications of vector search?

The common applications of vector search include natural language processing to search for semantically similar text in documents, sentences, or words, and identifying similar items, such as products, movies, books, documents, or videos in recommendation systems.

How does vector search work?

Generally vector search works by first converting the data into vectors using embeddings. Then measuring the closeness between vectors via a similarity metric. Additionally, one can use indexing to speed up searches, and then can search for similar vectors by comparing a query vector against the indexed vectors.

Why is pgvector commonly used?

pgvector is popular because it's open-source, integrates with PostgreSQL, and supports efficient indexing and querying. It handles large-scale vector data and performs quick similarity searches, crucial for real-time applications.

What is vector search in CockroachDB?

Starting in v24.2, vector search entered preview in CockroachDB. Vector search in CockroachDB finds data points similar to a query using vectors. Compatible with pgvector, it stores and manipulates vectors, useful for AI applications like LLMs, and supports various vector comparison operators for similarity searches.

More about pgvector⟶

DBaaS (Database-as-a-Service)

What is Database-as-a-Service (DBaaS)?

A Database-as-a-Service (DBaaS) is a cloud-based service model that provides users with access to a database without the need for setting up physical hardware, installing software, or managing the database infrastructure. This service allows organizations to focus on their core business activities while the DBaaS provider handles the database management tasks such as backups, updates, and scaling.

What are the benefits of DBaaS?

There are many potential benefits for utilizing a DBaaS in your organization. DBaaS simplifies the architecture and reduces the operational and financial costs associated with maintaining a dedicated database infrastructure. This service is particularly valuable for organizations looking to modernize their database solutions and reduce the operational headaches of managing existing databases. Including:

Operational Simplicity: The service could help offer zero downtime for planned routine maintenance, upgrades, patching, cluster settings, and scaling. By allowing a company that specializes in databases to handle the nitty gritty of deploying your database solutions, your team can trust the experts to ensure data consistency, regulatory compliance, no data loss, and continued innovation in the database space. In turn, your team can focus on creating more business impact rather than operating a database.
Scalability: Particularly at larger organizations where you have multiple teams that need different kinds of related data, and that support different services, platforms, applications, or products for your clients, having a dedicated DBaaS can help standardize operations across the organization, and allow a smoother implementation of new offerings.
Enterprise Integration: You likely can also integrate your database solution or solutions with enterprise security features such as single sign-on and role-based access controls. The DBaaS could also support observability tools for exporting metrics and logs, or help manage compliance, audit, and data loss protection.

DBaaS vs. Cloud Database

The primary difference between database-as-a-service (DBaaS) and a cloud database lies in the level of management and operational responsibilities handled by the service provider versus the user. For example, a DBaaS provider might manage the database infrastructure, including setup, maintenance, backups, updates, and scaling. This allows organizations to focus on their core business activities without worrying about the underlying database management tasks. On the other hand a cloud database might not offer these “concierge” services. While a cloud database provider might still host the database on their cloud infrastructure, the user is typically responsible for managing the database, including the tasks listed above, increasing the complexity and resource requirements for managing the database.

In summary, DBaaS provides a more managed and simplified approach to database management, reducing the operational burden on the user.

Key considerations when choosing a DBaaS provider

It is a big decision to entrust a provider with managing your database needs. Here are a few key considerations if you choose to find a third-party database provider:

Performance and Latency

Performance and latency are critical considerations when using Database-as-a-Service (DBaaS). Since DBaaS relies on cloud infrastructure, the physical distance between the data center and the end-users can impact the speed at which data is accessed and processed. High latency can lead to slower application performance, which can be particularly problematic for real-time applications and services that require quick data retrieval and updates. Providers often offer various performance tiers and configurations, allowing businesses to select the appropriate level of resources to meet their specific needs.

Compliance and Regulatory Issues

Compliance and regulatory issues are significant challenges when adopting DBaaS, especially for industries with stringent data protection and privacy requirements, such as healthcare, finance, and government sectors. Organizations must ensure that their DBaaS provider complies with relevant regulations, such as GDPR, HIPAA, and PCI-DSS. This includes ensuring that data is stored, processed, and transmitted securely, with appropriate encryption and access controls in place. Additionally, organizations should verify that the provider offers features that support compliance, such as audit logs, data residency options, and regular security assessments. By carefully evaluating the compliance capabilities of a DBaaS provider, businesses can mitigate the risk of regulatory breaches and ensure that their data management practices align with legal requirements.

Data Migration

Data migration to a DBaaS platform can be a complex and resource-intensive process. Migrating existing databases may involve transferring large volumes of data, reconfiguring applications, and ensuring data integrity throughout the process. This can lead to potential downtime and disruptions to business operations if not managed effectively. To facilitate a smooth migration, organizations should leverage the tools and services offered by DBaaS providers, such as automated migration tools, detailed documentation, and professional support. It is also crucial to plan the migration meticulously, conduct thorough testing, and implement a phased approach to minimize risks. By taking these steps, businesses can ensure a seamless transition to a DBaaS environment with minimal impact on their operations.

Check out the video below to learn about how Rightmove, the UK’s #1 property portal – and a publicly traded, multi-billion dollar company that has been in business since 2000 – migrated from Oracle to CockroachDB to support their 24 years worth of industry data:

Downtime

Downtime is a critical concern for any database system, and DBaaS is no exception. Relying on a third-party provider means that any service outages or disruptions on their end can directly impact the availability of your database. This can lead to significant business interruptions, loss of revenue, and damage to reputation. To mitigate the risk of downtime, it is essential to choose a DBaaS provider with a strong track record of reliability and robust Service Level Agreements (SLAs) that guarantee high availability. Additionally, implementing multi-region deployments and failover mechanisms can enhance resilience and ensure that your database remains accessible even in the event of localized outages.

Data Sovereignty

Data sovereignty refers to the concept that data is subject to the laws and regulations of the country or jurisdiction where it is stored. This is a critical consideration for organizations using DBaaS, as storing data in the cloud often involves data centers located in different jurisdictions. Compliance with data sovereignty requirements is essential to avoid legal and regulatory complications. Organizations must ensure that their DBaaS provider offers data residency options that allow them to store data in specific geographic locations that comply with local laws. Additionally, understanding the data protection regulations of the countries where data is stored and processed is crucial for maintaining compliance and protecting sensitive information. By addressing data sovereignty concerns, businesses can confidently leverage DBaaS while adhering to legal and regulatory obligations.

CockroachDB as a DBaaS provider

CockroachDB as a DBaaS simplifies the architecture and reduces the operational and financial costs associated with maintaining a dedicated database infrastructure. This service is particularly valuable for organizations looking to modernize their database solutions and reduce the operational headaches of managing existing databases. CockroachDB is PostgreSQL compatible so CockroachDB feels like PostgreSQL and scales like NoSQL, making it easier to adopt for your whole organization.

The above video is a talk by Netflix engineers on how they provide CockroachDB-as-a-Service to their customers – Netflix developers – who have a variety of use cases, from a reliable device management platform and supporting a ML orchestration platform, to Netflix’s new gaming platform. By deploying CockroachDB-as-a-Service, Netflix engineers can focus on expanding the scope of their business and rely on CockroachDB to ensure that all their applications are always running smoothly with consistent data.

DBaaS FAQ

What is Database-as-a-Service (DBaaS)?

Database-as-a-Service (DBaaS) is a cloud computing service that provides users with access to a database without the need for physical hardware, software installation, or database management. It allows businesses to focus on using the database rather than maintaining it, as the service provider handles all backend tasks such as updates, backups, and scaling.

How does DBaaS work?

DBaaS works by hosting databases on cloud infrastructure. Users can access and manage their databases through a web interface or API. The service provider takes care of the underlying infrastructure, including servers, storage, and network resources, ensuring high availability, security, and performance.

What are the benefits of using DBaaS?

Benefits of using DBaaS include cost efficiency, scalability, accessibility, and performance. In addition, a major benefit is that the DBaaS provider can handle updates, backups, and security, freeing up your resources to focus on building and maintaining business-critical applications.

What types of databases are available as DBaaS?

DBaaS offerings include a variety of database types such as relational databases like CockroachDB or NoSQL databases like MongoDB.

Is DBaaS secure?

Yes, DBaaS providers implement robust security measures including encryption, access controls, and regular security audits. However, it is essential for users to follow best practices such as strong password policies and regular monitoring to ensure data security.

How do I choose the right DBaaS provider?

Consider the following factors when choosing a DBaaS provider:

Database Compatibility: Ensure the provider supports the database type you need.
Performance and Scalability: Check the provider's performance benchmarks and scalability options.
Security Features: Look for comprehensive security measures and compliance certifications.
Cost: Compare pricing models and ensure they fit your budget.
Support and Reliability: Evaluate the provider's customer support and service level agreements (SLAs).

Can I migrate my existing database to a DBaaS?

Yes, most DBaaS providers offer tools and services to help migrate existing databases to their platform. The migration process typically involves exporting your current database, transferring the data to the cloud, and configuring the new environment. For example, you can check out CockroachDB’s MOLT (Migrate Off Legacy Technology) tools.

What is the difference between DBaaS and traditional database hosting?

Traditional database hosting requires users to manage the hardware, software, and maintenance of the database environment. In contrast, DBaaS offloads these responsibilities to the service provider, allowing users to focus on database usage and application development.

Can DBaaS handle large-scale applications?

Yes, DBaaS is designed to handle applications of all sizes, from small projects to large-scale enterprise applications. Providers offer various scaling options to accommodate growing data and user demands.

More about DBaaS (Database-as-a-Service)⟶

Black Swan Event

A black swan event is an event that has the following three attributes:

It was unexpected.
It had significant, wide-ranging consequences.
After it happens, people will suggest that it was predictable, despite the fact that it was not widely predicted before it happened.

This definition comes from mathematical statistician Nassim Taleb, who coined and popularized the term in his books Fooled by Randomness and The Black Swan.

However, in everyday language when people talk about a “black swan event,” they’re generally thinking just about unexpected events with wide-ranging consequences. The third criteria – that people will rationalize the event as predictable after the fact – isn’t typically discussed.

Officially the term doesn’t have a positive or negative connotation. Black swan events can theoretically be good or bad or neutral. In the real world, however, the term is often used to describe events with negative impacts, such as financial crashes, widespread service outages, and even natural disasters and terrorism.

Long story short: while the official definition of a black swan event is quite a bit more nuanced, in everyday life the term is often used to mean something pretty simple: an unexpected event with significant negative consequences.

(The name, in case you’re wondering, comes from the once-widespread perception that all swans were white, and black swans either didn’t exist or were incredibly rare. In reality, black swans do exist. But they’re only native to Australia, so they’re quite rare everywhere else – including in Europe, where the idea of a “black swan” as a symbol for something unpredictable, unexpected, or unlikely first came to be used.)

Black swan events in technology

In the tech industry, most discussion of black swan events is typically related to infrastructure, and infrastructure failures leading to service outages.

On a global scale, a black swan event in technology could be something like a coronal mass ejection from the sun causing a kind of natural EMP that knocks out electronic systems over a large area. (Whether this would be a true black swan event is debatable, considering that it has happened before, but it happens only rarely and is considered unlikely).

More commonly, though, discussions of black swan events in technology are company-specific, meaning that a “black swan event” is an event that significantly and negatively impacts the company’s services, generally due to some kind of infrastructure failure or outage.

Examples of potential company-level black swan events in technology include:

* Cloud provider outages.

* Power outages.

* System failures.

* Human mistakes.

Real-world example of a black swan event in tech

More about Black Swan Event⟶

Distributed SQL

What is Distributed SQL?

Distributed SQL represents a significant evolution in database technology, designed to meet the needs of modern cloud applications. Traditional SQL databases were built for data consistency, vertical scalability, and tight integration, which worked well for monolithic applications on single-server environments. However, as the paradigm shifted to distributed applications in the cloud, these traditional databases began to show limitations.

SQL vs. NoSQL

SQL (Structured Query Language) and NoSQL (Not Only SQL) databases serve different purposes and have distinct characteristics. SQL databases, such as CockroachDB, MySQL, and PostgreSQL, are relational databases that use structured query language for defining and manipulating data. They are known for their ACID (Atomicity, Consistency, Isolation, Durability) properties, which ensure reliable transactions and data integrity. SQL databases are ideal for complex queries and transactions, supporting structured data with predefined schemas and relationships.

On the other hand, NoSQL databases, like MongoDB and Cassandra, are designed to handle unstructured or semi-structured data. They may offer flexible schemas, allowing for rapid development and iteration. NoSQL databases are often used for large-scale data storage and real-time web applications due to their ability to scale horizontally and handle high volumes of read and write operations. They typically provide eventual consistency rather than the strong consistency guaranteed by SQL databases.

The Need for Distributed SQL

Modern applications require horizontal scalability, elasticity, and support for microservices. Traditional single-node relational databases, with their fixed schemas and lack of support for distributed data models, are not suited for these needs. Distributed SQL databases address these challenges by combining the consistency and structure of early relational databases with the scalability, survivability, and performance first pioneered in NoSQL databases.

Core Capabilities of Distributed SQL Databases

Horizontally scalable: Distributed SQL databases can seamlessly scale to mirror the capabilities of cloud environments without introducing operational complexity. They distribute data across multiple nodes, ensuring efficient resource utilization.

Consistency: These databases deliver high levels of isolation in distributed environments, mediating contention and ensuring transactional consistency across multiple operators.

Resilience: Distributed SQL databases provide inherent resilience, reducing recovery times to near zero and replicating data naturally without external configuration.

Geo-replication: They allow for the distribution of data across geographically dispersed environments, ensuring low latency access and compliance with data sovereignty requirements.

In addition to the unique capabilities of distributed SQL, these databases must also meet foundational requirements such as:

Operational efficiency: Easy installation, configuration, and control of the database environment.
Optimization: Advanced features like cost-based optimizers for query performance.
Security: Key capabilities for authentication, authorization, and accountability.
Integration: Compatibility with existing applications, ORMs, ETL tools, and more.

How does distributed SQL work?

Distributed SQL databases address these challenges by combining the consistency and structure of relational databases with the scalability, survivability, and performance first pioneered in NoSQL databases. They distribute data evenly across multiple nodes, ensuring efficient resource utilization and high availability. These databases deliver high levels of isolation in distributed environments, mediating contention and ensuring transactional consistency across multiple operators. Additionally, they provide inherent resilience, reducing recovery times to near zero and replicating data naturally without external configuration.

CockroachDB, for example, is a distributed SQL database that uses a transactional and strongly-consistent key-value store. It scales horizontally, is incredibly resilient against various kinds of outages, including disk, machine, rack, and even datacenter failures with minimal latency disruption and no manual intervention. CockroachDB supports strongly-consistent ACID transactions and provides a familiar SQL API for structuring, manipulating, and querying data. It ensures data consistency in a distributed environment by guaranteeing serializable SQL transactions, the highest isolation level defined by the SQL standard, using the Raft consensus algorithm for writes and a custom time-based synchronization algorithm for reads to guarantee strong data consistency.

Examples of Distributed SQL Databases

Several databases meet the requirements of distributed SQL, including Google Spanner, Amazon Aurora, Yugabyte, FaunaDB, and CockroachDB. These databases offer various levels of support for the core capabilities mentioned above, and therefore are more appropriate for different use cases.

Amazon Aurora: Amazon Aurora is a cloud-based relational database engine that combines the speed and reliability of high-end commercial databases with the simplicity and cost-effectiveness of open-source databases. Aurora is often described as a distributed database, but it does not scale for writes, and therefore is not truly distributed. Aurora is designed to provide high availability and durability, with automatic failover and replication across multiple Availability Zones. It supports MySQL and PostgreSQL and offers features like automatic scaling and serverless options.
CockroachDB: CockroachDB is a distributed SQL database designed for global, cloud-native applications. CockroachDB is PostgreSQL compatible, so that most applications built on PostgreSQL can be migrated without changing the application code. CockroachDB provides strong consistency, horizontal scalability, and high availability. CockroachDB supports ACID transactions and a familiar SQL interface, making it easy to use for developers. It survives various types of failures, including mechanical, data center, region, and even cloud failures, with minimal latency disruption.
FaunaDB: FaunaDB is a distributed, multi-model database that provides strong consistency, ACID transactions, and a flexible data model. It is designed for serverless applications and offers a globally distributed architecture that ensures low-latency access to data. FaunaDB supports GraphQL and FQL (Fauna Query Language) for querying data.
Google Spanner: Google Spanner is a globally distributed database service that provides strong consistency and horizontal scalability. It uses a combination of hardware (atomic clocks) and software to achieve global consistency and high availability. Spanner supports SQL queries and is designed to handle large-scale, mission-critical applications.
YugabyteDB: YugabyteDB is an open-source, distributed SQL database. It supports both SQL and NoSQL workloads, making it versatile for various use cases.

Evaluating Distributed SQL Databases

When evaluating a distributed SQL database, it is essential to consider several core requirements to ensure it meets the needs of modern applications. First, assess the database's scalability. A distributed SQL database should be able to scale horizontally, distributing data evenly across multiple nodes to handle increased loads without compromising performance. This capability is crucial for applications that experience variable or growing workloads, and can support businesses at any stage of growth, scaling as demand scales. Additionally, the database should support strong consistency, ensuring that all nodes reflect the same data state, which is vital for maintaining data integrity across distributed environments.

Another critical factor is high availability and resilience. The database should be designed to handle failures gracefully, whether they occur at the disk, machine, rack, or even datacenter level, with minimal disruption to operations. This includes features like automatic failover, data replication, and quick recovery times.

RELATED

To learn more about inherently resilient systems, check out this webinar hosted by Cockroach Labs’ CTO and CPO, Peter Mattis and Technical Evangelist, Rob Reid: “The Always-On Dilemma: Disaster Recovery vs. Inherent Resilience.”

Geo-replication is also important, allowing data to be distributed across multiple geographic locations to reduce latency and comply with data sovereignty requirements.

Finally, consider operational efficiency, security, and integration capabilities. The database should be easy to install, configure, and manage, offer robust security features for authentication and authorization, and integrate seamlessly with existing applications and tools.

A mature distributed SQL database should meet all these requirements, ensuring it is suitable for business-critical applications.

Distributed SQL Use-Cases

Distributed SQL databases are utilized across various verticals to address specific industry challenges and requirements.

For example, check out this presentation by Netflix engineers who provide CockroachDB-as-a-Service (DBaaS) to Netflix developers with a variety of use-cases:

Here are some examples of vertical use cases for distributed SQL databases:

Distributed SQL for Financial Services

In the financial sector, distributed SQL databases are crucial for applications such as payment processing, trading platforms, and identity management. These applications require strong consistency, high availability, and the ability to handle high transaction volumes. For instance, payment systems must ensure accurate and timely transactions, while trading platforms need to process large volumes of trades with minimal latency. Identity management systems benefit from distributed SQL databases by providing secure and consistent access to user data across multiple regions.

Distributed SQL for Retail and eCommerce

Retail and eCommerce companies leverage distributed SQL databases for order and inventory management. These systems must handle large spikes in traffic, such as during Black Friday or Cyber Monday, without compromising performance (or overselling products). Distributed SQL databases provide horizontal scalability, ensuring that the system can manage increased demand. They also offer strong consistency, which is essential for maintaining accurate inventory levels and processing transactions reliably. Additionally, these databases support multi-region deployments, allowing retailers to provide low-latency access to customers worldwide.

Distributed SQL for Gaming

The gaming industry uses distributed SQL databases to manage user accounts, in-game transactions, and real-time data processing. Gaming platforms often experience high concurrency with thousands of players interacting simultaneously. Distributed SQL databases ensure that user data is consistent and available, even during peak usage times. They also support the scalability needed to accommodate growing user bases and the resilience required to maintain uptime during unexpected failures.

Distributed SQL for Logistics and Supply Chain

In logistics and supply chain management, distributed SQL databases are used for scheduling, workflow management, and tracking systems. These applications require precise coordination and data integrity to ensure timely deliveries and efficient operations. Distributed SQL databases provide the high availability and fault tolerance needed to prevent disruptions in logistics workflows. They also support geo-replication, which helps in maintaining data consistency across different geographic locations.

These examples illustrate how distributed SQL databases can be tailored to meet the specific needs of various industries, providing the scalability, consistency, and resilience required for modern applications.

Try a Distributed SQL Database

Distributed SQL is the future of database management in the cloud, offering the scalability, consistency, and resilience needed for modern applications. CockroachDB, among others, exemplifies these capabilities, making it a strong candidate for organizations looking to transition to cloud-native distributed SQL databases.

Get started with CockroachDB Cloud for free, today!

Distributed SQL FAQ

What is distributed SQL?

Distributed SQL refers to a class of relational databases that distribute data across multiple nodes to ensure high availability, fault tolerance, and scalability. These databases maintain SQL capabilities while leveraging a distributed architecture to handle large-scale, geographically dispersed data.

How do distributed SQL databases differ from traditional SQL databases?

Traditional SQL databases are typically monolithic, meaning they run on a single server. Distributed SQL databases, on the other hand, spread data across multiple servers or nodes. This distribution allows for better performance, scalability, and resilience against failures.

What are the benefits of using distributed SQL?

Distributed SQL databases offer several benefits, including scalability, high availability, and resilience against mechanical failures. Distributed SQL databases combine the consistency and structure of early relational databases with the scalability, survivability, and performance first pioneered in NoSQL databases.

How does CockroachDB implement distributed SQL?

CockroachDB is a distributed SQL database that uses a transactional and strongly-consistent key-value store. It scales horizontally, survives various types of failures, including mechanical, data center, region, and even cloud failures with minimal latency disruption and no manual intervention. CockroachDB supports strongly-consistent ACID transactions and provides a familiar SQL API for structuring, manipulating, and querying data.

How does CockroachDB ensure data consistency in a distributed environment?

CockroachDB guarantees serializable SQL transactions, the highest isolation level defined by the SQL standard. It uses the Raft consensus algorithm for writes and a custom time-based synchronization algorithm for reads to ensure consistency between replicas.

What are some common use cases for distributed SQL?

Distributed SQL is ideal for many use cases including for distributed customer bases as you can bring the data closer to the customer, businesses that handle a high transaction volume, and businesses that handle spiky workloads. For example, this can include financial institutions, retailers or eCommerce businesses, and gaming platforms.

How does CockroachDB compare to other distributed databases?

CockroachDB supports SQL syntax and scales easily without the manual complexity of sharding, rebalances and repairs itself automatically, and distributes transactions seamlessly across your cluster. It provides strong consistency and supports distributed transactions, unlike many other distributed databases.

More about Distributed SQL⟶

CockroachDB

What is CockroachDB?

At the highest level, CockroachDB is software for storing data. More specifically, CockroachDB is a distributed SQL database technology that enables users to enjoy many of the benefits of the traditional relational database (such as the familiar SQL language, reliable schema, etc.) while also offering the key features required for a modern, cloud-native database, such as high availability, bulletproof resilience, elastic scale, and geographic scale.

Learn more about CockroachDB.

More about CockroachDB⟶

Cost-Based Optimizer

Note: This term is specific to CockroachDB, a Distributed SQL database. In other contexts, it may be used differently.

What is a cost-based optimizer?

The Cost-Based Optimizer is a feature of CockroachDB that looks at all possible ways in which a query can be executed and assigns each a “cost” that indicates how efficiently the query can be run. Then the optimizer chooses the way that has the lowest cost, and is therefore most efficient. This feature only works with databases that speak SQL, so it’s an added benefit obtained from having a SQL layer.

More about Cost-Based Optimizer⟶

Encoding

Note: This term is specific to CockroachDB, a Distributed SQL database. In other contexts, it may be used differently.

What is encoding?

Encoding is the process by which CockroachDB converts SQL statements into bytes (because the lower layers of CockroachDB deal with bytes).

More about Encoding⟶

Gateway Node

Note: This term is specific to CockroachDB, a Distributed SQL database. In other contexts, it may be used differently.

What is a Gateway Node?

When a request comes into CockroachDB, the load balancer routes the request to the node it thinks can best handle the request at that time. This node is called the Gateway Node. The Gateway Node receives the request and then responds to the request. It identifies which nodes in the cluster are the leaseholders for the specific requests that came in, and sends the requests to those nodes.

More about Gateway Node⟶

Key Value (KV) Layer

Note: This term is specific to CockroachDB, a Distributed SQL database. In other contexts, it may be used differently.

What is the Key Value (KV) Layer?

The key value layer is a figurative layer of CockroachDB. One helpful way to think about the architecture of CockroachDB is as a SQL system built on top of a key value store. When the data is up in the top layers, it follows the rules of a SQL system and is structured in a table format. When the data travels deeper into the database, table format no longer works, due to the distributed nature of the database. So instead of being stored in tables, the data is stored in a different way: in key-value pairs. It’s important to note that this combination of a SQL upper layer with a key-value store underneath is a relatively unusual setup, because translating structured table data into key-value pairs is a difficult task.

Key Value (KV) Pair

Note: This term is specific to CockroachDB, a Distributed SQL database. In other contexts, it may be used differently.

What is a Key Value (KV) Pair?

A key-value pair is a way of storing data. In CockroachDB, individual rows from the tables are mapped into key-value pairs. One column becomes the index, meaning that each piece of data in that column is the “key” in its own key-value pair. One or more other columns become the “value.”

For example, in a table called “Customer Data” the first column might be “Name,” the second “Hometown” and the third “Email Address.” You might decide that the “Name” should be the key because ultimately you want all your data to be sorted by name. Then you might decide that the “Hometown” and “Email Address” columns should be the values.

This information gets mapped, row-by-row, into key-value pairs, and the ultimate format of a single row might end up reading as a string: “Customer Data Table/John Smith/New York City/Johnsmith@gmail.com.” Then, once all the data in the table is translated into the monolithic sorted key-value map, these pieces of data are all sorted by the name value.

More about Key Value (KV) Layer⟶

Load Balance

Note: This term is specific to CockroachDB, a Distributed SQL database. In other contexts, it may be used differently.

What is the Load Balancer?

The Load Balancer is a piece of software added to the front of CockroachDB that helps balance the requests coming into the database. All the nodes in CockroachDB can process requests (they’re all symmetrical), so it makes the most sense to spread the work around between nodes so that the requests can be completed more quickly. The load balancer accomplishes that.

More about Load Balance⟶

Meta Range

Note: This term is specific to CockroachDB, a Distributed SQL database. In other contexts, it may be used differently.

What is a Meta Range?

A meta range is information stored in the cluster that tells the database where to find ranges – i.e., which nodes have certain ranges – so that the database can access the correct range. Sometimes also referred to as an index, but a different definition of an index than above.

More about Meta Range⟶

Unstructured Data

What is Unstructured Data?

Unstructured Data is essentially the opposite of structured data: data that is not arranged with any kind of data model or schema and that can’t be easily adapted to a table format.

For example, datasets consisting of video or audio files are good examples of unstructured data. A traditional table structure would work for storing metadata about videos (such as title, description, Youtube link, etc.), but it is not a good fit for storing the videos themselves.

CockroachDB does support some types of unstructured data via its JSON support, but it was designed with primarily structured data in mind.

More about Unstructured Data⟶

Transaction Layer

Note: This term is specific to CockroachDB, a Distributed SQL database. In other contexts, it may be used differently.

What is the transaction layer?

The Transaction Layer is the layer of CockroachDB that receives requests from the SQL Layer and coordinates concurrent operations.

More about Transaction Layer⟶

TPC-C

What is TPC-C?

TPC-C, short for Transaction Processing Performance Council Benchmark C, is a benchmark used to measure how well a database holds up when it’s trying to run certain workloads. TPC-C is specifically an OLTP benchmark, and it’s widely recognized and standardized. TPC-C simulates an environment where a lot of users are making requests of a database, and then it measures how well the database holds up – i.e., how fast it can complete the transactions, what the cost of completing the transactions is, and so on.

More about TPC-C⟶

Structured Data

What is Structured Data?

Structured Data (also called relational data) is data that lives best in a structured format, i.e., the kind of data you would enter into a table in a spreadsheet. The best way to organize it is in columns and rows.

Two examples of structured data are an inventory of products for an eCommerce site, or a list of customers and their information. CockroachDB was designed primarily to support this kind of data (although it does also include support for unstructured data).

More about Structured Data⟶

Storage Layer

Note: This term is specific to CockroachDB, a Distributed SQL database. In other contexts, it may be used differently.

What is the Storage Layer?

The storage Layer is the layer of CockroachDB that writes data to the disk (the Physical Storage component), and reads data from the disk. It’s still part of the software, but it’s the piece that communicates with the hardware.

More about Storage Layer⟶

SQL Layer

Note: This term is specific to CockroachDB, a Distributed SQL database. In other contexts, it may be used differently.

What is the SQL Layer?

The SQL Layer is the layer of CockroachDB that speaks to the application or client using the programming language SQL, adhering to the PostgreSQL Wire Protocol. After performing various tasks, the SQL layer sends the relevant requests to the Transaction Layer.

More about SQL Layer⟶

SQL API

What is a SQL API?

A SQL API is an API for interacting and communicating with a database. In the case of CockroachDB, CockroachDB offers a SQL API to users and apps. This means that users and apps can send SQL commands into the API and receive the results of their query in return. The commands must follow the PostgreSQL Wire Protocol, because CockroachDB adheres to it.

More about SQL API⟶

Serializable Isolation

What is Serializable Isolation?

Serializable Isolation is the highest level of isolation possible under the guidelines provided by the “ANSI SQL Standard” (an official standard of best SQL practices). It means that transactions committed to a database appear as if they were executed one after another, even if they were processed in parallel. Most distributed databases only achieve a lower level of isolation, called “snapshot isolation,” but CockroachDB is able to achieve serializable isolation.

More about Serializable Isolation⟶

Secondary Index

What is a Secondary Index?

A Secondary Index is a secondary column by which you can sort data. CockroachDB supports both primary and secondary indexes. This just adds another level of organization to data sorting. For example, in a table of data about employees, you might first sort alphabetically by “name,” (the primary index) and then within that, sort alphabetically by “hometown” (the secondary index).

More about Secondary Index⟶

RTO (Recovery Time Objective)

What is RTO (Recovery Time Objective)?

RTO (Recovery Time Objective) is the amount of time a system’s data is unavailable due to a failure, before the system returns to service. The goal is zero RTO, and CockroachDB achieves this through its multi-active availability model.

More about RTO (Recovery Time Objective)⟶

RocksDB

What is RocksDB?

RocksDB is a piece of software that was embedded in CockroachDB to store key-value data. CockroachDB used RocksDB to communicate with the disk in order to actually store data. CockroachDB now uses Pebble, a RocksDB-inspired and compatible key-value store that is specifically designed for distributed SQL databases. Many other tech companies still use RocksDB.

More about RocksDB⟶

Replication Layer

Note: This term may have other meanings in other contexts. Here, we are defining it in the context of a distributed database.

What is the Replication Layer?

The Replication Layer is the layer of distributed database software that copies data between nodes and ensures consistency between these copies. In CockroachDB, this is accomplished by implementing the Raft consensus protocol.

More about Replication Layer⟶

Region

Note: This term may have other meanings in other contexts. Here, we are defining it in the context of a distributed database.

What is a Region?

A region is a specific geographical location where you can host your resources. Each region has one or more zones; most regions have three or more zones. For example, US-West is the West Coast of the US. US-East is the East Coast. For example, public cloud providers often allow you to choose the region or regions you’d like to deploy to.

More about Region⟶

Range / Shard

What is a Range / Shard?

A range in CockroachDB (called a “shard” in other databases) is a chunk of data, stored as key-value pairs. The Distribution Layer breaks tables apart into these chunks so the data can be distributed across different nodes.

In CockroachDB, a range is 512 MiB or smaller. This default range size is a sweet spot – small enough to move quickly between nodes, but large enough to store a meaningfully contiguous set of data whose keys are more likely to be accessed together. Once a range gets bigger than 512 MiB, it’s split into two smaller chunks in order to keep the ranges from getting too big.

More about Range / Shard⟶

Range Replication

What is Range Replication?

Range replication is the duplication of ranges on multiple nodes so that if one node fails, the range’s data is still accessible via another node. In CockroachDB, each range is replicated three times by default. This replication allows CockroachDB to be highly available and remain online even when a node goes down, because the data lives in two other places (and two nodes are required to achieve quorum). When a failure happens, CockroachDB automatically redistributes data to the surviving nodes.

More about Range Replication⟶

Raft Consensus Protocol

What is the Raft Consensus Protocol?

The Raft Consensus Protocol is the set of quorum guidelines that CockroachDB follows to ensure each range is consistent. It’s an algorithm that makes sure all copies agree on changes to data. A “leader” is elected for each range (leaders are also known as Leaseholders), to coordinate changes for that range, and the two other ranges are called “followers.” Changes get entered in the leader’s log, then the leader sends out the changes to the followers, and the changes get entered into their logs. Once this happens, the change is committed to the leader’s data, and then committed to the followers' data.

More about Raft Consensus Protocol⟶

Quorum

Note: This term is specific to CockroachDB, a Distributed SQL database. In other contexts, it may be used differently.

What is Quorum?

Quorum is the consensus required to commit changes in a distributed database such as CockroachDB. Different types of distributed databases may use different systems for quorum. CockroachDB uses the Raft Consensus Protocol, in which a minimum of two nodes are required to achieve a consensus (i.e. quorum) in a three-node system.

More about Quorum⟶

Public Cloud

What is a Public Cloud?

Public Cloud is the term for the most common kind of cloud computing vendor. Data lives in “public,” and hardware, storage, and network devices are shared with other organizations or cloud “tenants.” The services are accessed and managed via a web browser.

More about Public Cloud⟶

Private Cloud

What is a Private Cloud?

In a private cloud, a company’s data is stored in a dedicated cloud environment, rather than being stored on shared machines. The services and infrastructure are always maintained on a private network and the hardware and software are dedicated solely to that single organization. Benefits include increased security and more customization. Vendors include Amazon Virtual Private Cloud (VPC), Dell Cloud Solutions, and Microsoft Private Cloud.

More about Private Cloud⟶

Primary Index

What is a Primary Index?

An index is a column whose data is designated to the “key” location in a Key-Value pair. Think of sorting data in a spreadsheet; the column by which you’re sorting the data is called the index. A primary index is the main column by which the data is sorted.

More about Primary Index⟶

PostgreSQL Wire Protocol

What is the PostgreSQL Wire Protocol?

PostgreSQL Wire Protocol is a dialect of the SQL language that’s spoken by the database called PostgreSQL. Many people have used PostgreSQL before and are familiar with its language, and many apps and ORMs already use the dialect. The fact that CockroachDB adheres to the PostgreSQL Wire Protocol makes it easy for customers to plug into the database and use it.

Physical Storage

What is Physical Storage?

Physical Storage is the hardware on which the data is stored. CockroachDB’s Storage Layer (software) communicates with this hardware to physically write data onto the disk and read data from the disk.

More about Physical Storage⟶

ORM (Object-Relational Mapper)

What is an ORM (Object-Relational Mapper)?

An ORM is a software intermediary between the application and the database. It allows developers to speak to a SQL database using a language other than SQL. Some developers may not be experienced with SQL, or simply prefer to write in languages like Python, C++, Javascript etc. When writing the parts of their application that communicate with a database, they can use an ORM as a go-between, to translate their code into SQL and send it to the database to make requests.

More about ORM (Object-Relational Mapper)⟶

On-Prem (On-Premises)

Note: This term may have other meanings in other contexts. Here, we are defining it in the context of a distributed database.

What is On-Prem (On-Premises)?

In the context of databases, on-prem describes a database deployment in which a company’s data is stored on a machine that they own, rather than stored with a public cloud provider such as GCP or AWS. Companies often choose to store proprietary or high-security data on-prem because they can have more control over the safety of it.

More about On-Prem (On-Premises)⟶

OLTP (OnLine Transaction Processing)

What is OLTP (OnLine Transaction Processing)?

OLTP (Online Transaction Processing) describes the kind of data processing that deals with heavy transactional workloads. In OLTP workloads, in other words, there are many relatively simple, transactional operations (many reads and writes) constantly coming into the database. The data typically relates to standard business tasks like keeping track of inventory, hotel reservations, or online banking functions.

The other main kind of data processing is OLAP (Online Analytics Processing). Compared to OLTP workloads, the workloads run on OLAP databases are usually much more complicated but much less frequent.

Often, database technologies are designed with one or the other in mind. OLTP databases such as CockroachDB are focused on transactional use cases such as payment processing, logistics, metadata management – basically any use case that involves frequent reads and writes. OLAP databases, in contrast, are typically used for data analytics.

Often a company may use both, with the OLTP database handling the transactions coming from the application, and with relevant data periodically offloaded to an OLAP database for analysis. This approach ensures that even very complex analytics work will not interfere with the performance of the OLTP database.

More about OLTP (OnLine Transaction Processing)⟶

Node

Note: This term may have other meanings in other contexts. Here, we are defining it in the context of a distributed database.

What is a Node?

A node is a single instance of a distributed database such as CockroachDB, or one individual machine of many that are running the same distributed database. Many nodes join together to create the full cluster.

More about Node⟶

MVCC (Multiversion Concurrency Control)

Note: This term is specific to CockroachDB, a Distributed SQL database. In other contexts, it may be used differently.

What is MVCC (Multiversion Concurrency Control)?

MVCC (Multiversion Concurrency Control) is a protocol that CockroachDB follows to ensure isolation of transactions when concurrent transactions are happening. Without MVCC, if a database is being used in multiple ways at the same time, then someone might see half-written or inconsistent data. MVCC keeps multiple copies of data, so each user sees a snapshot of the database at a particular instant in time, and they won’t see changes until the transaction has been committed.

More about MVCC (Multiversion Concurrency Control)⟶

Multi-Cloud

Note: This term may have other meanings in other contexts. Here, we are defining it in the context of a distributed database.

What is Multi-Cloud?

Multi-Cloud refers to a strategy where multiple cloud providers are being used, rather than just one. Typically, these are multiple public cloud providers (i.e. AWS and GCP). CockroachDB is one of the few distributed SQL databases that supports multi-cloud deployments.

More about Multi-Cloud⟶

Multi-Active Availability

What is Multi-Active Availability?

Multi-Active Availability is CockroachDB’s high availability model. All replicas in the cluster can handle traffic, including both reads and writes, but the Raft consensus is used to ensure data remains consistent. This model also prevents RTO if a node goes down.

More about Multi-Active Availability⟶

Monolithic Sorted Key Value Map

Note: This term is specific to CockroachDB, a Distributed SQL database. In other contexts, it may be used differently.

What is a Monolithic Sorted Key Value Map?

When all of the data from all of the tables is translated into key-value pairs in CockroachDB, it’s called a monolithic sorted key-value map. This just means a giant, list of key-value pairs that correspond to rows in tables, organized in a way that allows you to easily insert and find data.

More about Monolithic Sorted Key Value Map⟶

Mainframe

What is a mainframe?

A mainframe is a gigantic machine (typically made by IBM) that has a lot of storage abilities and high computing power. Mainframes are almost exclusively on-prem and privately owned by the businesses that use them. However, recently IBM has released a version of its mainframe to be used in private cloud settings.

More about Mainframe⟶

Machine / Server

Note: This term may have other meanings in other contexts. Here, we are defining it in the context of a distributed database.

What is a machine or server?

A machine is just a computer. It can live in a data warehouse or on-prem. It can vary in computing power and storage abilities. Typically, the machines that live in data warehouses to be used for cloud purposes are much smaller with less power than mainframes.

More about Machine / Server⟶

JSON

Note: Here, we are defining JSON in the context of a distributed database.

What is JSON?

JSON, or Javascript Object Notation, is a way of formatting “semi-structured” data, and CockroachDB supports it by default. A small amount of data in CockroachDB is typically JSON data.

To understand what JSON is, follow this example: Imagine a table that stores information about blog posts. Much of the data is structured; every blog post needs data on title, author, and number of words (and these become the columns in the table). But there might be additional data that’s not applicable to every post, such as if a user comments on a post or likes a post. Instead of having to make a separate column for each of these data pieces (which would be inefficient since many of the cells would be empty), you can make a single column that supports JSON – and then you format the unique data pieces in JSON and put any item at all into that column, without having a predefined label attached to it.

More about JSON⟶

Isolation

Note: This term may have other meanings in other contexts. Here, we are defining it in the context of a distributed database.

What is Isolation?

Isolation describes a desirable database characteristic in which concurrently-processed transactions (i.e. transactions happening at the same time) leave the database in the same state as if they were executed sequentially (i.e. one after another).

It is one of the four ACID properties that are desirable for databases dealing with transactional workloads.

More about Isolation⟶

Hybrid Logical Clock (HLC) Timestamps

What are Hybrid Logical Clock (HLC) Timestamps?

Accurate time is very important in distributed systems, because events often occur at different nodes at the same time, and these events need to be ordered accurately. Google Spanner uses atomic clocks to accomplish this, but because CockroachDB is open source and can run on any public/private cloud/on-prem, it’d be impossible to use atomic clocks. Instead CockroachDB used HLC, which is a clock method that has logical and physical components. CockroachDB applies HLC timestamps to all transactions so the system knows when they occured and can order them correctly.

More about Hybrid Logical Clock (HLC) Timestamps⟶

Hybrid Cloud

Note: This term may have other meanings in other contexts. Here, we are defining it in the context of a distributed database.

What is Hybrid Cloud?

Hybrid Cloud refers to a strategy where cloud storage and on-prem storage are both being used. A typical example of this would be a single company storing sensitive data on-prem and less sensitive data in the cloud.

More about Hybrid Cloud⟶

High availability

Note: This term may have other meanings in other contexts. Here, we are defining it in the context of a distributed database.

What is high availability?

High Availability is a desirable characteristic for databases; it describes the ability of a database to survive (and thus remain available) even when parts of the system fail. For example, a CockroachDB database with five nodes could survive and remain available even if several nodes failed.

Different databases use different models to achieve high availability. The two most common high availability models are Active-Passive and Active-Active. Meanwhile, CockroachDB uses a Multi-Active Availability model.

More about High availability⟶

Gossip Protocol

Note: This term may have other meanings in other contexts. Here, we are defining it in the context of a distributed database.

What is a gossip protocol?

In the context of a distributed database, the gossip protocol is the form of communication used between nodes, to allow each node to locate data across the cluster.

More about Gossip Protocol⟶

Durability

Note: This term may have other meanings in other contexts. Here, we are defining it in the context of a distributed database.

What is durability?

In the context of a database, durability is a desirable property that describes a system where, once data has been committed to the database, it will remain even in the event of machine failures, power outages, etc.

It is one of the four ACID properties that are desirable for databases dealing with transactional workloads.

More about Durability⟶

Driver

Note: This term has other meanings in other contexts. Here, we are defining it in the context of a distributed database.

What is a driver?

In the context of databases, a driver is a piece of code that you add to your app to help it speak the language necessary for communicating with the database, such as SQL. For example, if you’re building a Python app, you might use the psycopg2 driver to enable it to communicate with CockroachDB using SQL.

More about Driver⟶

Distribution Layer

Note: This term may have other meanings in other contexts. Here, we are defining it in the context of a distributed database.

What is the distribution layer?

In a distributed database such as CockroachDB, the distribution layer takes data from the SQL layer and breaks it down into chunks called ranges to be stored in a distributed way (i.e., it is stored in multiple locations instead of just a single location).

More about Distribution Layer⟶

Data Warehouse / Datacenter

What is a data warehouse or datacenter?

A data warehouse or datacenter is a giant warehouse where thousands of machines live, arranged in racks. Cloud providers own many of these warehouses, and when you run or store something on the cloud, it lives in one of these warehouses.

More about Data Warehouse / Datacenter⟶

CPU

What is a CPU?

A CPU (Central Processing Unit) is a chip that sits on top of the motherboard of a computer and carries out instructions from the software. Usually CPUs are multi-core, meaning that there is more than one CPU on a single chip. In the context of databases, the power of the CPUs on the machines running your database will impact the performance of the database.

More about CPU⟶

Core

Note: This term may have other meanings in other contexts. Here, we are defining it in the context of a distributed database.

What is a core?

A core is a component of a CPU that carries out instructions. A multi-core CPU has multiple CPUs together on a single chip, increasing the chip’s overall computing power.

More about Core⟶

Consistency

What is database consistency?

Most database systems have rules for what kinds of data can and cannot be stored (among other rules). Consistency is the term used to describe a database in which those rules are always enforced. A database is said to have consistency when no transaction can violate the integrity of the database – all transactions must leave the database in a valid state.

It is one of the four ACID properties that are desirable for databases dealing with transactional workloads.

More about Consistency⟶

Cluster

Note: This term may have other meanings in other contexts. Here, we are defining it in the context of a distributed database.

What is a cluster?

A cluster is the full collection of nodes associated with a distributed database. For example, a company might spin up a CockroachDB cluster with five nodes for its database.

More about Cluster⟶

Cloud

What is the cloud (in the context of databases)?

In the context of databases, “cloud” refers to storing your data on machines that belong to a third-party cloud provider. This is more cost efficient than on-prem storage, as it eliminates the need for a company to maintain its own machines, and typically increases the availability and scalability of the company’s services. This is because with cloud storage, data is stored across many smaller computers inside a data warehouse and across data warehouses (when storing data on-prem, it’s usually stored in a single massive computer or mainframe).

More about Cloud⟶

Byte

What is a byte?

In the context of data storage, a byte is the most basic format for coding a character on a computer. A byte is a group of eight 0’s and 1’s in a specific order that represents their character. For example, the letter “A,” when translated into bytes, reads as: “01000001”

More about Byte⟶

AZ (Availability Zone, or Zone)

What is an AZ?

An AZ, availability zone, or just “zone”, typically refers to a single data warehouse within a region. Multiple warehouses/zones make up a single region. Sometimes, an availability zone isn’t at the warehouse level, but instead at the rack level - i.e., a single rack within a warehouse.

More about AZ (Availability Zone, or Zone)⟶

Atomicity

What is atomicity?

Atomicity is a desirable characteristic for database transactions. The name comes from the idea of an indivisible “atomic unit”, and it describes a method for processing transactions that treats each transaction as a single, indivisible unit (even if processing the transaction requires multiple steps). In other words, every step of the transaction must complete successfully or the entire transaction will fail and no change will be written to the database.

This is desirable because without atomicity, a transaction that’s interrupted while processing may only write a portion of the changes it’s supposed to make, which could leave the database in an inconsistent state.

It is one of the four ACID properties that are desirable for databases dealing with transactional workloads.

More about Atomicity⟶

Application / Client

What is an application or client?

An application or client is a software program. In the context of a database, the client is what sends data to a database, and/or gets data from it. An example of an application is a phone app – a piece of software that runs on the phones of users, and may re quest and send data to and from a database as the user takes different actions in the app.

More about Application / Client⟶

API

What is an API?

API stands for Application Programming Interface. Simply put, an API is a way for developers/apps/clients to communicate with applications. It’s an interface that allows the developers to send requests to an application and receive simple responses from it.

More about API⟶

Active-Passive Availability

What is Active-Passive Availability?

Active-passive availability is one way to configure a distributed database to offer high availability.

In an Active-Passive Availability setup, all traffic is routed through a single active replica, and changes are copied to a backup passive replica. If the active replica fails, the passive one takes over. However, the active one might fail before the passive one copies all the data over, leading to data loss. Plus, it can take a while for the passive replica to boot up, causing some RTO.

Other configurations include active-active availability and multi-active availability.

More about Active-Passive Availability⟶

Active-Active Availability

What is Active-Active Availability?

Active-active availability is one way to configure a distributed database to offer high availability.

In an active-active availability setup, multiple replicas run identical services and traffic is routed to all of them. If any replica fails, the others handle the extra traffic. This model runs into consistency issues in database contexts because multiple replicas might be trying to edit the same data at the same time.

Other configurations include active-passive availability and multi-active availability.

More about Active-Active Availability⟶

Edge Computing

What is edge computing?

Edge computing is a somewhat overloaded term to describe locality-sensitive distributed computing architecture. Wikipedia defines it as “a distributed computing paradigm that brings computation and data storage closer to the sources of data.” We can think of the “sources of data” being users or even sensors making requests to our system.

The main aim of edge computing or multi-access edge computing is to reduce location-related latency in applications to enable high-performance, real-time use cases in widespread geographies. Edge computing systems are faster when computation and data are closer to the devices.

Developers today are beginning to realize that to get computation data closer to devices, there are better choices than centralized databases and even distributed databases with limited distribution to a single region.

Operating a database in a single region leads to high latency for users or edge devices in areas outside of the database region. Even if you distribute your application across multiple regions, users or devices outside the database region may experience unacceptable response times. And unexpectedly high latency can translate into dissatisfied users.

Edge computing uses data that has life cycles or life spans. In the same application, you can have ephemeral data, in-memory data, short-term persistent data, and long-term persistent data. Typically, long-term persistent data is stored in databases. Unfortunately, when LTP data is far from the edge, it’s slower, and this slow data effect tends to give databases a lousy reputation.

Choosing the right database solution can improve your edge computing architecture and user experience.

More about Edge Computing⟶

Bit

What is a bit?

In the context of data storage, a bit is a single 0 or 1 (also called a binary digit). There are 8 bits in a byte.

More about Bit⟶