A Database-as-a-Service (DBaaS) is a cloud-based service model that provides users with access to a database without the need for setting up physical hardware, installing software, or managing the database infrastructure. This service allows organizations to focus on their core business activities while the DBaaS provider handles the database management tasks such as backups, updates, and scaling.
There are many potential benefits for utilizing a DBaaS in your organization. DBaaS simplifies the architecture and reduces the operational and financial costs associated with maintaining a dedicated database infrastructure. This service is particularly valuable for organizations looking to modernize their database solutions and reduce the operational headaches of managing existing databases. Including:
Operational Simplicity: The service could help offer zero downtime for planned routine maintenance, upgrades, patching, cluster settings, and scaling. By allowing a company that specializes in databases to handle the nitty gritty of deploying your database solutions, your team can trust the experts to ensure data consistency, regulatory compliance, no data loss, and continued innovation in the database space. In turn, your team can focus on creating more business impact rather than operating a database.
Scalability: Particularly at larger organizations where you have multiple teams that need different kinds of related data, and that support different services, platforms, applications, or products for your clients, having a dedicated DBaaS can help standardize operations across the organization, and allow a smoother implementation of new offerings.
Enterprise Integration: You likely can also integrate your database solution or solutions with enterprise security features such as single sign-on and role-based access controls. The DBaaS could also support observability tools for exporting metrics and logs, or help manage compliance, audit, and data loss protection.
The primary difference between database-as-a-service (DBaaS) and a cloud database lies in the level of management and operational responsibilities handled by the service provider versus the user. For example, a DBaaS provider might manage the database infrastructure, including setup, maintenance, backups, updates, and scaling. This allows organizations to focus on their core business activities without worrying about the underlying database management tasks. On the other hand a cloud database might not offer these “concierge” services. While a cloud database provider might still host the database on their cloud infrastructure, the user is typically responsible for managing the database, including the tasks listed above, increasing the complexity and resource requirements for managing the database.
In summary, DBaaS provides a more managed and simplified approach to database management, reducing the operational burden on the user.
It is a big decision to entrust a provider with managing your database needs. Here are a few key considerations if you choose to find a third-party database provider:
Performance and latency are critical considerations when using Database-as-a-Service (DBaaS). Since DBaaS relies on cloud infrastructure, the physical distance between the data center and the end-users can impact the speed at which data is accessed and processed. High latency can lead to slower application performance, which can be particularly problematic for real-time applications and services that require quick data retrieval and updates. Providers often offer various performance tiers and configurations, allowing businesses to select the appropriate level of resources to meet their specific needs.
Compliance and regulatory issues are significant challenges when adopting DBaaS, especially for industries with stringent data protection and privacy requirements, such as healthcare, finance, and government sectors. Organizations must ensure that their DBaaS provider complies with relevant regulations, such as GDPR, HIPAA, and PCI-DSS. This includes ensuring that data is stored, processed, and transmitted securely, with appropriate encryption and access controls in place. Additionally, organizations should verify that the provider offers features that support compliance, such as audit logs, data residency options, and regular security assessments. By carefully evaluating the compliance capabilities of a DBaaS provider, businesses can mitigate the risk of regulatory breaches and ensure that their data management practices align with legal requirements.
Data migration to a DBaaS platform can be a complex and resource-intensive process. Migrating existing databases may involve transferring large volumes of data, reconfiguring applications, and ensuring data integrity throughout the process. This can lead to potential downtime and disruptions to business operations if not managed effectively. To facilitate a smooth migration, organizations should leverage the tools and services offered by DBaaS providers, such as automated migration tools, detailed documentation, and professional support. It is also crucial to plan the migration meticulously, conduct thorough testing, and implement a phased approach to minimize risks. By taking these steps, businesses can ensure a seamless transition to a DBaaS environment with minimal impact on their operations.
Check out the video below to learn about how Rightmove, the UK’s #1 property portal – and a publicly traded, multi-billion dollar company that has been in business since 2000 – migrated from Oracle to CockroachDB to support their 24 years worth of industry data:
Downtime is a critical concern for any database system, and DBaaS is no exception. Relying on a third-party provider means that any service outages or disruptions on their end can directly impact the availability of your database. This can lead to significant business interruptions, loss of revenue, and damage to reputation. To mitigate the risk of downtime, it is essential to choose a DBaaS provider with a strong track record of reliability and robust Service Level Agreements (SLAs) that guarantee high availability. Additionally, implementing multi-region deployments and failover mechanisms can enhance resilience and ensure that your database remains accessible even in the event of localized outages.
Data sovereignty refers to the concept that data is subject to the laws and regulations of the country or jurisdiction where it is stored. This is a critical consideration for organizations using DBaaS, as storing data in the cloud often involves data centers located in different jurisdictions. Compliance with data sovereignty requirements is essential to avoid legal and regulatory complications. Organizations must ensure that their DBaaS provider offers data residency options that allow them to store data in specific geographic locations that comply with local laws. Additionally, understanding the data protection regulations of the countries where data is stored and processed is crucial for maintaining compliance and protecting sensitive information. By addressing data sovereignty concerns, businesses can confidently leverage DBaaS while adhering to legal and regulatory obligations.
CockroachDB as a DBaaS simplifies the architecture and reduces the operational and financial costs associated with maintaining a dedicated database infrastructure. This service is particularly valuable for organizations looking to modernize their database solutions and reduce the operational headaches of managing existing databases. CockroachDB is PostgreSQL compatible so CockroachDB feels like PostgreSQL and scales like NoSQL, making it easier to adopt for your whole organization.
The above video is a talk by Netflix engineers on how they provide CockroachDB-as-a-Service to their customers – Netflix developers – who have a variety of use cases, from a reliable device management platform and supporting a ML orchestration platform, to Netflix’s new gaming platform. By deploying CockroachDB-as-a-Service, Netflix engineers can focus on expanding the scope of their business and rely on CockroachDB to ensure that all their applications are always running smoothly with consistent data.
Database-as-a-Service (DBaaS) is a cloud computing service that provides users with access to a database without the need for physical hardware, software installation, or database management. It allows businesses to focus on using the database rather than maintaining it, as the service provider handles all backend tasks such as updates, backups, and scaling.
DBaaS works by hosting databases on cloud infrastructure. Users can access and manage their databases through a web interface or API. The service provider takes care of the underlying infrastructure, including servers, storage, and network resources, ensuring high availability, security, and performance.
Benefits of using DBaaS include cost efficiency, scalability, accessibility, and performance. In addition, a major benefit is that the DBaaS provider can handle updates, backups, and security, freeing up your resources to focus on building and maintaining business-critical applications.
DBaaS offerings include a variety of database types such as relational databases like CockroachDB or NoSQL databases like MongoDB.
Yes, DBaaS providers implement robust security measures including encryption, access controls, and regular security audits. However, it is essential for users to follow best practices such as strong password policies and regular monitoring to ensure data security.
Consider the following factors when choosing a DBaaS provider:
Database Compatibility: Ensure the provider supports the database type you need.
Performance and Scalability: Check the provider's performance benchmarks and scalability options.
Security Features: Look for comprehensive security measures and compliance certifications.
Cost: Compare pricing models and ensure they fit your budget.
Support and Reliability: Evaluate the provider's customer support and service level agreements (SLAs).
Yes, most DBaaS providers offer tools and services to help migrate existing databases to their platform. The migration process typically involves exporting your current database, transferring the data to the cloud, and configuring the new environment. For example, you can check out CockroachDB’s MOLT (Migrate Off Legacy Technology) tools.
Traditional database hosting requires users to manage the hardware, software, and maintenance of the database environment. In contrast, DBaaS offloads these responsibilities to the service provider, allowing users to focus on database usage and application development.
Yes, DBaaS is designed to handle applications of all sizes, from small projects to large-scale enterprise applications. Providers offer various scaling options to accommodate growing data and user demands.
pgvector
is an open-source extension for PostgreSQL that provides efficient storage, retrieval, and similarity search of vector data. It is particularly useful for applications involving semantic search, recommendation systems, and natural language processing, which are critical for AI-driven applications, including Large Language Models (LLMs).
CockroachDB introduced vector search capabilities in the 24.2 release. This implementation uses the same interface as pgvector
and aims to be compatible with pgvector
’s API. This compatibility facilitates seamless integration with tools like Langchain and Hugging Face, making it easier to incorporate data into AI models.
The integration of vector search in CockroachDB combines the strengths of a vector database and an operational database into a single, horizontally scalable solution. This approach simplifies the architecture and eliminates the operational and financial costs associated with maintaining a dedicated vector database. Additionally, CockroachDB’s distributed SQL engine allows for complex SQL operations on vector data, enhancing the performance and efficiency of AI and machine learning workloads.
pgvector
adds support for vector data types and vector similarity search to the PostgreSQL database. It is particularly useful for applications involving machine learning, natural language processing, and other domains where vector representations of data (such as embeddings) are commonly used.
Key features of pgvector
include:
Vector Data Type: It introduces a new data type for storing vectors (arrays of floating-point numbers) directly in PostgreSQL tables.
Similarity Search: It provides functions for performing similarity searches on vector data, such as finding the nearest neighbors based on various distance metrics (e.g., Euclidean distance, cosine similarity).
Indexing: It supports indexing mechanisms to speed up similarity searches, making it efficient to query large datasets.
Vector search refers to the process of finding vectors in a dataset that are similar to a given query vector. Vector search is a powerful tool for handling high-dimensional data and is widely used in modern AI and data-driven applications to provide relevant and accurate results based on the inherent similarities in the data.
Vector search is essential in applications where data is represented as high-dimensional vectors, such as:
Machine Learning: Finding similar data points in embedding spaces, such as word embeddings, image embeddings, or user embeddings.
Natural Language Processing (NLP): Searching for semantically similar text documents, sentences, or words based on their vector representations.
Recommendation Systems: Identifying items (e.g., products, movies) that are similar to a user's preferences, represented as vectors.
Computer Vision: Finding images that are visually similar to a query image based on their feature vectors.
Check out this video showcasing how to use CockroachDB to build a product recommendation engine for ecommerce with AI:
Vector search typically involves the following steps:
Vector Representation: Data is converted into vector representations using techniques like word embeddings (e.g., Word2Vec, GloVe), sentence embeddings (e.g., BERT), or image embeddings (e.g., convolutional neural networks).
Similarity Metric: A similarity metric (e.g., Euclidean distance, cosine similarity) is chosen to measure the closeness between vectors.
Indexing: Efficient indexing structures (e.g., KD-trees, ball trees, HNSW) are used to speed up the search process, especially for large datasets.
Querying: The query vector is compared against the indexed vectors to find the most similar ones based on the chosen similarity metric.
pgvector
commonly used?pgvector
has become one of the most popular extensions for working with vectors in part because it allows easy integration with PostgreSQL. Users can thus leverage the robustness, scalability, and familiarity of PostgreSQL, making it easier for developers to integrate vector search capabilities into their existing databases without needing a separate specialized system. PostgreSQL has a large and active community, and pgvector benefits from this ecosystem. Users can leverage existing PostgreSQL tools, extensions, and best practices while working with vector data.
By extending PostgreSQL, pgvector
allows users to work with vector data using standard SQL queries. This reduces the learning curve and simplifies the development process. pgvector
also supports efficient indexing and querying, and can handle large-scale vector data and perform similarity searches quickly, which is crucial for real-time applications. pgvector
also allows users to choose the best approach for their specific use case with various distance metrics and indexing methods.
pgvector
-compatible vector search in CockroachDBCockroachDB is deeply invested in providing robust database offerings so that organizations can run their business-critical applications at scale. To that end, the Cockroach Labs team continues to focus on expanding generative AI capabilities, leading to the release of vector search features in v24.2.
Our vector search features make it easier for users to deploy AI-driven applications with CockroachDB. CockroachDB's vector search functionality is compatible with the pgvector
extension for PostgreSQL, allowing users to store and manipulate vectors within the database.
Although CockroachDB does not currently support vector indexing, it allows for the storage of billions of vectors and supports various vector comparison operators, such as Euclidean distance, inner product, and cosine distance, to facilitate similarity searches. This capability enables developers to build fault-tolerant AI applications that leverage vectors stored in CockroachDB, benefiting from its horizontal data distribution and multi-region abstractions.
pgvector
is an open-source PostgreSQL extension for efficient storage, retrieval, and similarity search of vector data. It's useful for semantic search, recommendation systems, and natural language processing, critical for AI applications like Large Language Models (LLMs).
The key features of pgvector
include storing vectors in PostgreSQL tables, similarity search, and vector indexing.
Vector search finds vectors in a dataset similar to a query vector. It's used in AI and data-driven applications to provide relevant results based on data similarities.
The common applications of vector search include natural language processing to search for semantically similar text in documents, sentences, or words, and identifying similar items, such as products, movies, books, documents, or videos in recommendation systems.
Generally vector search works by first converting the data into vectors using embeddings. Then measuring the closeness between vectors via a similarity metric. Additionally, one can use indexing to speed up searches, and then can search for similar vectors by comparing a query vector against the indexed vectors.
pgvector
is popular because it's open-source, integrates with PostgreSQL, and supports efficient indexing and querying. It handles large-scale vector data and performs quick similarity searches, crucial for real-time applications.
Starting in v24.2, vector search entered preview in CockroachDB. Vector search in CockroachDB finds data points similar to a query using vectors. Compatible with pgvector
, it stores and manipulates vectors, useful for AI applications like LLMs, and supports various vector comparison operators for similarity searches.
A black swan event is an event that has the following three attributes:
It was unexpected.
It had significant, wide-ranging consequences.
After it happens, people will suggest that it was predictable, despite the fact that it was not widely predicted before it happened.
This definition comes from mathematical statistician Nassim Taleb, who coined and popularized the term in his books Fooled by Randomness and The Black Swan.
However, in everyday language when people talk about a “black swan event,” they’re generally thinking just about unexpected events with wide-ranging consequences. The third criteria – that people will rationalize the event as predictable after the fact – isn’t typically discussed.
Officially the term doesn’t have a positive or negative connotation. Black swan events can theoretically be good or bad or neutral. In the real world, however, the term is often used to describe events with negative impacts, such as financial crashes, widespread service outages, and even natural disasters and terrorism.
Long story short: while the official definition of a black swan event is quite a bit more nuanced, in everyday life the term is often used to mean something pretty simple: an unexpected event with significant negative consequences.
(The name, in case you’re wondering, comes from the once-widespread perception that all swans were white, and black swans either didn’t exist or were incredibly rare. In reality, black swans do exist. But they’re only native to Australia, so they’re quite rare everywhere else – including in Europe, where the idea of a “black swan” as a symbol for something unpredictable, unexpected, or unlikely first came to be used.)
In the tech industry, most discussion of black swan events is typically related to infrastructure, and infrastructure failures leading to service outages.
On a global scale, a black swan event in technology could be something like a coronal mass ejection from the sun causing a kind of natural EMP that knocks out electronic systems over a large area. (Whether this would be a true black swan event is debatable, considering that it has happened before, but it happens only rarely and is considered unlikely).
More commonly, though, discussions of black swan events in technology are company-specific, meaning that a “black swan event” is an event that significantly and negatively impacts the company’s services, generally due to some kind of infrastructure failure or outage.
* Cloud provider outages.
* Power outages.
* System failures.
* Human mistakes.
Distributed SQL represents a significant evolution in database technology, designed to meet the needs of modern cloud applications. Traditional SQL databases were built for data consistency, vertical scalability, and tight integration, which worked well for monolithic applications on single-server environments. However, as the paradigm shifted to distributed applications in the cloud, these traditional databases began to show limitations.
SQL (Structured Query Language) and NoSQL (Not Only SQL) databases serve different purposes and have distinct characteristics. SQL databases, such as CockroachDB, MySQL, and PostgreSQL, are relational databases that use structured query language for defining and manipulating data. They are known for their ACID (Atomicity, Consistency, Isolation, Durability) properties, which ensure reliable transactions and data integrity. SQL databases are ideal for complex queries and transactions, supporting structured data with predefined schemas and relationships.
On the other hand, NoSQL databases, like MongoDB and Cassandra, are designed to handle unstructured or semi-structured data. They may offer flexible schemas, allowing for rapid development and iteration. NoSQL databases are often used for large-scale data storage and real-time web applications due to their ability to scale horizontally and handle high volumes of read and write operations. They typically provide eventual consistency rather than the strong consistency guaranteed by SQL databases.
Modern applications require horizontal scalability, elasticity, and support for microservices. Traditional single-node relational databases, with their fixed schemas and lack of support for distributed data models, are not suited for these needs. Distributed SQL databases address these challenges by combining the consistency and structure of early relational databases with the scalability, survivability, and performance first pioneered in NoSQL databases.
Horizontally scalable: Distributed SQL databases can seamlessly scale to mirror the capabilities of cloud environments without introducing operational complexity. They distribute data across multiple nodes, ensuring efficient resource utilization.
Consistency: These databases deliver high levels of isolation in distributed environments, mediating contention and ensuring transactional consistency across multiple operators.
Resilience: Distributed SQL databases provide inherent resilience, reducing recovery times to near zero and replicating data naturally without external configuration.
Geo-replication: They allow for the distribution of data across geographically dispersed environments, ensuring low latency access and compliance with data sovereignty requirements.
In addition to the unique capabilities of distributed SQL, these databases must also meet foundational requirements such as:
Operational efficiency: Easy installation, configuration, and control of the database environment.
Optimization: Advanced features like cost-based optimizers for query performance.
Security: Key capabilities for authentication, authorization, and accountability.
Integration: Compatibility with existing applications, ORMs, ETL tools, and more.
Distributed SQL databases address these challenges by combining the consistency and structure of relational databases with the scalability, survivability, and performance first pioneered in NoSQL databases. They distribute data evenly across multiple nodes, ensuring efficient resource utilization and high availability. These databases deliver high levels of isolation in distributed environments, mediating contention and ensuring transactional consistency across multiple operators. Additionally, they provide inherent resilience, reducing recovery times to near zero and replicating data naturally without external configuration.
CockroachDB, for example, is a distributed SQL database that uses a transactional and strongly-consistent key-value store. It scales horizontally, is incredibly resilient against various kinds of outages, including disk, machine, rack, and even datacenter failures with minimal latency disruption and no manual intervention. CockroachDB supports strongly-consistent ACID transactions and provides a familiar SQL API for structuring, manipulating, and querying data. It ensures data consistency in a distributed environment by guaranteeing serializable SQL transactions, the highest isolation level defined by the SQL standard, using the Raft consensus algorithm for writes and a custom time-based synchronization algorithm for reads to guarantee strong data consistency.
Several databases meet the requirements of distributed SQL, including Google Spanner, Amazon Aurora, Yugabyte, FaunaDB, and CockroachDB. These databases offer various levels of support for the core capabilities mentioned above, and therefore are more appropriate for different use cases.
Amazon Aurora: Amazon Aurora is a cloud-based relational database engine that combines the speed and reliability of high-end commercial databases with the simplicity and cost-effectiveness of open-source databases. Aurora is often described as a distributed database, but it does not scale for writes, and therefore is not truly distributed. Aurora is designed to provide high availability and durability, with automatic failover and replication across multiple Availability Zones. It supports MySQL and PostgreSQL and offers features like automatic scaling and serverless options.
CockroachDB: CockroachDB is a distributed SQL database designed for global, cloud-native applications. CockroachDB is PostgreSQL compatible, so that most applications built on PostgreSQL can be migrated without changing the application code. CockroachDB provides strong consistency, horizontal scalability, and high availability. CockroachDB supports ACID transactions and a familiar SQL interface, making it easy to use for developers. It survives various types of failures, including mechanical, data center, region, and even cloud failures, with minimal latency disruption.
FaunaDB: FaunaDB is a distributed, multi-model database that provides strong consistency, ACID transactions, and a flexible data model. It is designed for serverless applications and offers a globally distributed architecture that ensures low-latency access to data. FaunaDB supports GraphQL and FQL (Fauna Query Language) for querying data.
Google Spanner: Google Spanner is a globally distributed database service that provides strong consistency and horizontal scalability. It uses a combination of hardware (atomic clocks) and software to achieve global consistency and high availability. Spanner supports SQL queries and is designed to handle large-scale, mission-critical applications.
YugabyteDB: YugabyteDB is an open-source, distributed SQL database. It supports both SQL and NoSQL workloads, making it versatile for various use cases.
When evaluating a distributed SQL database, it is essential to consider several core requirements to ensure it meets the needs of modern applications. First, assess the database's scalability. A distributed SQL database should be able to scale horizontally, distributing data evenly across multiple nodes to handle increased loads without compromising performance. This capability is crucial for applications that experience variable or growing workloads, and can support businesses at any stage of growth, scaling as demand scales. Additionally, the database should support strong consistency, ensuring that all nodes reflect the same data state, which is vital for maintaining data integrity across distributed environments.
Another critical factor is high availability and resilience. The database should be designed to handle failures gracefully, whether they occur at the disk, machine, rack, or even datacenter level, with minimal disruption to operations. This includes features like automatic failover, data replication, and quick recovery times.
RELATED
To learn more about inherently resilient systems, check out this webinar hosted by Cockroach Labs’ CTO and CPO, Peter Mattis and Technical Evangelist, Rob Reid: “The Always-On Dilemma: Disaster Recovery vs. Inherent Resilience.”
Geo-replication is also important, allowing data to be distributed across multiple geographic locations to reduce latency and comply with data sovereignty requirements.
Finally, consider operational efficiency, security, and integration capabilities. The database should be easy to install, configure, and manage, offer robust security features for authentication and authorization, and integrate seamlessly with existing applications and tools.
A mature distributed SQL database should meet all these requirements, ensuring it is suitable for business-critical applications.
Distributed SQL databases are utilized across various verticals to address specific industry challenges and requirements.
For example, check out this presentation by Netflix engineers who provide CockroachDB-as-a-Service (DBaaS) to Netflix developers with a variety of use-cases:
Here are some examples of vertical use cases for distributed SQL databases:
In the financial sector, distributed SQL databases are crucial for applications such as payment processing, trading platforms, and identity management. These applications require strong consistency, high availability, and the ability to handle high transaction volumes. For instance, payment systems must ensure accurate and timely transactions, while trading platforms need to process large volumes of trades with minimal latency. Identity management systems benefit from distributed SQL databases by providing secure and consistent access to user data across multiple regions.
Retail and eCommerce companies leverage distributed SQL databases for order and inventory management. These systems must handle large spikes in traffic, such as during Black Friday or Cyber Monday, without compromising performance (or overselling products). Distributed SQL databases provide horizontal scalability, ensuring that the system can manage increased demand. They also offer strong consistency, which is essential for maintaining accurate inventory levels and processing transactions reliably. Additionally, these databases support multi-region deployments, allowing retailers to provide low-latency access to customers worldwide.
The gaming industry uses distributed SQL databases to manage user accounts, in-game transactions, and real-time data processing. Gaming platforms often experience high concurrency with thousands of players interacting simultaneously. Distributed SQL databases ensure that user data is consistent and available, even during peak usage times. They also support the scalability needed to accommodate growing user bases and the resilience required to maintain uptime during unexpected failures.
In logistics and supply chain management, distributed SQL databases are used for scheduling, workflow management, and tracking systems. These applications require precise coordination and data integrity to ensure timely deliveries and efficient operations. Distributed SQL databases provide the high availability and fault tolerance needed to prevent disruptions in logistics workflows. They also support geo-replication, which helps in maintaining data consistency across different geographic locations.
These examples illustrate how distributed SQL databases can be tailored to meet the specific needs of various industries, providing the scalability, consistency, and resilience required for modern applications.
Distributed SQL is the future of database management in the cloud, offering the scalability, consistency, and resilience needed for modern applications. CockroachDB, among others, exemplifies these capabilities, making it a strong candidate for organizations looking to transition to cloud-native distributed SQL databases.
Get started with CockroachDB Cloud for free, today!
Distributed SQL refers to a class of relational databases that distribute data across multiple nodes to ensure high availability, fault tolerance, and scalability. These databases maintain SQL capabilities while leveraging a distributed architecture to handle large-scale, geographically dispersed data.
Traditional SQL databases are typically monolithic, meaning they run on a single server. Distributed SQL databases, on the other hand, spread data across multiple servers or nodes. This distribution allows for better performance, scalability, and resilience against failures.
Distributed SQL databases offer several benefits, including scalability, high availability, and resilience against mechanical failures. Distributed SQL databases combine the consistency and structure of early relational databases with the scalability, survivability, and performance first pioneered in NoSQL databases.
CockroachDB is a distributed SQL database that uses a transactional and strongly-consistent key-value store. It scales horizontally, survives various types of failures, including mechanical, data center, region, and even cloud failures with minimal latency disruption and no manual intervention. CockroachDB supports strongly-consistent ACID transactions and provides a familiar SQL API for structuring, manipulating, and querying data.
CockroachDB guarantees serializable SQL transactions, the highest isolation level defined by the SQL standard. It uses the Raft consensus algorithm for writes and a custom time-based synchronization algorithm for reads to ensure consistency between replicas.
Distributed SQL is ideal for many use cases including for distributed customer bases as you can bring the data closer to the customer, businesses that handle a high transaction volume, and businesses that handle spiky workloads. For example, this can include financial institutions, retailers or eCommerce businesses, and gaming platforms.
CockroachDB supports SQL syntax and scales easily without the manual complexity of sharding, rebalances and repairs itself automatically, and distributes transactions seamlessly across your cluster. It provides strong consistency and supports distributed transactions, unlike many other distributed databases.
At the highest level, CockroachDB is software for storing data. More specifically, CockroachDB is a distributed SQL database technology that enables users to enjoy many of the benefits of the traditional relational database (such as the familiar SQL language, reliable schema, etc.) while also offering the key features required for a modern, cloud-native database, such as high availability, bulletproof resilience, elastic scale, and geographic scale.
Note: This term is specific to CockroachDB, a Distributed SQL database. In other contexts, it may be used differently.
The Cost-Based Optimizer is a feature of CockroachDB that looks at all possible ways in which a query can be executed and assigns each a “cost” that indicates how efficiently the query can be run. Then the optimizer chooses the way that has the lowest cost, and is therefore most efficient. This feature only works with databases that speak SQL, so it’s an added benefit obtained from having a SQL layer.
Note: This term is specific to CockroachDB, a Distributed SQL database. In other contexts, it may be used differently.
Encoding is the process by which CockroachDB converts SQL statements into bytes (because the lower layers of CockroachDB deal with bytes).
Note: This term is specific to CockroachDB, a Distributed SQL database. In other contexts, it may be used differently.
When a request comes into CockroachDB, the load balancer routes the request to the node it thinks can best handle the request at that time. This node is called the Gateway Node. The Gateway Node receives the request and then responds to the request. It identifies which nodes in the cluster are the leaseholders for the specific requests that came in, and sends the requests to those nodes.
Note: This term is specific to CockroachDB, a Distributed SQL database. In other contexts, it may be used differently.
The key value layer is a figurative layer of CockroachDB. One helpful way to think about the architecture of CockroachDB is as a SQL system built on top of a key value store. When the data is up in the top layers, it follows the rules of a SQL system and is structured in a table format. When the data travels deeper into the database, table format no longer works, due to the distributed nature of the database. So instead of being stored in tables, the data is stored in a different way: in key-value pairs. It’s important to note that this combination of a SQL upper layer with a key-value store underneath is a relatively unusual setup, because translating structured table data into key-value pairs is a difficult task.
Note: This term is specific to CockroachDB, a Distributed SQL database. In other contexts, it may be used differently.
A key-value pair is a way of storing data. In CockroachDB, individual rows from the tables are mapped into key-value pairs. One column becomes the index, meaning that each piece of data in that column is the “key” in its own key-value pair. One or more other columns become the “value.”
For example, in a table called “Customer Data” the first column might be “Name,” the second “Hometown” and the third “Email Address.” You might decide that the “Name” should be the key because ultimately you want all your data to be sorted by name. Then you might decide that the “Hometown” and “Email Address” columns should be the values.
This information gets mapped, row-by-row, into key-value pairs, and the ultimate format of a single row might end up reading as a string: “Customer Data Table/John Smith/New York City/Johnsmith@gmail.com.” Then, once all the data in the table is translated into the monolithic sorted key-value map, these pieces of data are all sorted by the name value.
Note: This term is specific to CockroachDB, a Distributed SQL database. In other contexts, it may be used differently.
The Load Balancer is a piece of software added to the front of CockroachDB that helps balance the requests coming into the database. All the nodes in CockroachDB can process requests (they’re all symmetrical), so it makes the most sense to spread the work around between nodes so that the requests can be completed more quickly. The load balancer accomplishes that.
Note: This term is specific to CockroachDB, a Distributed SQL database. In other contexts, it may be used differently.
A meta range is information stored in the cluster that tells the database where to find ranges – i.e., which nodes have certain ranges – so that the database can access the correct range. Sometimes also referred to as an index, but a different definition of an index than above.
ACID stands for Atomicity, Consistency, Isolation, and Durability. These are a set of characteristics that are desirable for database transactions. Together, they guarantee that transactions and all data stored in the database remains valid even in the event of errors or power failures. These characteristics are extremely important for any OLTP database.
Specifically:
Atomicity: Each transaction must be treated as a single, indivisible unit (even if processing the transaction requires multiple steps). In other words, every step of the transaction must complete successfully or the entire transaction will fail and no change will be written to the database. This is desirable because without atomicity, a transaction that’s interrupted while processing may only write a portion of the changes it’s supposed to make, which could leave the database in an inconsistent state.
Consistency: No transaction can violate the integrity of the database – all transactions must leave the database in a valid state.
Isolation: Any concurrently-processed transactions (i.e. transactions happening at the same time) leave the database in the same state as if they were executed sequentially (i.e. one after another).
Durability: Once committed, the transaction remains committed even in the case of hardware failure, power outage, etc.
Unstructured Data is essentially the opposite of structured data: data that is not arranged with any kind of data model or schema and that can’t be easily adapted to a table format.
For example, datasets consisting of video or audio files are good examples of unstructured data. A traditional table structure would work for storing metadata about videos (such as title, description, Youtube link, etc.), but it is not a good fit for storing the videos themselves.
CockroachDB does support some types of unstructured data via its JSON support, but it was designed with primarily structured data in mind.
Note: This term is specific to CockroachDB, a Distributed SQL database. In other contexts, it may be used differently.
The Transaction Layer is the layer of CockroachDB that receives requests from the SQL Layer and coordinates concurrent operations.
TPC-C, short for Transaction Processing Performance Council Benchmark C, is a benchmark used to measure how well a database holds up when it’s trying to run certain workloads. TPC-C is specifically an OLTP benchmark, and it’s widely recognized and standardized. TPC-C simulates an environment where a lot of users are making requests of a database, and then it measures how well the database holds up – i.e., how fast it can complete the transactions, what the cost of completing the transactions is, and so on.
Structured Data (also called relational data) is data that lives best in a structured format, i.e., the kind of data you would enter into a table in a spreadsheet. The best way to organize it is in columns and rows.
Two examples of structured data are an inventory of products for an eCommerce site, or a list of customers and their information. CockroachDB was designed primarily to support this kind of data (although it does also include support for unstructured data).
Note: This term is specific to CockroachDB, a Distributed SQL database. In other contexts, it may be used differently.
The storage Layer is the layer of CockroachDB that writes data to the disk (the Physical Storage component), and reads data from the disk. It’s still part of the software, but it’s the piece that communicates with the hardware.
Note: This term is specific to CockroachDB, a Distributed SQL database. In other contexts, it may be used differently.
The SQL Layer is the layer of CockroachDB that speaks to the application or client using the programming language SQL, adhering to the PostgreSQL Wire Protocol. After performing various tasks, the SQL layer sends the relevant requests to the Transaction Layer.
A SQL API is an API for interacting and communicating with a database. In the case of CockroachDB, CockroachDB offers a SQL API to users and apps. This means that users and apps can send SQL commands into the API and receive the results of their query in return. The commands must follow the PostgreSQL Wire Protocol, because CockroachDB adheres to it.
Serializable Isolation is the highest level of isolation possible under the guidelines provided by the “ANSI SQL Standard” (an official standard of best SQL practices). It means that transactions committed to a database appear as if they were executed one after another, even if they were processed in parallel. Most distributed databases only achieve a lower level of isolation, called “snapshot isolation,” but CockroachDB is able to achieve serializable isolation.
A Secondary Index is a secondary column by which you can sort data. CockroachDB supports both primary and secondary indexes. This just adds another level of organization to data sorting. For example, in a table of data about employees, you might first sort alphabetically by “name,” (the primary index) and then within that, sort alphabetically by “hometown” (the secondary index).
RTO (Recovery Time Objective) is the amount of time a system’s data is unavailable due to a failure, before the system returns to service. The goal is zero RTO, and CockroachDB achieves this through its multi-active availability model.
RocksDB is a piece of software that was embedded in CockroachDB to store key-value data. CockroachDB used RocksDB to communicate with the disk in order to actually store data. CockroachDB now uses Pebble, a RocksDB-inspired and compatible key-value store that is specifically designed for distributed SQL databases. Many other tech companies still use RocksDB.
Note: This term may have other meanings in other contexts. Here, we are defining it in the context of a distributed database.
The Replication Layer is the layer of distributed database software that copies data between nodes and ensures consistency between these copies. In CockroachDB, this is accomplished by implementing the Raft consensus protocol.
Note: This term may have other meanings in other contexts. Here, we are defining it in the context of a distributed database.
A region is a specific geographical location where you can host your resources. Each region has one or more zones; most regions have three or more zones. For example, US-West is the West Coast of the US. US-East is the East Coast. For example, public cloud providers often allow you to choose the region or regions you’d like to deploy to.
A range in CockroachDB (called a “shard” in other databases) is a chunk of data, stored as key-value pairs. The Distribution Layer breaks tables apart into these chunks so the data can be distributed across different nodes.
In CockroachDB, a range is 512 MiB or smaller. This default range size is a sweet spot – small enough to move quickly between nodes, but large enough to store a meaningfully contiguous set of data whose keys are more likely to be accessed together. Once a range gets bigger than 512 MiB, it’s split into two smaller chunks in order to keep the ranges from getting too big.
Range replication is the duplication of ranges on multiple nodes so that if one node fails, the range’s data is still accessible via another node. In CockroachDB, each range is replicated three times by default. This replication allows CockroachDB to be highly available and remain online even when a node goes down, because the data lives in two other places (and two nodes are required to achieve quorum). When a failure happens, CockroachDB automatically redistributes data to the surviving nodes.
The Raft Consensus Protocol is the set of quorum guidelines that CockroachDB follows to ensure each range is consistent. It’s an algorithm that makes sure all copies agree on changes to data. A “leader” is elected for each range (leaders are also known as Leaseholders), to coordinate changes for that range, and the two other ranges are called “followers.” Changes get entered in the leader’s log, then the leader sends out the changes to the followers, and the changes get entered into their logs. Once this happens, the change is committed to the leader’s data, and then committed to the followers' data.
Note: This term is specific to CockroachDB, a Distributed SQL database. In other contexts, it may be used differently.
Quorum is the consensus required to commit changes in a distributed database such as CockroachDB. Different types of distributed databases may use different systems for quorum. CockroachDB uses the Raft Consensus Protocol, in which a minimum of two nodes are required to achieve a consensus (i.e. quorum) in a three-node system.
Public Cloud is the term for the most common kind of cloud computing vendor. Data lives in “public,” and hardware, storage, and network devices are shared with other organizations or cloud “tenants.” The services are accessed and managed via a web browser.
In a private cloud, a company’s data is stored in a dedicated cloud environment, rather than being stored on shared machines. The services and infrastructure are always maintained on a private network and the hardware and software are dedicated solely to that single organization. Benefits include increased security and more customization. Vendors include Amazon Virtual Private Cloud (VPC), Dell Cloud Solutions, and Microsoft Private Cloud.
An index is a column whose data is designated to the “key” location in a Key-Value pair. Think of sorting data in a spreadsheet; the column by which you’re sorting the data is called the index. A primary index is the main column by which the data is sorted.
PostgreSQL Wire Protocol is a dialect of the SQL language that’s spoken by the database called PostgreSQL. Many people have used PostgreSQL before and are familiar with its language, and many apps and ORMs already use the dialect. The fact that CockroachDB adheres to the PostgreSQL Wire Protocol makes it easy for customers to plug into the database and use it.
Physical Storage is the hardware on which the data is stored. CockroachDB’s Storage Layer (software) communicates with this hardware to physically write data onto the disk and read data from the disk.
An ORM is a software intermediary between the application and the database. It allows developers to speak to a SQL database using a language other than SQL. Some developers may not be experienced with SQL, or simply prefer to write in languages like Python, C++, Javascript etc. When writing the parts of their application that communicate with a database, they can use an ORM as a go-between, to translate their code into SQL and send it to the database to make requests.
Note: This term may have other meanings in other contexts. Here, we are defining it in the context of a distributed database.
In the context of databases, on-prem describes a database deployment in which a company’s data is stored on a machine that they own, rather than stored with a public cloud provider such as GCP or AWS. Companies often choose to store proprietary or high-security data on-prem because they can have more control over the safety of it.
OLTP (Online Transaction Processing) describes the kind of data processing that deals with heavy transactional workloads. In OLTP workloads, in other words, there are many relatively simple, transactional operations (many reads and writes) constantly coming into the database. The data typically relates to standard business tasks like keeping track of inventory, hotel reservations, or online banking functions.
The other main kind of data processing is OLAP (Online Analytics Processing). Compared to OLTP workloads, the workloads run on OLAP databases are usually much more complicated but much less frequent.
Often, database technologies are designed with one or the other in mind. OLTP databases such as CockroachDB are focused on transactional use cases such as payment processing, logistics, metadata management – basically any use case that involves frequent reads and writes. OLAP databases, in contrast, are typically used for data analytics.
Often a company may use both, with the OLTP database handling the transactions coming from the application, and with relevant data periodically offloaded to an OLAP database for analysis. This approach ensures that even very complex analytics work will not interfere with the performance of the OLTP database.
Note: This term may have other meanings in other contexts. Here, we are defining it in the context of a distributed database.
A node is a single instance of a distributed database such as CockroachDB, or one individual machine of many that are running the same distributed database. Many nodes join together to create the full cluster.
Note: This term is specific to CockroachDB, a Distributed SQL database. In other contexts, it may be used differently.
MVCC (Multiversion Concurrency Control) is a protocol that CockroachDB follows to ensure isolation of transactions when concurrent transactions are happening. Without MVCC, if a database is being used in multiple ways at the same time, then someone might see half-written or inconsistent data. MVCC keeps multiple copies of data, so each user sees a snapshot of the database at a particular instant in time, and they won’t see changes until the transaction has been committed.
Note: This term may have other meanings in other contexts. Here, we are defining it in the context of a distributed database.
Multi-Cloud refers to a strategy where multiple cloud providers are being used, rather than just one. Typically, these are multiple public cloud providers (i.e. AWS and GCP). CockroachDB is one of the few distributed SQL databases that supports multi-cloud deployments.
Multi-Active Availability is CockroachDB’s high availability model. All replicas in the cluster can handle traffic, including both reads and writes, but the Raft consensus is used to ensure data remains consistent. This model also prevents RTO if a node goes down.
Note: This term is specific to CockroachDB, a Distributed SQL database. In other contexts, it may be used differently.
When all of the data from all of the tables is translated into key-value pairs in CockroachDB, it’s called a monolithic sorted key-value map. This just means a giant, list of key-value pairs that correspond to rows in tables, organized in a way that allows you to easily insert and find data.
A mainframe is a gigantic machine (typically made by IBM) that has a lot of storage abilities and high computing power. Mainframes are almost exclusively on-prem and privately owned by the businesses that use them. However, recently IBM has released a version of its mainframe to be used in private cloud settings.
Note: This term may have other meanings in other contexts. Here, we are defining it in the context of a distributed database.
A machine is just a computer. It can live in a data warehouse or on-prem. It can vary in computing power and storage abilities. Typically, the machines that live in data warehouses to be used for cloud purposes are much smaller with less power than mainframes.
Note: Here, we are defining JSON in the context of a distributed database.
JSON, or Javascript Object Notation, is a way of formatting “semi-structured” data, and CockroachDB supports it by default. A small amount of data in CockroachDB is typically JSON data.
To understand what JSON is, follow this example: Imagine a table that stores information about blog posts. Much of the data is structured; every blog post needs data on title, author, and number of words (and these become the columns in the table). But there might be additional data that’s not applicable to every post, such as if a user comments on a post or likes a post. Instead of having to make a separate column for each of these data pieces (which would be inefficient since many of the cells would be empty), you can make a single column that supports JSON – and then you format the unique data pieces in JSON and put any item at all into that column, without having a predefined label attached to it.
Note: This term may have other meanings in other contexts. Here, we are defining it in the context of a distributed database.
Isolation describes a desirable database characteristic in which concurrently-processed transactions (i.e. transactions happening at the same time) leave the database in the same state as if they were executed sequentially (i.e. one after another).
It is one of the four ACID properties that are desirable for databases dealing with transactional workloads.
Accurate time is very important in distributed systems, because events often occur at different nodes at the same time, and these events need to be ordered accurately. Google Spanner uses atomic clocks to accomplish this, but because CockroachDB is open source and can run on any public/private cloud/on-prem, it’d be impossible to use atomic clocks. Instead CockroachDB used HLC, which is a clock method that has logical and physical components. CockroachDB applies HLC timestamps to all transactions so the system knows when they occured and can order them correctly.
Note: This term may have other meanings in other contexts. Here, we are defining it in the context of a distributed database.
Hybrid Cloud refers to a strategy where cloud storage and on-prem storage are both being used. A typical example of this would be a single company storing sensitive data on-prem and less sensitive data in the cloud.
Note: This term may have other meanings in other contexts. Here, we are defining it in the context of a distributed database.
High Availability is a desirable characteristic for databases; it describes the ability of a database to survive (and thus remain available) even when parts of the system fail. For example, a CockroachDB database with five nodes could survive and remain available even if several nodes failed.
Different databases use different models to achieve high availability. The two most common high availability models are Active-Passive and Active-Active. Meanwhile, CockroachDB uses a Multi-Active Availability model.
Note: This term may have other meanings in other contexts. Here, we are defining it in the context of a distributed database.
In the context of a distributed database, the gossip protocol is the form of communication used between nodes, to allow each node to locate data across the cluster.
Note: This term may have other meanings in other contexts. Here, we are defining it in the context of a distributed database.
In the context of a database, durability is a desirable property that describes a system where, once data has been committed to the database, it will remain even in the event of machine failures, power outages, etc.
It is one of the four ACID properties that are desirable for databases dealing with transactional workloads.
Note: This term has other meanings in other contexts. Here, we are defining it in the context of a distributed database.
In the context of databases, a driver is a piece of code that you add to your app to help it speak the language necessary for communicating with the database, such as SQL. For example, if you’re building a Python app, you might use the psycopg2 driver to enable it to communicate with CockroachDB using SQL.
Note: This term may have other meanings in other contexts. Here, we are defining it in the context of a distributed database.
In a distributed database such as CockroachDB, the distribution layer takes data from the SQL layer and breaks it down into chunks called ranges to be stored in a distributed way (i.e., it is stored in multiple locations instead of just a single location).
A data warehouse or datacenter is a giant warehouse where thousands of machines live, arranged in racks. Cloud providers own many of these warehouses, and when you run or store something on the cloud, it lives in one of these warehouses.
A CPU (Central Processing Unit) is a chip that sits on top of the motherboard of a computer and carries out instructions from the software. Usually CPUs are multi-core, meaning that there is more than one CPU on a single chip. In the context of databases, the power of the CPUs on the machines running your database will impact the performance of the database.
Note: This term may have other meanings in other contexts. Here, we are defining it in the context of a distributed database.
A core is a component of a CPU that carries out instructions. A multi-core CPU has multiple CPUs together on a single chip, increasing the chip’s overall computing power.
Most database systems have rules for what kinds of data can and cannot be stored (among other rules). Consistency is the term used to describe a database in which those rules are always enforced. A database is said to have consistency when no transaction can violate the integrity of the database – all transactions must leave the database in a valid state.
It is one of the four ACID properties that are desirable for databases dealing with transactional workloads.
Note: This term may have other meanings in other contexts. Here, we are defining it in the context of a distributed database.
A cluster is the full collection of nodes associated with a distributed database. For example, a company might spin up a CockroachDB cluster with five nodes for its database.
In the context of databases, “cloud” refers to storing your data on machines that belong to a third-party cloud provider. This is more cost efficient than on-prem storage, as it eliminates the need for a company to maintain its own machines, and typically increases the availability and scalability of the company’s services. This is because with cloud storage, data is stored across many smaller computers inside a data warehouse and across data warehouses (when storing data on-prem, it’s usually stored in a single massive computer or mainframe).
In the context of data storage, a byte is the most basic format for coding a character on a computer. A byte is a group of eight 0’s and 1’s in a specific order that represents their character. For example, the letter “A,” when translated into bytes, reads as: “01000001”
An AZ, availability zone, or just “zone”, typically refers to a single data warehouse within a region. Multiple warehouses/zones make up a single region. Sometimes, an availability zone isn’t at the warehouse level, but instead at the rack level - i.e., a single rack within a warehouse.
Atomicity is a desirable characteristic for database transactions. The name comes from the idea of an indivisible “atomic unit”, and it describes a method for processing transactions that treats each transaction as a single, indivisible unit (even if processing the transaction requires multiple steps). In other words, every step of the transaction must complete successfully or the entire transaction will fail and no change will be written to the database.
This is desirable because without atomicity, a transaction that’s interrupted while processing may only write a portion of the changes it’s supposed to make, which could leave the database in an inconsistent state.
It is one of the four ACID properties that are desirable for databases dealing with transactional workloads.
An application or client is a software program. In the context of a database, the client is what sends data to a database, and/or gets data from it. An example of an application is a phone app – a piece of software that runs on the phones of users, and may re quest and send data to and from a database as the user takes different actions in the app.
API stands for Application Programming Interface. Simply put, an API is a way for developers/apps/clients to communicate with applications. It’s an interface that allows the developers to send requests to an application and receive simple responses from it.
Active-passive availability is one way to configure a distributed database to offer high availability.
In an Active-Passive Availability setup, all traffic is routed through a single active replica, and changes are copied to a backup passive replica. If the active replica fails, the passive one takes over. However, the active one might fail before the passive one copies all the data over, leading to data loss. Plus, it can take a while for the passive replica to boot up, causing some RTO.
Other configurations include active-active availability and multi-active availability.
Active-active availability is one way to configure a distributed database to offer high availability.
In an active-active availability setup, multiple replicas run identical services and traffic is routed to all of them. If any replica fails, the others handle the extra traffic. This model runs into consistency issues in database contexts because multiple replicas might be trying to edit the same data at the same time.
Other configurations include active-passive availability and multi-active availability.
Edge computing is a somewhat overloaded term to describe locality-sensitive distributed computing architecture. Wikipedia defines it as “a distributed computing paradigm that brings computation and data storage closer to the sources of data.” We can think of the “sources of data” being users or even sensors making requests to our system.
The main aim of edge computing or multi-access edge computing is to reduce location-related latency in applications to enable high-performance, real-time use cases in widespread geographies. Edge computing systems are faster when computation and data are closer to the devices.
Developers today are beginning to realize that to get computation data closer to devices, there are better choices than centralized databases and even distributed databases with limited distribution to a single region.
Operating a database in a single region leads to high latency for users or edge devices in areas outside of the database region. Even if you distribute your application across multiple regions, users or devices outside the database region may experience unacceptable response times. And unexpectedly high latency can translate into dissatisfied users.
Edge computing uses data that has life cycles or life spans. In the same application, you can have ephemeral data, in-memory data, short-term persistent data, and long-term persistent data. Typically, long-term persistent data is stored in databases. Unfortunately, when LTP data is far from the edge, it’s slower, and this slow data effect tends to give databases a lousy reputation.
Choosing the right database solution can improve your edge computing architecture and user experience.
In the context of data storage, a bit is a single 0 or 1 (also called a binary digit). There are 8 bits in a byte.