Blog
Engineering
SIGMOD 2020: Cockroach Labs publishes research paper on CockroachDB
Over the past few months, a team of our engineers, technical writers, product managers, and sales engineers codified the research and learnings of CockroachDB and are now contributing this knowledge back into the very system from which we have benefited with the hope of further advancing distributed systems research and design. The research paper, "CockroachDB: The Resilient Geo-Distributed SQL Database", is a labor of love that we are honored to have published by SIGMOD, the Association for Computing Machinery's (ACM) Special Interest Group on Management of Data, which specializes in large-scale data management problems.
Jessica Edwards
June 10, 2020
Engineering
How online primary key changes work in CockroachDB
As of our 20.1 release, CockroachDB supports online primary key changes. This means that it is now possible to change the primary key of a table while performing read and write operations on the table. Online primary key changes make it easier to transition an application from being single-region to multi-region, or to iterate on the schema of an application with no down time. Let’s dive into the technical challenges behind this feature, and our approach to solving them. As part of the deep dive, let’s review how data is stored in CockroachDB, and how online schema changes work in CockroachDB.
Rohan Yadav
May 21, 2020
Engineering
Tutorial: How to build a low-latency Flask app that runs across multiple regions
If your company has a global customer base, you’ll likely be building an application for users in different areas across the world. Deploying the application in just a single region can make high latencies a serious problem for users located far from the application’s deployment region. Latency can dramatically affect user experience, and it can also lead to more serious problems with data integrity, like transaction contention. As a result, limiting latency is among the top priorities for global application developers. In this blog, we walk you through a low-latency multi-region application that we built and deployed for MovR, a fictional vehicle-sharing company with a growing user base. The MovR application is a Flask web application, connected to a CockroachDB cluster with SQLAlchemy. We deployed the application in multiple regions across the US and Europe, using Google Kubernetes Engine, with some additional Google Cloud Services for load balancing, ingress, and domain hosting. For the database, we used a multi-region CockroachDB Dedicated deployment on GCP. The application source code is available on GitHub, and an end-to-end tutorial for developing and deploying the application is available on our documentation website.
Eric Harmeling
April 2, 2020
Engineering
How to run chaos tests in a multi-cloud environment
This year, as every year, Black Friday and Cyber Monday stressed e-commerce systems to their breaking points. Major companies like H&M, Nordstrom Rack, and other retailers experienced the kinds of costly outages that keep SREs up at night. Multi-cloud infrastructure is sometimes offered as a panacea to these kinds of outages. But multi-cloud deployments are not a band-aid. In fact, they often introduce new complexities into the system that need to be sniffed out. But sniffing out bugs in multi-cloud environments is, by nature, complicated. Ana Medina, a chaos engineer from Gremlin, spoke at ESCAPE/19 about how to do it, including a detailed list of the kinds of errors to search for and checklists of questions to ask.
Dan Kelly
December 9, 2019
Engineering
Availability and region failure: Joint consensus in CockroachDB
At Cockroach Labs, we write quite a bit about consensus algorithms. They are a critical component of CockroachDB and we rely on them in the lower layers of our transactional, scalable, distributed key-value store. In fact, large clusters can contain tens of thousands of consensus groups because in CockroachDB, every Range (similar to a shard) is an independent consensus group. Under the hood, we run a large number of instances of Raft (a consensus algorithm), which has come with interesting engineering challenges. This post dives into one that we’ve tackled recently: adding support for atomic replication changes (“Joint Quorums”) to etcd/raft and using them in CockroachDB to improve resilience against region failures.
Tobias Grieger
November 26, 2019
Engineering
Parallel Commits: An atomic commit protocol for globally distributed transactions
Distributed ACID transactions form the beating heart of CockroachDB. They allow users to manipulate any and all of their data transactionally, no matter where it physically resides. Distributed transactions are so important to CockroachDB’s goal to “Make Data Easy” that we spend a lot of time thinking about how to make them as fast as possible. Specifically, CockroachDB specializes in globally distributed deployments, so we put a lot of effort into optimizing CockroachDB’s transaction protocol for clusters with high inter-node latencies.
Nathan VanBenschoten
November 7, 2019
Engineering
SQLsmith: Randomized SQL testing in CockroachDB
Randomized testing is a way for programmers to automate the discovery of interesting test cases that would be difficult or overly time consuming to come up with by hand. CockroachDB uses randomized testing in many parts of its code. I previously wrote about generating random, valid SQL. Since then we’ve added an improved SQL generator to our suite called SQLsmith, inspired by a C compiler tester called Csmith. It improves on the previous tool by generating type and column-aware SQL that usually passes semantic checking and tests the execution logic of the database. It has found over 40 new bugs in just a few months that the previous tool was unable to produce. Here I’ll discuss the evolution of our randomized SQL testing, how the new SQLsmith tool works, and some thoughts on the future of targeted randomized testing.
Matt Jibson
June 27, 2019
Engineering
High availability without giving up consistency
If you’re reading this, you’re surely familiar with the arguments for high availability: services are only useful when they’re online. Unavailable services not only lose money, but also deteriorate your credibility in customers’ eyes. This could lead to immeasurable costs to your company in the future. Given that CockroachDB got its name because of its ability to survive failures, we thought we would cover some architectural considerations when building high availability services on top of Cockroach.
Sean Loiselle
August 23, 2018
Engineering
Kubernetes: The state of stateful apps
Over the past year, Kubernetes––also known as K8s––has become a dominant topic of conversation in the infrastructure world. Given its pedigree of literally working at Google-scale, it makes sense that people want to bring that kind of power to their DevOps stories; container orchestration turns many tedious and complex tasks into something as simple as a declarative config file.
Sean Loiselle
May 1, 2018