[CASE STUDY]

How City Storage Systems survives regional outages with CockroachDB

[City Storage Systems] Hero terminal image
City Storage Systems – a startup founded by Uber’s Travis Kalanick – is on a mission to reinvent the businesses of food. Their product brands including Otter, CloudKitchens, and Lab37 offer software to run growing food businesses, and kitchens to produce quality food at dramatically lower cost.

City Storage Systems’ strategy is made up of two key elements 1) tackling technical problems in today’s modern digital world and connecting them to the physical world such as robotics platform and ML prediction for real estate and 2) optimizing their platform to support hundreds of daily restaurant use cases.

The Infrastructure Storage Team at City Storage Systems is in charge of building a resilient foundation so that their developers can take technical challenges head on without worrying about the reliability of their platform. In 2019 the team started to build their infrastructure on CockroachDB and now runs a fleet of clusters that are designed to tolerate a region-wide outage.
[City Storage Systems] Logo

[ INDUSTRY ]

Software development (for food businesses)


[ CHALLENGE ]

Zero downtime for mission-critical apps and flexible deployment configurations.


[ SOLUTION ]

A highly-available, strongly consistent database with multi-region and multi-cloud support that remains performant at scale.

downtime

0

regions

3

multi-active architecture

1

Tolerating a region failure

Before using CockroachDB, City Storage Systems’ built applications on GCP Cloud SQL for PostgreSQL which requires maintenance windows for many standard operations. City Storage Systems did not like having frequent downtime for maintenance and wanted a no downtime solution for Tier 0 applications.

They wanted to build a storage-as-a-system service that would allow app developers to move fast – this meant delivering a SQL interface that’s easy to understand, strongly consistent, and easy to scale. What City Storage Systems didn't want was developers getting bogged down worrying about eventual consistency, especially when scaling across regions. Additionally, they wanted to avoid being locked in to a single cloud provider, while having the ability to deploy their infrastructure across multiple clouds in the future. 

In summary, their requirements for a new database included: 

  • Ability to tolerate the outage of an entire region 

  • Cloud-native foundation, with no vendor lock-in

  • Flexibility to scale up and down on demand

  • Portability and multi-cloud from day one

  • Quick development speed with strongly consistent storage and no downtime for maintenance 

After they evaluated several available database solutions, the City Storage Systems team felt that CockroachDB was the most mature product that met all of their requirements.

quote

It’s a design goal of ours to make sure that we have the option of running a multi-cloud deployment, or at the very least portability across clouds. It’s great to be able to say, ‘We are just going to take this stuff and move it due to regional availability.’ Not many organizations have that capability.

-Rasmus Bach Krabbe, Engineering Manager, City Storage Systems

Mission-critical migration

Once City Storage Systems selected CockroachDB to support their Tier 0 workloads, they started to migrate applications over from PostgreSQL. They still run PostgreSQL for smaller use cases and backend apps that don’t require the advanced capabilities of CockroachDB. 

For example, they migrated their order management system and user authentication platform to CockroachDB. They wanted these applications to run across multiple regions and be highly available at all times. 

These systems also make up the bulk of their online transaction processing (OLTP) load. For example, their order management system can see spikes in volume based on certain time periods when more people are placing more orders. If this happens, the Infrastructure Storage Team can add more nodes to the cluster and scale up. When the demand levels out, they can scale back down. 

The migration process itself required work – especially since City Storage Systems wanted to do it without downtime. At a high level, they built a system that supported dual writes, synchronized the data across the two databases, and at some point flipped over the source of truth. Now the team can leverage CockroachDB’s suite of migration tools, called MOLT, to make the migration process much easier for any additional use cases that require zero downtime.

quote

The ability to have zero downtime restarts with CockroachDB is huge. We can do actual maintenance of the nodes without disturbing our customers and having them know that we are even doing it. Running a multi-active data center gives us the type of survivability we are looking for.

-Frederik Stenum Mogensen, Software Engineer, City Storage Systems

Kubernetes & CockroachDB for app efficiency

City Storage Systems builds their applications on Kubernetes for some of the same reasons they build on CockroachDB. Kubernetes… 

  • Is programmable so they don’t have standard ops org to run many clusters

  • Facilitates high availability by automating application development and restarts 

  • Is portable and doesn’t lock them into a specific cloud vendor 

  • Can speed up development because many devs are already familiar with it

In fact, CockroachDB is built to work on Kubernetes and that compatibility helped make the development process smoother than it might otherwise have been. However, that doesn’t mean achieving an ideal setup with minimal operational upkeep was easy. 

After many trials and tribulations, the City Storage Systems team decided to build their own custom operator. By using the operator, they were able to free up significant time by simplifying and automating a variety of tasks. The diagram below shows the architecture of the operator on the left, and a sample of a YAML file defining resources and hierarchies on the right. 

[City Storage Systems] Diagram

The team runs multiple in-house orchestrations for the CockroachDB clusters to make them configurable by code instead of ClickOps. This helps keep the clusters reliable, consistent across environments, and tolerant of the disruptions that are inherent to Kubernetes. 

For a majority of their Tier 0 applications, they are running CockroachDB in Kubernetes across three U.S. regions (in a single AZ per region) for high availability. For some of their smaller, non-mission critical applications, they run CockroachDB in a single region which meets the availability requirements. 

Flexible cloud deployments 

After an initial successful deployment via one cloud provider, in 2022 City Storage Systems made the decision to move CockroachDB to Microsoft Azure, aligning to the company's broader investment in Azure to take advantage of operational efficiencies. 

Additionally, the company was keen on establishing a partnership, similar to CockroachDB, in order to help evolve CSS' stack and Azure together. These business objectives were easily supported by CockroachDB's flexible cloud-native architecture that enables seamless migration between cloud platforms. 

The team reports that the cloud migration process was made possible via two major components: CockroachDB itself and the Kubernetes control plane. 

As the City Storage Systems team was removing CockroachDB nodes from the original cloud provider, they were able to drain data across nodes and control data placement via CockroachDB node attributes. They would then spin back up the nodes in Azure and move the data. The Kubernetes control plane gave them a single pane of glass experience across the two cloud providers that abstracted away many of the details.

quote

The cloud migration we performed was actually very smooth for our use case. We now run CockroachDB on Azure across multiple regions and are able to tolerate the loss of an entire region.

-Rasmus Bach Krabbe, Engineering Manager, City Storage Systems

city-storage-systems-order-image

Addressing modern application challenges with CockroachDB & Azure

CockroachDB helped City Storage Systems address modern application challenges so that they could deliver on their business requirements.  First, CockroachDB enables them to build a resilient foundation that can tolerate region-wide outages, ensuring their critical Tier 0 applications remain operational.

Additionally, CockroachDB allows City Storage Systems to scale their database to handle an increase in demand for applications like their order management system. CockroachDB’s familiar SQL interface and strong consistency model have simplified application development and freed developers from concerns about eventual consistency.

Not only have they been able to increase developer productivity, but they’ve also been able to simplify operations. CockroachDB features like zero downtime upgrades and online schema changes have reduced the operational overhead for City Storage Systems, allowing them to perform maintenance without disrupting their customers. They were also able to seamlessly migrate to Azure from another cloud provider since CockroachDB is cloud-native

What’s Next

Today, CockroachDB, in tandem with a custom Kubernetes operator running on Azure, is delivering what City Storage Systems needs: a highly-available, strongly consistent database with multi-region and multi-cloud support that remains performant at scale.

In the future, like many startups, City Storage Systems will introduce automation where possible, leverage new technologies to improve performance, and push GenAI further into their use cases. For example, they have created an advanced food prep prediction model that leverages unique data to predict food prep time 42% more accurately than the baseline. To learn more about what the team is building, subscribe to their tech blog

If you’d like to run CockroachDB on Azure, visit the Azure Marketplace or learn more about our partnership here.