Exporting data is a crucial tool in any database user’s toolkit. In CockroachDB, the EXPORT command has long provided this essential functionality to:
Move data to a different data store for business analytics
Migrate data to a new database
Archive data in a platform-neutral way
Seed an application with data
Say, for example, you need to export a sizable amount of JSON to seed a core data store for a streaming service. Sometimes you can pre-seed this data alongside your application with an EXPORT. But if you’re working on the scale of a company like Netflix, EXPORT commands come with some limitations that led us to explore an additional way to export data.
How to use changefeeds to export data out of CockroachDB
Problems with EXPORT
First of all, EXPORTs have some configurability and integration limitations — but these are minor issues compared to two other problems:
EXPORT commands have a natural scale limit: above a certain table size (TBs of data), EXPORTs are likely to run into snags that cause them to restart. Since they must restart from the beginning, it is unlikely a very high data scale EXPORT will ever finish.
EXPORTs have limited observability compared to jobs like schema changes and backups, since they aren’t actually jobs themselves.
RELATED
When (and why) you should use change data capture
There is a solution to these problems. In CockroachDB you can export data using changefeeds (available in CockroachDB Enterprise, our fully managed offering, CockroachDB dedicated, and CockroachDB serverless).
(CockroachDB Core users are invited to try an experimental feature: EXPERIMENTAL CHANGEFEED FOR.
This statement allows users to create a new changefeed to stream row-level changes to the client indefinitely, until the underlying connection is closed or the changefeed is canceled).
How to export data with changefeeds
Enterprise change data capture (CDC) provides row-level change subscriptions for downstream processing. In CockroachDB users can create a new changefeed to stream row-level changes in a configurable format to a configurable sink such as Apache Kafka.
CREATE CHANGEFEED FOR table
INTO [kafka, webhook, cloud
storage, gc pubsub] WITH
format=csv, initial_scan=only;
The CREATE CHANGEFEED
statement creates a new CockroachDB enterprise changefeed, which targets an allowlist of tables called “watched rows”. Every change to a watched row is emitted as a record in a configurable format (JSON
or Avro) to a configurable sink (Kafka, Google Cloud Pub/Sub, a cloud storage sink, or a webhook sink). You can create, pause, resume, alter, or cancel an Enterprise changefeed.
In the base case, changefeeds send a stream of continuous updates to a table via a SQL connection or external system. However, using the initial_scan=only
option, a changefeed will emit a snapshot of the table and the job will complete.
Using changefeeds allows us to take advantage of a wider set of integrations and configurability, as well as the robust-at-scale changefeed architecture.
Automatic data scaling
The huge advantage: in CockroachDB changefeeds are built to scale organically with your data. Here’s how it works:
Changefeeds take advantage of parallel processing of data
Changefeeds use the jobs system, which comes with enhanced observability and scale infrastructure
Changefeeds utilize checkpointing to remain robust as data scales out
Let’s compare CREATE CHANGEFEED
to a traditional EXPORT:
For more details about the differences in configurability, check out the export and enterprise changefeed docs.
Changefeeds offer a great alternative to EXPORTs for users at a high scale of data. There are still a few limitations compared to the EXPORT command, but changefeeds are quickly approaching feature parity. Moreover, changefeeds offer many configuration and destination options beyond those offered by EXPORTs.