CockroachDB snapshot backups in Kubernetes

9/10/2019

I am trying to take snapshot backups with Velero in Kubernetes of a 12 node test CockroachDB cluster with Velero such that, if the cluster failed, we could rebuild the cluster and restore the cockroachdb from these snapshots.

We're using Velero to do that and the snapshot and restore seems to work, but on recovery, we seem to have issues with CockroachDB losing ranges.

Has anyone gotten snapshot backups to work with CockroachDB with a high scale database? (Given the size of the dataset, doing dumps or restores from dumps is not viable.)

-- outside2344
cockroachdb
kubernetes

1 Answer

9/11/2019

Performing backups of the underlying disks while CockroachDB nodes are running is unlikely to work as expected.

The main reason is that even if a persistent disk snapshot is atomic, there is no way to ensure that all disks are captured at the exact same time (time being defined by CockroachDB's consistency mechanism). The restore would contain data with replicas across nodes at different commit indices, resulting in data loss or loss of quorum (show in the Admin UI as "unavailable" ranges).

You have a few options (in order or convenience):

  • CockroachDB BACKUP which has all nodes write data to external storage (S3, GCS, etc...), but requires an enterprise license
  • SQL dump which is impractical for large datasets
  • stop all nodes, snapshot all disks, startup all nodes again. warning: this is something we have used to quickly load testing datasets but have not used it in production environments.
-- Marc
Source: StackOverflow