Schedule a Job using the same PVC's as 1 other Pod in a StatefulSet

1/2/2020

In a Kubernetes cluster, I'd like to be able to schedule a Job (with a CronJob) that will mount the same Volumes as 1 Pod of a given StatefulSet. Which Pod that is is a run time decision, depending on the labels set on the Pod at the time of scheduling the Job.

I guess many people will wonder why, therefore a description of what we're doing and trying to do:

Current setup

We have a StatefulSet which serves a PostgreSQL database. (one primary, multiple replica's) We want to be able to create a backup from one of the pods of the StatefulSet.

For PostgreSQL we can already do backups over the network with pg_basebackup, however we are running multi-TB PostgreSQL databases, which means full streaming backups (with pg_basebackup) is not feasible.

We currently use pgBackRest to backup the databases, which allows for incremental backups.

As the incremental backup of pgBackRest requires access to the data volume and the WAL volume, we need to run the Backup Container on the same Kubernetes Node as the PostgreSQL instance, we currently even run it inside the same Pod in a separate Container.

Inside the container, a small api wraps around pgBackRest and can be triggered by sending POST requests to the api, this triggering is currently done using CronJobs.

Downsides

  • Every PostgreSQL instance has multiple containers in the Pod, 1 to serve Postgres, 1 to serve a tiny wrapper around pgBackRest
  • Job Logs only show successful backup triggers, the actual backup logs are part of the Backup container
  • The Pod that will run the backup may run on a relatively old configuration, changing backup configuration requires a rescheduling of the Pod, which may mean a fail over of the PostgreSQL primary.

Proposed setup

Have a CronJob schedule a Pod that has the same Volume's as 1 of the Pods of the StatefulSet. This will allow the backup to use these Volumes.

However, which Volumes it needs is a run time decision: We may want to run the backup on the Volumes connected to the primary, or we may want to backup using the Volumes of a replica. The primary/replica may change at any moment, as auto-failover of the PostgreSQL primary is part of the solution.

Currently, this is not possible, as I cannot in the CronJob spec find any way to use information from the k8s api.

What does work, but is not very nice:

  • Use a CronJob that schedules a Job
  • This Job queries the k8s api and schedules another job

For example, this is what we can do to have a job create another job using this run-time information:

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: schedule-backup
spec:
  schedule: "13 03 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: backup-trigger
            image: bitnami/kubectl
            command:
            - sh
            - -c
            - |
              PRIMARYPOD=$(kubectl get pods -l cluster-name=<NAME>,role=master -o custom-columns=":metadata.name" --no-headers)
              kubectl apply -f - <<__JOB__
                apiVersion: batch/v1
                kind: Job
                metadata:
                  name: test
                spec:
                  volumes:
                    name: storage-volume
                    persistentVolumeClaim:
                      claimName:
                        data-volume-${PRIMARYPOD}
                  [...]
              __JOB__

The above may be best served by an Operator instead of using just a CronJob, but I'm wondering if anyone has a solution to the above.

Downsides

  • Job Logs only show successful backup triggers, the actual backup logs are part of another Job
  • The Job requires permissions to schedule Pod's, requiring yet another role/rolebinding
  • Using heredocs in Bash makes things harder to read/parse/understand

Summary

Long story, but these are the constraints we want to satisfy:

  • Run a Backup of PostgreSQL database
  • These are multi-TB databases
  • Therefore, incremental backups are required
  • Therefore, we need to mount the same PV of an already running Pod
  • Therefore, we need to run a Pod (or container) on the same K8s Node as the PV
  • We want to be able to express this in a CronJob spec, instead of having to do runtime kubernetes api calls
-- Feike Steenbergen
kubernetes
kubernetes-cronjob
kubernetes-statefulset

1 Answer

1/2/2020

Well, simple and short answer would be: you generally can't.

But let's be creative for while :)

A very limited amount of storage backends support RWX (read write many) access, and in most cases these are the slower ones which you want to avoid when using for a database. This means that unless you run your backup wrapper as a sidecar (which you do now) you can't really access the PVs in a different POD, period.

I'd probably stick to your original approach, with some tweaks (like making sure that you never bring down primary due to backup / config change).

On a up to date K8S cluster and supported infrastructure provider, you could probably look into VolumeSnapshots for snapshot based backups and potentially using the snapshot as source to spin up an incremental backup job. Sounds a bit convoluted though.

You could also run a backup dedicated postgres replica pod with limited resources (no live traffic) and embed backup logic only in that Pod.

-- Radek 'Goblin' Pieczonka
Source: StackOverflow