In a Kubernetes cluster, I'd like to be able to schedule a Job (with a CronJob) that will mount the same Volumes as 1 Pod of a given StatefulSet. Which Pod that is is a run time decision, depending on the labels set on the Pod at the time of scheduling the Job.
I guess many people will wonder why, therefore a description of what we're doing and trying to do:
We have a StatefulSet which serves a PostgreSQL database. (one primary, multiple replica's) We want to be able to create a backup from one of the pods of the StatefulSet.
For PostgreSQL we can already do backups over the network with pg_basebackup
, however we are running multi-TB PostgreSQL databases, which means full streaming backups (with pg_basebackup
) is not feasible.
We currently use pgBackRest
to backup the databases, which allows for incremental backups.
As the incremental backup of pgBackRest
requires access to the data volume and the WAL volume, we need to run the Backup Container on the same Kubernetes Node as the PostgreSQL instance, we currently even run it inside the same Pod in a separate Container.
Inside the container, a small api wraps around pgBackRest
and can be triggered by sending POST
requests to the api, this triggering is currently done using CronJobs.
pgBackRest
Have a CronJob schedule a Pod that has the same Volume's as 1 of the Pods of the StatefulSet. This will allow the backup to use these Volumes.
However, which Volumes it needs is a run time decision: We may want to run the backup on the Volumes connected to the primary, or we may want to backup using the Volumes of a replica. The primary/replica may change at any moment, as auto-failover of the PostgreSQL primary is part of the solution.
Currently, this is not possible, as I cannot in the CronJob spec find any way to use information from the k8s api.
What does work, but is not very nice:
For example, this is what we can do to have a job create another job using this run-time information:
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: schedule-backup
spec:
schedule: "13 03 * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: backup-trigger
image: bitnami/kubectl
command:
- sh
- -c
- |
PRIMARYPOD=$(kubectl get pods -l cluster-name=<NAME>,role=master -o custom-columns=":metadata.name" --no-headers)
kubectl apply -f - <<__JOB__
apiVersion: batch/v1
kind: Job
metadata:
name: test
spec:
volumes:
name: storage-volume
persistentVolumeClaim:
claimName:
data-volume-${PRIMARYPOD}
[...]
__JOB__
The above may be best served by an Operator instead of using just a CronJob, but I'm wondering if anyone has a solution to the above.
Long story, but these are the constraints we want to satisfy:
Well, simple and short answer would be: you generally can't.
But let's be creative for while :)
A very limited amount of storage backends support RWX (read write many) access, and in most cases these are the slower ones which you want to avoid when using for a database. This means that unless you run your backup wrapper as a sidecar (which you do now) you can't really access the PVs in a different POD, period.
I'd probably stick to your original approach, with some tweaks (like making sure that you never bring down primary due to backup / config change).
On a up to date K8S cluster and supported infrastructure provider, you could probably look into VolumeSnapshots for snapshot based backups and potentially using the snapshot as source to spin up an incremental backup job. Sounds a bit convoluted though.
You could also run a backup dedicated postgres replica pod with limited resources (no live traffic) and embed backup logic only in that Pod.