I'm maintaining a Kubernetes cluster which includes two PostgreSQL servers in two different pods, a primary and a replica. The replica is sync'ed from the primary via log shipping.
A glitch caused the log shipping to start failing so the replica is no longer in sync with the primary.
The process for bringing a replica back into sync with the primary requires, amongst other things, stopping the postgres service of the replica. And this is where I'm having trouble.
It appears that Kubernetes is restarting the container as soon as I shut down the postgres service, which immediately restarts postgres again. I need the container running with the postgres service inside it stopped, to allow me to perform the next steps in fixing the broken replication.
How can I get Kubernetes to allow me to shut down the postgres service without restarting the container?
Further Details:
To stop the replica I run a shell on the replica pod via kubectl exec -it <pod name> -- /bin/sh
, then run pg_ctl stop
from the shell. I get the following response:
server shutting down
command terminated with exit code 137
and I'm kicked out of the shell.
When I run kubectl describe pod
I see the following:
Name: pgset-primary-1
Namespace: qa
Priority: 0
Node: aks-nodepool1-95718424-0/10.240.0.4
Start Time: Fri, 09 Jul 2021 13:48:06 +1200
Labels: app=pgset-primary
controller-revision-hash=pgset-primary-6d7d65c8c7
name=pgset-replica
statefulset.kubernetes.io/pod-name=pgset-primary-1
Annotations: <none>
Status: Running
IP: 10.244.1.42
IPs:
IP: 10.244.1.42
Controlled By: StatefulSet/pgset-primary
Containers:
pgset-primary:
Container ID: containerd://bc00b4904ab683d9495ad020328b5033ecb00d19c9e5b12d22de18f828918455
Image: *****/crunchy-postgres:centos7-9.6.8-1.6.0
Image ID: docker.io/*****/crunchy-postgres@sha256:2850e00f9a619ff4bb6ff889df9bcb2529524ca8110607e4a7d9e36d00879057
Port: 5432/TCP
Host Port: 0/TCP
State: Running
Started: Sat, 06 Nov 2021 18:29:34 +1300
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Sat, 06 Nov 2021 18:28:09 +1300
Finished: Sat, 06 Nov 2021 18:29:18 +1300
Ready: True
Restart Count: 6
Limits:
cpu: 250m
memory: 512Mi
Requests:
cpu: 10m
memory: 256Mi
Environment:
PGHOST: /tmp
PG_PRIMARY_USER: primaryuser
PG_MODE: set
PG_PRIMARY_HOST: pgset-primary
PG_REPLICA_HOST: pgset-replica
PG_PRIMARY_PORT: 5432
[...]
ARCHIVE_TIMEOUT: 60
MAX_WAL_KEEP_SEGMENTS: 400
Mounts:
/backrestrepo from backrestrepo (rw)
/pgconf from pgbackrestconf (rw)
/pgdata from pgdata (rw)
/var/run/secrets/kubernetes.io/serviceaccount from pgset-sa-token-nh6ng (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
pgdata:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: pgdata-pgset-primary-1
ReadOnly: false
backrestrepo:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: backrestrepo-pgset-primary-1
ReadOnly: false
pgbackrestconf:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: pgbackrest-configmap
Optional: false
pgset-sa-token-nh6ng:
Type: Secret (a volume populated by a Secret)
SecretName: pgset-sa-token-nh6ng
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning BackOff 88m (x3 over 3h1m) kubelet Back-off restarting failed container
Normal Pulled 88m (x7 over 120d) kubelet Container image "*****/crunchy-postgres:centos7-9.6.8-1.6.0" already present on machine
Normal Created 88m (x7 over 120d) kubelet Created container pgset-primary
Normal Started 88m (x7 over 120d) kubelet Started container pgset-primary
The events suggest the container was started by Kubernetes.
The pod has no liveness or readiness probes so I don't know what would prompt Kubernetes to restart the container when I shut down the postgres service running within it.
This happens due to restartPolicy. Container lifecycle is terminated due to its process being completed. If you do not want a new container to be created you need to change the restart policy for these pods.
If this pod is a part of deployment, look into kubectl explain deployment.spec.template.spec.restartPolicy