I have a kubernetes cluster containing (amongst others) a jupyterhub pod and a postgresql pod as its database. Everything worked fine for months until a recent incident where a shared storage ran full; the resulting file system warnings forced the connected linux machines (including this cluster node) into a read-only status. Now, that and all other issues resulting from it so far could be fixed; the nodes and pods all seem to start up fine, but the jupyterhub pod alone runs into a CrashLoopBackoff, because it can for some reason no longer connect to the database service/pod.
Here are the logs from the relevant pods I've gathered so far. I've redacted the username and password for obvious reasons, but I have checked that they align between the pods. And as said, I haven't changed the configuration and the system ran fine before the incident.
kubectl logs <jupyterhub> | tail
[I 2022-01-22 08:04:28.905 JupyterHub app:2349] Running JupyterHub version 1.3.0
[I 2022-01-22 08:04:28.906 JupyterHub app:2379] Using Authenticator: builtins.MyAuthenticator
[I 2022-01-22 08:04:28.906 JupyterHub app:2379] Using Spawner: builtins.MySpawner
[I 2022-01-22 08:04:28.906 JupyterHub app:2379] Using Proxy: jupyterhub.proxy.ConfigurableHTTPProxy-1.3.0
[I 2022-01-22 08:04:28.981 JupyterHub app:1465] Writing cookie_secret to /jhub/jupyterhub_cookie_secret
[E 2022-01-22 08:04:39.048 JupyterHub app:1597] Failed to connect to db: postgresql://[redacted]:[redacted]@postgres:1500
[C 2022-01-22 08:04:39.049 JupyterHub app:1601] If you recently upgraded JupyterHub, try running
jupyterhub upgrade-db
to upgrade your JupyterHub database schema
The database itself seems to be running fine though.
kubectl logs <postgres> | tail
2022-01-19 13:47:50.245 UTC [1] LOG: starting PostgreSQL 14.1 (Debian 14.1-1.pgdg110+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
2022-01-19 13:47:50.245 UTC [1] LOG: listening on IPv4 address "0.0.0.0", port 1500
2022-01-19 13:47:50.245 UTC [1] LOG: listening on IPv6 address "::", port 1500
2022-01-19 13:47:50.380 UTC [1] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.1500"
2022-01-19 13:47:50.494 UTC [62] LOG: database system was shut down at 2022-01-19 13:47:49 UTC
2022-01-19 13:47:50.535 UTC [1] LOG: database system is ready to accept connections
And so does the service:
kubectl describe service postgres
Name: postgres
Namespace: jhub
Labels: <none>
Annotations: <none>
Selector: app=postgres
Type: ClusterIP
IP Families: <none>
IP: 10.100.209.184
IPs: 10.100.209.184
Port: <unset> 1500/TCP
TargetPort: 1500/TCP
Endpoints: 10.0.0.139:1500
Session Affinity: None
Events: <none>
For reference, the relevant yamls.
postgres.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: postgres
labels:
app: postgres
spec:
replicas: 1
selector:
matchLabels:
app: postgres
template:
metadata:
labels:
app: postgres
spec:
containers:
- name: postgres
image: postgres
ports:
- containerPort: 1500
env:
- name: POSTGRES_USER
value: <redacted>
- name: POSTGRES_PASSWORD
value: <redacted>
- name: PGPORT
value: '1500'
---
apiVersion: v1
kind: Service
metadata:
name: postgres
spec:
selector:
app: postgres
ports:
- protocol: TCP
port: 1500
targetPort: 1500
The db url within the jupyterhub_config.py also doesn't seem unusual:
postgres_passwd = os.getenv('POSTGRES_PASSWORD')
c.JupyterHub.db_url = f'postgresql://redacted:{postgres_passwd}@postgres:1500'
This was everything I deemed relevant for now; if there's more you need, let me know.
I'm rather stumped. It's mostly the fact that, as I said, before the incident everything ran fine. The overall cluster configuration hasn't changed. All other issues had some outside factor as cause and could through that be identified and fixed, but here the issue seems to be contained entirely within the cluster.
Thanks for reading and I appreciate any help or hints.
E: List of actions taken so far:
kubectl get endpoints kube-dns --namespace=kube-system
NAME ENDPOINTS AGE
kube-dns 260d
kubectl describe endpoints kube-dns --namespace=kube-system
Name: kube-dns
Namespace: kube-system
Labels: k8s-app=kube-dns
kubernetes.io/cluster-service=true
kubernetes.io/name=KubeDNS
Annotations: endpoints.kubernetes.io/last-change-trigger-time: 2022-01-10T13:09:07Z
Subsets:
Addresses: <none>
NotReadyAddresses: 10.0.6.78,10.0.9.76
Ports:
Name Port Protocol
---- ---- --------
dns-tcp 53 TCP
dns 53 UDP
metrics 9153 TCP
Events: <none>
kubectl logs --namespace=kube-system -l k8s-app=kube-dns
Error from server: Get "https://141.83.188.131:10250/containerLogs/kube-system/coredns-74ff55c5b-n76km/coredns?tailLines=10": dial tcp 141.83.188.131:10250: connect: connection refused
kubectl describe --namespace=kube-system -l k8s-app=kube-dns
error: You must specify the type of resource to describe. Use "kubectl api-resources" for a complete list of supported resources.
bergmann@k8s-manager:~/jupyterhub$ kubectl describe pod --namespace=kube-system -l k8s-app=kube-dns
Name: coredns-74ff55c5b-n76km
Namespace: kube-system
Priority: 2000000000
Priority Class Name: system-cluster-critical
Node: k8s-worker-08/141.83.188.131
Start Time: Mon, 06 Dec 2021 01:40:35 +0100
Labels: k8s-app=kube-dns
pod-template-hash=74ff55c5b
Annotations: <none>
Status: Running
IP: 10.0.6.78
IPs:
IP: 10.0.6.78
Controlled By: ReplicaSet/coredns-74ff55c5b
Containers:
coredns:
Container ID: docker://d7239ff0f11295180ebff1434bc8a0dcb357a5d55128e8cf02b2b821822da6b3
Image: k8s.gcr.io/coredns:1.7.0
Image ID: docker-pullable://k8s.gcr.io/coredns@sha256:73ca82b4ce829766d4f1f10947c3a338888f876fbed0540dc849c89ff256e90c
Ports: 53/UDP, 53/TCP, 9153/TCP
Host Ports: 0/UDP, 0/TCP, 0/TCP
Args:
-conf
/etc/coredns/Corefile
State: Running
Started: Mon, 06 Dec 2021 01:40:41 +0100
Ready: True
Restart Count: 0
Limits:
memory: 170Mi
Requests:
cpu: 100m
memory: 70Mi
Liveness: http-get http://:8080/health delay=60s timeout=5s period=10s #success=1 #failure=5
Readiness: http-get http://:8181/ready delay=0s timeout=1s period=10s #success=1 #failure=3
Environment: <none>
Mounts:
/etc/coredns from config-volume (ro)
/var/run/secrets/kubernetes.io/serviceaccount from coredns-token-gzml9 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady True
PodScheduled True
Volumes:
config-volume:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: coredns
Optional: false
coredns-token-gzml9:
Type: Secret (a volume populated by a Secret)
SecretName: coredns-token-gzml9
Optional: false
QoS Class: Burstable
Node-Selectors: kubernetes.io/os=linux
Tolerations: CriticalAddonsOnly op=Exists
node-role.kubernetes.io/control-plane:NoSchedule
node-role.kubernetes.io/master:NoSchedule
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events: <none>
Name: coredns-74ff55c5b-vv8v7
Namespace: kube-system
Priority: 2000000000
Priority Class Name: system-cluster-critical
Node: k8s-worker-10/141.83.188.161
Start Time: Mon, 06 Dec 2021 01:49:21 +0100
Labels: k8s-app=kube-dns
pod-template-hash=74ff55c5b
Annotations: <none>
Status: Running
IP: 10.0.9.76
IPs:
IP: 10.0.9.76
Controlled By: ReplicaSet/coredns-74ff55c5b
Containers:
coredns:
Container ID: docker://986105a2646ecdadf6fadbd700b9fdbeb578325603ee8353e5283b2b65967c23
Image: k8s.gcr.io/coredns:1.7.0
Image ID: docker-pullable://k8s.gcr.io/coredns@sha256:73ca82b4ce829766d4f1f10947c3a338888f876fbed0540dc849c89ff256e90c
Ports: 53/UDP, 53/TCP, 9153/TCP
Host Ports: 0/UDP, 0/TCP, 0/TCP
Args:
-conf
/etc/coredns/Corefile
State: Running
Started: Mon, 06 Dec 2021 01:49:27 +0100
Ready: True
Restart Count: 0
Limits:
memory: 170Mi
Requests:
cpu: 100m
memory: 70Mi
Liveness: http-get http://:8080/health delay=60s timeout=5s period=10s #success=1 #failure=5
Readiness: http-get http://:8181/ready delay=0s timeout=1s period=10s #success=1 #failure=3
Environment: <none>
Mounts:
/etc/coredns from config-volume (ro)
/var/run/secrets/kubernetes.io/serviceaccount from coredns-token-gzml9 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady True
PodScheduled True
Volumes:
config-volume:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: coredns
Optional: false
coredns-token-gzml9:
Type: Secret (a volume populated by a Secret)
SecretName: coredns-token-gzml9
Optional: false
QoS Class: Burstable
Node-Selectors: kubernetes.io/os=linux
Tolerations: CriticalAddonsOnly op=Exists
node-role.kubernetes.io/control-plane:NoSchedule
node-role.kubernetes.io/master:NoSchedule
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events: <none>
After tracing the issue back to the kube-dns pods, I restarted them. This fixed this issue, though I still don't know why it occured.