JupyterHub pod no longer connects to Postgres pod

1/22/2022

I have a kubernetes cluster containing (amongst others) a jupyterhub pod and a postgresql pod as its database. Everything worked fine for months until a recent incident where a shared storage ran full; the resulting file system warnings forced the connected linux machines (including this cluster node) into a read-only status. Now, that and all other issues resulting from it so far could be fixed; the nodes and pods all seem to start up fine, but the jupyterhub pod alone runs into a CrashLoopBackoff, because it can for some reason no longer connect to the database service/pod.

Here are the logs from the relevant pods I've gathered so far. I've redacted the username and password for obvious reasons, but I have checked that they align between the pods. And as said, I haven't changed the configuration and the system ran fine before the incident.

kubectl logs <jupyterhub> | tail

[I 2022-01-22 08:04:28.905 JupyterHub app:2349] Running JupyterHub version 1.3.0
[I 2022-01-22 08:04:28.906 JupyterHub app:2379] Using Authenticator: builtins.MyAuthenticator
[I 2022-01-22 08:04:28.906 JupyterHub app:2379] Using Spawner: builtins.MySpawner
[I 2022-01-22 08:04:28.906 JupyterHub app:2379] Using Proxy: jupyterhub.proxy.ConfigurableHTTPProxy-1.3.0
[I 2022-01-22 08:04:28.981 JupyterHub app:1465] Writing cookie_secret to /jhub/jupyterhub_cookie_secret
[E 2022-01-22 08:04:39.048 JupyterHub app:1597] Failed to connect to db: postgresql://[redacted]:[redacted]@postgres:1500
[C 2022-01-22 08:04:39.049 JupyterHub app:1601] If you recently upgraded JupyterHub, try running
        jupyterhub upgrade-db
    to upgrade your JupyterHub database schema

The database itself seems to be running fine though.

kubectl logs <postgres> | tail

2022-01-19 13:47:50.245 UTC [1] LOG:  starting PostgreSQL 14.1 (Debian 14.1-1.pgdg110+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
2022-01-19 13:47:50.245 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 1500
2022-01-19 13:47:50.245 UTC [1] LOG:  listening on IPv6 address "::", port 1500
2022-01-19 13:47:50.380 UTC [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.1500"
2022-01-19 13:47:50.494 UTC [62] LOG:  database system was shut down at 2022-01-19 13:47:49 UTC
2022-01-19 13:47:50.535 UTC [1] LOG:  database system is ready to accept connections

And so does the service:

kubectl describe service postgres

Name:              postgres
Namespace:         jhub
Labels:            <none>
Annotations:       <none>
Selector:          app=postgres
Type:              ClusterIP
IP Families:       <none>
IP:                10.100.209.184
IPs:               10.100.209.184
Port:              <unset>  1500/TCP
TargetPort:        1500/TCP
Endpoints:         10.0.0.139:1500
Session Affinity:  None
Events:            <none>

For reference, the relevant yamls.

postgres.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: postgres
  labels:
    app: postgres
spec:
  replicas: 1
  selector:
    matchLabels:
      app: postgres
  template:
    metadata:
      labels:
        app: postgres
    spec:
      containers:
        - name: postgres
          image: postgres
          ports:
            - containerPort: 1500
          env:
            - name: POSTGRES_USER
              value: <redacted>
            - name: POSTGRES_PASSWORD
              value: <redacted>
            - name: PGPORT
              value: '1500'

---

apiVersion: v1
kind: Service
metadata:
  name: postgres
spec:
  selector:
    app: postgres
  ports:
    - protocol: TCP
      port: 1500
      targetPort: 1500

The db url within the jupyterhub_config.py also doesn't seem unusual:

postgres_passwd = os.getenv('POSTGRES_PASSWORD')
c.JupyterHub.db_url = f'postgresql://redacted:{postgres_passwd}@postgres:1500'

This was everything I deemed relevant for now; if there's more you need, let me know.

I'm rather stumped. It's mostly the fact that, as I said, before the incident everything ran fine. The overall cluster configuration hasn't changed. All other issues had some outside factor as cause and could through that be identified and fixed, but here the issue seems to be contained entirely within the cluster.

Thanks for reading and I appreciate any help or hints.

E: List of actions taken so far:

  • Restarting the entire node has been tried before.
  • Started up a postgresclient in the same namespace and tried connecting with the URI that's in the jupyterhub_config; gave me a "could not translate host name "postgres" to address: Temporary failure in name resolution".
  • Running down the DNS Troubleshooting list, I found the following issues:
kubectl get endpoints kube-dns --namespace=kube-system
NAME       ENDPOINTS   AGE
kube-dns               260d
kubectl describe endpoints kube-dns --namespace=kube-system
Name:         kube-dns
Namespace:    kube-system
Labels:       k8s-app=kube-dns
              kubernetes.io/cluster-service=true
              kubernetes.io/name=KubeDNS
Annotations:  endpoints.kubernetes.io/last-change-trigger-time: 2022-01-10T13:09:07Z
Subsets:
  Addresses:          <none>
  NotReadyAddresses:  10.0.6.78,10.0.9.76
  Ports:
    Name     Port  Protocol
    ----     ----  --------
    dns-tcp  53    TCP
    dns      53    UDP
    metrics  9153  TCP

Events:  <none>
kubectl logs --namespace=kube-system -l k8s-app=kube-dns
Error from server: Get "https://141.83.188.131:10250/containerLogs/kube-system/coredns-74ff55c5b-n76km/coredns?tailLines=10": dial tcp 141.83.188.131:10250: connect: connection refused
kubectl describe --namespace=kube-system -l k8s-app=kube-dns
error: You must specify the type of resource to describe. Use "kubectl api-resources" for a complete list of supported resources.
bergmann@k8s-manager:~/jupyterhub$ kubectl describe pod --namespace=kube-system -l k8s-app=kube-dns
Name:                 coredns-74ff55c5b-n76km
Namespace:            kube-system
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Node:                 k8s-worker-08/141.83.188.131
Start Time:           Mon, 06 Dec 2021 01:40:35 +0100
Labels:               k8s-app=kube-dns
                      pod-template-hash=74ff55c5b
Annotations:          <none>
Status:               Running
IP:                   10.0.6.78
IPs:
  IP:           10.0.6.78
Controlled By:  ReplicaSet/coredns-74ff55c5b
Containers:
  coredns:
    Container ID:  docker://d7239ff0f11295180ebff1434bc8a0dcb357a5d55128e8cf02b2b821822da6b3
    Image:         k8s.gcr.io/coredns:1.7.0
    Image ID:      docker-pullable://k8s.gcr.io/coredns@sha256:73ca82b4ce829766d4f1f10947c3a338888f876fbed0540dc849c89ff256e90c
    Ports:         53/UDP, 53/TCP, 9153/TCP
    Host Ports:    0/UDP, 0/TCP, 0/TCP
    Args:
      -conf
      /etc/coredns/Corefile
    State:          Running
      Started:      Mon, 06 Dec 2021 01:40:41 +0100
    Ready:          True
    Restart Count:  0
    Limits:
      memory:  170Mi
    Requests:
      cpu:        100m
      memory:     70Mi
    Liveness:     http-get http://:8080/health delay=60s timeout=5s period=10s #success=1 #failure=5
    Readiness:    http-get http://:8181/ready delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /etc/coredns from config-volume (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from coredns-token-gzml9 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   True
  PodScheduled      True
Volumes:
  config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      coredns
    Optional:  false
  coredns-token-gzml9:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  coredns-token-gzml9
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  kubernetes.io/os=linux
Tolerations:     CriticalAddonsOnly op=Exists
                 node-role.kubernetes.io/control-plane:NoSchedule
                 node-role.kubernetes.io/master:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:          <none>


Name:                 coredns-74ff55c5b-vv8v7
Namespace:            kube-system
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Node:                 k8s-worker-10/141.83.188.161
Start Time:           Mon, 06 Dec 2021 01:49:21 +0100
Labels:               k8s-app=kube-dns
                      pod-template-hash=74ff55c5b
Annotations:          <none>
Status:               Running
IP:                   10.0.9.76
IPs:
  IP:           10.0.9.76
Controlled By:  ReplicaSet/coredns-74ff55c5b
Containers:
  coredns:
    Container ID:  docker://986105a2646ecdadf6fadbd700b9fdbeb578325603ee8353e5283b2b65967c23
    Image:         k8s.gcr.io/coredns:1.7.0
    Image ID:      docker-pullable://k8s.gcr.io/coredns@sha256:73ca82b4ce829766d4f1f10947c3a338888f876fbed0540dc849c89ff256e90c
    Ports:         53/UDP, 53/TCP, 9153/TCP
    Host Ports:    0/UDP, 0/TCP, 0/TCP
    Args:
      -conf
      /etc/coredns/Corefile
    State:          Running
      Started:      Mon, 06 Dec 2021 01:49:27 +0100
    Ready:          True
    Restart Count:  0
    Limits:
      memory:  170Mi
    Requests:
      cpu:        100m
      memory:     70Mi
    Liveness:     http-get http://:8080/health delay=60s timeout=5s period=10s #success=1 #failure=5
    Readiness:    http-get http://:8181/ready delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /etc/coredns from config-volume (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from coredns-token-gzml9 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   True
  PodScheduled      True
Volumes:
  config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      coredns
    Optional:  false
  coredns-token-gzml9:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  coredns-token-gzml9
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  kubernetes.io/os=linux
Tolerations:     CriticalAddonsOnly op=Exists
                 node-role.kubernetes.io/control-plane:NoSchedule
                 node-role.kubernetes.io/master:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:          <none>
-- Suthek
jupyterhub
kubernetes
postgresql

1 Answer

1/22/2022

After tracing the issue back to the kube-dns pods, I restarted them. This fixed this issue, though I still don't know why it occured.

-- Suthek
Source: StackOverflow