Setting up a readiness, liveness or startup probe

1/22/2020

I'm having difficulty understanding which would be best for my situation and how to actually implement it.

In a nutshell, the problem is this:

  • I'm spinning up my DB (Postgres), BE (Django), and FE (React) deployments with Skaffold
  • About 50% of the time the BE spins up before the DB
  • One of the first things Django tries to do is connect to the DB
  • It only tries once (by design and can't be changed), if it can't, it fails and the application is broken

  • Thus, I need to make sure every single time I spin up my deployments, the DB deployment is running before starting the BE deployment

I came across readiness, liveness, and starup probes. I've read it a couple times and readiness probes sound like what I need: I don't want the BE deployment to start until the DB deployment is ready to accept connections.

I guess I'm not understanding how to set it up. This is what I've tried, but I still run into instances where one is being loaded before another.

postgres.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: postgres-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      component: postgres
  template:
    metadata:
      labels:
        component: postgres
    spec:
      containers:
        - name: postgres
          image: testappcontainers.azurecr.io/postgres
          ports:
            - containerPort: 5432
          env: 
            - name: POSTGRES_DB
              valueFrom:
                secretKeyRef:
                  name: testapp-secrets
                  key: PGDATABASE
            - name: POSTGRES_USER
              valueFrom:
                secretKeyRef:
                  name: testapp-secrets
                  key: PGUSER
            - name: POSTGRES_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: testapp-secrets
                  key: PGPASSWORD
            - name: POSTGRES_INITDB_ARGS
              value: "-A md5"
          volumeMounts:
            - name: postgres-storage
              mountPath: /var/lib/postgresql/data
              subPath: postgres
      volumes:
        - name: postgres-storage
          persistentVolumeClaim:
            claimName: postgres-storage
---
apiVersion: v1
kind: Service
metadata:
  name: postgres-cluster-ip-service
spec:
  type: ClusterIP
  selector:
    component: postgres
  ports:
    - port: 1423
      targetPort: 5432

api.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-deployment
spec:
  replicas: 3
  selector:
    matchLabels:
      component: api
  template:
    metadata:
      labels:
        component: api
    spec:
      containers:
        - name: api
          image: testappcontainers.azurecr.io/testapp-api
          ports:
            - containerPort: 5000
          env:
            - name: PGUSER
              valueFrom:
                secretKeyRef:
                  name: testapp-secrets
                  key: PGUSER
            - name: PGHOST
              value: postgres-cluster-ip-service
            - name: PGPORT
              value: "1423"
            - name: PGDATABASE
              valueFrom:
                secretKeyRef:
                  name: testapp-secrets
                  key: PGDATABASE
            - name: PGPASSWORD
              valueFrom:
                secretKeyRef:
                  name: testapp-secrets
                  key: PGPASSWORD
            - name: SECRET_KEY
              valueFrom:
                secretKeyRef:
                  name: testapp-secrets
                  key: SECRET_KEY
            - name: DEBUG
              valueFrom:
                secretKeyRef:
                  name: testapp-secrets
                  key: DEBUG
          readinessProbe:
            httpGet:
              host: postgres-cluster-ip-service
              port: 1423
            initialDelaySeconds: 10
            periodSeconds: 5
            timeoutSeconds: 2
---
apiVersion: v1
kind: Service
metadata:
  name: api-cluster-ip-service
spec:
  type: ClusterIP
  selector:
    component: api
  ports:
    - port: 5000
      targetPort: 5000

client.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: client-deployment
spec:
  replicas: 3
  selector:
    matchLabels:
      component: client
  template:
    metadata:
      labels:
        component: client
    spec:
      containers:
        - name: client
          image: testappcontainers.azurecr.io/testapp-client
          ports:
            - containerPort: 3000
          readinessProbe:
            httpGet:
              path: api-cluster-ip-service
              port: 5000
            initialDelaySeconds: 10
            periodSeconds: 5
            timeoutSeconds: 2
---
apiVersion: v1
kind: Service
metadata:
  name: client-cluster-ip-service
spec:
  type: ClusterIP
  selector:
    component: client
  ports:
    - port: 3000
      targetPort: 3000

I don't think the ingress.yaml and the skaffold.yaml will be helpful, but let me know if I should add those.

So what am I doing wrong here?


Edit:

So I've tried out a few things based on David Maze's response. This helped me understand what is going on better, but I am still running into issues I'm not quite understanding how to resolve.

The first problem is that even with a default restartPolicy: Always, and even though Django fails, the Pods themselves don't fail. The Pods think they are perfectly healthy even though Django has failed.

The second problem is that apparently the Pods need to be made aware of Django's status. That is the part I'm not quite wrapping my brain around, particularly should probes be checking the status of other deployments or themselves?

Yesterday my thinking was the former, but today I'm thinking it is the latter: the Pod needs to know the program contained in it has failed. However, everything I've tried just results in a failed probe, connection refused, etc.:

# referring to itself
host: /health
port: 5000

host: /healthz
port: 5000

host: /api
port: 5000

host: /
port: 5000

host: /api-cluster-ip-service
port: 5000

host: /api-deployment
port: 5000

# referring to the DB deployment
host: /health
port: 1423 #or 5432

host: /healthz
port: 1423 #or 5432

host: /api
port: 1423 #or 5432

host: /
port: 1423 #or 5432

host: /postgres-cluster-ip-service
port: 1423 #or 5432

host: /postgres-deployment
port: 1423 #or 5432

So apparently I'm setting up the probe wrong, despite it being a "super-easy" implementation (as a few blogs have described it). For example, the /health and /healthz routes: are these built into Kubernetes or do these need to be setup? Rereading the docs to hopefully clarify this.

-- eox.dev
docker
kubernetes
readinessprobe
skaffold

2 Answers

1/22/2020

Actually, think I might have sorted it out.

Part of the problem is that even though restartPolicy: Always is the default, the Pods are not aware the Django has failed so it thinks they are healthy.

My thinking was wrong in that I originally assumed I needed to refer to the DB deployment to see if it had start before starting the API deployment. Instead I needed to check if Django had failed and redeploy it if it had.

Doing the following accomplished this for me:

livenessProbe:
  tcpSocket:
    port: 5000
  initialDelaySeconds: 2
  periodSeconds: 2
readinessProbe:
  tcpSocket:
    port: 5000
  initialDelaySeconds: 2
  periodSeconds: 2

I'm learning Kubernetes so please correct me if there is a better way to do this or if this is just plain wrong. I just know it accomplishes what I want.

-- eox.dev
Source: StackOverflow

1/22/2020

You're just not waiting long enough.

The deployment artifacts you're showing here look pretty normal. It's even totally normal for your application to fail fast if it can't reach the database, say because it hasn't started up yet. Every pod has a restart policy, though, which defaults to Always. So, when the pod fails, Kubernetes will restart it; and when it fails again, it will get restarted again; and when it keeps failing, Kubernetes will pause tens of seconds between restarts (the dreaded CrashLoopBackOff state).

Eventually if you're in this wait-and-restart loop, the database will actually come up, and then Kubernetes will restart your application pods, at which point the application will start up normally.

The only thing that I'd change here is that your readiness probes for the two pods should probe the services themselves, not some other service. You probably want the path to be something like / or /healthz or something else that is an actual HTTP request path in the service. That can return 503 Service Unavailable if it detects its dependency isn't available, or you can just crash. Just crashing is fine.

This is a totally normal setup in Kubernetes; there's no way to more directly say that pod A can't start until service B is ready. The flip side of this is that the pattern is actually pretty generic: if your application crashes and restarts whenever it can't reach its database, it doesn't matter if the database is hosted outside the cluster, or if it crashes sometime well after startup time; the same logic will try to restart your application until it works again.

-- David Maze
Source: StackOverflow