kubernetes Pod's readinessProbe errored but endpoint not removed from Service

2/23/2019

I'm running Spinnaker on Kubernetes 1.10.111. One of the Spinnaker services is a Pod running a service called Clouddriver. This Pod was running fine, but then the readinessProbe started erroring continuously. Kubernetes docs say

readinessProbe: Indicates whether the Container is ready to service requests. If the readiness probe fails, the endpoints controller removes the Pod’s IP address from the endpoints of all Services that match the Pod.

https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#container-probes

But this Pod's IP is still in the Service's endpoints. Why?

Clouddriver Pod YAML

kubectl -n spinnaker-test get pods spin-clouddriver-5559d44484-mp8q9 -o yaml

apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubernetes.io/psp: spotify.backend-service
  creationTimestamp: 2019-02-15T20:46:38Z
  generateName: spin-clouddriver-5559d44484-
  labels:
    app: spin
    app.kubernetes.io/managed-by: halyard
    app.kubernetes.io/name: clouddriver
    app.kubernetes.io/part-of: spinnaker
    app.kubernetes.io/version: 1.12.1
    cluster: spin-clouddriver
    pod-template-hash: "1115800040"
  name: spin-clouddriver-5559d44484-mp8q9
  namespace: spinnaker-test
  ownerReferences:
  - apiVersion: extensions/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: spin-clouddriver-5559d44484
    uid: ce79561c-3161-11e9-acdf-42010a800082
  resourceVersion: "53541277"
  selfLink: /api/v1/namespaces/spinnaker-test/pods/spin-clouddriver-5559d44484-mp8q9
  uid: caa66d7c-3162-11e9-acdf-42010a800082
spec:
  containers:
  - env:
    - name: JAVA_OPTS
      value: -XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap -XX:MaxRAMFraction=2
    - name: SPRING_PROFILES_ACTIVE
      value: local
    image: gcr.io/spinnaker-marketplace/clouddriver:4.3.1-20190130095322
    imagePullPolicy: IfNotPresent
    lifecycle: {}
    name: clouddriver
    ports:
    - containerPort: 7002
      protocol: TCP
    readinessProbe:
      exec:
        command:
        - wget
        - --no-check-certificate
        - --spider
        - -q
        - http://localhost:7002/health
      failureThreshold: 3
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
    resources:
      limits:
        cpu: "20"
        memory: 5000Mi
      requests:
        cpu: "20"
        memory: 5000Mi
    securityContext:
      allowPrivilegeEscalation: false
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /opt/spinnaker/config
      name: spin-clouddriver-files-1952526246
    - mountPath: /home/halyard/.hal/k8s-spinnaker/staging/dependencies
      name: spin-clouddriver-files-1757773194
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: default-token-w2lt5
      readOnly: true
  dnsPolicy: ClusterFirst
  nodeName: gke-production-us-ce-terraform-201812-d63606d6-9vq9
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 720
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: spin-clouddriver-files-1952526246
    secret:
      defaultMode: 420
      secretName: spin-clouddriver-files-1952526246
  - name: spin-clouddriver-files-1757773194
    secret:
      defaultMode: 420
      secretName: spin-clouddriver-files-1757773194
  - name: default-token-w2lt5
    secret:
      defaultMode: 420
      secretName: default-token-w2lt5
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: 2019-02-15T20:46:38Z
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: 2019-02-15T20:53:40Z
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: 2019-02-15T20:46:38Z
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: docker://3509b48511b1ea7bc97812cb82831c559d9410cb9eaaa26b4f492d881603fb31
    image: gcr.io/spinnaker-marketplace/clouddriver:4.3.1-20190130095322
    imageID: docker-pullable://gcr.io/spinnaker-marketplace/clouddriver@sha256:466228b97b8c4a61a0270c53ae4c397eb04bc3661bc4f1ee9ef4d5fce70d187d
    lastState: {}
    name: clouddriver
    ready: true
    restartCount: 0
    state:
      running:
        startedAt: 2019-02-15T20:47:26Z
  hostIP: 10.178.32.98
  phase: Running
  podIP: 10.179.34.24
  qosClass: Guaranteed
  startTime: 2019-02-15T20:46:38Z

Describing the Pod shows the readinessProbe has been continuously erroring for over a day.

kubectl -n spinnaker-test describe pods spin-clouddriver-5559d44484-mp8q9

Name:           spin-clouddriver-5559d44484-mp8q9
Namespace:      spinnaker-test
Node:           gke-production-us-ce-terraform-201812-d63606d6-9vq9/10.178.32.98
Start Time:     Fri, 15 Feb 2019 15:46:38 -0500
Labels:         app=spin
                app.kubernetes.io/managed-by=halyard
                app.kubernetes.io/name=clouddriver
                app.kubernetes.io/part-of=spinnaker
                app.kubernetes.io/version=1.12.1
                cluster=spin-clouddriver
                pod-template-hash=1115800040
Annotations:    kubernetes.io/psp=spotify.backend-service
Status:         Running
IP:             10.179.34.24
Controlled By:  ReplicaSet/spin-clouddriver-5559d44484
Containers:
  clouddriver:
    Container ID:   docker://3509b48511b1ea7bc97812cb82831c559d9410cb9eaaa26b4f492d881603fb31
    Image:          gcr.io/spinnaker-marketplace/clouddriver:4.3.1-20190130095322
    Image ID:       docker-pullable://gcr.io/spinnaker-marketplace/clouddriver@sha256:466228b97b8c4a61a0270c53ae4c397eb04bc3661bc4f1ee9ef4d5fce70d187d
    Port:           7002/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Fri, 15 Feb 2019 15:47:26 -0500
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     20
      memory:  5000Mi
    Requests:
      cpu:      20
      memory:   5000Mi
    Readiness:  exec [wget --no-check-certificate --spider -q http://localhost:7002/health] delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      JAVA_OPTS:               -XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap -XX:MaxRAMFraction=2
      SPRING_PROFILES_ACTIVE:  local
    Mounts:
      /home/halyard/.hal/k8s-spinnaker/staging/dependencies from spin-clouddriver-files-1757773194 (rw)
      /opt/spinnaker/config from spin-clouddriver-files-1952526246 (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-w2lt5 (ro)
Conditions:
  Type           Status
  Initialized    True
  Ready          True
  PodScheduled   True
Volumes:
  spin-clouddriver-files-1952526246:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  spin-clouddriver-files-1952526246
    Optional:    false
  spin-clouddriver-files-1757773194:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  spin-clouddriver-files-1757773194
    Optional:    false
  default-token-w2lt5:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-w2lt5
    Optional:    false
QoS Class:       Guaranteed
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason     Age                From                                                          Message
  ----     ------     ----               ----                                                          -------
  Warning  Unhealthy  3m (x321 over 1d)  kubelet, gke-production-us-ce-terraform-201812-d63606d6-9vq9  Readiness probe errored: rpc error: code = DeadlineExceeded desc = context deadline exceeded

But Service still has the Pod's IP of 10.179.34.24 in its Endpoints.

kubectl -n spinnaker-test describe services spin-clouddriver

Name:              spin-clouddriver
Namespace:         spinnaker-test
Labels:            app=spin
                   cluster=spin-clouddriver
Annotations:       kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"v1","kind":"Service","metadata":{"annotations":{},"labels":{"app":"spin","cluster":"spin-clouddriver"},"name":"spin-clouddriver","namesp...
Selector:          app=spin,cluster=spin-clouddriver
Type:              ClusterIP
IP:                10.178.65.100
Port:              <unset>  7002/TCP
TargetPort:        7002/TCP
Endpoints:         10.179.34.24:7002
Session Affinity:  None
Events:            <none>

kubectl -n spinnaker-test describe endpoints spin-clouddriver

Name:         spin-clouddriver
Namespace:    spinnaker-test
Labels:       app=spin
              cluster=spin-clouddriver
Annotations:  <none>
Subsets:
  Addresses:          10.179.34.24
  NotReadyAddresses:  <none>
  Ports:
    Name     Port  Protocol
    ----     ----  --------
    <unset>  7002  TCP

Events:  <none>

footnotes

  1. GKE 1.10.11-gke.1 to be exact, but the fact that it's GKE shouldn't matter.
-- David Xia
kubernetes
spinnaker

1 Answer

2/27/2019

A probe by the kubelet can end in one of three states:

  • successful
  • failed (command returned a non-0 exit code)
  • errored (command did not return before the timeout elapsed, the command does not exist inside the container, etc)

Here is the code (in 1.10.11) where the event probe errored is recorded. Note that err != nil.

Here is the code that calls the above function - when err != nil (the probe returned an error), the result is discarded.

Only probes that fail will actually cause the pod's ready state to be changed.

-- matt b
Source: StackOverflow