I'm running Spinnaker on Kubernetes 1.10.111. One of the Spinnaker services is a Pod running a service called Clouddriver. This Pod was running fine, but then the readinessProbe started erroring continuously. Kubernetes docs say
readinessProbe: Indicates whether the Container is ready to service requests. If the readiness probe fails, the endpoints controller removes the Pod’s IP address from the endpoints of all Services that match the Pod.
— https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#container-probes
But this Pod's IP is still in the Service's endpoints. Why?
Clouddriver Pod YAML
kubectl -n spinnaker-test get pods spin-clouddriver-5559d44484-mp8q9 -o yaml
apiVersion: v1
kind: Pod
metadata:
annotations:
kubernetes.io/psp: spotify.backend-service
creationTimestamp: 2019-02-15T20:46:38Z
generateName: spin-clouddriver-5559d44484-
labels:
app: spin
app.kubernetes.io/managed-by: halyard
app.kubernetes.io/name: clouddriver
app.kubernetes.io/part-of: spinnaker
app.kubernetes.io/version: 1.12.1
cluster: spin-clouddriver
pod-template-hash: "1115800040"
name: spin-clouddriver-5559d44484-mp8q9
namespace: spinnaker-test
ownerReferences:
- apiVersion: extensions/v1beta1
blockOwnerDeletion: true
controller: true
kind: ReplicaSet
name: spin-clouddriver-5559d44484
uid: ce79561c-3161-11e9-acdf-42010a800082
resourceVersion: "53541277"
selfLink: /api/v1/namespaces/spinnaker-test/pods/spin-clouddriver-5559d44484-mp8q9
uid: caa66d7c-3162-11e9-acdf-42010a800082
spec:
containers:
- env:
- name: JAVA_OPTS
value: -XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap -XX:MaxRAMFraction=2
- name: SPRING_PROFILES_ACTIVE
value: local
image: gcr.io/spinnaker-marketplace/clouddriver:4.3.1-20190130095322
imagePullPolicy: IfNotPresent
lifecycle: {}
name: clouddriver
ports:
- containerPort: 7002
protocol: TCP
readinessProbe:
exec:
command:
- wget
- --no-check-certificate
- --spider
- -q
- http://localhost:7002/health
failureThreshold: 3
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
resources:
limits:
cpu: "20"
memory: 5000Mi
requests:
cpu: "20"
memory: 5000Mi
securityContext:
allowPrivilegeEscalation: false
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /opt/spinnaker/config
name: spin-clouddriver-files-1952526246
- mountPath: /home/halyard/.hal/k8s-spinnaker/staging/dependencies
name: spin-clouddriver-files-1757773194
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: default-token-w2lt5
readOnly: true
dnsPolicy: ClusterFirst
nodeName: gke-production-us-ce-terraform-201812-d63606d6-9vq9
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 720
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- name: spin-clouddriver-files-1952526246
secret:
defaultMode: 420
secretName: spin-clouddriver-files-1952526246
- name: spin-clouddriver-files-1757773194
secret:
defaultMode: 420
secretName: spin-clouddriver-files-1757773194
- name: default-token-w2lt5
secret:
defaultMode: 420
secretName: default-token-w2lt5
status:
conditions:
- lastProbeTime: null
lastTransitionTime: 2019-02-15T20:46:38Z
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: 2019-02-15T20:53:40Z
status: "True"
type: Ready
- lastProbeTime: null
lastTransitionTime: 2019-02-15T20:46:38Z
status: "True"
type: PodScheduled
containerStatuses:
- containerID: docker://3509b48511b1ea7bc97812cb82831c559d9410cb9eaaa26b4f492d881603fb31
image: gcr.io/spinnaker-marketplace/clouddriver:4.3.1-20190130095322
imageID: docker-pullable://gcr.io/spinnaker-marketplace/clouddriver@sha256:466228b97b8c4a61a0270c53ae4c397eb04bc3661bc4f1ee9ef4d5fce70d187d
lastState: {}
name: clouddriver
ready: true
restartCount: 0
state:
running:
startedAt: 2019-02-15T20:47:26Z
hostIP: 10.178.32.98
phase: Running
podIP: 10.179.34.24
qosClass: Guaranteed
startTime: 2019-02-15T20:46:38Z
Describing the Pod shows the readinessProbe has been continuously erroring for over a day.
kubectl -n spinnaker-test describe pods spin-clouddriver-5559d44484-mp8q9
Name: spin-clouddriver-5559d44484-mp8q9
Namespace: spinnaker-test
Node: gke-production-us-ce-terraform-201812-d63606d6-9vq9/10.178.32.98
Start Time: Fri, 15 Feb 2019 15:46:38 -0500
Labels: app=spin
app.kubernetes.io/managed-by=halyard
app.kubernetes.io/name=clouddriver
app.kubernetes.io/part-of=spinnaker
app.kubernetes.io/version=1.12.1
cluster=spin-clouddriver
pod-template-hash=1115800040
Annotations: kubernetes.io/psp=spotify.backend-service
Status: Running
IP: 10.179.34.24
Controlled By: ReplicaSet/spin-clouddriver-5559d44484
Containers:
clouddriver:
Container ID: docker://3509b48511b1ea7bc97812cb82831c559d9410cb9eaaa26b4f492d881603fb31
Image: gcr.io/spinnaker-marketplace/clouddriver:4.3.1-20190130095322
Image ID: docker-pullable://gcr.io/spinnaker-marketplace/clouddriver@sha256:466228b97b8c4a61a0270c53ae4c397eb04bc3661bc4f1ee9ef4d5fce70d187d
Port: 7002/TCP
Host Port: 0/TCP
State: Running
Started: Fri, 15 Feb 2019 15:47:26 -0500
Ready: True
Restart Count: 0
Limits:
cpu: 20
memory: 5000Mi
Requests:
cpu: 20
memory: 5000Mi
Readiness: exec [wget --no-check-certificate --spider -q http://localhost:7002/health] delay=0s timeout=1s period=10s #success=1 #failure=3
Environment:
JAVA_OPTS: -XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap -XX:MaxRAMFraction=2
SPRING_PROFILES_ACTIVE: local
Mounts:
/home/halyard/.hal/k8s-spinnaker/staging/dependencies from spin-clouddriver-files-1757773194 (rw)
/opt/spinnaker/config from spin-clouddriver-files-1952526246 (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-w2lt5 (ro)
Conditions:
Type Status
Initialized True
Ready True
PodScheduled True
Volumes:
spin-clouddriver-files-1952526246:
Type: Secret (a volume populated by a Secret)
SecretName: spin-clouddriver-files-1952526246
Optional: false
spin-clouddriver-files-1757773194:
Type: Secret (a volume populated by a Secret)
SecretName: spin-clouddriver-files-1757773194
Optional: false
default-token-w2lt5:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-w2lt5
Optional: false
QoS Class: Guaranteed
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 3m (x321 over 1d) kubelet, gke-production-us-ce-terraform-201812-d63606d6-9vq9 Readiness probe errored: rpc error: code = DeadlineExceeded desc = context deadline exceeded
But Service still has the Pod's IP of 10.179.34.24 in its Endpoints.
kubectl -n spinnaker-test describe services spin-clouddriver
Name: spin-clouddriver
Namespace: spinnaker-test
Labels: app=spin
cluster=spin-clouddriver
Annotations: kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"v1","kind":"Service","metadata":{"annotations":{},"labels":{"app":"spin","cluster":"spin-clouddriver"},"name":"spin-clouddriver","namesp...
Selector: app=spin,cluster=spin-clouddriver
Type: ClusterIP
IP: 10.178.65.100
Port: <unset> 7002/TCP
TargetPort: 7002/TCP
Endpoints: 10.179.34.24:7002
Session Affinity: None
Events: <none>
kubectl -n spinnaker-test describe endpoints spin-clouddriver
Name: spin-clouddriver
Namespace: spinnaker-test
Labels: app=spin
cluster=spin-clouddriver
Annotations: <none>
Subsets:
Addresses: 10.179.34.24
NotReadyAddresses: <none>
Ports:
Name Port Protocol
---- ---- --------
<unset> 7002 TCP
Events: <none>
A probe by the kubelet can end in one of three states:
Here is the code (in 1.10.11) where the event probe errored
is recorded. Note that err != nil
.
Here is the code that calls the above function - when err != nil
(the probe returned an error), the result is discarded.
Only probes that fail will actually cause the pod's ready state to be changed.