I have implemented a http health check and a separate http liveness check for my pod. For both, I see that Kubernetes works as expected if my pod delays before responding. However, when they respond immediately with a status 500, Kubernetes treats that as a success response. This is after the pod is up and running OK - before the checks start returning status 500.
In fact, I see that returning status 500 actually resets the failure count, so it caused my pod to be treated as healthy again.
Question is whether I am doing something wrong? How to get Kubernetes to do its stuff when my pod is unhealthy?
$ k version
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.8", GitCommit:"9f2892aab98fe339f3bd70e3c470144299398ace", GitTreeState:"clean", BuildDate:"2020-08-13T16:12:48Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.1", GitCommit:"c4d752765b3bbac2237bf87cf0b1c2e307844666", GitTreeState:"clean", BuildDate:"2020-12-18T12:00:47Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}
To investigate this problem, I have added test endpoints to my pod so that I can change the behaviour at runtime: pass (200), fail (500), delay fail (wait 15 seconds, then return 500). And I separated the health and liveness endpoints.
From describe pod:
Liveness: exec [curl http://localhost:30030/livez] delay=10s timeout=1s period=10s #success=1 #failure=5
Readiness: exec [curl http://localhost:30030/healthz] delay=10s timeout=1s period=10s #success=1 #failure=3
I tested the endpoints by exec into the pod and curl the endpoints from there (details below).
Then I cycled both the liveness check and the health check through the 3 modes and monitored the Kubernetes response.
Health Check: expect pod to be restarted after failing health check 5 times in a row.
Liveness Check: describe the service and expect IP address of the pod to be removed from the list of endpoints.
Success case:
bash-4.4$ curl http://localhost:30030/unfailhealth
unfailhealth: REMOVE force all health checks to fail, was failHealth=false, delayFailHealth=false
bash-4.4$ curl http://localhost:30030/healthz -v
* Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 30030 (#0)
> GET /healthz HTTP/1.1
> Host: localhost:30030
> User-Agent: curl/7.61.1
> Accept: */*
>
< HTTP/1.1 200 OK
< X-Powered-By: Express
< Content-Type: text/html; charset=utf-8
< Content-Length: 3
< ETag: W/"3-CftlTBfMBbEe9TvTWqcB9tVQ6OE"
< Date: Fri, 05 Feb 2021 13:30:59 GMT
< Connection: keep-alive
< Keep-Alive: timeout=5
<
OK
* Connection #0 to host localhost left intact
Failure case:
bash-4.4$ curl http://localhost:30030/failhealth
failhealth: force all health checks to fail, was failHealth=true, delayFailHealth=false
bash-4.4$ curl http://localhost:30030/healthz -v
* Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 30030 (#0)
> GET /healthz HTTP/1.1
> Host: localhost:30030
> User-Agent: curl/7.61.1
> Accept: */*
>
< HTTP/1.1 500 Internal Server Error
< X-Powered-By: Express
< Content-Type: text/html; charset=utf-8
< Content-Length: 26
< ETag: W/"1a-yI5D4Rtao1KH34GZVYKKvxZoEVo"
< Date: Fri, 05 Feb 2021 13:29:14 GMT
< Connection: keep-alive
< Keep-Alive: timeout=5
<
FAKE HEALTH CHECK FAILURE
* Connection #0 to host localhost left intact
Delayed failure case:
bash-4.4$ curl http://localhost:30030/delayfailhealth
delayfailhealth: force all health checks to sleep 15sec, then fail, was failHealth=false, delayFailHealth=true
bash-4.4$ date; curl http://localhost:30030/healthz -v
Fri Feb 5 13:33:08 UTC 2021
* Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 30030 (#0)
> GET /healthz HTTP/1.1
> Host: localhost:30030
> User-Agent: curl/7.61.1
> Accept: */*
>
< HTTP/1.1 500 Internal Server Error
< X-Powered-By: Express
< Content-Type: text/html; charset=utf-8
< Content-Length: 47
< ETag: W/"2f-n+Ix8oU/09OT9+cpPVm1/EejE9Y"
< Date: Fri, 05 Feb 2021 13:33:23 GMT
< Connection: keep-alive
< Keep-Alive: timeout=5
<
FAKE HEALTH CHECK FAILURE - AFTER 15 SEC DELAY
* Connection #0 to host localhost left intact
Test Results
Default to SUCCESS for both health and liveness endpoints,return status 200 -> pod starts and works OK.
Set liveness check to FAIL, return status 500 -> no change, pod IP still in service, requests still dispatched to the pod.
Set liveness check to DELAY before responding (then 500) -> pod is removed from Kubernetes service (yippee)
Set liveness check to FAIL (quickly) again -> pod is restored to the service (treated like success).
Set health check to FAIL (return status 500) -> no effect, pod continues without restart.
Set health check to DELAY before responding (then 500) -> pod is restarted after 5 failed probes
Thanks for any help with this. I guess I can change my code to delay before responding in the failure case but that seems like a workaround.
Problem solved thanks to comment from @mdaniel. Expanding it here because it took me a while to fully understand the comment.
The problem was in the configuration of the health and liveness checks in the pod spec.
readinessProbe:
exec:
command:
- curl
- http://localhost:30030/healthz
failureThreshold: 3
initialDelaySeconds: 10
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
This relies on the output of the curl
command in the exec clause.
Curl always exits with code 0.
Use curl -f
if you want to use curl. Then it will exit with non-zero in case of error.
But better to use httpGet
in the pod spec, like this
readinessProbe:
httpGet:
path: /healthz
port: 30030
scheme: HTTP
failureThreshold: 3
initialDelaySeconds: 10
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
I tested both and both work. I will go with httpGet
as suggested - the right tool for the job.
Note that the reason for using exec/curl instead of httpGet was that the pod uses TLS which prevents http from the Kubernetes pod. Ref. https://medium.com/cloud-native-the-gathering/kubernetes-liveness-probe-for-scratch-image-with-istio-mtls-enabled-90543e4bae34
Thanks!