Kubernetes http health check not working as expected - 500 response is ignored

2/5/2021

I have implemented a http health check and a separate http liveness check for my pod. For both, I see that Kubernetes works as expected if my pod delays before responding. However, when they respond immediately with a status 500, Kubernetes treats that as a success response. This is after the pod is up and running OK - before the checks start returning status 500.

In fact, I see that returning status 500 actually resets the failure count, so it caused my pod to be treated as healthy again.

Question is whether I am doing something wrong? How to get Kubernetes to do its stuff when my pod is unhealthy?

$ k version
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.8", GitCommit:"9f2892aab98fe339f3bd70e3c470144299398ace", GitTreeState:"clean", BuildDate:"2020-08-13T16:12:48Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.1", GitCommit:"c4d752765b3bbac2237bf87cf0b1c2e307844666", GitTreeState:"clean", BuildDate:"2020-12-18T12:00:47Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}

To investigate this problem, I have added test endpoints to my pod so that I can change the behaviour at runtime: pass (200), fail (500), delay fail (wait 15 seconds, then return 500). And I separated the health and liveness endpoints.

From describe pod:

Liveness:   exec [curl http://localhost:30030/livez] delay=10s timeout=1s period=10s #success=1 #failure=5
Readiness:  exec [curl http://localhost:30030/healthz] delay=10s timeout=1s period=10s #success=1 #failure=3

I tested the endpoints by exec into the pod and curl the endpoints from there (details below).
Then I cycled both the liveness check and the health check through the 3 modes and monitored the Kubernetes response.
Health Check: expect pod to be restarted after failing health check 5 times in a row.
Liveness Check: describe the service and expect IP address of the pod to be removed from the list of endpoints.

Success case:

bash-4.4$ curl http://localhost:30030/unfailhealth
unfailhealth: REMOVE force all health checks to fail, was failHealth=false, delayFailHealth=false

bash-4.4$ curl http://localhost:30030/healthz -v
*   Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 30030 (#0)
> GET /healthz HTTP/1.1
> Host: localhost:30030
> User-Agent: curl/7.61.1
> Accept: */*
>
< HTTP/1.1 200 OK
< X-Powered-By: Express
< Content-Type: text/html; charset=utf-8
< Content-Length: 3
< ETag: W/"3-CftlTBfMBbEe9TvTWqcB9tVQ6OE"
< Date: Fri, 05 Feb 2021 13:30:59 GMT
< Connection: keep-alive
< Keep-Alive: timeout=5
<
OK
* Connection #0 to host localhost left intact

Failure case:

bash-4.4$ curl http://localhost:30030/failhealth
failhealth: force all health checks to fail, was failHealth=true, delayFailHealth=false

bash-4.4$ curl http://localhost:30030/healthz -v
*   Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 30030 (#0)
> GET /healthz HTTP/1.1
> Host: localhost:30030
> User-Agent: curl/7.61.1
> Accept: */*
>
< HTTP/1.1 500 Internal Server Error
< X-Powered-By: Express
< Content-Type: text/html; charset=utf-8
< Content-Length: 26
< ETag: W/"1a-yI5D4Rtao1KH34GZVYKKvxZoEVo"
< Date: Fri, 05 Feb 2021 13:29:14 GMT
< Connection: keep-alive
< Keep-Alive: timeout=5
<
FAKE HEALTH CHECK FAILURE
* Connection #0 to host localhost left intact

Delayed failure case:

bash-4.4$ curl http://localhost:30030/delayfailhealth
delayfailhealth: force all health checks to sleep 15sec, then fail, was failHealth=false, delayFailHealth=true

bash-4.4$ date; curl http://localhost:30030/healthz -v
Fri Feb  5 13:33:08 UTC 2021
*   Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 30030 (#0)
> GET /healthz HTTP/1.1
> Host: localhost:30030
> User-Agent: curl/7.61.1
> Accept: */*
>
< HTTP/1.1 500 Internal Server Error
< X-Powered-By: Express
< Content-Type: text/html; charset=utf-8
< Content-Length: 47
< ETag: W/"2f-n+Ix8oU/09OT9+cpPVm1/EejE9Y"
< Date: Fri, 05 Feb 2021 13:33:23 GMT
< Connection: keep-alive
< Keep-Alive: timeout=5
<
FAKE HEALTH CHECK FAILURE - AFTER 15 SEC DELAY
* Connection #0 to host localhost left intact

Test Results

Default to SUCCESS for both health and liveness endpoints,return status 200 -> pod starts and works OK.

Set liveness check to FAIL, return status 500 -> no change, pod IP still in service, requests still dispatched to the pod.
Set liveness check to DELAY before responding (then 500) -> pod is removed from Kubernetes service (yippee)
Set liveness check to FAIL (quickly) again -> pod is restored to the service (treated like success).

Set health check to FAIL (return status 500) -> no effect, pod continues without restart.
Set health check to DELAY before responding (then 500) -> pod is restarted after 5 failed probes

Thanks for any help with this. I guess I can change my code to delay before responding in the failure case but that seems like a workaround.

-- Dave Deasy
kubernetes

1 Answer

2/5/2021

Problem solved thanks to comment from @mdaniel. Expanding it here because it took me a while to fully understand the comment.

The problem was in the configuration of the health and liveness checks in the pod spec.

        readinessProbe:
          exec:
            command:
            - curl
            - http://localhost:30030/healthz
          failureThreshold: 3
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1

This relies on the output of the curl command in the exec clause.
Curl always exits with code 0. Use curl -f if you want to use curl. Then it will exit with non-zero in case of error.

But better to use httpGet in the pod spec, like this

        readinessProbe:
          httpGet:
            path: /healthz
            port: 30030
            scheme: HTTP
          failureThreshold: 3
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1

I tested both and both work. I will go with httpGet as suggested - the right tool for the job.

Note that the reason for using exec/curl instead of httpGet was that the pod uses TLS which prevents http from the Kubernetes pod. Ref. https://medium.com/cloud-native-the-gathering/kubernetes-liveness-probe-for-scratch-image-with-istio-mtls-enabled-90543e4bae34

Thanks!

-- Dave Deasy
Source: StackOverflow