I have a kubernetes HPA set up in my cluster, and it works as expected scaling up and down instances of pods as the cpu/memory increases and decreases.
The only thing is that my pods handle web requests, so it occasionally scales down a pod that's in the process of handling a web request. The web server never gets a response back from the pod that was scaled down and thus the caller of the web api gets an error back.
This all makes sense theoretically. My question is does anyone know of a best practice way to handle this? Is there some way I can wait until all requests are processed before scaling down? Or some other way to ensure that requests complete before HPA scales down the pod?
I can think of a few solutions, none of which I like:
Any suggestions would be appreciated. Thanks in advance!
You must design your apps to support graceful shutdown. First your pod will receive a SIGTERM
signal and after 30 seconds (can be configured) your pod will receive a SIGKILL
signal and be removed. See Termination of pods
SIGTERM: When your app receives termination signal, your pod will not receive new requests but you should try to fulfill responses of already received requests.
Your apps should also be designed for idempotency so you can safely retry failed requests.