So I have a completely working EKS cluster with an ELB in front of the service. The problem now is when I do a helm upgrade where all it does is pull a new version of the application image, the target pod goes out of service on the load balancer and never recovers. Verified that the pod is running properly with the new image, so not quite sure what is going on here.
If I just destroy the entire release and recreate, it's all fine again, but as soon as I trigger a deploy it goes out of service.
Sometimes killing the pod and let it recreate fixes it, but still not sure why it goes out of service to begin with.
I can run a tcpdump on the container and see the healthcheck requests going to the pod but they are not being replied to, so obviously this triggers it going out of service on the ELB. But if I tail the logs of the pod it's working fine as I can see the readiness checks working fine which is actually runs a function in the application.
Not quite sure where to be looking from here. Everything checks out as far as I can see.