How to detect exception occured in a Pod in Kubernetes?

8/19/2018

I have a multinode kubernetes cluster. Multiple services are deployed as Pods. They communicate over each other via rabbitmq which also exists as Pod in the Cluster.

Problem Scenario:

Many time services fails to connect to required queue in the Rabbitmq. Log for the same are reported in Rabbitmq pod logs and on the services Pod as well. This occurs primarily due to connectivity issues and is inconsistent. Due to this failure functionality breaks. And also since this is NOT a crash, pod is always in running state in the kubernetes. To fix this we have to manually go and restart the pod.

I want to create a liveness probe for every pod. But how this should work to catch the exception? Since many process in a service can be trying to access the connection, any one of them can fail.

-- C.v
docker
kubectl
kubernetes
pod
rabbitmq

1 Answer

8/19/2018

I'd suggest implementing http endpoint for liveness probe that would check statew of the connection to rabbitmq or actualy failing miserably and exiting whole process when rabbit connection does not work.

But... the best solution would be to retry the connection indefinitely when it fails so a temporary networking issue is transparently recovered from. Well written service should wait for depending services to become operational instead of cascading the failure up the stack.

Imagine you have a liveness check like you ask here on 20 services using that rabvbit or other service. That service goes down for some time, and what you end up with is cluster with 20+ services in CrashLoopBackoff state due to incremental backoff on failure. Meaning your cluster will take some time to recover when that originaly failing service is back, as well as the picture will be pretty messed up and will make it harder to understand what happened at first glance.

-- Radek 'Goblin' Pieczonka
Source: StackOverflow