We are noticing restarts of our Kubernetes pods (Google Container Engine) whenever JVM's garbage collection duration is a little long.
Specifically speaking, anytime, it seems to cross ~20 seconds, it causes a restart of the pod.
1) The JVM is not out of heap memory. It's still less than 20% of allocated heap. It's just that once in a long while, a particular GC cycle takes long (could be due to IO on that pod's disk at that time)
2) I tried to adjust the liveness check parameters to periodSeconds=12, failureThreshold=5, so that the liveness checking process waits for at least 12 * 5 = 60 seconds before deciding that a pod has become unresponsive and replace it with a new one, but still it's restarting the pod as soon as the GC-pause crosses 20-22 seconds.
Could anyone comment on why this might be happening and what else can I adjust to not restart the pod on this GC-pause? It's a pity, because there is lot of heap capacity still available, and memory is not really the reason it should be replaced.
Found it.
I had to adjust timeoutSeconds from default of 1 second to 5 seconds in addition to setting periodSeconds to 12, to make it wait for ~60 seconds before flagging a pod as unresponsive.