Kubernetes POD restarting

1/18/2020

I am running GKE cluster with two node pool.

1st node pool: 1 node (No auto scaling)(4 vCPU, 16 GB RAM)

2nd node pool: 1 node (Auto scaling to 2 node) (1 vCPU, 3.75 GB RAM)

here : kubectl top node

enter image description here

we started cluster with a single node running Elasticsearch, Redis, RabbitMQ and all micro service on single node. we can not add more node in 1st node pool as it will be wasting of resources. 1st node can satisfy all resource requirements.

We are facing POD restarting for only one microservice.

enter image description here

core service pod is only restarting. when tried to describe pod it's ERROR 137 terminated.

In GKE stack drive graph Memory and CPU is not reaching to limit.

All pods in cluster utilization

enter image description here

In cluster log I have found this warning :

0/3 nodes are available: 3 Insufficient CPU. 

but here it's 3 nodes total CPU around 6 vCPU which is more than enough.

Also this error

Memory cgroup out of memory: Kill process 3383411 (python3) score 2046 or sacrifice child Killed process 3384902 (python3) total-vm:14356kB, anon-rss:5688kB, file-rss:4572kB, shmem-rss:0kB

EDIT : 1

Name:           test-core-7fc8bbcb4c-vrbtw
Namespace:      default
Priority:       0
Node:           gke-test-cluster-highmem-pool-gen2-f2743e02-msv2/10.128.0.7
Start Time:     Fri, 17 Jan 2020 19:59:54 +0530
Labels:         app=test-core
                pod-template-hash=7fc8bbcb4c
                tier=frontend
Annotations:    <none>
Status:         Running
IP:             10.40.0.41
IPs:            <none>
Controlled By:  ReplicaSet/test-core-7fc8bbcb4c
Containers:
  test-core:
    Container ID:   docker://0cc49c15ed852e99361590ee421a9193e10e7740b7373450174f549e9ba1d7b5
    Image:          gcr.io/test-production/core/production:fc30db4
    Image ID:       docker-pullable://gcr.io/test-production/core/production@sha256:b5dsd03b57sdfsa6035ff5ba9735984c3aa714bb4c9bb92f998ce0392ae31d055fe
    Ports:          9595/TCP, 443/TCP
    Host Ports:     0/TCP, 0/TCP
    State:          Running
      Started:      Sun, 19 Jan 2020 14:54:52 +0530
    Last State:     Terminated
      Reason:       Error
      Exit Code:    137
      Started:      Sun, 19 Jan 2020 07:36:42 +0530
      Finished:     Sun, 19 Jan 2020 14:54:51 +0530
    Ready:          True
    Restart Count:  7
    Limits:
      cpu:     990m
      memory:  1Gi
    Requests:
      cpu:      200m
      memory:   128Mi
    Liveness:   http-get http://:9595/k8/liveness delay=25s timeout=5s period=5s #success=1 #failure=30
    Readiness:  http-get http://:9595/k8/readiness delay=25s timeout=8s period=5s #success=1 #failure=30
    Environment Variables from:
      test-secret             Secret     Optional: false
      core-staging-configmap  ConfigMap  Optional: false
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  default-token-hcz6d:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-hcz6d
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:          <none>

Please help. Thankyou in advance.

-- Harsh Manvar
docker
google-cloud-platform
google-kubernetes-engine
kubernetes
python

1 Answer

1/19/2020

The application running in the pod may be consuming more memory than the specified limits. You can docker exec / kubectl exec into the container and monitor the applications using top. But from perspective of managing the whole cluster, we do it using cadvisor (which is part of Kubelet) + Heapster. But now Heapster is replaced by kube-metric server (https://kubernetes.io/docs/tasks/debug-application-cluster/resource-usage-monitoring)

-- pr-pal
Source: StackOverflow