Recently we've experienced issues with both Non-Production and Production clusters where the nodes encountered 'System OOM encountered' issue.
The nodes within the Non-Production cluster don't seem to be sharing the pods. It seems like a given node is running all the pods and putting a load on the system.
Also, the Pods are stuck in this status: 'Waiting: ContainerCreating'.
Any help/guidance with the above issues would be greatly appreciated. We are building more and more services in this cluster and want to make sure there's no instability and/or environment issues and place proper checks/configuration in place before we go live.
Unable to mount volumes for pod "xxx-3615518044-6l1cf_xxx-qa(8a5d9893-230b-11e8-a943-000d3a35d8f4)": timeout expired waiting for volumes to attach/mount for pod "xxx-service-3615518044-6l1cf"/"xxx-qa"
That indicates your pod is having trouble mounting the volume specified in your configuration. This can often be a permissions issue. If you post your config files (like to a gist) with private info removed, we could probably be more helpful.
"I would recommend you manage container compute resources properly within your Kubernetes cluster. When creating a Pod, you can optionally specify how much CPU and memory (RAM) each Container needs to avoid OOM situations.
When Containers have resource requests specified, the scheduler can make better decisions about which nodes to place Pods on. And when Containers have their limits specified, contention for resources on a node can be handled in a specified manner. CPU specifications are in units of cores, and memory is specified in units of bytes.
An event is produced each time the scheduler fails, use the command below to see the status of events:
$ kubectl describe pod <pod-name>| grep Events
Also, read the official Kubernetes guide on “Configure Out Of Resource Handling”. Always make sure to:
reserve 10-20%
of memory capacity for system daemons like kubelet and OS kernel identify pods which can be evicted at 90-95%
memory utilization to reduce thrashing and incidence of system OOM.
To facilitate this kind of scenario, the kubelet would be launched with options like below:
--eviction-hard=memory.available<xMi
--system-reserved=memory=yGi
Replacing x and y with actual memory values.
Having Heapster container monitoring in place should be helpful for visualization".
Read more reading on Kubernetes and Docker Administration