Issue with service/deployment intermittently stuck on ContainerCreating due to docker data directory moving on a nodes

5/1/2019

Running kubeadm cluster (version v1.13.1) across 8 nodes (7 running RHEL 7.x, and one running Ubuntu 18.04.2; docker version 1.13.1 w/ API ver 1.26 on RHEL and 18.09.5 w/ API ver 1.39 on Ubuntu).

Everything had been working very well until the other day when the master node locked up due to docker-current eating up memory and the machine becoming all messed up, which necessitated a reboot.

Now that everything is back up and running, I started testing the cluster again. However, flakey behavior started happening: service names were not being picked up as host names (as svc_name.default) in some pods that use these to communicate between pods, and when I submit service/deployments the deployment gets stuck on ContainerCreating. If I restart the kubelet on the node that I am trying to deploy the pod to then it goes through the next attempt and deploys the pod without issue.

I just added resource limitations on the nodes with limited memory/cpu, as per --system-reserved=cpu=500m,memory=1Gi to /etc/systemd/system/kubelet.service.d/10-kubeadm.conf, but that hassn't helped at all.

I am monitoring the cluster using MetricsServer and the dashboard and don't see anything unusual. I have also scoured the logs using journalctl, with nothing popping out.

I've checked the DNS, as per dns debugging, and everything is fine. So, not sure why the service name as a host is not always being picked up, though I suspect there is some underlying issue that was introduced when the master node locked up.

I am tempted to just rebuild the cluster, but also hesitant too, especially if these issues can be resolved.

Any ideas? Everything I've searched for hasn't been applicable to this issue. We're close to a production run and the timing of this is not very good.

EDIT

The following is a description of the pod failing to spin up, which now makes sense:

Events:
  Type     Reason       Age               From                      Message
  ----     ------       ----              ----                      -------
  Normal   Scheduled    25s               default-scheduler         Successfully assigned default/nlp-adapt-wf-wmw6r-2161497416 to bpb.X.X.X
  Warning  FailedMount  9s (x6 over 25s)  kubelet, bpb.X.X.X  MountVolume.SetUp failed for volume "docker-lib" : hostPath type check failed: /var/lib/docker is not a directory

The issue is that I changed the default location of the docker data directory on the node bpb.X.X.X, but apparently kubernetes is not smart enough to know this.

My Googling of this has not yield any results of worth.

How do I let kubernetes know where the docker data on this node is now residing? Docker itself works fine on this node.

-- horcle_buzz
docker
kubernetes

1 Answer

5/1/2019

Creating a symlink from the new location of the docker data folder to /var/lib/docker seems to have fixed the issue.

-- horcle_buzz
Source: StackOverflow