Running kubeadm cluster (version v1.13.1) across 8 nodes (7 running RHEL 7.x, and one running Ubuntu 18.04.2; docker version 1.13.1 w/ API ver 1.26 on RHEL and 18.09.5 w/ API ver 1.39 on Ubuntu).
Everything had been working very well until the other day when the master node locked up due to docker-current
eating up memory and the machine becoming all messed up, which necessitated a reboot.
Now that everything is back up and running, I started testing the cluster again. However, flakey behavior started happening: service names were not being picked up as host names (as svc_name.default
) in some pods that use these to communicate between pods, and when I submit service/deployments the deployment gets stuck on ContainerCreating
. If I restart the kubelet
on the node that I am trying to deploy the pod to then it goes through the next attempt and deploys the pod without issue.
I just added resource limitations on the nodes with limited memory/cpu, as per --system-reserved=cpu=500m,memory=1Gi
to /etc/systemd/system/kubelet.service.d/10-kubeadm.conf
, but that hassn't helped at all.
I am monitoring the cluster using MetricsServer and the dashboard and don't see anything unusual. I have also scoured the logs using journalctl
, with nothing popping out.
I've checked the DNS, as per dns debugging, and everything is fine. So, not sure why the service name as a host is not always being picked up, though I suspect there is some underlying issue that was introduced when the master node locked up.
I am tempted to just rebuild the cluster, but also hesitant too, especially if these issues can be resolved.
Any ideas? Everything I've searched for hasn't been applicable to this issue. We're close to a production run and the timing of this is not very good.
EDIT
The following is a description of the pod failing to spin up, which now makes sense:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 25s default-scheduler Successfully assigned default/nlp-adapt-wf-wmw6r-2161497416 to bpb.X.X.X
Warning FailedMount 9s (x6 over 25s) kubelet, bpb.X.X.X MountVolume.SetUp failed for volume "docker-lib" : hostPath type check failed: /var/lib/docker is not a directory
The issue is that I changed the default location of the docker data directory on the node bpb.X.X.X
, but apparently kubernetes is not smart enough to know this.
My Googling of this has not yield any results of worth.
How do I let kubernetes know where the docker data on this node is now residing? Docker itself works fine on this node.
Creating a symlink from the new location of the docker data folder to /var/lib/docker
seems to have fixed the issue.