I have a preemptible node pool of size 1 on GKE. I've been running this node pool with size 1 for almost a month now. Every day the node restarts after 24 hours and rejoins the cluster. Today it restarted but did not rejoin the cluster.
Instead, I noticed that according to gcloud compute instances list
the underlying instance was running but not included in the output of kubectl get node
. I increased the node pool size to 2, whereupon a second instance was launched. That node immediately joined my GKE cluster and pods were scheduled onto it. The first node is still running according to gcloud
, but it won't join the cluster.
What's going on? How can I debug this this problem?
Update:
I SSHed into the instance and was immediately greeted with this excellent error message:
Broken (or in progress) Kubernetes node setup! Check the cluster initialization status
using the following commands:
Master instance:
- sudo systemctl status kube-master-installation
- sudo systemctl status kube-master-configuration
Node instance:
- sudo systemctl status kube-node-installation
- sudo systemctl status kube-node-configuration
The results of sudo systemctl status kube-node-installation
:
goto mark: ● kube-node-installation.service - Download and install k8s binaries and configurations
Loaded: loaded (/etc/systemd/system/kube-node-installation.service; enabled; vendor preset: disabled)
Active: active (exited) since Thu 2017-12-28 21:08:53 UTC; 6h ago
Process: 945 ExecStart=/home/kubernetes/bin/configure.sh (code=exited, status=0/SUCCESS)
Process: 941 ExecStartPre=/bin/chmod 544 /home/kubernetes/bin/configure.sh (code=exited, status=0/SUCCESS)
Process: 937 ExecStartPre=/usr/bin/curl --fail --retry 5 --retry-delay 3 --silent --show-error -H X-Google-Metadata-Request: True -o /home/kubernetes/bin/configure.sh http://metadata.google.internal/com
puteMetadata/v1/instance/attributes/configure-sh (code=exited, status=0/SUCCESS)
Process: 933 ExecStartPre=/bin/mount -o remount,exec /home/kubernetes/bin (code=exited, status=0/SUCCESS)
Process: 930 ExecStartPre=/bin/mount --bind /home/kubernetes/bin /home/kubernetes/bin (code=exited, status=0/SUCCESS)
Process: 925 ExecStartPre=/bin/mkdir -p /home/kubernetes/bin (code=exited, status=0/SUCCESS)
Main PID: 945 (code=exited, status=0/SUCCESS)
Tasks: 0 (limit: 4915)
Memory: 0B
CPU: 0
CGroup: /system.slice/kube-node-installation.service
Dec 28 21:08:52 gke-cluster0-pool-d59e9506-g9sc configure.sh[945]: Downloading node problem detector.
Dec 28 21:08:52 gke-cluster0-pool-d59e9506-g9sc configure.sh[945]: % Total % Received % Xferd Average Speed Time Time Time Current
Dec 28 21:08:52 gke-cluster0-pool-d59e9506-g9sc configure.sh[945]: Dload Upload Total Spent Left Speed
Dec 28 21:08:52 gke-cluster0-pool-d59e9506-g9sc configure.sh[945]: [158B blob data]
Dec 28 21:08:52 gke-cluster0-pool-d59e9506-g9sc configure.sh[945]: == Downloaded https://storage.googleapis.com/kubernetes-release/node-problem-detector/node-problem-detector-v0.4.1.tar.gz (SHA1 = a57a3fe
64cab8a18ec654f5cef0aec59dae62568) ==
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc configure.sh[945]: cni-0799f5732f2a11b329d9e3d51b9c8f2e3759f2ff.tar.gz is preloaded.
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc configure.sh[945]: kubernetes-manifests.tar.gz is preloaded.
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc configure.sh[945]: mounter is preloaded.
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc configure.sh[945]: Done for installing kubernetes files
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc systemd[1]: Started Download and install k8s binaries and configurations.
And the result of sudo systemctl status kube-node-configuration
:
● kube-node-configuration.service - Configure kubernetes node
Loaded: loaded (/etc/systemd/system/kube-node-configuration.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Thu 2017-12-28 21:08:53 UTC; 6h ago
Process: 994 ExecStart=/home/kubernetes/bin/configure-helper.sh (code=exited, status=4)
Process: 990 ExecStartPre=/bin/chmod 544 /home/kubernetes/bin/configure-helper.sh (code=exited, status=0/SUCCESS)
Main PID: 994 (code=exited, status=4)
CPU: 33ms
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc systemd[1]: Starting Configure kubernetes node...
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[994]: Start to configure instance for kubernetes
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[994]: Configuring IP firewall rules
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[994]: Another app is currently holding the xtables lock. Perhaps you want to use the -w option?
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[994]: Another app is currently holding the xtables lock. Perhaps you want to use the -w option?
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[994]: Another app is currently holding the xtables lock. Perhaps you want to use the -w option?
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc systemd[1]: kube-node-configuration.service: Main process exited, code=exited, status=4/NOPERMISSION
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc systemd[1]: Failed to start Configure kubernetes node.
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc systemd[1]: kube-node-configuration.service: Unit entered failed state.
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc systemd[1]: kube-node-configuration.service: Failed with result 'exit-code'.
So it looks like kube-node-configuration
failed. I ran sudo systemctl restart kube-node-configuration
and now the status output is:
● kube-node-configuration.service - Configure kubernetes node
Loaded: loaded (/etc/systemd/system/kube-node-configuration.service; enabled; vendor preset: disabled)
Active: active (exited) since Fri 2017-12-29 03:41:36 UTC; 3s ago
Main PID: 20802 (code=exited, status=0/SUCCESS)
CPU: 1.851s
Dec 29 03:41:28 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[20802]: Extend the docker.service configuration to set a higher pids limit
Dec 29 03:41:28 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[20802]: Docker command line is updated. Restart docker to pick it up
Dec 29 03:41:30 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[20802]: Start kubelet
Dec 29 03:41:35 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[20802]: Using kubelet binary at /home/kubernetes/bin/kubelet
Dec 29 03:41:35 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[20802]: Start kube-proxy static pod
Dec 29 03:41:35 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[20802]: Start node problem detector
Dec 29 03:41:35 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[20802]: Using node problem detector binary at /home/kubernetes/bin/node-problem-detector
Dec 29 03:41:36 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[20802]: Prepare containerized mounter
Dec 29 03:41:36 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[20802]: Done for the configuration for kubernetes
Dec 29 03:41:36 gke-cluster0-pool-d59e9506-g9sc systemd[1]: Started Configure kubernetes node.
...and the node joined the cluster :). But, my original question stands: what happened?