I have a kubernetes cluster with one master and four nodes. kube-proxy was working fine on all four nodes, and I could access services on any of the nodes irrespective of where it was running; ie. http://node1:30000 through http://node4:30000 was giving the same response.
After restarting node4 by running shutdown -r now, it came back up, but I noticed that the node was no longer responding to requests. I am running the following command:
curl http://node4:30000
If I run it from my PC, or from any other node in the cluster -- node1 through node3, or master -- I get:
curl: (7) Failed to connect to node4 port 30000: Connection timed out
However, if I run it from node4, it responds successfully. This leads me to believe that kube-proxy is running fine, but something is preventing external connections.
When I run kubectl describe node node4, my output looks normal:
Name: node4
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/hostname=node4
Taints: <none>
CreationTimestamp: Tue, 21 Feb 2017 15:21:17 -0400
Phase:
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
OutOfDisk False Wed, 22 Feb 2017 08:03:40 -0400 Tue, 21 Feb 2017 15:21:18 -0400 KubeletHasSufficientDisk kubelet has sufficient disk space available
MemoryPressure False Wed, 22 Feb 2017 08:03:40 -0400 Tue, 21 Feb 2017 15:21:18 -0400 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Wed, 22 Feb 2017 08:03:40 -0400 Tue, 21 Feb 2017 15:21:18 -0400 KubeletHasNoDiskPressure kubelet has no disk pressure
Ready True Wed, 22 Feb 2017 08:03:40 -0400 Tue, 21 Feb 2017 15:21:28 -0400 KubeletReady kubelet is posting ready status. AppArmor enabled
Addresses: 10.6.81.64,10.6.81.64,node4
Capacity:
alpha.kubernetes.io/nvidia-gpu: 0
cpu: 2
memory: 4028748Ki
pods: 110
Allocatable:
alpha.kubernetes.io/nvidia-gpu: 0
cpu: 2
memory: 4028748Ki
pods: 110
System Info:
Machine ID: dbc0bb6ba10acae66b1061f958220ade
System UUID: 4229186F-AA5C-59CE-E5A2-258C1BBE9D2C
Boot ID: a3968e6c-eba3-498c-957f-f29283af1cff
Kernel Version: 4.4.0-63-generic
OS Image: Ubuntu 16.04.1 LTS
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://1.13.0
Kubelet Version: v1.5.2
Kube-Proxy Version: v1.5.2
ExternalID: node4
Non-terminated Pods: (27 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits
--------- ---- ------------ ---------- --------------- -------------
<< application pods listed here >>
kube-system kube-proxy-0p3lj 0 (0%) 0 (0%) 0 (0%) 0 (0%)
kube-system weave-net-uqmr1 20m (1%) 0 (0%) 0 (0%) 0 (0%)
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.
CPU Requests CPU Limits Memory Requests Memory Limits
------------ ---------- --------------- -------------
20m (1%) 0 (0%) 0 (0%) 0 (0%)
Is there anything specific I need to do to bring a node back online after a system restart?
My team was able to solve this one by downgrading docker to 1.12. It appears that the problem is related to this issue:
https://github.com/kubernetes/kubernetes/issues/40182
After downgrading docker to 1.12, everything is working now.