We have a small private k8s cluster and until this morning everything was working but as of this morning just kubectl is working and no traffic is going through.
I mean I can launch new deployments, kill them, etc and I can see that they are up and running
But when I want to access them via http, amqp, etc I can't.
I was looking at our nginx logs and tried to go to the homepage but there was no log in nginx and nothing loaded in browser which means that no traffic received by nginx.
We are using Weave net as our CNI.
I checked the dns logs and also tested it and dns is working. I don't know where to start looking for solving this problem, any suggestion?
After some hours the problem almost solved and now I can access my applications but I want to ask another question which is very related to this:
Is there a way that we can detect that the problem is because of networking or it is from the cluster networking (the internal k8s network)? I am asking this because in the past I had a problem with k8s dns and this time I thought something is wrong with the k8s CNI.
Now I see this error in weave:
ERRO: 2019/09/27 11:10:03.358321 Captured frame from MAC (d2:14:2a:47:62:d9) to (02:01:5b:b9:8e:fd) associated with another peer 4a:8d:75:d7:59:ff(serflex-argus-2)
And my environment:
Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.1", GitCommit:"b7394102d6ef778017f2ca4046abbaa23b88c290", GitTreeState:"clean", BuildDate:"2019-04-08T17:11:31Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.3", GitCommit:"2d3c76f9091b6bec110a5e63777c332469e0cba2", GitTreeState:"clean", BuildDate:"2019-08-19T11:05:50Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}
Cloud provider or hardware configuration: In house private cluster contains of 5 nodes and set up with kubeadm.
OS (e.g: cat /etc/os-release): All machines are running Ubuntu 18.04.3
Kernel (e.g. uname -a): Linux k8s-master 4.15.0-62-generic #69-Ubuntu SMP Wed Sep 4 20:55:53 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
weave status
):/home/weave # ./weave --local status
Version: 2.5.2 (up to date; next check at 2019/09/27 15:12:49)
Service: router
Protocol: weave 1..2
Name: 02:01:5b:b9:8e:fd(k8s-master)
Encryption: disabled
PeerDiscovery: enabled
Targets: 1
Connections: 5 (4 established, 1 failed)
Peers: 5 (with 20 established connections)
TrustedSubnets: none
Service: ipam
Status: ready
Docker version 19.03.2, build 6a30dfc
I couldn't find a solution for this problem and I had to tear down the cluster and recreate it but this time I used Calico and after running for a week there was no problem.
The only thing I think could cause the problem was the 200Mb memory limit of the Weave and the fact that 4 out of 5 of my Weave pods were hitting that limit and also on their github I found that Weave has an issue with memleak and because of these I decided to change the CNI.