How to detect temporary network partition in Kubernetes?

11/9/2017

We have a Kubernetes cluster set up on AWS VPC with 10+ nodes. We encountered an incident where one node was not accessible to others and vice-versa for ~10 minutes. Finding this out took quite a lot of time.

Is there a tool for Kubernetes or AWS to detect these kind of network problems? Maybe something like a Daemon Set where each pod pings the others in the network and logs it when the ping fails.

-- Indrek
amazon-web-services
kubernetes

1 Answer

11/9/2017

If you are mostly interested in being alerted when such problem happens, I would set up monitoring system and hook it up with something like alertmanager. For collecting metrics, you can look at open source project such as Prometheus. Once you set this up, it is really easy to integrate it with Grafana (for dashboard) and alertmanager (for alerting based on rules you specify in Prometheus). And they are all open source projects.

https://prometheus.io/

-- kevink
Source: StackOverflow