I have a cluster on google cloud container engine with 6 n1-standard-1
machine.
I deployed several services and pod on this cluster and sometime they fail with the only reason FailedSync
and no more explanation, I have no idea why they fail. Virtual machine are not overloaded, only 6% of the CPU is used and less than 1Gi of memory.
Here some events from describe command :
pods filter by is system object: true
have the same problem, some of them have more than 900 restarts in 4 days...
I miss maybe something in my kubernetes configuration and I have no idea what...
Thanks for your help
I finally found the reason of the node failures. I use a glusterfs volume with the https://eventstore.org/ database and I think the latency make it fails, I saw lot of slow queries in the eventstore logs. I don't really know what happen but since I use a persistent ssd disk in the same region of my cluster I have no issue. 0 restart since several days and nodes work like a charm.
I also isolated this database on a single node.
I think the best way to find out the issue is just ssh to the node and use sudo docker logs $CONTAINER_Id
to see what happened to your applications.
You can tell on what nodes your applications are deployed to by kubectl describe po $PO_NAME
or simply kubectl get po -o wide
.