I am experiencing a very complicated issue with Kubernetes in my production environments losing all their Agent Nodes, they change from Ready to NotReady, all the pods change from Running to NodeLost state. I have discovered that Kubernetes is making intensive usage of disks:
My cluster is deployed using acs-engine 0.17.0 (and I tested previous versions too and the same happened).
On the other hand, we decided to deploy the Standard_DS2_VX VM series which contains Premium disks and we incresed the IOPS to 2000 (It was previously under 500 IOPS) and same thing happened. I am going to try with a higher number now.
Any help on this will be appreaciated.
It was a microservice exhauting resources and then Kubernetes just halt the nodes. We have worked on establishing resources/limits based so we can avoid the entire cluster disruption.