We are running several kubernetes clusters on a few hundred VMs. A few VMs go down every week. We bring it back up. Our metrics show that the CPU & memory usage are low to moderate on these VMs when they go down. Other VM metrics (like the network traffic) also don't point to any unusual patterns. There are no specific messages in /var/log/messages when the VMs go down.
Kubernetes version: 1.9 Linux kernel version: 4.1.12-124.19.5.el7uek.x86_64
Are there other logs or diagnostic information we can check to get to the root cause of the VM outages.
Usually we also check the host journal especially if you are running kubelet as systemd.
There is a good tutorial on digitalocean explaining journald.
It might give you some clue as to why your kube nodes are crashing.