We are using Google Kubernetes Engine on 1.9.6-gke.1, and have a cluster with several nodepools for which we enable auto-scaling because the nodes in them contain attached GPUs (p100s).
Sometimes we run jobs overnight via a Kubernetes Pod on a node that was brought up due to a triggered auto-scaling event, and many hours later return to find that the pod has disappeared because the pod has terminated, in some unknown state, and since no other pod is scheduled to the node for 10 minutes, the node it ran on has been drained and removed.
That is, once the node is gone the pod disappears from the perspective of the Kubernetes logs and control plane ie.running things like kubectl get pods
and kubectl describe pod
. We would like to be able to know the status of these pods at the time of termination, eg. 'Completed, Error, OOM'. Is there a way to have this pod lifecycle information logged in Google Cloud Platform, perhaps via Stackdriver or other? If it's already available where would we find it?
Note this is for pods for which the node the pod ran on is no longer in the cluster.
Thanks in advance!
There are two logs within Stackdriver Logging that you can check GKE logs. The first one is called "GKE Cluster Operations", and the second is called "Container Logs".
The "GKE Cluster Operations" logs will show you all the operations that take place within the cluster such as pod creation, container creation, etc...
The "Container Logs" will log the operations of a Container. I created a simple job using the yaml file given here. After running the job, I went into the "Container Logs" and it showed the output of the container successfully.
In this case, you should be able to see the logs of the pod status from the "GKE Cluster Operations" logs within GCP.