We have been facing a rather strange issue where our pods in one of the GKE managed cloud composer cluster is unhealthy and this has resulted in multiple job failure on the past few weeks.
We are not sure for now, why would such pods enter this state and restart again after few attempts on the same node and work just like normal as if nothing ever happened.
How can this be troubleshooted?