Using TPU on GKE: Error recorded from infeed: Socket closed

5/16/2019

Once in a while our GKE TPUEstimator based training job using TPUs fail with:

Error recorded from infeed: Socket closed
An error was raised. This may be due to a preemption in a connected worker or parameter server. The current session will be closed and a new session will be created. This error may also occur due to a gRPC failure caused by high memory or network bandwidth usage in the parameter servers. If this error occurs repeatedly, try increasing the number of parameter servers assigned to the job. Error: Socket closed

I have two questions about that:

  1. What is happening here? I checked the pods memory usage and that did not spike. The TPU allocated to the pod is still there as well.
  2. The job doesn't always raise an error to the pod. It continues to show as running unless someone manually checks the state and then takes action to restart it. Any way to make it always restart automatically?
-- Prashast
google-kubernetes-engine
tensorflow
tpu

0 Answers