Why does tensorflow 2 stop saving checkpoint to s3 at random times in the middle of a training job?

4/8/2020

I have a custom model training python script which relies on tf 2 to train the model and communicate with S3 to save checkpoints and export the final model. I use Docker to containerize the script and deploy it on a Kubernetes cluster using Kubeflow. Somehow, at a random point in training, tensorflow throws the following error:

There was no new checkpoint after the training. Eval status: no new checkpoint ('There was no new checkpoint after the training. Eval status: no new checkpoint',)

Does anyone have any idea where this could come from? I'm at a loss trying to understand what the pb is here. I've done the same thing using one of Kubeflow's tutorial examples and it does the exact same thing. After some time, for no reason, the same error pops out... Thanks for all input

-- Whynote
amazon-s3
kubeflow
kubernetes
python
tensorflow2.0

0 Answers