ResourceExhausted error while training nasnet_large on cifar10 on Kubernetes

8/7/2019

I am trying to train nasnet on cifar10 on Kubernetes. I am getting this error:

ResourceExhausted. "tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[32,1008,42,42] [[Node: cell_6/strided_slice = StridedSlice[Index=DT_INT32, T=DT_FLOAT, begin_mask=3, ellipsis_maseplica:0/task:0/device:GPU:0"](cell_6/Pad, gradients/cell_stem_1/strided_slice_grad/StridedSliceGrad-1-Layoptimizer, gradients/cell_stem_1/strided_slice_grad/StridedSliceGrad-3-LayoutOptimizer)]]"

The repo here.

The command I used to train: python train_image_classifier.py --train_dir=/tmp/train_logs --dataset_name=cifar10 --dataset_split_name=train --dataset_dir=tmp//data//cifar10 --model_name=nasnet_large

CUDA Version 9.0.176
GPU (TensorFlow) 1.9.0
Ubuntu 16.04

Pod resource:
CPU: 28
Memory: 64Gi
GPU (NVIDIA): 2

Question 1: What should I do to resolve this error?
Question 2: Alternatively, how can I train on a single GPU only, in case if I want to?

-- Titir Santra
kubernetes
tensorflow

0 Answers