Issue with Batch Prediction on Tensorflow Serving

9/26/2018

We are trying to deploy our model on Kubernetes using Tensorflow Serving. Earlier we deployed our model (SSD + Inception) on K8S with our own base image for docker that we built using bazel. K8S configuration was as below: Cluster size - 2 Nodes Per Node config - 20 GB Memory, 2 GPU, 8vCPU

Now we have changed our model and are using Retinanet with Resnet50. This time we are using the base image for docker from Tensorflow's docker hub (tensorflow/serving:latest-devel-gpu) with the same K8S configuration.

Now the problem is earlier we were able to get the prediction for 500 images per batch and we were able to send these 500 images per batch using multiple workers (unlimited) but now in the new deployment, we are not able to send more than 100 images per batch. We are getting OOM error as follows:

{'error': 'OOM when allocating tensor with shape[150,256,160,160] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc\n\t [[Node: FeatureExtractor/resnet_v1_50/resnet_v1_50/block1/unit_1/bottleneck_v1/conv3/Conv2D = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](FeatureExtractor/resnet_v1_50/resnet_v1_50/block1/unit_1/bottleneck_v1/conv2/Relu6, FeatureExtractor/resnet_v1_50/block1/unit_1/bottleneck_v1/conv3/weights)]]\nHint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info

We checked K8S memory utilization as well and it wasn't fully utilized (maximum 30%). Can anyone tell us why are we getting this Out of Memory error and Which memory is Tensorflow referring to here?

TIA

-- Arpit Tejan
google-cloud-platform
google-kubernetes-engine
kubernetes
tensorflow
tensorflow-serving

2 Answers

9/26/2018

It's not about how much RAM Kubernetes consumes.
This is about how much RAM you told Kubernetes that your container will utilize vs. how much it actually utilizes.

-- samhain1138
Source: StackOverflow

9/27/2018

The problem is that you are OOMing the GPU. Given the error message you posted, you are trying to allocate 150 * 256 * 160 * 160 * 32 / 1024 / 1024 / 1024 = 29.3 GB of GPU memory. Tesla cards come with either 12/16 GB of vRAM and some (probably not yet available in any cloud as they are very new) with 32 like the GV100, but that's a Quadro card.

So, you have two options. Either decrease the batch size or any other dimension of that huge Tensor you are trying to allocate. Or find the specific operation in your graph and force it to run on the main memory with a

with tf.device('cpu:0'):
    # operation goes here

However, this second method will just alleviate the problem and you will OOM in some other part. Plus, by running the operation on the CPU, you'll have a huge performance decrease without even counting the back and forth transfers of data between main memory and GPU memory.

So, summarizing, you should definitely consider decreasing one of the dimensions of that tensor, being the batch size, one of the image sizes (or both), or the number of channels.

The model you used before was probably not using so many output channels in its convolutional layers.

-- marcyb5st
Source: StackOverflow