We are trying to deploy our model on Kubernetes using Tensorflow Serving. Earlier we deployed our model (SSD + Inception) on K8S with our own base image for docker that we built using bazel. K8S configuration was as below: Cluster size - 2 Nodes Per Node config - 20 GB Memory, 2 GPU, 8vCPU
Now we have changed our model and are using Retinanet with Resnet50. This time we are using the base image for docker from Tensorflow's docker hub (tensorflow/serving:latest-devel-gpu) with the same K8S configuration.
Now the problem is earlier we were able to get the prediction for 500 images per batch and we were able to send these 500 images per batch using multiple workers (unlimited) but now in the new deployment, we are not able to send more than 100 images per batch. We are getting OOM error as follows:
{'error': 'OOM when allocating tensor with shape[150,256,160,160] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc\n\t [[Node: FeatureExtractor/resnet_v1_50/resnet_v1_50/block1/unit_1/bottleneck_v1/conv3/Conv2D = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](FeatureExtractor/resnet_v1_50/resnet_v1_50/block1/unit_1/bottleneck_v1/conv2/Relu6, FeatureExtractor/resnet_v1_50/block1/unit_1/bottleneck_v1/conv3/weights)]]\nHint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info
We checked K8S memory utilization as well and it wasn't fully utilized (maximum 30%). Can anyone tell us why are we getting this Out of Memory error and Which memory is Tensorflow referring to here?
TIA
It's not about how much RAM Kubernetes consumes.
This is about how much RAM you told Kubernetes that your container will utilize vs. how much it actually utilizes.
The problem is that you are OOMing the GPU. Given the error message you posted, you are trying to allocate 150 * 256 * 160 * 160 * 32 / 1024 / 1024 / 1024 = 29.3 GB
of GPU memory. Tesla cards come with either 12/16 GB of vRAM and some (probably not yet available in any cloud as they are very new) with 32 like the GV100, but that's a Quadro card.
So, you have two options. Either decrease the batch size or any other dimension of that huge Tensor you are trying to allocate. Or find the specific operation in your graph and force it to run on the main memory with a
with tf.device('cpu:0'):
# operation goes here
However, this second method will just alleviate the problem and you will OOM in some other part. Plus, by running the operation on the CPU, you'll have a huge performance decrease without even counting the back and forth transfers of data between main memory and GPU memory.
So, summarizing, you should definitely consider decreasing one of the dimensions of that tensor, being the batch size, one of the image sizes (or both), or the number of channels.
The model you used before was probably not using so many output channels in its convolutional layers.