TensorFlow serving on docker failed call to cuInit: CUresult(-1)

7/25/2018

When using the image tensorflow/serving:latest-devel-gpu on Kuberenetes, the GPU isn't being used.

I don't do anything fancy with it, simply pass server.conf and model files.

The default runtime is nvidia-docker, and my other GPU pod is able to use the GPU.

The only error in the log:

E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:397 ] failed call to cuInit: CUresult(-1)

Something else which is interesting:

I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program

-- aclokay
docker
kubernetes
tensorflow
tensorflow-serving

2 Answers

7/26/2018

There are several issues on the tracker: #394, #2882, #646.

In brief, there are solutions that work well (try one at a time):

  1. Run:

    $ sudo apt-get install nvidia-modprobe  
    $ sudo reboot
  2. Run:

    $ nvidia-cuda-mps-server
  3. Run the following:

    $ sudo modinfo nvidia-<driver_version_num>-uvm (with driver_version_num as 384 in my case)
    $ sudo modprobe --force-modversion nvidia-<nvidia-version>-uvm
  4. I was on CUDA-8 and CuDNN-6.0
    I moved to CUDA-9 and CuDNN-7.0

As soon as you run Tensorflow as a pod, I can guess that solution 1,2,3 should be applied to the worker node, but for solution 4 you may need to update your Tensorflow docker image.a

-- VAS
Source: StackOverflow

9/23/2018

update you dockerfile with

RUN rm /usr/local/cuda/lib64/stubs/libcuda.so.1

or extend the serving devel gpu image from docker hub, with one extra line

FROM tensorflow/serving:1.9.0-devel-gpu
RUN rm /usr/local/cuda/lib64/stubs/libcuda.so.1
-- Srini Reddy
Source: StackOverflow