tensorflow distribution train on kubernetes cluster raise cudnn handle error

12/6/2019

When I try to run a distribution training task on kubernetes cluster by using kubeflow, it raises: CUDNN_STATUS_INTERNAL_ERROR

Here are the details about the environment and logs: Host machine OS :ubuntu18.04 Host machine nvidia-driver:435.21 cuda:10.1 cudnn:7.5.0 kubeflow:0.5.1 kubernetes:1.13.5 docker:18.03 nvidia-docker2:1.12 tensorflow-gpu:1.14.0

        with tf.compat.v1.train.MonitoredTrainingSession(context.getServer().target, 
                                               is_chief=context.is_chief,
                                               config=context.getSessConfig(),
                                               checkpoint_dir=runSpace.getModelPath(),
                                               summary_dir=runSpace.getLogPath()) as sess:

error logs:

] Successfully opened dynamic library libcudnn.so.7
2019-10-17 23:47:10.553186: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-10-17 23:47:10.558872: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
conv = tf.nn.conv2d(input=input_op, filter=filter, strides=[1, stride_h, stride_w, 1], padding='SAME')
File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 1026, in conv2d
data_format=data_format, dilations=dilations, name=name)
File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
op_def=op_def)
File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
self._traceback = tf_stack.extract_stack()
UnknownError (see above for traceback): Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above

I have as config gpu_options=tf.compat.v1.GPUOptions(allow_growth=True)

I cleaned .nv cache I tried to downgrade tensorfow,cuda,cudnn to tensorflow1.10 cuda9.0 cudnn7.1.2, but the problem still exists.

Any response will help! Thanks!

-- gfyulx
cudnn
kubernetes
python
tensorflow

0 Answers