I want to run a docker container which uses GPU (it runs a cnn to detect objects on a video), and then run that container on Kubernetes.
I can run the container from docker alone without problems, but when I try to run the container from Kubernetes it fails to find the GPU.
I run it using this command:
kubectl exec -it namepod /bin/bash
This is the problem that I get:
kubectl exec -it tym-python-5bb7fcf76b-4c9z6 /bin/bash
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
root@tym-python-5bb7fcf76b-4c9z6:/opt# cd servicio/
root@tym-python-5bb7fcf76b-4c9z6:/opt/servicio# python3 TM_Servicev2.py
Try to load cfg: /opt/darknet/cfg/yolov4.cfg, weights: /opt/yolov4.weights, clear = 0
CUDA status Error: file: ./src/dark_cuda.c : () : line: 620 : build time: Jul 30 2021 - 14:05:34
CUDA Error: no CUDA-capable device is detected
python3: check_error: Unknown error -1979678822
root@tym-python-5bb7fcf76b-4c9z6:/opt/servicio#
EDIT. I followed all the steps on the Nvidia docker 2 guide and downloaded the Nvidia plugin for Kubernetes.
however when I deploy Kubernetes it stays as "pending" and never actually starts. I don't get an error anymore, but it never starts. The pod appears like this:
gpu-pod 0/1 Pending 0 3m19s
EDIT 2.
I ended up reinstalling everything and now my pod appears completed but not running. like this.
default gpu-operator-test 0/1 Completed 0 62m
Answering Wiktor. when I run this command:
kubectl describe pod gpu-operator-test
I get:
Name: gpu-operator-test
Namespace: default
Priority: 0
Node: pdi-mc/192.168.0.15
Start Time: Mon, 09 Aug 2021 12:09:51 -0500
Labels: <none>
Annotations: cni.projectcalico.org/containerID: 968e49d27fb3d86ed7e70769953279271b675177e188d52d45d7c4926bcdfbb2
cni.projectcalico.org/podIP:
cni.projectcalico.org/podIPs:
Status: Succeeded
IP: 192.168.10.81
IPs:
IP: 192.168.10.81
Containers:
cuda-vector-add:
Container ID: docker://d49545fad730b2ec3ea81a45a85a2fef323edc82e29339cd3603f122abde9cef
Image: nvidia/samples:vectoradd-cuda10.2
Image ID: docker-pullable://nvidia/samples@sha256:4593078cdb8e786d35566faa2b84da1123acea42f0d4099e84e2af0448724af1
Port: <none>
Host Port: <none>
State: Terminated
Reason: Completed
Exit Code: 0
Started: Mon, 09 Aug 2021 12:10:29 -0500
Finished: Mon, 09 Aug 2021 12:10:30 -0500
Ready: False
Restart Count: 0
Limits:
nvidia.com/gpu: 1
Requests:
nvidia.com/gpu: 1
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9ktgq (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kube-api-access-9ktgq:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events: <none>
I'm using this configuration file to create the pod
apiVersion: v1
kind: Pod
metadata:
name: gpu-operator-test
spec:
restartPolicy: OnFailure
containers:
- name: cuda-vector-add
image: "nvidia/samples:vectoradd-cuda10.2"
resources:
limits:
nvidia.com/gpu: 1
Addressing two topics here:
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
Means that you tried to use a deprecated version of the kubectl exec
command. The proper syntax is:
$ kubectl exec (POD | TYPE/NAME) [-c CONTAINER] [flags] -- COMMAND [args...]
See here for more details.
gpu-operator-test
pod should run to completion:
You can see that the pod's status is Succeeded
and also:
State: Terminated
Reason: Completed
Exit Code: 0
Exit Code: 0
means that the specified container command completed successfully.
More details can be found in the official docs.