single pod is not running with GPU in k8 cluster

1/16/2022

My cluster has 2 nodes with GPUs and a small master

cluster nodes are running OS Image: Container-Optimized OS from Google

I've installed the GPU daemon per:

 kubectl get pods -A
kube-system   nvidia-driver-installer-7rpff                                    1/1     Running   1          3d3h
kube-system   nvidia-driver-installer-97jg5                                    1/1     Running   1          3d3h

I have added taints to run the application on the GPU nodes and this is working correctly.

Each node runs a single pod of the application which should be able to access the GPU

I don't know why but only one of the pods is actually able to access its GPU.

The other instance (which is running on its dedicated node with a GPU) spits out this message:

RuntimeError: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero.

LD_LIBRARY_PATHS is set correctly (identical in both pods and valid paths). I see that the GPU appears when performing in both pods:

H/W path    Device  Class          Description
==============================================
                    system         Computer
/0                  bus            Motherboard
...
/0/100/4            display        TU104GL [Tesla T4] <<<<<<<<<<<<<<<

Helm is:

replicaCount: 2

image:
  repository: ""
  pullPolicy: Always
  tag: ""

imagePullSecrets: []
nameOverride: ""
fullnameOverride: ""

podAnnotations: {}

podSecurityContext: {}
  # fsGroup: 2000

securityContext: {}
  # capabilities:
  #   drop:
  #   - ALL
  # readOnlyRootFilesystem: true
  # runAsNonRoot: true
  # runAsUser: 1000

service:
  type: NodePort
  port: 3030

ingress:
  enabled: true
  className: ""
  annotations:
    kubernetes.io/ingress.global-static-ip-name: xxx
  hosts:
    - host: staging-xxxx.com
      paths:
        - path: /
          pathType: Prefix

resources:
  # We usually recommend not to specify default resources and to leave this as a conscious
  # choice for the user. This also increases chances charts run on environments with little
  # resources, such as Minikube. If you do want to specify resources, uncomment the following
  # lines, adjust them as necessary, and remove the curly braces after 'resources:'.
   limits:
     cpu: 2
     memory: 16Gi
     nvidia.com/gpu: 1
   requests:
     cpu: 2
     memory: 8Gi

autoscaling:
  enabled: false
  minReplicas: 1
  maxReplicas: 100
  targetCPUUtilizationPercentage: 80
  # targetMemoryUtilizationPercentage: 80

nodeSelector:
  cloud.google.com/gke-nodepool: xxxx-staging-0-autoscale-0

tolerations: {}

affinity: {}
-- Avner Barr
kubernetes
kubernetes-helm

0 Answers