My cluster has 2 nodes with GPUs and a small master
cluster nodes are running
OS Image: Container-Optimized OS from Google
I've installed the GPU daemon per:
kubectl get pods -A
kube-system nvidia-driver-installer-7rpff 1/1 Running 1 3d3h
kube-system nvidia-driver-installer-97jg5 1/1 Running 1 3d3h
I have added taints to run the application on the GPU nodes and this is working correctly.
Each node runs a single pod of the application which should be able to access the GPU
I don't know why but only one of the pods is actually able to access its GPU.
The other instance (which is running on its dedicated node with a GPU) spits out this message:
RuntimeError: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero.
LD_LIBRARY_PATHS
is set correctly (identical in both pods and valid paths). I see that the GPU appears when performing in both pods:
H/W path Device Class Description
==============================================
system Computer
/0 bus Motherboard
...
/0/100/4 display TU104GL [Tesla T4] <<<<<<<<<<<<<<<
Helm is:
replicaCount: 2
image:
repository: ""
pullPolicy: Always
tag: ""
imagePullSecrets: []
nameOverride: ""
fullnameOverride: ""
podAnnotations: {}
podSecurityContext: {}
# fsGroup: 2000
securityContext: {}
# capabilities:
# drop:
# - ALL
# readOnlyRootFilesystem: true
# runAsNonRoot: true
# runAsUser: 1000
service:
type: NodePort
port: 3030
ingress:
enabled: true
className: ""
annotations:
kubernetes.io/ingress.global-static-ip-name: xxx
hosts:
- host: staging-xxxx.com
paths:
- path: /
pathType: Prefix
resources:
# We usually recommend not to specify default resources and to leave this as a conscious
# choice for the user. This also increases chances charts run on environments with little
# resources, such as Minikube. If you do want to specify resources, uncomment the following
# lines, adjust them as necessary, and remove the curly braces after 'resources:'.
limits:
cpu: 2
memory: 16Gi
nvidia.com/gpu: 1
requests:
cpu: 2
memory: 8Gi
autoscaling:
enabled: false
minReplicas: 1
maxReplicas: 100
targetCPUUtilizationPercentage: 80
# targetMemoryUtilizationPercentage: 80
nodeSelector:
cloud.google.com/gke-nodepool: xxxx-staging-0-autoscale-0
tolerations: {}
affinity: {}