I'm trying to setup Kubernetes with Nvidia GPU nodes/slaves. I followed the guide at https://docs.nvidia.com/datacenter/kubernetes-install-guide/index.html and I was able to get the node join the cluster. I tried the below kubeadm example pod:
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
containers:
- name: cuda-container
image: nvidia/cuda:9.0-base
command: ["sleep"]
args: ["100000"]
extendedResourceRequests: ["nvidia-gpu"]
extendedResources:
- name: "nvidia-gpu"
resources:
limits:
nvidia.com/gpu: 1
affinity:
required:
- key: "nvidia.com/gpu-memory"
operator: "Gt"
values: ["8000"]
The pod fails scheduling & the kubectl events shows:
4s 2m 14 gpu-pod.15487ec0ea0a1882 Pod Warning FailedScheduling default-scheduler 0/2 nodes are available: 1 Insufficient nvidia.com/gpu, 1 PodToleratesNodeTaints.
I'm using AWS EC2 instances. m5.large for the master node & g2.8xlarge for the slave node. Describing the node also gives "nvidia.com/gpu: 4". Can anybody help me out if I'm missing any steps/configurations?
According to the AWS G2 documentation, g2.8xlarge
servers have the following resources:
Looking at the comments, 60 GB is standard RAM, and it is used for regular calculations. g2.8xlarge
servers have 4 GPUs with 4 GB of GPU memory each, and this memory is used for calculations in nvidia/cuda
containers.
In your case, it is requested 8 GB of GPU memory per GPU, but your server has only 4 GB. Therefore, the cluster experiences a lack of resources for scheduling the POD. So, try to reduce the memory usage in the Pod settings or try to use a server with a larger amount of GPU memory.