Running an example pod on Kubernetes with Nvidia GPU nodes

8/7/2018

I'm trying to setup Kubernetes with Nvidia GPU nodes/slaves. I followed the guide at https://docs.nvidia.com/datacenter/kubernetes-install-guide/index.html and I was able to get the node join the cluster. I tried the below kubeadm example pod:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
    - name: cuda-container
      image: nvidia/cuda:9.0-base
      command: ["sleep"]
      args: ["100000"]
      extendedResourceRequests: ["nvidia-gpu"]
  extendedResources:
    - name: "nvidia-gpu"
      resources:
        limits:
          nvidia.com/gpu: 1
      affinity:
        required:
          - key: "nvidia.com/gpu-memory"
            operator: "Gt"
            values: ["8000"]

The pod fails scheduling & the kubectl events shows:

4s          2m           14        gpu-pod.15487ec0ea0a1882        Pod                                          Warning   FailedScheduling        default-scheduler            0/2 nodes are available: 1 Insufficient nvidia.com/gpu, 1 PodToleratesNodeTaints.

I'm using AWS EC2 instances. m5.large for the master node & g2.8xlarge for the slave node. Describing the node also gives "nvidia.com/gpu: 4". Can anybody help me out if I'm missing any steps/configurations?

-- Aditya Abinash
kubernetes
nvidia-docker
tensorflow

1 Answer

8/13/2018

According to the AWS G2 documentation, g2.8xlarge servers have the following resources:

  • Four NVIDIA GRID GPUs, each with 1,536 CUDA cores and 4 GB of video memory and the ability to encode either four real-time HD video streams at 1080p or eight real-time HD video streams at 720P.
  • 32 vCPUs.
  • 60 GiB of memory.
  • 240 GB (2 x 120) of SSD storage.

Looking at the comments, 60 GB is standard RAM, and it is used for regular calculations. g2.8xlarge servers have 4 GPUs with 4 GB of GPU memory each, and this memory is used for calculations in nvidia/cuda containers.

In your case, it is requested 8 GB of GPU memory per GPU, but your server has only 4 GB. Therefore, the cluster experiences a lack of resources for scheduling the POD. So, try to reduce the memory usage in the Pod settings or try to use a server with a larger amount of GPU memory.

-- Artem Golenyaev
Source: StackOverflow