I want to have a instance group that scales from 0 to x pods. I get Insufficient nvidia.com/gpu
. Does someone see what I'm doing wrong here? This is on Kubernetes v1.9.6 with autoscaler 1.1.2.
I have two instance groups, one with cpus, and a new one I want to scale down to 0 nodes called gpus, so kops edit ig gpus
is:
apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: 2018-05-31T09:27:31Z
labels:
kops.k8s.io/cluster: ci.k8s.local
name: gpus
spec:
cloudLabels:
instancegroup: gpus
k8s.io/cluster-autoscaler/enabled: ""
image: ami-4450543d
kubelet:
featureGates:
DevicePlugins: "true"
machineType: p2.xlarge
maxPrice: "0.5"
maxSize: 3
minSize: 0
nodeLabels:
kops.k8s.io/instancegroup: gpus
role: Node
rootVolumeOptimization: true
subnets:
- eu-west-1c
And the autoscaler deployment has:
spec:
containers:
- command:
- ./cluster-autoscaler
- --v=4
- --stderrthreshold=info
- --cloud-provider=aws
- --skip-nodes-with-local-storage=false
- --nodes=0:3:gpus.ci.k8s.local
env:
- name: AWS_REGION
value: eu-west-1
image: k8s.gcr.io/cluster-autoscaler:v1.1.2
Now I try to deploy a simple GPU test:
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: simple-gpu-test
spec:
replicas: 1
template:
metadata:
labels:
app: "simplegputest"
spec:
containers:
- name: "nvidia-smi-gpu"
image: "nvidia/cuda:8.0-cudnn5-runtime"
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 GPU
volumeMounts:
- mountPath: /usr/local/nvidia
name: nvidia
command: [ "/bin/bash", "-c", "--" ]
args: [ "while true; do nvidia-smi; sleep 5; done;" ]
volumes:
- hostPath:
path: /usr/local/nvidia
name: nvidia
I expect the instance group to go from 0 to 1, but the autoscaler logs show:
I0605 11:27:29.865576 1 scale_up.go:54] Pod default/simple-gpu-test-6f48d9555d-l9822 is unschedulable
I0605 11:27:29.961051 1 scale_up.go:86] Upcoming 0 nodes
I0605 11:27:30.005163 1 scale_up.go:146] Scale-up predicate failed: PodFitsResources predicate mismatch, cannot put default/simple-gpu-test-6f48d9555d-l9822 on template-node-for-gpus.ci.k8s.local-5829202798403814789, reason: Insufficient nvidia.com/gpu
I0605 11:27:30.005262 1 scale_up.go:175] No pod can fit to gpus.ci.k8s.local
I0605 11:27:30.005324 1 scale_up.go:180] No expansion options
I0605 11:27:30.005393 1 static_autoscaler.go:299] Calculating unneeded nodes
I0605 11:27:30.008919 1 factory.go:33] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"simple-gpu-test-6f48d9555d-l9822", UID:"3416d787-68b3-11e8-8e8f-0639a6e973b0", APIVersion:"v1", ResourceVersion:"12429157", FieldPath:""}): type: 'Normal' reason: 'NotTriggerScaleUp' pod didn't trigger scale-up (it wouldn't fit if a new node is added)
I0605 11:27:30.031707 1 leaderelection.go:199] successfully renewed lease kube-system/cluster-autoscaler
When I start a node by setting the minimum tot 1, I see that it has the capacity:
Capacity:
cpu: 4
memory: 62884036Ki
nvidia.com/gpu: 1
pods: 110
and labels
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=p2.xlarge
beta.kubernetes.io/os=linux
failure-domain.beta.kubernetes.io/region=eu-west-1
failure-domain.beta.kubernetes.io/zone=eu-west-1c
kops.k8s.io/instancegroup=gpus
kubernetes.io/role=node
node-role.kubernetes.io/node=
spot=true
The required tag is present on the AWS Scale group:
{
"ResourceId": "gpus.ci.k8s.local",
"ResourceType": "auto-scaling-group",
"Key": "k8s.io/cluster-autoscaler/node-template/label/kops.k8s.io/instancegroup",
"Value": "gpus",
"PropagateAtLaunch": true
}
Finally, when I set the min pool size to 1, it can scale from 1 to 3 automatically. Just doesn't do 0 to 1.
Is there someway I can perhaps inspect the template to see why it doesn't have the resource?
Also make sure to have a non-zero limit for the desired instance types in the AWS account.
Cluster Autoscaler is a standalone program that adjusts the size of a Kubernetes cluster to meet the current needs. Cluster Autoscaler can manage GPU resources provided by the cloud provider in the same manner.
Based on cluster autoscaler documentation, for AWS, it is possible to scale a node group to 0 (and obviously from 0), assuming that all scale-down conditions are met.
Going back to your question, for AWS, if you are using nodeSelector, you need to tag your nodes in the ASG template using labels like "k8s.io/cluster-autoscaler/node-template/label/". Please note that Kubernetes and AWS GPU support require different labels.
For example, for a node label of foo=bar, you would tag the ASG with:
{
"ResourceType": "auto-scaling-group",
"ResourceId": "foo.example.com",
"PropagateAtLaunch": true,
"Value": "bar",
"Key": "k8s.io/cluster-autoscaler/node-template/label/foo"
}