How to make auto cluster upscaling work GKE/digitalocean for a job kind with varied requested memory requirement?

3/22/2020

I have 1 node K8 cluster on digitalocean with 1cpu/2gbRAM
and 3 node cluster on google cloud with 1cpu/2gbRAM I ran two jobs separatley on each cloud platform with auto-scaling enabled.

First job had memory request of 200Mi

apiVersion: batch/v1
kind: Job
metadata:
  name: scaling-test
spec:
  parallelism: 16
  template:
    metadata:
      name: scaling-test
    spec:
      containers:
        - name: debian
          image: debian
          command: ["/bin/sh","-c"]
          args: ["sleep 300"]
          resources:
            requests:
              cpu: "100m"
              memory: "200Mi"
      restartPolicy: Never

More nodes of (1cpu/2gbRAM) were added to cluster automatically and after job completion extra node were deleted automatically.

After that, i ran second job with memory request 4500Mi

apiVersion: batch/v1
kind: Job
metadata:
  name: scaling-test2
spec:
  parallelism: 3
  template:
    metadata:
      name: scaling-test2
    spec:
      containers:
        - name: debian
          image: debian
          command: ["/bin/sh","-c"]
          args: ["sleep 5"]
          resources:
            requests:
              cpu: "100m"
              memory: "4500Mi"
      restartPolicy: Never

After checking later job remained in pending state . I checked pods Events log and i'm seeing following error.

0/5 nodes are available: 5 Insufficient memory    **source: default-scheduler**
pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 Insufficient memory **source:cluster-autoscaler**

cluster did not auto-scaled for required requested resource for job. Is this possible using kubernetes?

-- Doe
digital-ocean
google-kubernetes-engine
kubernetes

1 Answer

3/22/2020

CA doesn't add nodes to the cluster if it wouldn't make a pod schedulable. It will only consider adding nodes to node groups for which it was configured. So one of the reasons it doesn't scale up the cluster may be that the pod has too large (e.g. 4500Mi memory). Another possible reason is that all suitable node groups are already at their maximum size.

-- Arghya Sadhu
Source: StackOverflow