Cluster-autoscaler not scaling up from 0 on Azure with ACS-Engine

9/20/2018

I am attempting to set up a cluster in Azure using acs-engine to build the Kubernetes cluster utilizing VMSS for the agent pools. After the cluster is up I add the cluster-autoscaler to manage 2 dedicated agent pools, 1 cpu and 1 gpu. Scale-down and scale-up work as long as the scale set still has running VMs in them. Both scale sets are set to scale down to 0. With ACS I have set these 2 scale sets up with taints and custom labels. Once the scale set has scaled down to 0, I am unable to get the autoscaler to spin back up a node when a new pod is scheduled. I am not sure what I'm doing wrong or if I am missing some config, label, taint, etc. I just started using kubernetes recently.

Below is my acs-engine json, pod definition and the logs from the autoscaler and pod describe.

Output from kubectl logs -n kube-system cluster-autoscaler-5967b96496-jnvjr

I0920 16:11:14.925761       1 scale_up.go:249] Pod default/my-test-pod is unschedulable
I0920 16:11:14.999323       1 utils.go:196] Pod my-test-pod can't be scheduled on k8s-pool2-24760778-vmss, predicate failed: GeneralPredicates predicate mismatch, cannot put default/my-test-pod on template-node-for-k8s-pool2-24760778-vmss-6220731686255962863, reason: node(s) didn't match node selector
I0920 16:11:14.999408       1 utils.go:196] Pod my-test-pod can't be scheduled on k8s-pool3-24760778-vmss, predicate failed: GeneralPredicates predicate mismatch, cannot put default/my-test-pod on template-node-for-k8s-pool3-24760778-vmss-3043543739698957784, reason: node(s) didn't match node selector
I0920 16:11:14.999442       1 scale_up.go:376] No expansion options

Output from kubectl describe pod my-test-pod

Name:               my-test-pod
Namespace:          default
Priority:           0
PriorityClassName:  <none>
Node:               <none>
Labels:             <none>
Annotations:        kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{},"name":"my-test-pod","namespace":"default"},"spec":{"affinity":{"nodeAffinity":{"preferred...
Status:             Pending
IP:
Containers:
  my-test-pod:
    Image:      ubuntu:latest
    Port:       <none>
    Host Port:  <none>
    Command:
      /bin/bash
      -ec
      while :; do echo '.'; sleep 5; done
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-qzm6s (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  default-token-qzm6s:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-qzm6s
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  agentpool=pool2
                 environment=DEV
                 hardware=cpu-spec
                 node-template=k8s-pool2-24760778-vmss
                 vmSize=Standard_D4s_v3
Tolerations:     dedicated=pool2:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason             Age                 From                Message
  ----     ------             ----                ----                -------
  Warning  FailedScheduling   2m (x273 over 17m)  default-scheduler   0/3 nodes are available: 3 node(s) didn't match node selector.
  Normal   NotTriggerScaleUp  2m (x89 over 17m)   cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added)

acs-engine config file (using terraform to render and generate)

{
    "apiVersion": "vlabs",
    "properties": {
      "orchestratorProfile": {
        "orchestratorType": "Kubernetes",
        "orchestratorRelease": "1.11",
        "kubernetesConfig": {
          "networkPlugin": "azure",
          "clusterSubnet": "${cidr}",
          "privateCluster": {
            "enabled": true
          },
          "addons": [
            {
              "name": "nvidia-device-plugin",
              "enabled": true
            },
            {
              "name": "cluster-autoscaler",
              "enabled": true,
              "config": {
                "minNodes": "0",
                "maxNodes": "2",
                "image": "gcr.io/google-containers/cluster-autoscaler:1.3.1"
              }
            }
          ]
        }
      },
      "masterProfile": {
        "count": ${master_vm_count},
        "dnsPrefix": "${dns_prefix}",
        "vmSize": "${master_vm_size}",
        "storageProfile": "ManagedDisks",
        "vnetSubnetId": "${pool_subnet_id}",
        "firstConsecutiveStaticIP": "${first_master_ip}",
        "vnetCidr": "${cidr}"
      },
      "agentPoolProfiles": [
        {
          "name": "pool3",
          "count": ${dedicated_vm_count},
          "vmSize": "${dedicated_vm_size}",
          "storageProfile": "ManagedDisks",
          "OSDiskSizeGB": 31,
          "vnetSubnetId": "${pool_subnet_id}",
          "customNodeLabels": {
              "vmSize":"${dedicated_vm_size}",
              "dedicatedOnly": "true",
              "environment":"${environment}",
              "hardware": "${dedicated_spec}"
          },
          "availabilityProfile": "VirtualMachineScaleSets",
          "scaleSetEvictionPolicy": "Delete",
          "kubernetesConfig": {
            "kubeletConfig": {
              "--register-with-taints": "dedicated=pool3:NoSchedule"
            }
          }
        },
        {
          "name": "pool2",
          "count": ${pool2_vm_count},
          "vmSize": "${pool2_vm_size}",
          "storageProfile": "ManagedDisks",
          "OSDiskSizeGB": 31,
          "vnetSubnetId": "${pool_subnet_id}",
          "availabilityProfile": "VirtualMachineScaleSets",
          "customNodeLabels": {
              "vmSize":"${pool2_vm_size}",
              "environment":"${environment}",
              "hardware": "${pool_spec}"
          },
          "kubernetesConfig": {
            "kubeletConfig": {
              "--register-with-taints": "dedicated=pool2:NoSchedule"
            }
          }
    },
        {
          "name": "pool1",
          "count": ${pool1_vm_count},
          "vmSize": "${pool1_vm_size}",
          "storageProfile": "ManagedDisks",
          "OSDiskSizeGB": 31,
          "vnetSubnetId": "${pool_subnet_id}",
          "availabilityProfile": "VirtualMachineScaleSets",
          "customNodeLabels": {
              "vmSize":"${pool1_vm_size}",
              "environment":"${environment}",
              "hardware": "${pool_spec}"
          }
        }
      ],
      "linuxProfile": {
        "adminUsername": "${admin_user}",
        "ssh": {
          "publicKeys": [
            {
              "keyData": "${ssh_key}"
            }
          ]
        }
      },
      "servicePrincipalProfile": {
        "clientId": "${service_principal_client_id}",
        "secret": "${service_principal_client_secret}"
      }
    }
  }

Pod config file

apiVersion: v1
kind: Pod
metadata:
  name: my-test-pod
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: vmSize
            operator: In
            values:
              - Standard_D4s_v3
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 1
          preference:
            matchExpressions:
              - key: hardware
                operator: In
                values:
                - cpu-spec
  nodeSelector:
    agentpool: pool2
    hardware: cpu-spec
    vmSize: Standard_D4s_v3
    environment: DEV
    node-template: k8s-pool2-24760778-vmss
  tolerations:
    - key: dedicated
      operator: Equal
      value: pool2
      effect: NoSchedule
  containers:
    - name: my-test-pod
      image: ubuntu:latest
      command: ["/bin/bash", "-ec", "while :; do echo '.'; sleep 5; done"]
  restartPolicy: Never

I've tried with variations in the nodeAffinity/nodeSelector/Tolerations adding and removing them, all with the same outcome.

After the cluster is up, I do add pool2 to the autoscaler. In searching the Internet for the solution, I keep running across posts about a node-template label, I think in the form or k8s.io/autoscaler/cluster-autoscaler/node-template/label/value, but that seems to be needed for AWS.

Can anyone provide me any direction with this on Azure?

Thank you.

-- J. Crippen
azure
kubernetes

1 Answer

9/24/2018

Update.

I have figured out the answer to this. By removing the requiredDuringSchedulingIgnoreDuringExecution node affinity rule and just using the preferredDuringSchedulingIgnoreDuringExecution, the scheduler properly spins up a new VM in the scale set.

apiVersion: v1
kind: Pod
metadata:
  name: my-test-pod
spec:
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 1
          preference:
            matchExpressions:
              - key: hardware
                operator: In
                values:
                - cpu-spec
  nodeSelector:
    agentpool: pool2
    hardware: cpu-spec
    vmSize: Standard_D4s_v3
    environment: DEV
    node-template: k8s-pool2-24760778-vmss
  tolerations:
    - key: dedicated
      operator: Equal
      value: pool2
      effect: NoSchedule
  containers:
    - name: my-test-pod
      image: ubuntu:latest
      command: ["/bin/bash", "-ec", "while :; do echo '.'; sleep 5; done"]
  restartPolicy: Never
-- J. Crippen
Source: StackOverflow