Deploying AKS cluster with ARM template sporadically fails with PutNetworkSecurityGroupOperation error

12/10/2018

I am deploying an AKS cluster using an Azure template. Most of the time deploying the AKS cluster succeeds. But sometimes, with the same inputs the deployment fails with Operation PutNetworkSecurityGroupOperation (XXXXXXXX) was canceled and superseded by operation PutNetworkSecurityGroupOperation. The azure template and deployment error are included below. What could cause this issue?

Template

{
  "$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
  "contentVersion": "1.0.0.0",
  "parameters": {
    "resourceGroupName": {
      "type": "string",
      "metadata": {
        "description": "The resource group name."
      }
    },
    "subscriptionId": {
      "type": "string",
      "metadata": {
        "description": "The subscription id."
      }
    },
    "region": {
      "type": "string",
      "metadata": {
        "description": "The region of AKS resource."
      }
    },
    "gbPerNode": {
      "type": "int",
      "defaultValue": 20,
      "metadata": {
        "description": "Disk size (in GB) to provision for each of the agent pool nodes. This value ranges from 0 to 1023. Specifying 0 will apply the default disk size for that agentVMSize."
      },
      "minValue": 1,
      "maxValue": 1023
    },
    "numNodes": {
      "type": "int",
      "defaultValue": 3,
      "metadata": {
        "description": "The number of agent nodes for the cluster."
      },
      "minValue": 1,
      "maxValue": 50
    },
    "machineType": {
      "type": "string",
      "defaultValue": "Standard_D2_v2",
      "metadata": {
        "description": "The size of the Virtual Machine."
      }
    },
    "servicePrincipalClientId": {
      "metadata": {
        "description": "Client ID (used by cloudprovider)"
      },
      "type": "securestring"
    },
    "servicePrincipalClientSecret": {
      "metadata": {
        "description": "The Service Principal Client Secret."
      },
      "type": "securestring"
    },
    "osType": {
      "type": "string",
      "defaultValue": "Linux",
      "allowedValues": [
        "Linux"
      ],
      "metadata": {
        "description": "The type of operating system."
      }
    },
    "kubernetesVersion": {
      "type": "string",
      "defaultValue": "1.11.5",
      "metadata": {
        "description": "The version of Kubernetes."
      }
    },
    "maxPods": {
      "type": "int",
      "defaultValue": 30,
      "metadata": {
        "description": "Maximum number of pods that can run on a node."
      }
    }
  },
  "variables": {
    "deploymentEventTopic": "deploymenteventtopic",
    "resourceGroupName": "[parameters('resourceGroupName')]",
    "omswsName": "[concat('omsws-', parameters('resourceGroupName'))]",
    "clustername": "cluster"
  },
  "resources": [
    {
      "apiVersion": "2018-03-31",
      "type": "Microsoft.ContainerService/managedClusters",
      "location": "[parameters('region')]",
      "name": "[variables('clustername')]",
      "properties": {
        "kubernetesVersion": "[parameters('kubernetesVersion')]",
        "enableRBAC": true,
        "dnsPrefix": "clust",
        "addonProfiles": {
          "httpApplicationRouting": {
            "enabled": true
          },
          "omsagent": {
            "enabled": false
          }
        },
        "agentPoolProfiles": [
          {
            "name": "agentpool",
            "osDiskSizeGB": "[parameters('gbPerNode')]",
            "count": "[parameters('numNodes')]",
            "vmSize": "[parameters('machineType')]",
            "osType": "[parameters('osType')]",
            "storageProfile": "ManagedDisks"
          }
        ],
        "servicePrincipalProfile": {
          "ClientId": "[parameters('servicePrincipalClientId')]",
          "Secret": "[parameters('servicePrincipalClientSecret')]"
        },
        "networkProfile": {
          "networkPlugin": "kubenet"
        }
      }
    }
  ]
}

Error

{
   "code":"DeploymentFailed",
   "message":"At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-debug for usage details.",
   "details":[
      {
         "code":"Conflict",
         "message":"{\r\n \"status\": \"Failed\",\r\n \"error\": {\r\n \"code\": \"ResourceDeploymentFailure\",\r\n \"message\": \"The resource operation completed with terminal provisioning state 'Failed'.\",\r\n \"details\": [\r\n {\r\n \"code\": \"Canceled\",\r\n \"message\": \"Operation was canceled.\",\r\n \"details\": [\r\n {\r\n \"code\": \"Canceled\",\r\n \"message\": \"Operation was canceled.\",\r\n \"details\": [\r\n {\r\n \"code\": \"CanceledAndSupersededDueToAnotherOperation\",\r\n \"message\": \"Operation PutNetworkSecurityGroupOperation (XXXXXXX) was canceled and superseded by operation PutNetworkSecurityGroupOperation (XXXXX).\"\r\n }\r\n ]\r\n }\r\n ]\r\n }\r\n ]\r\n }\r\n}"
      }
   ]
}
-- ilooner
azure
azure-kubernetes
azure-resource-manager

1 Answer

4/6/2019

The cause seems to be deploying the AKS cluster with httpApplicationRouting routing enabled. To work around the issue I deploy the cluster without httpApplicationRouting in the template and then programmatically enable it after the cluster is deployed using the azure java sdk.

final KubernetesCluster kCluster = serviceManager.kubernetesClusters()
  .getByResourceGroup(resourceGroupName, deploymentName);

final Map<String, ManagedClusterAddonProfile> addonProfileMap = new HashMap<>();
addonProfileMap.put("httpApplicationRouting", 
   new ManagedClusterAddonProfile().withEnabled(true));

kCluster.update()
  .withAddOnProfiles(addonProfileMap)
  .apply();

I opened a support ticket with Azure and the support engineer confirmed this is a bug that the AKS team is working on fixing. So there should be a resolution soon if you don't want to implement the workaround.

-- ilooner
Source: StackOverflow