GitLab job Job succeeded but not finished (create / delete Azure AKS)

10/6/2019

I am using a runner to create a AKS on the fly and also delete a previous one.

unfortunately these jobs take a while and i have now more than often experienced that the job Stops suddenly (in the range of 5+ min after a az aks delete or az aks create call.

The Situation is happening in GitLab and after several retries it usually works one time.

On some googleing in found that before and after scripts might have an impact ... but even with removing them there was not difference.

Is there any Runner rules or something particular that might need to be changed ? It would be more understandable when it would stop with a TImeout Error, but it handles it as job succeeded, even it did no even finish running thorugh all lines. Below is the stagging segment causing the issue:

create-kubernetes-az:
  stage: create-kubernetes-az
  image: microsoft/azure-cli:latest
#  when: manual
  script:
    # REQUIRE CREATED SERVICE PRINCIPAL
    - az login --service-principal -u ${AZ_PRINC_USER} -p ${AZ_PRINC_PASSWORD} --tenant ${AZ_PRINC_TENANT}
    # Create Resource Group
    - az group create --name ${AZ_RESOURCE_GROUP} --location ${AZ_RESOURCE_LOCATION}
# ERROR HAPPENS HERE # Delete Kubernetes Cluster // SOMETIMES STOPS AFTER THIS
    - az aks delete --resource-group ${AZ_RESOURCE_GROUP} --name ${AZ_AKS_TEST_CLUSTER} --yes
#// OR HERE # Create Kubernetes Cluster // SOMETIMES STOPS AFTER THIS
    - az aks create --name ${AZ_AKS_TEST_CLUSTER} --resource-group ${AZ_RESOURCE_GROUP} --node-count ${AZ_AKS_TEST_NODECOUNT} --service-principal ${AZ_PRINC_USER} --client-secret ${AZ_PRINC_PASSWORD} --generate-ssh-keys 
    # Get cubectl
    - az aks install-cli
    # Get Login Credentials
    - az aks get-credentials --name ${AZ_AKS_TEST_CLUSTER} --resource-group ${AZ_RESOURCE_GROUP}
    # Install Helm and Tiller on Azure Cloud Shell
    - curl https://raw.githubusercontent.com/kubernetes/helm/master/scripts/get > get_helm.sh
    - chmod 700 get_helm.sh
    - ./get_helm.sh
    - helm init
    - kubectl create serviceaccount --namespace kube-system tiller
    - kubectl create clusterrolebinding tiller-cluster-rule --clusterrole=cluster-admin --serviceaccount=kube-system:tiller
    - kubectl patch deploy --namespace kube-system tiller-deploy -p '{"spec":{"template":{"spec":{"serviceAccount":"tiller"}}}}'
    # Create a namespace for your ingress resources
    - kubectl create namespace ingress-basic
    # Wait 1 minutes
    - sleep 60
    # Use Helm to deploy an NGINX ingress controller
    - helm install stable/nginx-ingress --namespace ingress-basic --set controller.replicaCount=2 --set controller.nodeSelector."beta\.kubernetes\.io/os"=linux --set defaultBackend.nodeSelector."beta\.kubernetes\.io/os"=linux
    # Test by get public IP
    - kubectl get service
    - kubectl get service -l app=nginx-ingress --namespace ingress-basic
    #- while [ "$(kubectl get service -l app=nginx-ingress --namespace ingress-basic | grep pending)" == "pending" ]; do echo "Updating"; sleep 1 ; done && echo "Finished"
    - while [ "$(kubectl get service -l app=nginx-ingress --namespace ingress-basic -o jsonpath='{.items[*].status.loadBalancer.ingress[*].ip}')" == "" ]; do echo "Updating"; sleep 10 ; done && echo "Finished"
    # Add Ingress Ext IP / Alternative
    - KUBip=$(kubectl get service -l app=nginx-ingress --namespace ingress-basic -o jsonpath='{.items[*].status.loadBalancer.ingress[*].ip}')
    - echo $KUBip
    # Add DNS Name - TODO - GITLAB ENV VARIABELEN KLAPPEN NICHT
    - DNSNAME="bl-test"
    # Get the resource-id of the public ip
    - PUBLICIPID=$(az network public-ip list --query "[?ipAddress!=null]|[?contains(ipAddress, '$KUBip')].[id]" --output tsv)
    - echo $PUBLICIPID
    - az network public-ip update --ids $PUBLICIPID --dns-name $DNSNAME
    #Install CertManager Console
    # Install the CustomResourceDefinition resources separately
    - kubectl apply -f https://raw.githubusercontent.com/jetstack/cert-manager/release-0.8/deploy/manifests/00-crds.yaml
    # Create the namespace for cert-manager
    - kubectl create namespace cert-manager
    # Label the cert-manager namespace to disable resource validation
    - kubectl label namespace cert-manager certmanager.k8s.io/disable-validation=true
    # Add the Jetstack Helm repository
    - helm repo add jetstack https://charts.jetstack.io
    # Update your local Helm chart repository cache
    - helm repo update
    # Install the cert-manager Helm chart
    - helm install --name cert-manager --namespace cert-manager --version v0.8.0 jetstack/cert-manager
    # Run Command issuer.yaml  
    - sed 's/_AZ_AKS_ISSUER_NAME_/'"${AZ_AKS_ISSUER_NAME}"'/g; s/_BL_DEV_E_MAIL_/'"${BL_DEV_E_MAIL}"'/g' infrastructure/kubernetes/cluster-issuer.yaml > cluster-issuer.yaml;
    - kubectl apply -f cluster-issuer.yaml
    # Run Command ingress.yaml  
    - sed 's/_BL_AZ_HOST_/'"beautylivery-test.${AZ_RESOURCE_LOCATION}.${AZ_AKS_HOST}"'/g; s/_AZ_AKS_ISSUER_NAME_/'"${AZ_AKS_ISSUER_NAME}"'/g' infrastructure/kubernetes/ingress.yaml > ingress.yaml;
    - kubectl apply -f ingress.yaml 

And the result

Running with gitlab-runner 12.3.0 (a8a019e0)
  on runner-gitlab-runner-676b494b6b-b5q6h gzi97H3Q
Using Kubernetes namespace: gitlab-managed-apps
Using Kubernetes executor with image microsoft/azure-cli:latest ...
Waiting for pod gitlab-managed-apps/runner-gzi97h3q-project-14628452-concurrent-0l8wsx to be running, status is Pending
Waiting for pod gitlab-managed-apps/runner-gzi97h3q-project-14628452-concurrent-0l8wsx to be running, status is Pending
Running on runner-gzi97h3q-project-14628452-concurrent-0l8wsx via runner-gitlab-runner-676b494b6b-b5q6h...
Fetching changes with git depth set to 50...
Initialized empty Git repository in /builds/****/*******/.git/
Created fresh repository.
From https://gitlab.com/****/********
 * [new branch]      Setup-Kubernetes -> origin/Setup-Kubernetes
Checking out d2ca489b as Setup-Kubernetes...

Skipping Git submodules setup
$ function create_secret() { # collapsed multi-line command
$ echo "current time $(TZ=Europe/Berlin date +"%F %T")"
current time 2019-10-06 09:00:50
$ az login --service-principal -u ${AZ_PRINC_USER} -p ${AZ_PRINC_PASSWORD} --tenant ${AZ_PRINC_TENANT}
[
  {
    "cloudName": "AzureCloud",
    "id": "******",
    "isDefault": true,
    "name": "Nutzungsbasierte Bezahlung",
    "state": "Enabled",
    "tenantId": "*******",
    "user": {
      "name": "http://*****",
      "type": "servicePrincipal"
    }
  }
]
$ az group create --name ${AZ_RESOURCE_GROUP} --location ${AZ_RESOURCE_LOCATION}
{
  "id": "/subscriptions/*********/resourceGroups/*****",
  "location": "francecentral",
  "managedBy": null,
  "name": "******",
  "properties": {
    "provisioningState": "Succeeded"
  },
  "tags": null,
  "type": "Microsoft.Resources/resourceGroups"
}
$ az aks delete --resource-group ${AZ_RESOURCE_GROUP} --name ${AZ_AKS_TEST_CLUSTER} --yes
Running after script...
$ echo "current time $(TZ=Europe/Berlin date +"%F %T")"
current time 2019-10-06 09:05:55
Job succeeded

Is there ways to have the running completely ? And succesfull in the best case ?

UPDATE: What is the idea: i try to automate the process of setting up a complete kubernetes Cluster with SSL and DNS management. Having everything fas setup and ready for different use cases and different environments in future. I also want to learn how to do things better :)

NEW_UPDATE:

Added a solution

-- Bliv_Dev
azure
bash
command-line-interface
gitlab
kubernetes

1 Answer

10/8/2019

I added a small work around as I was expecting it requires an execution every once in a while.....

It seems the az aks wait command did the trick for me as of now. and the previous command requires --no-wait in order to continue.

# Delete Kubernetes Cluster 
  - az aks delete --resource-group ${AZ_RESOURCE_GROUP} --name ${AZ_AKS_TEST_CLUSTER} --no-wait --yes
  - az aks wait --deleted -g ${AZ_RESOURCE_GROUP} -n ${AZ_AKS_TEST_CLUSTER} --updated --interval 60 --timeout 1800
  # Create Kubernetes Cluster  
  - az aks create --name ${AZ_AKS_TEST_CLUSTER} --resource-group ${AZ_RESOURCE_GROUP} --node-count ${AZ_AKS_TEST_NODECOUNT} --service-principal ${AZ_PRINC_USER} --client-secret ${AZ_PRINC_PASSWORD} --generate-ssh-keys --no-wait
  - az aks wait --created -g ${AZ_RESOURCE_GROUP} -n ${AZ_AKS_TEST_CLUSTER} --updated --interval 60 --timeout 1800
-- Bliv_Dev
Source: StackOverflow