When deploying an image to an AKS instance, the image pull from the ACR (Premium SKU) is very slow, even for "small" images around ~150 MBs in size.
Both the AKS resource and the ACR resource are in the Canada East region.
Here is an example:
root@076fff2831b2:/tmp# kubectl describe pod application-service-59bcf96874-pvrmb
Name: application-service-59bcf96874-pvrmb
Namespace: default
Priority: 0
Node: aks-41067869-1/10.255.13.163
Start Time: Tue, 11 Feb 2020 18:15:53 -0500
Labels: app.kubernetes.io/instance=application-service
app.kubernetes.io/name=application-service
pod-template-hash=59bcf96874
Annotations: <none>
Status: Running
IP: 10.255.13.175
IPs: <none>
Controlled By: ReplicaSet/application-service-59bcf96874
Containers:
application-service:
Container ID: docker://0e86526a293d9055d482a09f043f0be68c594244fe4216f8fb190bc2caf6b65b
Image: myacr01.azurecr.io/microservices/application-service:0.0.6
Image ID: docker-pullable://myacr01.azurecr.io/microservices/application-service@sha256:cfbb3ffa7adc52da9cc0b8d7f78376076ea712025b59df8e406c559d369f4085
Port: 3000/TCP
Host Port: 0/TCP
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Tue, 11 Feb 2020 18:35:00 -0500
Finished: Tue, 11 Feb 2020 18:35:00 -0500
Ready: False
Restart Count: 5
Liveness: http-get https://:http/ delay=0s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get https://:http/ delay=0s timeout=1s period=10s #success=1 #failure=3
Environment:
PORT: 3000
undefined: undefined
Mounts:
/kvmnt from application-service-kv-volume (ro)
/var/run/secrets/kubernetes.io/serviceaccount from application-service-token-9jk8j (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
application-service-kv-volume:
Type: FlexVolume (a generic volume resource that is provisioned/attached using an exec based plugin)
Driver: azure/kv
FSType:
SecretRef: &LocalObjectReference{Name:kvcreds,}
ReadOnly: false
Options: map[keyvaultname:testIt2 keyvaultobjectnames:APPLICATION-SVC-SQLDB-CS;INGESTION-CONSUMER-EHB-CS;INGESTION-PRODUCER-EHB-CS keyvaultobjecttypes:secret;secret;secret tenantid:REMOVED usepodidentity:false usevmmanagedidentity:false]
application-service-token-9jk8j:
Type: Secret (a volume populated by a Secret)
SecretName: application-service-token-9jk8j
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 20m default-scheduler Successfully assigned default/application-service-59bcf96874-pvrmb to aks-41067869-1
Normal Pulling 20m kubelet, aks-41067869-1 Pulling image "myacr01.azurecr.io/microservices/application-service:0.0.6"
Normal Pulled 4m39s kubelet, aks-41067869-1 Successfully pulled image "myacr01.azurecr.io/microservices/application-service:0.0.6"
Normal Started 3m36s (x4 over 4m33s) kubelet, aks-41067869-1 Started container application-service
Warning BackOff 3m4s (x11 over 4m30s) kubelet, aks-41067869-1 Back-off restarting failed container
Normal Pulled 2m52s (x4 over 4m32s) kubelet, aks-41067869-1 Container image "myacr01.azurecr.io/microservices/application-service:0.0.6" already present on machine
Normal Created 2m51s (x5 over 4m33s) kubelet, aks-41067869-1 Created container application-service
Some details were modified/removed for privacy reasons.
However, the thing to note is the ~15m needed to go from a state of "Pulling" to "Pulled" for an image from an ACR.
This issue is occurring daily. The Azure Insights blade of the AKS instance shows a maximum of 26% node CPU and 14.32% node memory utilization over the last 7 days.
How we can go about troubleshooting this further to determine the possible causes of delays?
Any help is greatly appreciated.
Thanks!