Using Kubernetes v1.16 on AWS I am facing a weird issue while trying to reduce the time it takes to start a pod on a newly spawned node.
By default, a node AMI does not contains any pre cached docker image, so when a pod is scheduled onto it, its 1st job is to pull the docker image.
Pulling large docker image can take a while, so the pod takes a long time to run.
Recently I come with the idea of pre-pulling my large docker image right into the AMI, so that when a pod is scheduled onto it, it won't have to download it. Turns out a lot of people was doing this dor a while, known as "baking AMI":
My issue is that when I generate an AMI with my large image onto it and use this AMI, everything works as expected and the docker image is not downloaded as already present so the pod starts almost in 1 second but the pod itself runs 1000 times slower than if the docker image was not pre pulled on the AMI.
What I am doing:
If I don't prepull my docker image, then it's running normally, only if i prepull it, using the generated a new AMI then eben if its running in a second the container will be slow like never before.
My docker image use GPU resources and it is based on tensorflow/tensorflow:1.14.0-gpu-py3 image.It seems to be relates to the use of combined nvidia-docker & tensorflow om GPU enabled EC2.
Id anyone have an idea from where this extreme running latency comes from it would be much appreciated.
EDIT #1
Since then I am now using Packer to build my AMI. Here is my template file:
{
"builders": [
{
"type": "amazon-ebs",
"access_key": "{{user `aws_access_key`}}",
"secret_key": "{{user `aws_secret_key`}}",
"ami_name": "compute-{{user `environment_name`}}-{{timestamp}}",
"region": "{{user `region`}}",
"instance_type": "{{user `instance`}}",
"ssh_username": "admin",
"source_ami_filter": {
"filters": {
"virtualization-type": "hvm",
"name": "debian-stretch-hvm-x86_64-gp2-*",
"root-device-type": "ebs"
},
"owners":"379101102735",
"most_recent": true
}
}
],
"provisioners": [
{
"execute_command": "sudo env {{ .Vars }} {{ .Path }}",
"scripts": [
"ami/setup_vm.sh"
],
"type": "shell",
"environment_vars": [
"ENVIRONMENT_NAME={{user `environment_name`}}",
"AWS_ACCOUNT_ID={{user `aws_account_id`}}",
"AWS_REGION={{user `region`}}",
"AWS_ACCESS_KEY_ID={{user `aws_access_key`}}",
"AWS_SECRET_ACCESS_KEY={{user `aws_secret_key`}}",
"DOCKER_IMAGE_NAME={{user `docker_image_name`}}"
]
}
],
"post-processors": [
{
"type": "manifest",
"output": "ami/manifest.json",
"strip_path": true
}
],
"variables": {
"aws_access_key": "{{env `AWS_ACCESS_KEY_ID`}}",
"aws_secret_key": "{{env `AWS_SECRET_ACCESS_KEY`}}",
"environment_name": "",
"region": "eu-west-1",
"instance": "g4dn.xlarge",
"aws_account_id":"",
"docker_image_name":""
}
}
and here is the associated script to configure the AMI for Docker & Nvidia Docker:
#!/bin/bash
cd /tmp
export DEBIAN_FRONTEND=noninteractive
export APT_LISTCHANGES_FRONTEND=noninteractive
# docker
apt-get update
apt-get install -y apt-transport-https ca-certificates curl gnupg2 software-properties-common
curl -fsSL https://download.docker.com/linux/debian/gpg | apt-key add -
add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/debian stretch stable"
apt-get update
apt-get install -y docker-ce
usermod -a -G docker $USER
# graphical drivers
apt-get install -y software-properties-common
wget http://us.download.nvidia.com/XFree86/Linux-x86_64/440.64/NVIDIA-Linux-x86_64-440.64.run
bash NVIDIA-Linux-x86_64-440.64.run -sZ
# nvidia-docker
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | tee /etc/apt/sources.list.d/nvidia-docker.list
apt-get update
apt-get install -y nvidia-container-toolkit
apt-get install -y nvidia-docker2
cat > /etc/docker/daemon.json <<EOL
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
EOL
systemctl restart docker
# enable nvidia-persistenced service
cat > /etc/systemd/system/nvidia-persistenced.service <<EOL
[Unit]
Description=NVIDIA Persistence Daemon
Wants=syslog.target
[Service]
Type=forking
PIDFile=/var/run/nvidia-persistenced/nvidia-persistenced.pid
Restart=always
ExecStart=/usr/bin/nvidia-persistenced --verbose
ExecStopPost=/bin/rm -rf /var/run/nvidia-persistenced
[Install]
WantedBy=multi-user.target
EOL
systemctl enable nvidia-persistenced
# prepull docker
apt-get install -y python3-pip
pip3 install awscli --upgrade
aws ecr get-login-password --region $AWS_REGION | docker login --username AWS --password-stdin $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com
docker pull $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/$DOCKER_IMAGE_NAME:$ENVIRONMENT_NAME
# Clean up
apt-get -y autoremove
apt-get -y clean
Anyway, as soon as I put this line:
docker pull $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/$DOCKER_IMAGE_NAME:$ENVIRONMENT_NAME
I am facing the same weird issue, when pods are scheduled on nodes booted from this AMI, it says "image already present on machine", so it don't pull it again, but then the container is slow as hell when using TensorFlow, for eg. ts.Session() takes something like a minute to run.
EDIT #2
Adding extra information regarding what is executed on the pod:
Dockerfile
FROM tensorflow/tensorflow:1.14.0-gpu-py3
CMD ["python", "main.py"]
main.py
import tensorflow as tf
config = tf.ConfigProto(allow_soft_placement=True)
config.gpu_options.allow_growth = True
return tf.Session(graph=tf.Graph(), config=config)
Only with those lines TF Session initialization takes up to 1 mins to be done when the image is prepulled vs. 1 second when the image is not.