I use custom images (AMIs) configured for machine learning on GPU-enabled EC2 instances.
This means cuda
, libcudnn6
, nvidia-docker
etc are all properly setup on them.
However when Kops starts new nodes from these AMIs (I use cluster-autoscaler) it overrides my properly setup docker.
How can I prevent that?
For now I run a custom script on startup that re-installs nvidia-docker
properly, but that's obviously not ideal.
Kops will only install docker if there's a difference between the version it expects to use and the version that is already installed on the node.
Note that Kops will downgrade docker if the installed version is higher than what it expects!
So the solution to my problem was to have a pre-installed version that matches spec.docker.version
.
For this we had to downgrade docker to 17.03.2
and nvidia-docker to 2.0.3+docker17.03.2-1
.