I have a docker file which describes a classic data science machine with pandas
, sklearn
etc... installed on it, and I want to instantiate it from a google cloud machine through google container and give it my python package and some arguments as parameters, when a certain notification arrives.
I have to run my python package on 100 different datasets, with distinct access keys etc... My dream would be to instantiate 100 instances of my docker file through google containers and feed them with my 100 distinct datasets and parameters, so they can produce outputs quite fast.
Another option would be to instantiate one instance of my docker file and to give it one by one each datasets and parameters, but this looks so much longer to me.
My questions are:
1- is one of these solutions more doable or realistic than the other?
2- is there a third solution to instantiate this docker file in a smart way, in order to make my calculations fast and not to costly?
You can take advantage of Kubernetes. Kubernetes is an open source container cluster manager. It schedules any number of container replicas across a group of node instances. A master instance exposes the Kubernetes API, through which tasks are defined. Kubernetes spawns containers on nodes to handle the defined tasks.
The number and type of containers can be dynamically modified according to need. An agent (a kubelet) on each node instance monitors containers and restarts them if necessary.
Kubernetes is optimized for Google Cloud Platform, but can run on any physical or virtual machine.
Both solutions would work fine, you are just asking whether you should parallelize your problem (which is mostly up to you).
If you want to run your workload in parallel, you will need more computing resources. If you want to be able to run it in parallel on demand (when the signal arrives), you will either need to have these resources ready (idling) or instantiate them dynamically (which is cheaper since you are only paying for the compute when you are using it).
You could have a controller process that accepts the signal, scales up (or creates) a Google Container Engine cluster to have the desired number of nodes, and then submits N pods to the system to perform your work. Each pod can be parameterized using environment variables (you'd need to synthesize these on the fly). Then collect your output and scale down (or delete) the cluster when you are done.