Kubernetes (GKE/AWS/Azure) Scaling for Large Jobs

1/16/2020

I am looking for some advice, and I would be eternally grateful if anyone would be able to point me in the right direction.

I have a docker container that I use to do machine learning based object detection/tracking across sets of video frames. Currently, I start up an ec2 instance with this docker container, and then send batches of approximately 30 frames in serial fashion. If course, this is prohibitively slow.

I would like to set up a kubernetes system that can go from zero running containers to 50+, then immediately down to minimum required. Each container requires about 8 Gb of RAM due to the model size but can run on CPU. I would need these to run for about one minute to process the incoming images in parallel and then terminate, scaling down to zero active containers after the video processing is complete. In summary, send small batches of 30 frames to the cluster, have it scale up massively, and then scale down immediately when done.

I was able to set up a kubernetes cluster on Google cloud, but I cannot figure out how to make it scale all the way down to zero quickly after the job terminates. Having so many containers running after the job is done would be very expensive.

Would anybody be able to point me in the right direction? Can I do this with gke? Is there a different service I should try?

Many thanks in advance for your help.

N

-- agile
amazon-web-services
azure
google-cloud-platform
google-kubernetes-engine
kubernetes

1 Answer

1/16/2020

If I've understood your task clearly, it's Parallel Processing with Kubernetes you're looking for. With this feature of K8S, you can run a certain job with multiple pods running parallelly and those pods are terminated when the job is done.

You can read more from the following documentation links -

https://kubernetes.io/docs/tasks/job/parallel-processing-expansion/

https://kubernetes.io/docs/tasks/job/fine-parallel-processing-work-queue/

-- Shahed Mehbub
Source: StackOverflow