My app is a bash script that runs tesseract
on GNU parallel
. The data I need to process is to the tune of 50GB. It's too slow if I do it one VM. I need the power of cluster computing but I don't want to set up multiples VMs myself, instead I just want to launch my APP (along with the data files) on Google cluster (Kubernetes?). I don't have much clarity about these concepts. If someone can guide that would be great.
Might be a challenge to learn all the container orchestration details from scratch when you are just concerned about this one use case.
While GNU Parrellel is nice on a single machine, there don't seem to be many starter kits for using it in distributed mode in the cloud.
I would consider google dataflow rather than spining up a K8S cluster. It allocates and cleans up easily and lets you avoid managing VMs and learning an orchestration framework.