I'm attempting to deploy a parallelized data processing task that uses many containers of the same docker image which would each run with different incrementing environment variables. The image is setup to read env vars in order to determine which segment of a larger list to process.
Background: I was originally using a bash script that passed down an incrementing env var to docker run commands but now I'd like a better way to manage/monitor all the containers. I've only had experience using Kubernetes for application services but it seems like it may be a better way to orchestrate my multi-container task as well.
Wondering if this sort of dynamic environment variable passing is possible within the Kubernetes YAML configs as I'd prefer declarative config vs a shell script. I'm also not sure of the best approach to do this in Kubernetes whether it's multiple separate pods, multi-container pod, or to use replicas in some way.
I'm open to suggestions, I know other tools like Terraform may be helpful for this sort of programatic infrastructure also.
What about using Parallel Processing Using Work Queue for passing different environment variables to your k8s job pods with .spec.parallelism. Although having a separate service for work queue may be little too much depending on what you are trying to do.
The other idea can be using helm templating power to create a k8s manifest file. I create a sample helm chart to give an idea of templating parallel processing. See git repo - helm-parallel-jobs. Once you have the git repo cloned, you can install the helm chart for parallel processing like this. The template for job is same as used by k8s documentation. As seen in the output below, three different environment variables - apple,banana,cherry are provided, which creates 3 different pods with environment variables passed to them.
[root@jr]# helm install --set envs='{apple,banana,cherry}' --name jobs ./helm-parallel-jobs/example/parallel-jobs
NAME: jobs
LAST DEPLOYED: Sun Aug 26 16:29:23 2018
NAMESPACE: default
STATUS: DEPLOYED
RESOURCES:
==> v1/Job
NAME DESIRED SUCCESSFUL AGE
process-item-apple 1 0 0s
process-item-banana 1 0 0s
process-item-cherry 1 0 0s
==> v1/Pod(related)
NAME READY STATUS RESTARTS AGE
process-item-apple-dr6st 0/1 ContainerCreating 0 0s
process-item-banana-d2wwq 0/1 ContainerCreating 0 0s
process-item-cherry-wvlxz 0/1 ContainerCreating 0 0s
My understanding is you'd like to do something like https://kubernetes.io/docs/tasks/job/parallel-processing-expansion/ where jobs are created from a template, one for each data item in a list. But you don't want it to be shell-scripted.
I imagine helm could be used to replace the Job and it has a range function so a chart could be set up to create jobs for each entry in a section of a values.yaml. So it could occupy a space similar to what you suggested for terraform. Ansible could also be an option.
However, the direction of travel of this question seems to be towards batch scheduling. I am wondering if your jobs will evolve to end up having dependencies between them etc. If so Helm and Kubernetes: Is there a barrier equivalent for jobs? and https://www.quora.com/Is-Kubernetes-suited-for-long-running-batch-jobs help here. Currently Kubernetes has facilities for running batch jobs and the tooling to enable a batch scheduling system to run or be built on it but it doesn't itself contain an out of the box batch scheduling system. So people are currently using a range of different approaches to suit their needs.