What are some best practices for deploying multiple containers with related environment variables?

8/25/2018

I'm attempting to deploy a parallelized data processing task that uses many containers of the same docker image which would each run with different incrementing environment variables. The image is setup to read env vars in order to determine which segment of a larger list to process.

Background: I was originally using a bash script that passed down an incrementing env var to docker run commands but now I'd like a better way to manage/monitor all the containers. I've only had experience using Kubernetes for application services but it seems like it may be a better way to orchestrate my multi-container task as well.

Wondering if this sort of dynamic environment variable passing is possible within the Kubernetes YAML configs as I'd prefer declarative config vs a shell script. I'm also not sure of the best approach to do this in Kubernetes whether it's multiple separate pods, multi-container pod, or to use replicas in some way.

I'm open to suggestions, I know other tools like Terraform may be helpful for this sort of programatic infrastructure also.

-- ev-dev
data-processing
data-science
docker
kubernetes
terraform

2 Answers

8/26/2018

What about using Parallel Processing Using Work Queue for passing different environment variables to your k8s job pods with .spec.parallelism. Although having a separate service for work queue may be little too much depending on what you are trying to do.

The other idea can be using helm templating power to create a k8s manifest file. I create a sample helm chart to give an idea of templating parallel processing. See git repo - helm-parallel-jobs. Once you have the git repo cloned, you can install the helm chart for parallel processing like this. The template for job is same as used by k8s documentation. As seen in the output below, three different environment variables - apple,banana,cherry are provided, which creates 3 different pods with environment variables passed to them.

    [root@jr]# helm install --set envs='{apple,banana,cherry}'  --name jobs ./helm-parallel-jobs/example/parallel-jobs
    NAME:   jobs
    LAST DEPLOYED: Sun Aug 26 16:29:23 2018
    NAMESPACE: default
    STATUS: DEPLOYED

    RESOURCES:
    ==> v1/Job
    NAME                 DESIRED  SUCCESSFUL  AGE
    process-item-apple   1        0           0s
    process-item-banana  1        0           0s
    process-item-cherry  1        0           0s

    ==> v1/Pod(related)
    NAME                       READY  STATUS             RESTARTS  AGE
    process-item-apple-dr6st   0/1    ContainerCreating  0         0s
    process-item-banana-d2wwq  0/1    ContainerCreating  0         0s
    process-item-cherry-wvlxz  0/1    ContainerCreating  0         0s
-- Jay Rajput
Source: StackOverflow

8/26/2018

My understanding is you'd like to do something like https://kubernetes.io/docs/tasks/job/parallel-processing-expansion/ where jobs are created from a template, one for each data item in a list. But you don't want it to be shell-scripted.

I imagine helm could be used to replace the Job and it has a range function so a chart could be set up to create jobs for each entry in a section of a values.yaml. So it could occupy a space similar to what you suggested for terraform. Ansible could also be an option.

However, the direction of travel of this question seems to be towards batch scheduling. I am wondering if your jobs will evolve to end up having dependencies between them etc. If so Helm and Kubernetes: Is there a barrier equivalent for jobs? and https://www.quora.com/Is-Kubernetes-suited-for-long-running-batch-jobs help here. Currently Kubernetes has facilities for running batch jobs and the tooling to enable a batch scheduling system to run or be built on it but it doesn't itself contain an out of the box batch scheduling system. So people are currently using a range of different approaches to suit their needs.

-- Ryan Dawson
Source: StackOverflow