Kubernetes Parallelize multiple sample in directory

12/10/2018

I was able to get a kubernetes job up and running on AKS (uses docker hub image to process a biological sample and then upload the output to blob storage - this is done with a bash command that I provide in the args section of my yaml file). However, I have 20 samples, and would like to spin up 20 nodes so that I can process the samples in parallel (one sample per node). How do I send each sample to a different node? The "parallelism" option in a yaml file processes all of the 20 samples on each of the 20 nodes, which is not what I want.

Thank you for the help.

-- Tony
azure-aks
docker
kubernetes

2 Answers

12/10/2018

How/where the samples are stored? You could load them (or a pointer to the actual sample) into a queue like Kafka and let the application retrieve each sample once and upload it to the blob after computation. You can then even assure that if a computation fails, another pod will pick it up and restart the computation.

-- Alessandro Vozza
Source: StackOverflow

12/10/2018

if you want each instance of the job to be on a different node, you can use daemonSet, thats exactly what it does, provisions 1 pod per worker node.

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluentd-elasticsearch
  namespace: kube-system
  labels:
    k8s-app: fluentd-logging
spec:
  selector:
    matchLabels:
      name: fluentd-elasticsearch
  template:
    metadata:
      labels:
        name: fluentd-elasticsearch
    spec:
      tolerations:
      - key: node-role.kubernetes.io/master
        effect: NoSchedule
      containers:
      - name: fluentd-elasticsearch
        image: k8s.gcr.io/fluentd-elasticsearch:1.20
        resources:
          limits:
            memory: 200Mi
          requests:
            cpu: 100m
            memory: 200Mi
        volumeMounts:
        - name: varlog
          mountPath: /var/log
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
      terminationGracePeriodSeconds: 30
      volumes:
      - name: varlog
        hostPath:
          path: /var/log
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers

https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/

Another way of doing that - using pod antiaffinity:

  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
              - key: "app"
                operator: In
                values:
                - zk
          topologyKey: "kubernetes.io/hostname"

The requiredDuringSchedulingIgnoredDuringExecution field tells the Kubernetes Scheduler that it should never co-locate two Pods which have app label as zk in the domain defined by the topologyKey. The topologyKey kubernetes.io/hostname indicates that the domain is an individual node. Using different rules, labels, and selectors, you can extend this technique to spread your ensemble across physical, network, and power failure domains

-- 4c74356b41
Source: StackOverflow