Running beam dataflow Jobs from Kubernetes

6/26/2018

I am curious to know if beam dataflow job can be run with kubernetes. I can see lot of spring dataflow jobs run from kubernetes but not beam dataflow.

I can see one example like https://github.com/sanderploegsma/beam-scheduling-kubernetes/blob/master/kubernetes/cronjob.yml

But this doesn't explain how to pass args parameters like

args: ["--runner=DataflowRunner --project=$peoject --gcpTempLocation=$gcptemp"]   

Expanding more on this from https://streambench.wordpress.com/2018/06/07/set-up-the-direct-runner-for-beam/

I want to deploy this part on kubernetes.

beam_app_direct:
container_name: "beam_direct_app"
image: "beam_direct_app"
build:
    context: .
    dockerfile: ./Dockerfile-direct
environment:
- YOUR_ENV_PARAMETER=42
- ANOTHER_ENV_PARAMETER=abc
- ...
links:
- ...
# volume:
# - ./your-beam-app /usr/src/your-beam-app
command: "bash ./init.sh"

but I do not get any idea, how it can be deployed.

Updating more details.

My Cronjob.yaml file

apiVersion: batch/v1
kind: Job
metadata:
      name: "cronjob"
spec:
    template:
    spec:
    containers:
     - name: campaignjob
      image: cronjob
      build:
      context: .
      dockerfile: ./Dockerfile
      command: "bash ./init.sh"
  restartPolicy: Never

kubectl apply -f cronjob.yaml --validate=false

I am getting following error.

The Job "cronjob" is invalid: * spec.template.spec.containers: Required value * spec.template.spec.restartPolicy: Unsupported value: "Always": supported values: OnFailure, Never

Update: I am very surprised. I realised its a case of wrong YAML file only but even after 4 days there is not a comment. I even send this issue to Google team but they are asking me to use other technology.

-- user1115163
apache-beam
docker
google-kubernetes-engine
java
yaml

1 Answer

7/11/2018

From the github link you have provided, the job would have to run on the Master node. Within GKE, you do not have access to the Master node as it is a managed service.

I would suggest using Google Cloud Dataflow which is built to run the jobs you describe.

-- Jason
Source: StackOverflow