How to schedule jobs in a spark cluster using Kubernetes

1/26/2017

I am rather new to both Spark and Kubernetes but i am trying to understand how this can work in a production envitonment. I am planning to use Kubernetes to deploy a Spark cluster. I will then use SparkStraeming to process data from Kafka and output the result to a database. Furthermore, i am planning to set up a scheduled Spark-batch-job that is run every night.

1. How do i schedule the nightly batch-runs? I understand that Kubernetes has a cron-like feature (see documentation). But from my understanding, this is do schedule container deployments, i will already have my containers up and running (since i use the Spark-cluster for SparkStreaming), i just want to submit a job to the cluster every night.

2. Where do i store the SparkStreaming-application(s) (there might be many) and how do i start it? Do i seperate the Spark-container from the SparkStreaming-application (i.e. should the container only contain a clean Spark-node, and keep the SparkStreaming-applicatio in persistent storage and then push the job to the container using kubectl)? Or should my docker-file clone my SparkStreaming-application from a repository and be responsible for starting it.

I have tried looking through the documentations but i am unsure on how to set it up. Any link or reference that answers my question is highly appreciated.

-- Cleared
apache-spark
docker
kubernetes
spark-streaming

1 Answer

3/29/2017

You should absolutely use the CronJob resource for performing the backups... see also these repos for helping bootstrap spark on k8s

https://github.com/ramhiser/spark-kubernetes

https://github.com/navicore/spark-on-kubernetes

-- diclophis
Source: StackOverflow