I am rather new to both Spark and Kubernetes but i am trying to understand how this can work in a production envitonment. I am planning to use Kubernetes to deploy a Spark cluster. I will then use SparkStraeming to process data from Kafka and output the result to a database. Furthermore, i am planning to set up a scheduled Spark-batch-job that is run every night.
1. How do i schedule the nightly batch-runs? I understand that Kubernetes has a cron-like feature (see documentation). But from my understanding, this is do schedule container deployments, i will already have my containers up and running (since i use the Spark-cluster for SparkStreaming), i just want to submit a job to the cluster every night.
2. Where do i store the SparkStreaming-application(s) (there might be many) and how do i start it? Do i seperate the Spark-container from the SparkStreaming-application (i.e. should the container only contain a clean Spark-node, and keep the SparkStreaming-applicatio in persistent storage and then push the job to the container using kubectl)? Or should my docker-file clone my SparkStreaming-application from a repository and be responsible for starting it.
I have tried looking through the documentations but i am unsure on how to set it up. Any link or reference that answers my question is highly appreciated.
You should absolutely use the CronJob
resource for performing the backups... see also these repos for helping bootstrap spark on k8s