I'm learning spark, and I'm get confused about running docker which contains spark code on Kubernetes
cluster.
I read that spark
get utilized multiple nodes (servers) and it can run code on different nodes, in order to get complete jobs faster (and get used the memory of each node, when the data is too big)
On the other side, I read that Kubernetes pod (which contains dockers/containers) run on one node.
For example, I'm running the following spark
code from docker
:
num = [1, 2, 3, 4, 5]
num_rdd = sc.parallelize(num)
double_rdd = num_rdd.map(lambda x: x * 2)
Some notes and reminders (from my understanding):
map
command, each value of the num
array maps to different spark node (slave worker)k8s
pod run on one nodespark slave workers
runs on different nodes, and this is how the pod which run the code above can communicate with those nodes in order to utilize the spark framework ?When you run Spark on Kubernetes, you have a few ways to set things up. The most common way is to set Spark to run in client-mode.
Basically Spark can run on Kubernetes on a Pod.. then the application itself, having the endpoints for the k8s masters, is able to spawn its own worker Pods, as long as everything is correctly configured.
What is needed for this setup is to deploy the Spark application on Kubernetes (usually with a StatefulSet but it's not a must) along with an headless ClusterIP Service (which is required to make worker Pods able to communicate with the master application that spawned them)
You also need to give the Spark application all the correct configurations such as the k8s masters endpoint, the Pod name and other parameters to set things up.
There are other ways to setup Spark, there's no obligation to spawn worker Pods, you can run all the stages of your code locally (and the configuration is easy, if you have small jobs with small amount of data to execute you don't need workers)
Or you can execute the Spark application externally from the Kubernetes cluster, so not on a pod.. but giving it the Kubernetes master endpoints so that it can still spawn workers on the cluster (aka cluster-mode)
You can find a lot more info in the Spark documentation, which explains mostly everything to set things up (https://spark.apache.org/docs/latest/running-on-kubernetes.html#client-mode) And can read about StatefulSets and their usage of headless ClusterIP Services here (https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/)