I am new to spark. I am trying to get spark running on k8s using helm chart: stable/spark. I can see that it spins up the 1 master and 2 executer by default and exposes port: 8080 on ClusterIP
.
Now what I have done is to expose the Port: 8080
via elb
so I can see the UI
.
Question is do I always have to bake the jar
or pySpark
code in the image I am using to spin the master or do I have other option as well ?
I don't want to use k8s
as Cluster Manager for spark
. I am trying to see if there is a way to host spark
as an application on k8s
and submit jobs to it as it is a standalone cluster with worker nodes.
so instead of using:
spark-submit \
...
--master k8s://https://KUBECLUSTER-DNS-ADDRESS
I want to do:
spark-submit \
...
--master spark://SPARK-MASTER-ELB-DNS
Also, I am trying to avoid baking the job
in the spark docker image
I don't want to use k8s as Cluster Manager for spark. I am trying to see if there is a way to host spark as an application on k8s and submit jobs to it as it is a standalone cluster with worker nodes.
You can use client
or cluster
mode.
client:
# Run on a Spark standalone cluster in client deploy mode
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://IP-ADDRESS-OF-MASTER:7077 \
--executor-memory 20G \
--total-executor-cores 100 \
/path/to/examples.jar \
1000
cluster:
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://IP-ADDRESS-OF-MASTER:7077 \
--deploy-mode cluster \
--supervise \
--executor-memory 20G \
--total-executor-cores 100 \
/path/to/examples.jar \
1000
Also, I am trying to avoid baking the job in the spark docker image.
The only way is to use client
mode. Basically, your driver will be in whatever machine where you run spark-submit
from and that will need to have all the bits you need to execute your job. The only downside is that you might be susceptible to network latency if the client is not co-located with your Kubernetes cluster.
With cluster
mode you will have bake stuff into your container image because your driver can start on any of the containers/pods that are slaves in your cluster.