How to use helm chart for spark on k8s

4/17/2019

I am new to spark. I am trying to get spark running on k8s using helm chart: stable/spark. I can see that it spins up the 1 master and 2 executer by default and exposes port: 8080 on ClusterIP.

Now what I have done is to expose the Port: 8080 via elb so I can see the UI.

Question is do I always have to bake the jar or pySpark code in the image I am using to spin the master or do I have other option as well ?

I don't want to use k8s as Cluster Manager for spark. I am trying to see if there is a way to host spark as an application on k8s and submit jobs to it as it is a standalone cluster with worker nodes.

so instead of using:

spark-submit \
...
--master k8s://https://KUBECLUSTER-DNS-ADDRESS

I want to do:

spark-submit \
...
--master spark://SPARK-MASTER-ELB-DNS

Also, I am trying to avoid baking the job in the spark docker image

-- devnull
apache-spark
kubernetes
kubernetes-helm

1 Answer

4/18/2019

I don't want to use k8s as Cluster Manager for spark. I am trying to see if there is a way to host spark as an application on k8s and submit jobs to it as it is a standalone cluster with worker nodes.

You can use client or cluster mode.

client:

# Run on a Spark standalone cluster in client deploy mode
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master spark://IP-ADDRESS-OF-MASTER:7077 \
  --executor-memory 20G \
  --total-executor-cores 100 \
  /path/to/examples.jar \
  1000

cluster:

./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master spark://IP-ADDRESS-OF-MASTER:7077 \
  --deploy-mode cluster \
  --supervise \
  --executor-memory 20G \
  --total-executor-cores 100 \
  /path/to/examples.jar \
  1000

Also, I am trying to avoid baking the job in the spark docker image.

The only way is to use client mode. Basically, your driver will be in whatever machine where you run spark-submit from and that will need to have all the bits you need to execute your job. The only downside is that you might be susceptible to network latency if the client is not co-located with your Kubernetes cluster.

With cluster mode you will have bake stuff into your container image because your driver can start on any of the containers/pods that are slaves in your cluster.

-- Rico
Source: StackOverflow