Spark on K8's Issues loading jar

3/29/2019

I am trying to run a sample spark application(provided in the spark examples jar) on kubernetes and trying to understand the behavior. In this process, I did the following,

  1. Built a running kubernetes cluster with 3 nodes (1 master and 2 child) with adequate resources(10 cores, 64Gigs mem, 500GB disk). Note that I don't have internet access on my nodes.
  2. Installed Spark distribution - spark-2.3.3-bin-hadoop2.7
  3. As there is no internet access on the node, I preloaded a spark image( from gcr.io/cloud-solutions-images/spark:v2.3.0-gcs) into the docker on the node running kubernetes master
  4. Running spark-submit to k8's as follows,
./bin/spark-submit --master k8s://https://test-k8:6443 \
                   --deploy-mode cluster \ 
                   --name spark-pi \
                   --class org.apache.spark.examples.SparkPi \
                   --conf spark.executor.instances=5 \
                   --conf spark.kubernetes.container.image=gcr.io/cloud-solutions-images/spark:v2.3.0-gcs \
                   --conf spark.kubernetes.driver.pod.name=spark-pi-driver \
                   --conf spark.kubernetes.container.image.pullPolicy=IfNotPresent \
                   local:///opt/spark/examples/jars/spark-examples_2.11-2.3.3.jar

However, it fails with the below error,

Error: Could not find or load main class org.apache.spark.examples.SparkPi

In regards to the above I have below questions:

  1. Do we need to provide Kubernetes a distribution of spark? and is that what we are doing with?
--conf spark.kubernetes.container.image=gcr.io/cloud-solutions-images/spark:v2.3.0-gcs
  1. If I have my own spark example, for say processing events from Kafka. What should be my approach?

Any help in debugging the above Error and answering my follow up questions is thankful.

-- Cheater
apache-spark
kubernetes

1 Answer

3/29/2019

spark.kubernetes.container.image should be an image that has both the spark binaries & the application code. In my case, as I don't have access to the internet from my nodes. Doing the following let spark driver pick the correct jar.

So, this is what I did,

  1. In my local computer, I did a docker build
docker build -t spark_pi_test:v1.0 -f kubernetes/dockerfiles/spark/Dockerfile .

Above built me a docker image in my local computer.

  1. tar'd the built docker image,
docker save spark_pi_test:v1.0 > spark_pi_test_v1.0.tar
  1. scp'd the tar ball to all 3 kube nodes.
  2. docker load the tar ball on all 3 kube nodes.
docker load < spark_pi_test_v1.0.tar

Then I submitted the spark job as follows,

./bin/spark-submit --master k8s://https://test-k8:6443 --deploy-mode cluster --name spark-pi --class org.apache.spark.examples.SparkPi --conf spark.executor.instances=5 --conf spark.kubernetes.container.image=spark_pi_test:v1.0 --conf spark.kubernetes.driver.pod.name=spark-pi-driver --conf spark.kubernetes.container.image.pullPolicy=IfNotPresent local:///opt/spark/examples/jars/spark-examples_2.11-2.3.3.jar 100000

The above jar path is the path in the docker container. For reference to DockerFile, https://github.com/apache/spark/blob/master/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile

-- Cheater
Source: StackOverflow