Why doesn't the pyspark driver download jar files to local storage?

11/5/2019

I am using spark-on-k8s-operator to deploy Spark 2.4.4 on Kubernetes. However, I'm pretty sure this questions is about Spark itself, not about a Kubernetes deployment of it.

I include several files when I deploy a job to the kubernetes cluster, including jars, pyfiles and a main. In spark-on-k8s; this is done via a config file:

spec:
  mainApplicationFile: "s3a://project-folder/jobs/test/db_read_k8.py"
  deps:
    jars:
      - "s3a://project-folder/jars/mysql-connector-java-8.0.17.jar"
    pyFiles:
      - "s3a://project-folder/pyfiles/pyspark_jdbc.zip"

This would be equivalent to

spark-submit \
   --jars s3a://project-folder/jars/mysql-connector-java-8.0.17.jar \ 
   --py-files s3a://project-folder/pyfiles/pyspark_jdbc.zip \
   s3a://project-folder/jobs/test/db_read_k8.py

In spark-on-k8s, there is a sparkapplication kubernetes pod that manages the submitted spark jobs, and that pod spark-submits to a driver pod (which then interacts with the worker pods). My issue occurs on the driver pod. Once the driver recieves the spark-submit command, it goes about its business, and pull the required files from AWS S3, as expected. Except, it does not pull the jar file:

spark-kubernetes-driver 19/11/05 17:01:19 INFO SparkContext: Added JAR s3a://project-folder/jars/mysql-connector-java-8.0.17.jar at s3a://sezzle-spark/jars/mysql-connector-java-8.0.17.jar with timestamp 1572973279830
spark-kubernetes-driver 19/11/05 17:01:19 INFO SparkContext: Added file s3a://project-folder/jobs/test/db_read_k8.py at s3a://sezzle-spark/jobs/test/db_read_k8.py with timestamp 1572973279872
spark-kubernetes-driver 19/11/05 17:01:19 INFO Utils: Fetching s3a://project-folder/jobs/test/db_read_k8.py to /var/data/spark-f54f76a6-8f2b-4bd5-9644-c406aecac2dd/spark-42e3cd23-55c5-4099-a6af-455efb5dc4f2/userFiles-ae47c908-d0f0-4ff5-aee6-4dadc5c9b95f/fetchFileTemp1013256051456720708.tmp
spark-kubernetes-driver 19/11/05 17:01:19 INFO SparkContext: Added file s3a://project-folder/pyfiles/pyspark_jdbc.zip at s3a://sezzle-spark/pyfiles/pyspark_jdbc.zip with timestamp 1572973279962
spark-kubernetes-driver 19/11/05 17:01:20 INFO Utils: Fetching s3a://project-folder/pyfiles/pyspark_jdbc.zip to /var/data/spark-f54f76a6-8f2b-4bd5-9644-c406aecac2dd/spark-42e3cd23-55c5-4099-a6af-455efb5dc4f2/userFiles-ae47c908-d0f0-4ff5-aee6-4dadc5c9b95f/fetchFileTemp6740168219531159007.tmp

All three required files are "added" but only the main and pyfiles are "fetched." Looking through the driver pod, I can't find the jar file anywhere; it just doesn't get downloaded locally. This, of course, crashes my application, because the mysql driver isn't in the classpath.

Why doesn't spark download jar files to the driver's local filesystem the way it does for the pyfiles and python main?

-- kingledion
apache-spark
kubernetes
pyspark

1 Answer

11/6/2019

PySpark has a bit unclear and not enough documented dependency management.

If your problem is with adding .jar only I would recommend you to use --packages ... instead (spark-operator should have the analogous option).

Hope it'll work for you.

-- Aliaksandr Sasnouskikh
Source: StackOverflow