Does Spark Kubernetes suport the --py-files argument?

4/11/2019

I'm trying to run a PySpark Job using Kubernetes. Both the main script and the py-files are hosted on Google Cloud storage. If I launch the Job using the standalone resource manager:

spark-submit \
--master local \
--deploy-mode client \
--repositories "http://central.maven.org/maven2/" \
--packages "org.postgresql:postgresql:42.2.2" \
--py-files https://storage.googleapis.com/foo/some_dependencies.zip \
https://storage.googleapis.com/foo/script.py some args

It works fine. But if I try the same using Kubernetes:

spark-submit \
--master k8s://https://xx.xx.xx.xx  \
--deploy-mode cluster \
--conf spark.kubernetes.container.image=gcr.io/my-spark-image \
--repositories "http://central.maven.org/maven2/" \
--packages "org.postgresql:postgresql:42.2.2" \
--py-files https://storage.googleapis.com/foo/some_dependencies.zip \
https://storage.googleapis.com/foo/script.py  some args

Then the main script runs, but it can't find the modules in the dependencies files. I know I can copy all the files inside the Docker image but I would prefer doing it this way.

Is this possible? Am I missing something?

Thanks

-- Pablo
apache-spark
kubernetes
pyspark

2 Answers

4/11/2019

Actually --py-files can be used to distribute dependencies to executors. Can you display the errors you get ? Do you import your zips (SparkContext.addPyFile) in the main .py ?

-- machine424
Source: StackOverflow

7/8/2019

ENV: spark 2.4.3

UPDATED answer:

In https://spark.apache.org/docs/latest/running-on-kubernetes.html#dependency-management, docs says:

Note that using application dependencies from the submission client’s local file system is currently not yet supported.

OLDER answer:

I am facing the same issue. I don't think files in --py-files will be distributed to driver and executors. I submit a python file to K8s cluster with following command:

bin/spark-submit \
--master k8s://https://1.1.1.1:6443 \
--deploy-mode cluster \
--name spark-test \
--conf xxx.com/spark-py:v2.4.3 \
--py-files /xxx/spark-2.4.3-bin-hadoop2.7/spark_test1.py \
http://example.com/spark/__main__.py

I got logs in driver pod:

+ PYTHONPATH='/opt/spark/python/lib/pyspark.zip:/opt/spark/python/lib/py4j-*.zip:file:///xxx/spark-2.4.3-bin-hadoop2.7/spark_test1.py'

I got errors like following:

Traceback (most recent call last):
  File "/tmp/spark-5e76171d-c5a7-49c6-acd2-f48fdaeeb62a/__main__.py", line 1, in <module>
    from spark_test1 import main
ImportError: No module named spark_test1

From the errors, the main python file is get uploaded and distributed to driver. For --py-files, PYTHONPATH contains an exact same path in my cmd which I don't think those files get uploaded to that path in driver pod and executor pod.

I tried to replace the spark_test1.py from a local path to a HTTP URL. The PYTHONPATH changed accrodingly, but the error is the same.

-- lephix
Source: StackOverflow