I'm trying to run a PySpark Job using Kubernetes. Both the main script and the py-files are hosted on Google Cloud storage. If I launch the Job using the standalone resource manager:
spark-submit \
--master local \
--deploy-mode client \
--repositories "http://central.maven.org/maven2/" \
--packages "org.postgresql:postgresql:42.2.2" \
--py-files https://storage.googleapis.com/foo/some_dependencies.zip \
https://storage.googleapis.com/foo/script.py some args
It works fine. But if I try the same using Kubernetes:
spark-submit \
--master k8s://https://xx.xx.xx.xx \
--deploy-mode cluster \
--conf spark.kubernetes.container.image=gcr.io/my-spark-image \
--repositories "http://central.maven.org/maven2/" \
--packages "org.postgresql:postgresql:42.2.2" \
--py-files https://storage.googleapis.com/foo/some_dependencies.zip \
https://storage.googleapis.com/foo/script.py some args
Then the main script runs, but it can't find the modules in the dependencies files. I know I can copy all the files inside the Docker image but I would prefer doing it this way.
Is this possible? Am I missing something?
Thanks
Actually --py-files
can be used to distribute dependencies to executors. Can you display the errors you get ? Do you import your zips (SparkContext.addPyFile
) in the main .py ?
ENV: spark 2.4.3
UPDATED answer:
In https://spark.apache.org/docs/latest/running-on-kubernetes.html#dependency-management, docs says:
Note that using application dependencies from the submission client’s local file system is currently not yet supported.
OLDER answer:
I am facing the same issue. I don't think files in --py-files will be distributed to driver and executors. I submit a python file to K8s cluster with following command:
bin/spark-submit \
--master k8s://https://1.1.1.1:6443 \
--deploy-mode cluster \
--name spark-test \
--conf xxx.com/spark-py:v2.4.3 \
--py-files /xxx/spark-2.4.3-bin-hadoop2.7/spark_test1.py \
http://example.com/spark/__main__.py
I got logs in driver pod:
+ PYTHONPATH='/opt/spark/python/lib/pyspark.zip:/opt/spark/python/lib/py4j-*.zip:file:///xxx/spark-2.4.3-bin-hadoop2.7/spark_test1.py'
I got errors like following:
Traceback (most recent call last):
File "/tmp/spark-5e76171d-c5a7-49c6-acd2-f48fdaeeb62a/__main__.py", line 1, in <module>
from spark_test1 import main
ImportError: No module named spark_test1
From the errors, the main python file is get uploaded and distributed to driver. For --py-files, PYTHONPATH contains an exact same path in my cmd which I don't think those files get uploaded to that path in driver pod and executor pod.
I tried to replace the spark_test1.py
from a local path to a HTTP URL. The PYTHONPATH changed accrodingly, but the error is the same.