pyspark spark-submit with python dependency from s3

9/24/2019

I am trying to submit a simple pyspark job with external dependencies to my k8s cluster. Mind you that if I put the pyspark application in the spark image and refer them using local:/// it works fine.

./spark-submit --master k8s://https://myk8s.example.com/k8s/clusters/cluster01 --conf spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.InstanceProfileCredentialsProvider --conf spark.rpc.message.maxSize=1024 --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem --conf spark.kubernetes.container.image=pyspark2.4.4:v1 --conf spark.kubernetes.namespace=random --conf spark.kubernetes.container.image.pullPolicy=Always --conf spark.kubernetes.authenticate.driver.serviceAccountName=airflow --num-executors 5 --total-executor-cores 4 --executor-memory 5g --driver-memory 2g --name k8s_app01 --verbose --queue root.default --deploy-mode cluster --class org.apache.spark.deploy.PythonRunner s3a://s3bucket/pyspark/preprocess/python_model.py --py-files s3a://s3bucket/pyspark/preprocess/aws_utils.py s3a://s3bucket/pyspark/preprocess/dataset.py

It is always saying that it can't find the dependency

19/09/24 17:02:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Traceback (most recent call last):

  File "/tmp/spark-8c40f730-75f7-4f18-b0d4-8dfca03e07d2/python_model.py", line 1, in <module>

    from aws_utils import *

ModuleNotFoundError: No module named 'aws_utils'

log4j:WARN No appenders could be found for logger (org.apache.spark.util.ShutdownHookManager).

Technically --py-files should automatically download and add the files to PYTHONPATH but somehow it is not happening. Also if you see the primary application is also located in s3

-- devnull
amazon-s3
kubernetes
pyspark
python
python-3.x

0 Answers