I am trying to submit a simple pyspark
job with external dependencies to my k8s
cluster. Mind you that if I put the pyspark
application in the spark
image and refer them using local:///
it works fine.
./spark-submit --master k8s://https://myk8s.example.com/k8s/clusters/cluster01 --conf spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.InstanceProfileCredentialsProvider --conf spark.rpc.message.maxSize=1024 --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem --conf spark.kubernetes.container.image=pyspark2.4.4:v1 --conf spark.kubernetes.namespace=random --conf spark.kubernetes.container.image.pullPolicy=Always --conf spark.kubernetes.authenticate.driver.serviceAccountName=airflow --num-executors 5 --total-executor-cores 4 --executor-memory 5g --driver-memory 2g --name k8s_app01 --verbose --queue root.default --deploy-mode cluster --class org.apache.spark.deploy.PythonRunner s3a://s3bucket/pyspark/preprocess/python_model.py --py-files s3a://s3bucket/pyspark/preprocess/aws_utils.py s3a://s3bucket/pyspark/preprocess/dataset.py
It is always saying that it can't find the dependency
19/09/24 17:02:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Traceback (most recent call last):
File "/tmp/spark-8c40f730-75f7-4f18-b0d4-8dfca03e07d2/python_model.py", line 1, in <module>
from aws_utils import *
ModuleNotFoundError: No module named 'aws_utils'
log4j:WARN No appenders could be found for logger (org.apache.spark.util.ShutdownHookManager).
Technically --py-files
should automatically download and add the files to PYTHONPATH
but somehow it is not happening. Also if you see the primary application is also located in s3