I have spent days now trying to figure out a dependency issue I'm experiencing with (Py)Spark running on Kubernetes. I'm using the spark-on-k8s-operator and Spark's Google Cloud connector.
When I try to submit my spark job without a dependency using sparkctl create sparkjob.yaml ...
with below .yaml file, it works like a charm.
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
name: spark-job
namespace: my-namespace
spec:
type: Python
pythonVersion: "3"
hadoopConf:
"fs.gs.impl": "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem"
"fs.AbstractFileSystem.gs.impl": "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS"
"fs.gs.project.id": "our-project-id"
"fs.gs.system.bucket": "gcs-bucket-name"
"google.cloud.auth.service.account.enable": "true"
"google.cloud.auth.service.account.json.keyfile": "/mnt/secrets/keyfile.json"
mode: cluster
image: "image-registry/spark-base-image"
imagePullPolicy: Always
mainApplicationFile: ./sparkjob.py
deps:
jars:
- https://repo1.maven.org/maven2/org/apache/spark/spark-sql-kafka-0-10_2.11/2.4.5/spark-sql-kafka-0-10_2.11-2.4.5.jar
sparkVersion: "2.4.5"
restartPolicy:
type: OnFailure
onFailureRetries: 3
onFailureRetryInterval: 10
onSubmissionFailureRetries: 5
onSubmissionFailureRetryInterval: 20
driver:
cores: 1
coreLimit: "1200m"
memory: "512m"
labels:
version: 2.4.5
serviceAccount: spark-operator-spark
secrets:
- name: "keyfile"
path: "/mnt/secrets"
secretType: GCPServiceAccount
envVars:
GCS_PROJECT_ID: our-project-id
executor:
cores: 1
instances: 1
memory: "512m"
labels:
version: 2.4.5
secrets:
- name: "keyfile"
path: "/mnt/secrets"
secretType: GCPServiceAccount
envVars:
GCS_PROJECT_ID: our-project-id
The Docker image spark-base-image
is built with Dockerfile
FROM gcr.io/spark-operator/spark-py:v2.4.5
RUN rm $SPARK_HOME/jars/guava-14.0.1.jar
ADD https://repo1.maven.org/maven2/com/google/guava/guava/28.0-jre/guava-28.0-jre.jar $SPARK_HOME/jars
ADD https://repo1.maven.org/maven2/com/google/cloud/bigdataoss/gcs-connector/hadoop2-2.0.1/gcs-connector-hadoop2-2.0.1-shaded.jar $SPARK_HOME/jars
ENTRYPOINT [ "/opt/entrypoint.sh" ]
the main application file is uploaded to GCS when submitting the application and subsequently fetched from there and copied into the driver pod upon starting the application. The problem starts whenever I want to supply my own Python module deps.zip
as a dependency to be able to use it in my main application file sparkjob.py
.
Here's what I have tried so far:
1
Added the following lines to spark.deps in sparkjob.yaml
pyFiles:
- ./deps.zip
which resulted in the operator not being able to even submit the Spark application with error
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found
./deps.zip
is successfully uploaded to the GCS bucket along with the main application file but while the main application file can be successfully fetched from GCS (I see this in the logs in jobs with no dependencies as defined above), ./deps.zip
can somehow not be fetched from there. I also tried adding the gcs-connector jar to the spark.deps.jars list explicitly - nothing changes.
2
I added ./deps.zip
to the base docker image used for starting up the driver and executor pods by adding COPY ./deps.zip /mnt/
to the above Dockerfile and adding the dependency in the sparkjob.yaml via
pyFiles:
- local:///mnt/deps.zip
This time the spark job can be submitted and the driver pod is started, however I get a file:/mnt/deps.zip not found
error when the Spark context is being initialized
I also tried to additionally set ENV SPARK_EXTRA_CLASSPATH=/mnt/
in the Dockerfile but without any success. I even tried to explicitly mount the whole /mnt/
directory into the driver and executor pods using volume mounts, but that also didn't work.
edit:
My workaround (2), adding dependencies to the Docker image and setting ENV SPARK_EXTRA_CLASSPATH=/mnt/
in the Dockerfile actually worked! Turns out the tag didn't update and I've been using an old version of the Docker image all along. Duh.
I still don't know why the (more elegant) solution 1 via the gcs-connector isn't working, but it might be related to https://stackoverflow.com/questions/62408395/mountvolume-setup-failed-for-volume-spark-conf-volume
Use the Google Cloud Storage path to the python dependencies since they're uploaded there.
spec:
deps:
pyFiles:
- gs://gcs-bucket-name/deps.zip
If the zip files contain jar which you always shall require while running your spark job, facing a similar issue I just added
FROM gcr.io/spark-operator/spark-py:v2.4.5
COPY mydepjars/ /opt/spark/jars/
And everything is getting loaded within my spark session. Could be one way to do it.