I am trying to copy a Pyspark job into a Spark Docker container to be deployed on a Kubernetes cluster. My Dockerfile looks like the following:
ARG base_img=spark-base:latest
FROM $base_img
WORKDIR /
# Reset to root to run installation tasks
USER 0
RUN mkdir ${SPARK_HOME}/python
RUN mkdir /py_files
RUN apt-get update && \
apt install -y python3 python3-pip && \
pip3 install --upgrade pip setuptools && \
# Removed the .cache to save space
rm -r /root/.cache && rm -rf /var/cache/apt/*
COPY python/pyspark ${SPARK_HOME}/python/pyspark
COPY python/lib ${SPARK_HOME}/python/lib
COPY /python/py_files /py_files
WORKDIR /opt/spark/work-dir
ENTRYPOINT [ "/opt/entrypoint.sh" ]
Afterwards I try to submit the job to the Kubernetes cluster like so:
$SPARK_HOME/bin/spark-submit \
--master k8s://https://XXX \
--deploy-mode cluster \
--name spark-pi \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.executor.instances=2 \
--conf spark.kubernetes.container.image=XXX \
--files local:///py_files/config/config_extraction_prediction.yml \
--py-files local:///py_files/src/ml_pipeline-0.0.1-py3.8.egg \
local:///py_files/src/main_kubernetes.py
But I get the following error:
python3: can't open file '/py_files/src/main_kubernetes.py': [Errno 2] No such file or directory
From what I can tell, the COPY statement should be correct. But I don't have a way to prove it, because the container exits immediately after starting. I tried to change the ENTRYPOINT to "/bin/sh" and run ls, but that didn't work. I also tried to enter the container with docker exec, didn't work either.
Does anyone have an idea?