Copy job into Spark Docker container: [Errno 2] No such file or directory

6/22/2021

I am trying to copy a Pyspark job into a Spark Docker container to be deployed on a Kubernetes cluster. My Dockerfile looks like the following:

ARG base_img=spark-base:latest

FROM $base_img
WORKDIR /

# Reset to root to run installation tasks
USER 0

RUN mkdir ${SPARK_HOME}/python
RUN mkdir /py_files
RUN apt-get update && \
    apt install -y python3 python3-pip && \
    pip3 install --upgrade pip setuptools && \
    # Removed the .cache to save space
    rm -r /root/.cache && rm -rf /var/cache/apt/*

COPY python/pyspark ${SPARK_HOME}/python/pyspark
COPY python/lib ${SPARK_HOME}/python/lib
COPY /python/py_files /py_files

WORKDIR /opt/spark/work-dir
ENTRYPOINT [ "/opt/entrypoint.sh" ]

Afterwards I try to submit the job to the Kubernetes cluster like so:

$SPARK_HOME/bin/spark-submit \
--master k8s://https://XXX \
--deploy-mode cluster \
--name spark-pi \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.executor.instances=2 \
--conf spark.kubernetes.container.image=XXX \
--files local:///py_files/config/config_extraction_prediction.yml \
--py-files local:///py_files/src/ml_pipeline-0.0.1-py3.8.egg \
local:///py_files/src/main_kubernetes.py

But I get the following error:

python3: can't open file '/py_files/src/main_kubernetes.py': [Errno 2] No such file or directory

From what I can tell, the COPY statement should be correct. But I don't have a way to prove it, because the container exits immediately after starting. I tried to change the ENTRYPOINT to "/bin/sh" and run ls, but that didn't work. I also tried to enter the container with docker exec, didn't work either.

Does anyone have an idea?

-- Lorenz
apache-spark
docker
dockerfile
kubernetes
python

0 Answers