Numpy and other library dependencies for Spark application on Kubernetes

11/13/2018

I am running pyspark application, v2.4.0, on Kubernetes, my spark application depends on numpy and tensorflow modules, please suggest the way to add these dependencies to Spark executors.

I have checked the documentation, we can include the remote dependencies using --py-files, --jars etc. but nothing mentioned about library dependencies.

-- Lakshman Battini
apache-spark
kubernetes

1 Answer

11/17/2018

Found the way to add the library dependencies to Spark applications on K8S, thought of sharing it here.

Mention the required dependencies installation commands in Dockerfile and rebuild the spark image, when we submit the spark job, new container will be instantiated with the dependencies as well.

Dokerfile (/{spark_folder_path}/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/bindings/python/Dockerfile) contents:

RUN apk add --no-cache python && \
    apk add --no-cache python3 && \
    python -m ensurepip && \
    python3 -m ensurepip && \
    # We remove ensurepip since it adds no functionality since pip is
    # installed on the image and it just takes up 1.6MB on the image
    rm -r /usr/lib/python*/ensurepip && \
    pip install --upgrade pip setuptools && \
    # You may install with python3 packages by using pip3.6
    pip install numpy && \
    # Removed the .cache to save space
    rm -r /root/.cache
-- Lakshman Battini
Source: StackOverflow