submit a pyspark application from a kubernetes pod

6/16/2021

Use case: Get stream from Kafka store in parquet file using spark. Open these parquet files and generate graph using graphframes.

Infra: i have a bitnami spark infra on kubernetes connected to Kafka.

The goal is to call the spark-submit into a kubernetes pods. With that all the code run into kubernetes and i doesn't install spark outside kubernetes.

Without kubernetes, i have do the job into spark master container:

docker cp ./Spark/Python_code/edge_stream.py spark_spark_1:/opt/bitnami/spark/edge_stream.py
docker cp ./Spark/Python_code/config.json spark_spark_1:/opt/bitnami/spark/config.json
docker exec spark_spark_1 \
    spark-submit \
    --master spark://0.0.0.0:7077 \
    --deploy-mode client \
    --conf spark.cores.max=1 \
    --conf spark.executor.memory=1g \
    --conf spark.eventLog.enabled=true \
    --conf spark.eventLog.dir=/tmp/spark-events \
    --conf spark.eventLog.rolling.maxFileSize=256m\
    /opt/bitnami/spark/edge_stream.py

Is it possible to do the same job in kubernetes ?

Best regards

-- Sebastien Warichet
apache-kafka
docker
kubernetes
pyspark

3 Answers

7/2/2021
-- Akhil Jain
Source: StackOverflow

8/3/2021

I managed to create the job, but only when a not use the "command" tag. To have a functional yalm file, I had to put my command as an argument. If anyone has an explanation, I'm a taker :-)

Thanks

apiVersion: batch/v1
kind: Job                    
metadata:
  name: apao-spark-vertex-job
spec:                       
  template:
    metadata:
      name: apao-spark-vertex-job
    spec:
      containers:
      - name: apao-spark
        image: apao_spark
        imagePullPolicy: IfNotPresent
        args: [ "spark-submit", "--class", "VertexStreamApp", "--master", "spark://apao-service-spark-master-svc:7077", "--deploy-mode", "cluster", "--conf", "spark.cores.max=1", "--conf", "spark.executor.cores=1", "--conf", "spark.executor.memory=1g", "/tmp/app/vertex-stream-project_2.12-1.0.jar" ]
      restartPolicy: Never
-- Sebastien Warichet
Source: StackOverflow

6/16/2021

Using exec command of kubernetes

minikube kubectl -- exec my-spark-master-0 -- spark-submit \
    --master spark://0.0.0.0:7077 \
    --deploy-mode client \
    --conf spark.cores.max=1 \
    --conf spark.executor.memory=1g \
    --conf spark.eventLog.enabled=true \
    --conf spark.eventLog.dir=/tmp/spark-events \
    --conf spark.eventLog.rolling.maxFileSize=256m\
    ../Python/edge_stream.py
-- Sebastien Warichet
Source: StackOverflow