Use case: Get stream from Kafka store in parquet file using spark. Open these parquet files and generate graph using graphframes.
Infra: i have a bitnami spark infra on kubernetes connected to Kafka.
The goal is to call the spark-submit into a kubernetes pods. With that all the code run into kubernetes and i doesn't install spark outside kubernetes.
Without kubernetes, i have do the job into spark master container:
docker cp ./Spark/Python_code/edge_stream.py spark_spark_1:/opt/bitnami/spark/edge_stream.py
docker cp ./Spark/Python_code/config.json spark_spark_1:/opt/bitnami/spark/config.json
docker exec spark_spark_1 \
spark-submit \
--master spark://0.0.0.0:7077 \
--deploy-mode client \
--conf spark.cores.max=1 \
--conf spark.executor.memory=1g \
--conf spark.eventLog.enabled=true \
--conf spark.eventLog.dir=/tmp/spark-events \
--conf spark.eventLog.rolling.maxFileSize=256m\
/opt/bitnami/spark/edge_stream.py
Is it possible to do the same job in kubernetes ?
Best regards
Have you explored spark operator on k8s? https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/user-guide.md
I managed to create the job, but only when a not use the "command" tag. To have a functional yalm file, I had to put my command as an argument. If anyone has an explanation, I'm a taker :-)
Thanks
apiVersion: batch/v1
kind: Job
metadata:
name: apao-spark-vertex-job
spec:
template:
metadata:
name: apao-spark-vertex-job
spec:
containers:
- name: apao-spark
image: apao_spark
imagePullPolicy: IfNotPresent
args: [ "spark-submit", "--class", "VertexStreamApp", "--master", "spark://apao-service-spark-master-svc:7077", "--deploy-mode", "cluster", "--conf", "spark.cores.max=1", "--conf", "spark.executor.cores=1", "--conf", "spark.executor.memory=1g", "/tmp/app/vertex-stream-project_2.12-1.0.jar" ]
restartPolicy: Never
Using exec command of kubernetes
minikube kubectl -- exec my-spark-master-0 -- spark-submit \
--master spark://0.0.0.0:7077 \
--deploy-mode client \
--conf spark.cores.max=1 \
--conf spark.executor.memory=1g \
--conf spark.eventLog.enabled=true \
--conf spark.eventLog.dir=/tmp/spark-events \
--conf spark.eventLog.rolling.maxFileSize=256m\
../Python/edge_stream.py