Here is my setup:
Kubernetes cluster running airflow
, which submits the spark
job to Kubernetes cluster, job runs fine but the container are suppose to die once the job is done but they are still hanging there.
Dag
is baked in the airflow docker image because somehow I am not able to sync the dags from s3
. For some reason the cron
wont run.spark
job to K8S Cluster and job runs fine.Here is my SparkSubmitOperator
function
spark_submit_task = SparkSubmitOperator(
task_id='spark_submit_job_from_airflow',
conn_id='k8s_spark',
java_class='com.dom.rom.mainclass',
application='s3a://some-bucket/jars/demo-jar-with-dependencies.jar',
application_args=['300000'],
total_executor_cores='8',
executor_memory='20g',
num_executors='9',
name='mainclass',
verbose=True,
driver_memory='10g',
conf={
'spark.hadoop.fs.s3a.aws.credentials.provider': 'com.amazonaws.auth.InstanceProfileCredentialsProvider',
'spark.rpc.message.maxSize': '1024',
'spark.hadoop.fs.s3a.impl': 'org.apache.hadoop.fs.s3a.S3AFileSystem',
'spark.kubernetes.container.image': 'dockerhub/spark-image:v0.1',
'spark.kubernetes.namespace' : 'random',
'spark.kubernetes.container.image.pullPolicy': 'IfNotPresent',
'spark.kubernetes.authenticate.driver.serviceAccountName': 'airflow-spark'
},
dag=dag,
)
Figured the problem it was my mistake I wasn't closing the spark session
, added the following
session.stop();