On forcefully deletion of a spark pod driver, the driver is not getting restarted

5/1/2020

I have a spark streaming job that I am trying to submit by a spark-k8-operator. I have kept the restart policy as Always. However, on the manual deletion of the driver the driver is not getting restarted. My yaml:

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: test-v2
  namespace: default
spec:
  type: Scala
  mode: cluster
  image: "com/test:v1.0"
  imagePullPolicy: Never
  mainClass: com.test.TestStreamingJob
  mainApplicationFile: "local:///opt/spark-2.4.5/work-dir/target/scala-2.12/test-assembly-0.1.jar"
  sparkVersion: "2.4.5"
  restartPolicy:
    type: Always
  volumes:
    - name: "test-volume"
      hostPath:
        path: "/tmp"
        type: Directory
  driver:
    cores: 1
    coreLimit: "1200m"
    memory: "512m"
    labels:
      version: 2.4.5
    serviceAccount: spark
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"
    terminationGracePeriodSeconds: 60
  executor:
    cores: 1
    instances: 2
    memory: "512m"
    labels:
      version: 2.4.5
    volumeMounts:
      - name: "test-volume"
        mountPath: "/tmp"

Spark version: 2.4.5 apiVersion: "sparkoperator.k8s.io/v1beta2"

Steps which I followed:

Create resource via kubectl apply -f examples/spark-test.yaml . Pod created successfully. Delete the driver manually.

Expected behavior: A new driver pod would be restarted as per the restart policy.

Actual behavior: Driver and executor pods got deleted.

Environment: Testing out this with Docker On Mac. With 4 CPUs and 8 GB Memory

Logs from spark -operator {FAILING driver pod failed with ExitCode: 143, Reason: Error}

-- JDev
apache-spark
google-kubernetes-engine
kubernetes
pyspark
spark-streaming

1 Answer

5/3/2020

There was an issue with the spark-K8 driver, now it has been fixed and I can see the manually deleted driver getting restarted. Basically code was not handling default values

https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/pull/898

OR just have the following config in place so that default values are not required"

restartPolicy:
    type: Always
    onFailureRetries: 3
    onFailureRetryInterval: 10
    onSubmissionFailureRetries: 3
    onSubmissionFailureRetryInterval: 10
-- JDev
Source: StackOverflow