I have deployed spark applications in cluster-mode in kubernetes. The spark application pod is getting restarted almost every hour. The driver log has this message before restart:
20/07/11 13:34:02 ERROR TaskSchedulerImpl: Lost executor 1 on x.x.x.x: The executor with id 1 was deleted by a user or the framework.
20/07/11 13:34:02 ERROR TaskSchedulerImpl: Lost executor 2 on y.y.y.y: The executor with id 2 was deleted by a user or the framework.
20/07/11 13:34:02 INFO DAGScheduler: Executor lost: 1 (epoch 0)
20/07/11 13:34:02 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster.
20/07/11 13:34:02 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(1, x.x.x.x, 44879, None)
20/07/11 13:34:02 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor
20/07/11 13:34:02 INFO DAGScheduler: Shuffle files lost for executor: 1 (epoch 0)
20/07/11 13:34:02 INFO DAGScheduler: Executor lost: 2 (epoch 1)
20/07/11 13:34:02 INFO BlockManagerMasterEndpoint: Trying to remove executor 2 from BlockManagerMaster.
20/07/11 13:34:02 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(2, y.y.y.y, 46191, None)
20/07/11 13:34:02 INFO BlockManagerMaster: Removed 2 successfully in removeExecutor
20/07/11 13:34:02 INFO DAGScheduler: Shuffle files lost for executor: 2 (epoch 1)
20/07/11 13:34:02 INFO ExecutorPodsAllocator: Going to request 2 executors from Kubernetes.
20/07/11 13:34:16 INFO ExecutorPodsAllocator: Going to request 2 executors from Kubernetes.
And the Executor log has:
20/07/11 15:55:01 INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown
20/07/11 15:55:01 INFO MemoryStore: MemoryStore cleared
20/07/11 15:55:01 INFO BlockManager: BlockManager stopped
20/07/11 15:55:01 INFO ShutdownHookManager: Shutdown hook called
How can I find what's causing the executors deletion?
Deployment:
Replicas: 1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 0
RollingUpdateStrategy: 1 max unavailable, 0 max surge
Pod Template:
Labels: app=test
chart=test-2.0.0
heritage=Tiller
product=testp
release=test
service=test-spark
Containers:
test-spark:
Image: test-spark:2df66df06c
Port: <none>
Host Port: <none>
Command:
/spark/bin/start-spark.sh
Args:
while true; do sleep 30; done;
Limits:
memory: 4Gi
Requests:
memory: 4Gi
Liveness: exec [/spark/bin/liveness-probe.sh] delay=300s timeout=1s period=30s #success=1 #failure=10
Environment:
JVM_ARGS: -Xms256m -Xmx1g
KUBERNETES_MASTER: https://kubernetes.default.svc
KUBERNETES_NAMESPACE: test-spark
IMAGE_PULL_POLICY: Always
DRIVER_CPU: 1
DRIVER_MEMORY: 2048m
EXECUTOR_CPU: 1
EXECUTOR_MEMORY: 2048m
EXECUTOR_INSTANCES: 2
KAFKA_ADVERTISED_HOST_NAME: kafka.default:9092
ENRICH_KAFKA_ENRICHED_EVENTS_TOPICS: test-events
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
Progressing True NewReplicaSetAvailable
OldReplicaSets: <none>
NewReplicaSet: test-spark-5c5997b459 (1/1 replicas created)
Events: <none>
I don't know how you configured your application pod but you can use this to stop restarting pod include this in your deployment yaml file so that pod will never restart and you can debug the pod onwards.
restartPolicy: Never
I did a quick research on running Spark on Kubernetes, and it seems that Spark by design will terminate executor pod when they finished running Spark applications. Quoted from the official Spark website:
When the application completes, the executor pods terminate and are cleaned up, but the driver pod persists logs and remains in “completed” state in the Kubernetes API until it’s eventually garbage collected or manually cleaned up.
Therefore, I believe there is nothing to worry about the restarts as long as your Spark instance still manages to start executor pods as and when required.
Reference: https://spark.apache.org/docs/2.4.5/running-on-kubernetes.html#how-it-works