Spark on Kubernetes: Executor pods silently get killed

8/18/2020

I am running a Spark job on kubernetes, and with larger amounts of data, I frequently get "Executor lost", and executors are being killed, with jobs failing. I already did a kubectl logs -f on all the executor pods running, but I never see any exception being thrown (I'd expect something like OutOfMemoryError or the like). The pods just stop calculating out of a sudden, and are then directly removed, so they don't even stay in Error state to be able to dig around and troubleshoot. They just disappear.

How should I troubleshoot this? It seems to me that Kubernetes itself kills the pods because I think they exceed some boundaries, but to my understanding, the pods then should be in Evicted state (or shouldn't they?)

It seems to be related to memory usage because when I turn up spark.executor.memory my job runs to completion (but then with much less executors resulting in much lower speed).

When running the job with local[*] as the master it runs to completion even with much lower memory settings.

Followup 1

I started the job with only one executor and did a kubectl logs -f on the executor pod, and watched the output of the driver (running in client mode). First, there is the "Executor lost" message on the driver, then the executor pod just exits with no exception or error message whatsoever.

Followup 2

When the executor dies, the log of the driver looks like this:

20/08/18 10:36:40 INFO TaskSchedulerImpl: Removed TaskSet 15.0, whose tasks have all completed, from pool
20/08/18 10:36:40 INFO TaskSetManager: Starting task 3.0 in stage 18.0 (TID 1554, 10.244.1.64, executor 1, partition 3, NODE_LOCAL, 7717 bytes)
20/08/18 10:36:40 INFO DAGScheduler: ShuffleMapStage 15 (parquet at DataTasks.scala:208) finished in 5.913 s
20/08/18 10:36:40 INFO DAGScheduler: looking for newly runnable stages
20/08/18 10:36:40 INFO DAGScheduler: running: Set(ShuffleMapStage 18)
20/08/18 10:36:40 INFO DAGScheduler: waiting: Set(ShuffleMapStage 20, ShuffleMapStage 21, ResultStage 22)
20/08/18 10:36:40 INFO DAGScheduler: failed: Set()
20/08/18 10:36:40 INFO BlockManagerInfo: Added broadcast_11_piece0 in memory on 10.244.1.64:43809 (size: 159.0 KiB, free: 2.2 GiB)
20/08/18 10:36:40 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 1 to 10.93.111.35:20221
20/08/18 10:36:41 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 3 to 10.93.111.35:20221
20/08/18 10:36:49 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Disabling executor 1.
20/08/18 10:36:49 INFO DAGScheduler: Executor lost: 1 (epoch 12)
20/08/18 10:36:49 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster.
20/08/18 10:36:49 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(1, 10.244.1.64, 43809, None)
20/08/18 10:36:49 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor
20/08/18 10:36:49 INFO DAGScheduler: Shuffle files lost for executor: 1 (epoch 12)

On the executor, it looks like this:

20/08/18 10:36:40 INFO Executor: Running task 3.0 in stage 18.0 (TID 1554)
20/08/18 10:36:40 INFO TorrentBroadcast: Started reading broadcast variable 11 with 1 pieces (estimated total size 4.0 MiB)
20/08/18 10:36:40 INFO MemoryStore: Block broadcast_11_piece0 stored as bytes in memory (estimated size 159.0 KiB, free 2.2 GiB)
20/08/18 10:36:40 INFO TorrentBroadcast: Reading broadcast variable 11 took 7 ms
20/08/18 10:36:40 INFO MemoryStore: Block broadcast_11 stored as values in memory (estimated size 457.3 KiB, free 2.2 GiB)
20/08/18 10:36:40 INFO MapOutputTrackerWorker: Don't have map outputs for shuffle 1, fetching them
20/08/18 10:36:40 INFO MapOutputTrackerWorker: Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://MapOutputTracker@node01.maas:34271)
20/08/18 10:36:40 INFO MapOutputTrackerWorker: Got the output locations
20/08/18 10:36:40 INFO ShuffleBlockFetcherIterator: Getting 30 (142.3 MiB) non-empty blocks including 30 (142.3 MiB) local and 0 (0.0 B) host-local and 0 (0.0 B) remote blocks
20/08/18 10:36:40 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
20/08/18 10:36:41 INFO CodeGenerator: Code generated in 3.082897 ms
20/08/18 10:36:41 INFO CodeGenerator: Code generated in 5.132359 ms
20/08/18 10:36:41 INFO MapOutputTrackerWorker: Don't have map outputs for shuffle 3, fetching them
20/08/18 10:36:41 INFO MapOutputTrackerWorker: Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://MapOutputTracker@node01.maas:34271)
20/08/18 10:36:41 INFO MapOutputTrackerWorker: Got the output locations
20/08/18 10:36:41 INFO ShuffleBlockFetcherIterator: Getting 0 (0.0 B) non-empty blocks including 0 (0.0 B) local and 0 (0.0 B) host-local and 0 (0.0 B) remote blocks
20/08/18 10:36:41 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
20/08/18 10:36:41 INFO CodeGenerator: Code generated in 6.770762 ms
20/08/18 10:36:41 INFO CodeGenerator: Code generated in 3.150645 ms
20/08/18 10:36:41 INFO CodeGenerator: Code generated in 2.81799 ms
20/08/18 10:36:41 INFO CodeGenerator: Code generated in 2.989827 ms
20/08/18 10:36:41 INFO CodeGenerator: Code generated in 3.024777 ms
20/08/18 10:36:41 INFO CodeGenerator: Code generated in 4.32011 ms

Then, the executor exits.

The strange thing is: Stage 18.0 starts with Task 3.0, not with 1.0

Followup 3

I now changed the executor log level to DEBUG and I noticed something interesting, hust before the executor exits:

20/08/18 14:19:26 DEBUG TaskMemoryManager: Task 1155 acquired 64.0 KiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:26 DEBUG TaskMemoryManager: Task 1155 acquired 64.0 KiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@4ef2dc4a
20/08/18 14:19:26 DEBUG TaskMemoryManager: Task 1155 release 64.0 KiB from org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@4ef2dc4a
20/08/18 14:19:26 DEBUG TaskMemoryManager: Task 1155 acquired 64.0 MiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:26 DEBUG TaskMemoryManager: Task 1155 acquired 128.0 KiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d 20/08/18 14:19:26 DEBUG TaskMemoryManager: Task 1155 release 64.0 KiB from org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:26 DEBUG TaskMemoryManager: Task 1155 acquired 256.0 KiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d 20/08/18 14:19:26 DEBUG TaskMemoryManager: Task 1155 release 128.0 KiB from org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d 20/08/18 14:19:26 DEBUG TaskMemoryManager: Task 1155 acquired 512.0 KiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d 20/08/18 14:19:26 DEBUG TaskMemoryManager: Task 1155 release 256.0 KiB from org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d 20/08/18 14:19:26 DEBUG TaskMemoryManager: Task 1155 acquired 1024.0 KiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d20/08/18 14:19:26 DEBUG TaskMemoryManager: Task 1155 release 512.0 KiB from org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d 20/08/18 14:19:26 DEBUG TaskMemoryManager: Task 1155 acquired 2.0 MiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:26 DEBUG TaskMemoryManager: Task 1155 release 1024.0 KiB from org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d20/08/18 14:19:26 DEBUG TaskMemoryManager: Task 1155 acquired 4.0 MiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:26 DEBUG TaskMemoryManager: Task 1155 release 2.0 MiB from org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:27 DEBUG TaskMemoryManager: Task 1155 acquired 8.0 MiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:27 DEBUG TaskMemoryManager: Task 1155 release 4.0 MiB from org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:27 DEBUG TaskMemoryManager: Task 1155 acquired 16.0 MiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:27 DEBUG TaskMemoryManager: Task 1155 release 8.0 MiB from org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:27 DEBUG TaskMemoryManager: Task 1155 acquired 32.0 MiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:27 DEBUG TaskMemoryManager: Task 1155 release 16.0 MiB from org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:29 DEBUG TaskMemoryManager: Task 1155 acquired 64.0 MiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:30 DEBUG TaskMemoryManager: Task 1155 acquired 64.0 MiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:30 DEBUG TaskMemoryManager: Task 1155 release 32.0 MiB from org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:34 DEBUG TaskMemoryManager: Task 1155 acquired 64.0 MiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:34 DEBUG TaskMemoryManager: Task 1155 acquired 128.0 MiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d 20/08/18 14:19:34 DEBUG TaskMemoryManager: Task 1155 release 64.0 MiB from org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:36 DEBUG TaskMemoryManager: Task 1155 acquired 64.0 MiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:36 DEBUG TaskMemoryManager: Task 1155 acquired 64.0 MiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:37 DEBUG TaskMemoryManager: Task 1155 acquired 256.0 MiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d 20/08/18 14:19:37 DEBUG TaskMemoryManager: Task 1155 release 128.0 MiB from org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d 20/08/18 14:19:37 DEBUG TaskMemoryManager: Task 1155 acquired 64.0 MiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:38 DEBUG TaskMemoryManager: Task 1155 acquired 64.0 MiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:38 DEBUG TaskMemoryManager: Task 1155 acquired 64.0 MiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:39 DEBUG TaskMemoryManager: Task 1155 acquired 64.0 MiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d
20/08/18 14:19:39 DEBUG TaskMemoryManager: Task 1155 acquired 512.0 MiB for org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@5050038d

I gave the executor 4GB memory via spark.executor.memory, and those allocations add up to 1344MB. With 4GB memory and the default memory split settings, 40% are 1400MB.

Could I somehow limit how much memory the UnsafeExternalSorter takes?

Followup 4

I ran into a rare occasion where for some reason, Spark did not kill the "finished" executors and I saw that the pods were OOMKilled. It seems that spark.executor.memory sets both the requested memory of the pod AND the memory configuration in the Spark executor.

-- rabejens
apache-spark
kubernetes

1 Answer

11/11/2020

The Followup 4 was the answer. I ran the job again with kubectl get pod -w and I saw the executor pods getting OOMKilled. I am now running with spark.kubernetes.memoryOverheadFactor=0.5 and spark.memory.fraction=0.2, adjusting spark.executor.memory so high it barely starts one executor per node, and I set spark.executor.cores to the number of cores per node minus 1. That way, it runs.

I also tweaked my algorithm because it had a large partition skew and has to do some calculations that cannot easily be parallelized which caused a LOT of shuffling.

-- rabejens
Source: StackOverflow