Spark Standalone on Kubernetes - application got finished after consecutive master then driver failure

6/11/2018

Trying to achieve High Availability of SparkMaster using ZooKeeper with SparkDriver resiliency using metaData checkpoint into GlusterFS.

Some Informations :

  • Using Spark 2.2.0 (prebuilt binary)
  • Submitting a streaming app with --deploy-mode cluster and --supervise from a separate spark client pod
  • Spark Components on Kubernetes are of type Statefulset for Dynamic Volume Provisioning (Previously using Replication Controller/ Deployment)
  • Created 3 GlusterFS shared pvc - spark-master-pvc,spark-worker-pvc,spark-ckp-pvc

Successfully achieved the Scenarios like - Only Master Failure, Only Driver Failure, Consecutive Master and Driver Failure, Driver Failure then Master. But the Scenario like Submitted a Job -> Master Failure (Working fine) -> Driver Failure i.e. Worker Pod failure is not working.

NEW ALIVE MASTER's log -

18/06/11 10:23:16 INFO ZooKeeperLeaderElectionAgent: We have gained leadership
18/06/11 10:23:16 INFO Master: I have been elected leader! New state: RECOVERING
18/06/11 10:23:16 INFO Master: Trying to recover app: app-20180611102123-0001
18/06/11 10:23:16 INFO Master: Trying to recover worker: worker-20180611101834-10.1.53.142-36203
18/06/11 10:23:16 INFO Master: Trying to recover worker: worker-20180611102123-10.1.170.85-39447
18/06/11 10:23:16 INFO Master: Trying to recover worker: worker-20180611101834-10.1.185.87-38235
18/06/11 10:23:16 INFO TransportClientFactory: Successfully created connection to /10.1.53.142:36203 after 7 ms (0 ms spent in bootstraps)
18/06/11 10:23:16 INFO TransportClientFactory: Successfully created connection to /10.1.185.87:38235 after 3 ms (0 ms spent in bootstraps)
18/06/11 10:23:16 INFO TransportClientFactory: Successfully created connection to /10.1.53.142:38994 after 12 ms (0 ms spent in bootstraps)
18/06/11 10:23:16 INFO TransportClientFactory: Successfully created connection to /10.1.170.85:39447 after 7 ms (0 ms spent in bootstraps)
18/06/11 10:23:16 INFO Master: Application has been re-registered: app-20180611102123-0001
18/06/11 10:23:16 INFO Master: Worker has been re-registered: worker-20180611102123-10.1.170.85-39447
18/06/11 10:23:16 INFO Master: Worker has been re-registered: worker-20180611101834-10.1.53.142-36203
18/06/11 10:23:16 INFO Master: Worker has been re-registered: worker-20180611101834-10.1.185.87-38235
18/06/11 10:23:16 INFO Master: Recovery complete - resuming operations!
18/06/11 10:24:37 INFO Master: Received unregister request from application app-20180611102123-0001
18/06/11 10:24:37 INFO Master: Removing app app-20180611102123-0001
18/06/11 10:24:37 INFO Master: 10.1.53.142:38994 got disassociated, removing it.
18/06/11 10:24:37 INFO Master: 10.1.53.142:38994 got disassociated, removing it.
18/06/11 10:24:37 WARN Master: Got status update for unknown executor app-20180611102123-0001/0
18/06/11 10:24:37 WARN Master: Got status update for unknown executor app-20180611102123-0001/1
18/06/11 10:24:38 INFO Master: 10.1.53.142:36203 got disassociated, removing it.
18/06/11 10:24:38 INFO Master: Removing worker worker-20180611101834-10.1.53.142-36203 on 10.1.53.142:36203
18/06/11 10:24:38 INFO Master: Re-launching driver-20180611102017-0000
18/06/11 10:24:38 INFO Master: Launching driver driver-20180611102017-0000 on worker worker-20180611101834-10.1.185.87-38235
18/06/11 10:24:38 INFO Master: 10.1.53.142:59142 got disassociated, removing it.
18/06/11 10:24:38 INFO Master: 10.1.53.142:36203 got disassociated, removing it.
18/06/11 10:24:38 INFO Master: 10.1.53.142:36203 got disassociated, removing it.
18/06/11 10:24:43 INFO Master: Registering worker 10.1.53.143:35156 with 8 cores, 30.3 GB RAM

DRIVER is remained in Halted State. Driver Error Log -

log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
18/06/11 19:32:14 INFO SecurityManager: Changing view acls to: root
18/06/11 19:32:14 INFO SecurityManager: Changing modify acls to: root
18/06/11 19:32:14 INFO SecurityManager: Changing view acls groups to: 
18/06/11 19:32:14 INFO SecurityManager: Changing modify acls groups to: 
18/06/11 19:32:14 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
18/06/11 19:32:15 INFO Utils: Successfully started service 'Driver' on port 40594.
18/06/11 19:32:15 INFO WorkerWatcher: Connecting to worker spark://Worker@10.1.185.87:38235
18/06/11 19:32:15 INFO TransportClientFactory: Successfully created connection to /10.1.185.87:38235 after 44 ms (0 ms spent in bootstraps)
18/06/11 19:32:15 INFO WorkerWatcher: Successfully connected to spark://Worker@10.1.185.87:38235
18/06/11 19:32:15 INFO CheckpointReader: Checkpoint files found: file:/ckp/checkpoint-1528712675000,file:/ckp/checkpoint-1528712675000.bk,file:/ckp/checkpoint-1528712670000,file:/ckp/checkpoint-1528712670000.bk,file:/ckp/checkpoint-1528712665000,file:/ckp/checkpoint-1528712665000.bk,file:/ckp/checkpoint-1528712660000,file:/ckp/checkpoint-1528712660000.bk,file:/ckp/checkpoint-1528712655000,file:/ckp/checkpoint-1528712655000.bk
18/06/11 19:32:15 INFO CheckpointReader: Attempting to load checkpoint from file file:/ckp/checkpoint-1528712675000
18/06/11 19:32:15 INFO Checkpoint: Checkpoint for time 1528712675000 ms validated
18/06/11 19:32:15 INFO CheckpointReader: Checkpoint successfully loaded from file file:/ckp/checkpoint-1528712675000
18/06/11 19:32:15 INFO CheckpointReader: Checkpoint was generated at time 1528712675000 ms
18/06/11 19:32:15 INFO SparkContext: Running Spark version 2.2.0
18/06/11 19:32:15 INFO SparkContext: Submitted application: SparkStreamingWithCheckPointAndZK
18/06/11 19:32:15 INFO SecurityManager: Changing view acls to: root
18/06/11 19:32:15 INFO SecurityManager: Changing modify acls to: root
18/06/11 19:32:15 INFO SecurityManager: Changing view acls groups to: 
18/06/11 19:32:15 INFO SecurityManager: Changing modify acls groups to: 
18/06/11 19:32:15 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
18/06/11 19:32:15 INFO Utils: Successfully started service 'sparkDriver' on port 46544.
18/06/11 19:32:15 INFO SparkEnv: Registering MapOutputTracker
18/06/11 19:32:15 INFO SparkEnv: Registering BlockManagerMaster
18/06/11 19:32:15 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
18/06/11 19:32:15 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
18/06/11 19:32:16 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-623c4b9e-8045-4a19-a746-96a3b23c1184
18/06/11 19:32:16 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
18/06/11 19:32:16 INFO SparkEnv: Registering OutputCommitCoordinator
18/06/11 19:32:16 INFO Utils: Successfully started service 'SparkUI' on port 4040.
18/06/11 19:32:16 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://10.1.185.87:4040
18/06/11 19:32:16 INFO SparkContext: Added JAR file:///opt/spark/jars/spark-0.0.1-SNAPSHOT.jar at spark://10.1.185.87:46544/jars/spark-0.0.1-SNAPSHOT.jar with timestamp 1528745536460
18/06/11 19:32:16 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://10.1.170.81:7077...
18/06/11 19:32:36 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://10.1.170.81:7077...
18/06/11 19:32:56 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://10.1.170.81:7077...
18/06/11 19:33:16 ERROR StandaloneSchedulerBackend: Application has been killed. Reason: All masters are unresponsive! Giving up.
18/06/11 19:33:16 WARN StandaloneSchedulerBackend: Application ID is not initialized yet.
18/06/11 19:33:16 INFO SparkUI: Stopped Spark web UI at http://10.1.185.87:4040
18/06/11 19:33:16 INFO StandaloneSchedulerBackend: Shutting down all executors
18/06/11 19:33:16 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 46323.
18/06/11 19:33:16 INFO NettyBlockTransferService: Server created on 10.1.185.87:46323
18/06/11 19:33:16 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
18/06/11 19:33:16 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
18/06/11 19:33:16 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 10.1.185.87, 46323, None)
18/06/11 19:33:16 WARN StandaloneAppClient$ClientEndpoint: Drop UnregisterApplication(null) because has not yet connected to master
18/06/11 19:33:16 INFO BlockManagerMasterEndpoint: Registering block manager 10.1.185.87:46323 with 366.3 MB RAM, BlockManagerId(driver, 10.1.185.87, 46323, None)
18/06/11 19:33:16 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 10.1.185.87, 46323, None)
18/06/11 19:33:16 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 10.1.185.87, 46323, None)
18/06/11 19:33:16 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
18/06/11 19:33:16 INFO MemoryStore: MemoryStore cleared
18/06/11 19:33:16 INFO BlockManager: BlockManager stopped
18/06/11 19:33:16 INFO BlockManagerMaster: BlockManagerMaster stopped
18/06/11 19:33:16 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
18/06/11 19:33:16 ERROR SparkContext: Error initializing SparkContext.
java.lang.IllegalArgumentException: requirement failed: Can only call getServletHandlers on a running MetricsSystem
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.metrics.MetricsSystem.getServletHandlers(MetricsSystem.scala:91)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:524)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2509)
at org.apache.spark.streaming.StreamingContext.<init>(StreamingContext.scala:141)
at apache.spark.streaming.StreamingContext$anonfun$getOrCreate$1.apply(StreamingContext.scala:829)
at org.apache.spark.streaming.StreamingContext$anonfun$getOrCreate$1.apply(StreamingContext.scala:829)
at scala.Option.map(Option.scala:146)
at org.apache.spark.streaming.StreamingContext$.getOrCreate(StreamingContext.scala:829)
at org.apache.spark.streaming.api.java.JavaStreamingContext$.getOrCreate(JavaStreamingContext.scala:626)
at org.apache.spark.streaming.api.java.JavaStreamingContext.getOrCreate(JavaStreamingContext.scala)
at org.merlin.spark.SparkKafkaStreamingWithGluster.main(SparkKafkaStreamingWithGluster.java:42)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
18/06/11 19:33:16 INFO SparkContext: SparkContext already stopped.

Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.metrics.MetricsSystem.getServletHandlers(MetricsSystem.scala:91)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:524)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2509)
at org.apache.spark.streaming.StreamingContext.<init>(StreamingContext.scala:141)
at org.apache.spark.streaming.StreamingContext$anonfun$getOrCreate$1.apply(StreamingContext.scala:829)
at org.apache.spark.streaming.StreamingContext$anonfun$getOrCreate$1.apply(StreamingContext.scala:829)
at scala.Option.map(Option.scala:146)
at org.apache.spark.streaming.StreamingContext$.getOrCreate(StreamingContext.scala:829)
at org.apache.spark.streaming.api.java.JavaStreamingContext$.getOrCreate(JavaStreamingContext.scala:626)
at org.apache.spark.streaming.api.java.JavaStreamingContext.getOrCreate(JavaStreamingContext.scala)
at org.merlin.spark.SparkKafkaStreamingWithGluster.main(SparkKafkaStreamingWithGluster.java:42)
... 6 more

Am I choosing the right resource controller i.e. Statefulsets of kubernetes for spark? M new to this environment, any help will be highly appreciable.

-- Sayak Ghosh
apache-spark
apache-spark-standalone
kubernetes
kubernetes-statefulset
spark-streaming

1 Answer

6/13/2018

Seems like your driver is not able to find master node. Here is the log

18/06/11 19:33:16 ERROR StandaloneSchedulerBackend: Application has been killed. Reason: All masters are unresponsive! Giving up.

Try to telnet ip and port from your client machine.

-- Kaushal
Source: StackOverflow