Spark - Master: got disassociated, removing it

7/17/2019

I am deploying a Spark cluster with 1 Master node and 3 worker nodes. Upon moments of deploying the Master and Worker nodes, the master starts spamming the logs with the following messages;

19/07/17 12:56:51 INFO Master: I have been elected leader! New state: ALIVE
19/07/17 12:56:56 INFO Master: Registering worker 172.26.140.209:35803 with 1 cores, 2.0 GB RAM
19/07/17 12:56:57 INFO Master: 172.26.140.163:59146 got disassociated, removing it.
19/07/17 12:56:58 INFO Master: 172.26.140.132:56252 got disassociated, removing it.
19/07/17 12:56:58 INFO Master: 172.26.140.194:62135 got disassociated, removing it.
19/07/17 12:57:02 INFO Master: Registering worker 172.26.140.169:44249 with 1 cores, 2.0 GB RAM
19/07/17 12:57:02 INFO Master: 172.26.140.163:59202 got disassociated, removing it.
19/07/17 12:57:03 INFO Master: 172.26.140.132:56355 got disassociated, removing it.
19/07/17 12:57:03 INFO Master: 172.26.140.194:62157 got disassociated, removing it.
19/07/17 12:57:07 INFO Master: 172.26.140.163:59266 got disassociated, removing it.
19/07/17 12:57:08 INFO Master: 172.26.140.132:56376 got disassociated, removing it.
19/07/17 12:57:08 INFO Master: Registering worker 172.26.140.204:43921 with 1 cores, 2.0 GB RAM
19/07/17 12:57:08 INFO Master: 172.26.140.194:62203 got disassociated, removing it.
19/07/17 12:57:12 INFO Master: 172.26.140.163:59342 got disassociated, removing it.
19/07/17 12:57:13 INFO Master: 172.26.140.132:56392 got disassociated, removing it.
19/07/17 12:57:13 INFO Master: 172.26.140.194:62268 got disassociated, removing it.
19/07/17 12:57:17 INFO Master: 172.26.140.163:59417 got disassociated, removing it.
19/07/17 12:57:18 INFO Master: 172.26.140.132:56415 got disassociated, removing it.
19/07/17 12:57:18 INFO Master: 172.26.140.194:62296 got disassociated, removing it.
19/07/17 12:57:22 INFO Master: 172.26.140.163:59472 got disassociated, removing it.
19/07/17 12:57:23 INFO Master: 172.26.140.132:56483 got disassociated, removing it.
19/07/17 12:57:23 INFO Master: 172.26.140.194:62323 got disassociated, removing it.

The Worker Nodes seem to be connected to the Master correctly and are logging the following;

19/07/17 12:56:56 INFO Utils: Successfully started service 'sparkWorker' on port 35803.
19/07/17 12:56:56 INFO Worker: Starting Spark worker 172.26.140.209:35803 with 1 cores, 2.0 GB RAM
19/07/17 12:56:56 INFO Worker: Running Spark version 2.4.3
19/07/17 12:56:56 INFO Worker: Spark home: /opt/spark
19/07/17 12:56:56 INFO Utils: Successfully started service 'WorkerUI' on port 8081.
19/07/17 12:56:56 INFO WorkerWebUI: Bound WorkerWebUI to 0.0.0.0, and started at http://spark-worker-0.spark-worker-service.default.svc.cluster.local:8081
19/07/17 12:56:56 INFO Worker: Connecting to master spark-master-service.default.svc.cluster.local:7077...
19/07/17 12:56:56 INFO TransportClientFactory: Successfully created connection to spark-master-service.default.svc.cluster.local/10.0.179.236:7077 after 49 ms (0 ms spent in bootstraps)
19/07/17 12:56:56 INFO Worker: Successfully registered with master spark://172.26.140.196:7077

But the Master still logs the disassociating error, for three separate nodes, every 5 seconds.

What is strange is that the IP addresses listed in the Masters logs are all from the kube-proxy service;

kube-system   kube-proxy-5vp9r                                     1/1     Running            0          39h     172.26.140.163   aks-agentpool-31454219-2   <none>           <none>
kube-system   kube-proxy-kl695                                     1/1     Running            0          39h     172.26.140.132   aks-agentpool-31454219-1   <none>           <none>
kube-system   kube-proxy-xgjws                                     1/1     Running            0          39h     172.26.140.194   aks-agentpool-31454219-0   <none>           <none>

My questions are two-fold;

1) Why are the kube-proxy nodes connecting to the Master? Or why does the Master node think that the kube-proxy nodes are taking part in this cluster?

2) What setting do I need to change in order to clear this message from my log files.

Here is the contents of my spark-defaults.conf file

spark.master=spark://spark-master-service:7077
spark.submit.deploy-mode=cluster
spark.executor.cores=1
spark.driver.memory=500m
spark.executor.memory=500m
spark.eventLog.enabled=true
spark.eventLog.dir=/mnt/eventLog

I cannot find any meaningful reason why this is occurring and any assistance would be greatly appreciated.

-- Sage
apache-spark
kubernetes

1 Answer

12/18/2019

I had the same problem with my Spark Cluster in Kubernetes, tested spark 2.4.3 and Spark 2.4.4 and also Kubernetes 16.0 and 13.0

This is the solution:

This is how I got my spark object first

spark = SparkSession.builder.appName('Kubernetes-Spark-app').getOrCreate()

and the issue was resolved, by using the cluster ip of the Spark master!

spark = SparkSession.builder.master('spark://10.0.106.83:7077').appName('Kubernetes-Spark-app').getOrCreate()

works with this chart

helm install microsoft/spark --generate-name     
-- ckloan
Source: StackOverflow