Kubernestes standalone spark : spark-shell works on slave, not on master :Initial job has not accepted any resources;

8/29/2018

I am testing Spark 2.3.1 (standalone) on a baremetal Kubernetes cluster. I have a cluster with two Virtual Machines, both with 8GB of ram and 2 cores. I have deployed a cluster with one master-node, and two slaves-nodes. The nodes logs seem to be correct, workers are correctly registred with the master :

kubectl exec spark-master cat /opt/spark/logs/spark-logs

Master-logs

kubectl exec spark-worker1 cat /opt/spark/logs/spark-logs

Worker-logs

And, according to the GUI, workers appear to be ready and able to communicate with the master.
Spark GUI

I have opened the following port on the spark container:

  • 7077, for the worker to reach the master
  • 7078, for the worker to reach the master
  • 37421 spark.driver.port, from spark-default

I then tried to execute a basic spark job by launchind spark-shell from the container by using spark-shell --master spark://spark-master:7077 with sc.makeRDD(List(1,2,4,4)).count as the job.

if I use spark-shell from within a slave node, the code is executed and I get a result. However, if I launch the shell from the master, I get the following error message :

WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

While googling this error message, i came through this issue from github. I'm pretty sure this is a networking issue, because the job start on a worker node. The job, while, launched within the master container, reach the worker but it looks like the worker are not able to answer back to the master. The logs on the workers look like this.

All the ports the worker are using to communicate with the master are opened in the deployment.yml and the firewall is disabled.

Does anyone has already experienced that situation ?

-- jugo
apache-spark
kubernetes

1 Answer

8/30/2018

Docker was installed on my laptop. The driver was starting by using the Docker NAT. Hence, the worker weren't able to answer back to the driver, because they where trying to reach docker Verthernet ip. Disabling docker solved the problem.

-- jugo
Source: StackOverflow