I am testing Spark 2.3.1 (standalone) on a baremetal Kubernetes cluster. I have a cluster with two Virtual Machines, both with 8GB of ram and 2 cores. I have deployed a cluster with one master-node, and two slaves-nodes. The nodes logs seem to be correct, workers are correctly registred with the master :
kubectl exec spark-master cat /opt/spark/logs/spark-logs
kubectl exec spark-worker1 cat /opt/spark/logs/spark-logs
And, according to the GUI, workers appear to be ready and able to communicate with the master.
Spark GUI
I have opened the following port on the spark container:
I then tried to execute a basic spark job by launchind spark-shell from the container by using spark-shell --master spark://spark-master:7077
with sc.makeRDD(List(1,2,4,4)).count
as the job.
if I use spark-shell from within a slave node, the code is executed and I get a result. However, if I launch the shell from the master, I get the following error message :
WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
While googling this error message, i came through this issue from github. I'm pretty sure this is a networking issue, because the job start on a worker node. The job, while, launched within the master container, reach the worker but it looks like the worker are not able to answer back to the master. The logs on the workers look like this.
All the ports the worker are using to communicate with the master are opened in the deployment.yml and the firewall is disabled.
Does anyone has already experienced that situation ?
Docker was installed on my laptop. The driver was starting by using the Docker NAT. Hence, the worker weren't able to answer back to the driver, because they where trying to reach docker Verthernet ip. Disabling docker solved the problem.