Spark in Kubernetes Connection Refused

5/29/2020

I am trying to deploy a Spark job in a Kubernetes cluster (running on AWS EKS). I deploy a pod that executes spark-submit in client mode. The pod becomes the driver pod and then begins to launch executor pods. The executor pods try to connect to driver but fail causing the executors to crash. Here is the error message from the executor log:

Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: data-loom-stats/10.135.131.239:9902
Caused by: java.net.ConnectException: Connection refused

The driver pod is exposed thru a headless Kubernetes service (per recommendations by Spark: https://spark.apache.org/docs/latest/running-on-kubernetes.html#client-mode-networking). The service exposes the driver with the DNS name data-loom-stats. Based upon the error message the DNS resolution appears to be working since it is correctly translating it to the pod IP address 10.135.131.239. To see what is happening on the driver end I opened a shell in the running driver container and was able to netstat the listening ports:

[root@data-loom-stats-7496b69994-9t8zs work-dir]# netstat -tulpn
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 0.0.0.0:4040 0.0.0.0:* LISTEN 673/java
tcp 0 0 127.0.0.1:40077 0.0.0.0:* LISTEN 673/java
tcp 0 0 127.0.0.1:9902 0.0.0.0:* LISTEN 673/java
tcp 0 0 0.0.0.0:41267 0.0.0.0:* LISTEN 673/java

As you can see port 9902 is bound to the loopback IP address. Port 4040 is the Spark UI and it is bound to 0.0.0.0. Since the executor pods are not stable I did some testing from another pod that is. I was able to curl port 4040:

/merida/src # curl -v http://10.135.131.239:4040
* Trying 10.135.131.239:4040...
* TCP_NODELAY set
* Connected to 10.135.131.239 (10.135.131.239) port 4040 (#0)
> GET / HTTP/1.1
> Host: 10.135.131.239:4040
> User-Agent: curl/7.67.0
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 302 Found
< Date: Fri, 29 May 2020 22:50:46 GMT
< Location: http://10.135.131.239:4040/jobs/
< Content-Length: 0
< Server: Jetty(9.3.z-SNAPSHOT)
<
* Connection #0 to host 10.135.131.239 left intact

But trying to connect to port 9902 gives the connection refused error, just like the driver log.

/merida/src # curl -v http://10.135.131.239:9902
* Trying 10.135.131.239:9902...
* TCP_NODELAY set
* connect to 10.135.131.239 port 9902 failed: Connection refused
* Failed to connect to 10.135.131.239 port 9902: Connection refused
* Closing connection 0
curl: (7) Failed to connect to 10.135.131.239 port 9902: Connection refused

So it appears that my address/port binding needs to be fixed. Is this conclusion correct? If so is this something I can fix in the k8s manifest, or is it caused by something in the spark configuration?

I can supply more to help to identify a root cause.

-- Conrad Mukai
apache-spark
kubernetes

0 Answers