Network timeout initializing spark context in kubernetes (standalone driver)

4/29/2020

I'm getting this error when I try to run a spark program from a driver pod (running standalone in client mode not using spark-submit):

20/04/29 02:14:46 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://sparkrunner-0.sparkrunner:4040
20/04/29 02:14:46 INFO SparkKubernetesClientFactory: Auto-configuring K8S client using current context from users K8S config file
20/04/29 02:14:46 DEBUG Config: Trying to configure client from Kubernetes config...
20/04/29 02:14:46 DEBUG Config: Did not find Kubernetes config at: [/root/.kube/config]. Ignoring.
20/04/29 02:14:46 DEBUG Config: Trying to configure client from service account...
20/04/29 02:14:46 DEBUG Config: Found service account host and port: 10.96.0.1:443
20/04/29 02:14:46 DEBUG Config: Found service account ca cert at: [/var/run/secrets/kubernetes.io/serviceaccount/ca.crt].
20/04/29 02:14:46 DEBUG Config: Found service account token at: [/var/run/secrets/kubernetes.io/serviceaccount/token].
20/04/29 02:14:46 DEBUG Config: Trying to configure client namespace from Kubernetes service account namespace path...
20/04/29 02:14:46 DEBUG Config: Found service account namespace at: [/var/run/secrets/kubernetes.io/serviceaccount/namespace].
20/04/29 02:14:46 DEBUG Config: Trying to configure client from Kubernetes config...
20/04/29 02:14:46 DEBUG Config: Did not find Kubernetes config at: [/root/.kube/config]. Ignoring.
20/04/29 02:14:46 DEBUG Config: Trying to configure client from service account...
20/04/29 02:14:46 DEBUG Config: Found service account host and port: 10.96.0.1:443
20/04/29 02:14:46 DEBUG Config: Found service account ca cert at: [/var/run/secrets/kubernetes.io/serviceaccount/ca.crt].
20/04/29 02:14:46 DEBUG Config: Found service account token at: [/var/run/secrets/kubernetes.io/serviceaccount/token].
20/04/29 02:14:46 DEBUG Config: Trying to configure client namespace from Kubernetes service account namespace path...
20/04/29 02:14:46 DEBUG Config: Found service account namespace at: [/var/run/secrets/kubernetes.io/serviceaccount/namespace].
20/04/29 02:14:57 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: External scheduler cannot be instantiated
        at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$createTaskScheduler(SparkContext.scala:2934)
        at org.apache.spark.SparkContext.<init>(SparkContext.scala:548)
        at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2578)
        at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$5(SparkSession.scala:896)
        at scala.Option.getOrElse(Option.scala:138)
        at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:887)
        at sparkrunner.sparklibs.SparkSystem$.<init>(SparkSystem.scala:22)
        at sparkrunner.sparklibs.SparkSystem$.<clinit>(SparkSystem.scala)
        at sparkrunner.actors.RecipeManager$anonfun$receive$1.applyOrElse(RecipeManager.scala:41)
        at akka.actor.Actor.aroundReceive(Actor.scala:534)
        at akka.actor.Actor.aroundReceive$(Actor.scala:532)
        at sparkrunner.actors.RecipeManager.aroundReceive(RecipeManager.scala:20)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:573)
        at akka.actor.ActorCell.invoke(ActorCell.scala:543)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:269)
        at akka.dispatch.Mailbox.run(Mailbox.scala:230)
        at akka.dispatch.Mailbox.exec(Mailbox.scala:242)
        at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
        at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
        at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
        at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)
Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Operation: [get]  for kind: [Pod]  with name: [sparkrunner-0]  in namespace: [default]  failed.
        at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)
        at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72)
        at io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:237)
        at io.fabric8.kubernetes.client.dsl.base.BaseOperation.get(BaseOperation.java:170)
        at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$driverPod$1(ExecutorPodsAllocator.scala:59)
        at scala.Option.map(Option.scala:163)
        at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.<init>(ExecutorPodsAllocator.scala:58)
        at org.apache.spark.scheduler.cluster.k8s.KubernetesClusterManager.createSchedulerBackend(KubernetesClusterManager.scala:113)
        at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$createTaskScheduler(SparkContext.scala:2928)
        ... 20 more
Caused by: java.net.SocketTimeoutException: connect timed out
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
        at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
        at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
        at java.net.Socket.connect(Socket.java:589)
        at okhttp3.internal.platform.Platform.connectSocket(Platform.java:129)
        at okhttp3.internal.connection.RealConnection.connectSocket(RealConnection.java:247)
        at okhttp3.internal.connection.RealConnection.connect(RealConnection.java:167)
        at okhttp3.internal.connection.StreamAllocation.findConnection(StreamAllocation.java:258)
        at okhttp3.internal.connection.StreamAllocation.findHealthyConnection(StreamAllocation.java:135)
        at okhttp3.internal.connection.StreamAllocation.newStream(StreamAllocation.java:114)
        at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:42)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:127)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at io.fabric8.kubernetes.client.utils.BackwardsCompatibilityInterceptor.intercept(BackwardsCompatibilityInterceptor.java:119)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at io.fabric8.kubernetes.client.utils.ImpersonatorInterceptor.intercept(ImpersonatorInterceptor.java:68)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at io.fabric8.kubernetes.client.utils.HttpClientUtils.lambda$createHttpClient$3(HttpClientUtils.java:111)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:257)
        at okhttp3.RealCall.execute(RealCall.java:93)
        at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:411)
        at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:372)
        at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleGet(OperationSupport.java:337)
        at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleGet(OperationSupport.java:318)
        at io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleGet(BaseOperation.java:833)
        at io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:226)
        ... 26 more
20/04/29 02:14:57 DEBUG AbstractLifeCycle: stopping Server@68d79eec{STARTED}[9.4.z-SNAPSHOT]
20/04/29 02:14:57 DEBUG Server: doStop Server@68d79eec{STOPPING}[9.4.z-SNAPSHOT]
20/04/29 02:14:57 DEBUG QueuedThreadPool: ran SparkUI-59-acceptor-0@2b94b939-ServerConnector@79ce3216{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
20/04/29 02:14:57 DEBUG AbstractHandlerContainer: Graceful shutdown Server@68d79eec{STOPPING}[9.4.z-SNAPSHOT] by 
20/04/29 02:14:57 DEBUG AbstractLifeCycle: stopping Spark@79ce3216{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
20/04/29 02:14:57 DEBUG AbstractLifeCycle: stopping SelectorManager@Spark@79ce3216{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
20/04/29 02:14:57 DEBUG AbstractLifeCycle: stopping ManagedSelector@8993a98{STARTED} id=3 keys=0 selected=0 updates=0

Running spark-3.0preview2 on minikube (mac os).

➜  kubectl version
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.0", GitCommit:"9e991415386e4cf155a24b1da15becaa390438d8", GitTreeState:"clean", BuildDate:"2020-03-26T06:16:15Z", GoVersion:"go1.14", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.0", GitCommit:"9e991415386e4cf155a24b1da15becaa390438d8", GitTreeState:"clean", BuildDate:"2020-03-25T14:50:46Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/amd64"}

I've set up the cluster as described here:

https://spark.apache.org/docs/latest/running-on-kubernetes.html

It appears the kubernetes client is unable to communicate with the api? I'm trying to understand why.

Here are the things I've checked:

  • k8s host/port where driver is submitting job is correct (from kubectl cluster-info)

  • DNS is working (random debug pod can ping driver pod, no DNS resolution errors in logs)

  • RBAC "spark" role is enabled and being passed by driver

  • No iptables or other network policies are being used on the cluster

Any ideas on what else I can try to debug the issue?

-- user7654493
apache-spark
kubernetes

1 Answer

4/29/2020

It appears the issue here has to do with the k8s api as reported by:

kubectl cluster-info

That command results in this address:

k8s://https://kubernetes.default.svc:32768

The actual address which will make the client mode cluster work is the internal one:

k8s://https://10.96.0.1:443

I'm not sure if the original one returned is a proxy or an artifact of minikube, but things have started working again.

-- user7654493
Source: StackOverflow