I'm getting this error when I try to run a spark program from a driver pod (running standalone in client mode not using spark-submit):
20/04/29 02:14:46 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://sparkrunner-0.sparkrunner:4040
20/04/29 02:14:46 INFO SparkKubernetesClientFactory: Auto-configuring K8S client using current context from users K8S config file
20/04/29 02:14:46 DEBUG Config: Trying to configure client from Kubernetes config...
20/04/29 02:14:46 DEBUG Config: Did not find Kubernetes config at: [/root/.kube/config]. Ignoring.
20/04/29 02:14:46 DEBUG Config: Trying to configure client from service account...
20/04/29 02:14:46 DEBUG Config: Found service account host and port: 10.96.0.1:443
20/04/29 02:14:46 DEBUG Config: Found service account ca cert at: [/var/run/secrets/kubernetes.io/serviceaccount/ca.crt].
20/04/29 02:14:46 DEBUG Config: Found service account token at: [/var/run/secrets/kubernetes.io/serviceaccount/token].
20/04/29 02:14:46 DEBUG Config: Trying to configure client namespace from Kubernetes service account namespace path...
20/04/29 02:14:46 DEBUG Config: Found service account namespace at: [/var/run/secrets/kubernetes.io/serviceaccount/namespace].
20/04/29 02:14:46 DEBUG Config: Trying to configure client from Kubernetes config...
20/04/29 02:14:46 DEBUG Config: Did not find Kubernetes config at: [/root/.kube/config]. Ignoring.
20/04/29 02:14:46 DEBUG Config: Trying to configure client from service account...
20/04/29 02:14:46 DEBUG Config: Found service account host and port: 10.96.0.1:443
20/04/29 02:14:46 DEBUG Config: Found service account ca cert at: [/var/run/secrets/kubernetes.io/serviceaccount/ca.crt].
20/04/29 02:14:46 DEBUG Config: Found service account token at: [/var/run/secrets/kubernetes.io/serviceaccount/token].
20/04/29 02:14:46 DEBUG Config: Trying to configure client namespace from Kubernetes service account namespace path...
20/04/29 02:14:46 DEBUG Config: Found service account namespace at: [/var/run/secrets/kubernetes.io/serviceaccount/namespace].
20/04/29 02:14:57 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: External scheduler cannot be instantiated
at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$createTaskScheduler(SparkContext.scala:2934)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:548)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2578)
at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$5(SparkSession.scala:896)
at scala.Option.getOrElse(Option.scala:138)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:887)
at sparkrunner.sparklibs.SparkSystem$.<init>(SparkSystem.scala:22)
at sparkrunner.sparklibs.SparkSystem$.<clinit>(SparkSystem.scala)
at sparkrunner.actors.RecipeManager$anonfun$receive$1.applyOrElse(RecipeManager.scala:41)
at akka.actor.Actor.aroundReceive(Actor.scala:534)
at akka.actor.Actor.aroundReceive$(Actor.scala:532)
at sparkrunner.actors.RecipeManager.aroundReceive(RecipeManager.scala:20)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:573)
at akka.actor.ActorCell.invoke(ActorCell.scala:543)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:269)
at akka.dispatch.Mailbox.run(Mailbox.scala:230)
at akka.dispatch.Mailbox.exec(Mailbox.scala:242)
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)
Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Operation: [get] for kind: [Pod] with name: [sparkrunner-0] in namespace: [default] failed.
at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)
at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72)
at io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:237)
at io.fabric8.kubernetes.client.dsl.base.BaseOperation.get(BaseOperation.java:170)
at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$driverPod$1(ExecutorPodsAllocator.scala:59)
at scala.Option.map(Option.scala:163)
at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.<init>(ExecutorPodsAllocator.scala:58)
at org.apache.spark.scheduler.cluster.k8s.KubernetesClusterManager.createSchedulerBackend(KubernetesClusterManager.scala:113)
at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$createTaskScheduler(SparkContext.scala:2928)
... 20 more
Caused by: java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at okhttp3.internal.platform.Platform.connectSocket(Platform.java:129)
at okhttp3.internal.connection.RealConnection.connectSocket(RealConnection.java:247)
at okhttp3.internal.connection.RealConnection.connect(RealConnection.java:167)
at okhttp3.internal.connection.StreamAllocation.findConnection(StreamAllocation.java:258)
at okhttp3.internal.connection.StreamAllocation.findHealthyConnection(StreamAllocation.java:135)
at okhttp3.internal.connection.StreamAllocation.newStream(StreamAllocation.java:114)
at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:42)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:127)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
at io.fabric8.kubernetes.client.utils.BackwardsCompatibilityInterceptor.intercept(BackwardsCompatibilityInterceptor.java:119)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
at io.fabric8.kubernetes.client.utils.ImpersonatorInterceptor.intercept(ImpersonatorInterceptor.java:68)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
at io.fabric8.kubernetes.client.utils.HttpClientUtils.lambda$createHttpClient$3(HttpClientUtils.java:111)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:257)
at okhttp3.RealCall.execute(RealCall.java:93)
at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:411)
at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:372)
at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleGet(OperationSupport.java:337)
at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleGet(OperationSupport.java:318)
at io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleGet(BaseOperation.java:833)
at io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:226)
... 26 more
20/04/29 02:14:57 DEBUG AbstractLifeCycle: stopping Server@68d79eec{STARTED}[9.4.z-SNAPSHOT]
20/04/29 02:14:57 DEBUG Server: doStop Server@68d79eec{STOPPING}[9.4.z-SNAPSHOT]
20/04/29 02:14:57 DEBUG QueuedThreadPool: ran SparkUI-59-acceptor-0@2b94b939-ServerConnector@79ce3216{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
20/04/29 02:14:57 DEBUG AbstractHandlerContainer: Graceful shutdown Server@68d79eec{STOPPING}[9.4.z-SNAPSHOT] by
20/04/29 02:14:57 DEBUG AbstractLifeCycle: stopping Spark@79ce3216{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
20/04/29 02:14:57 DEBUG AbstractLifeCycle: stopping SelectorManager@Spark@79ce3216{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
20/04/29 02:14:57 DEBUG AbstractLifeCycle: stopping ManagedSelector@8993a98{STARTED} id=3 keys=0 selected=0 updates=0
Running spark-3.0preview2 on minikube (mac os).
➜ kubectl version
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.0", GitCommit:"9e991415386e4cf155a24b1da15becaa390438d8", GitTreeState:"clean", BuildDate:"2020-03-26T06:16:15Z", GoVersion:"go1.14", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.0", GitCommit:"9e991415386e4cf155a24b1da15becaa390438d8", GitTreeState:"clean", BuildDate:"2020-03-25T14:50:46Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/amd64"}
I've set up the cluster as described here:
https://spark.apache.org/docs/latest/running-on-kubernetes.html
It appears the kubernetes client is unable to communicate with the api? I'm trying to understand why.
Here are the things I've checked:
k8s host/port where driver is submitting job is correct (from kubectl cluster-info)
DNS is working (random debug pod can ping driver pod, no DNS resolution errors in logs)
RBAC "spark" role is enabled and being passed by driver
No iptables or other network policies are being used on the cluster
Any ideas on what else I can try to debug the issue?
It appears the issue here has to do with the k8s api as reported by:
kubectl cluster-info
That command results in this address:
k8s://https://kubernetes.default.svc:32768
The actual address which will make the client mode cluster work is the internal one:
k8s://https://10.96.0.1:443
I'm not sure if the original one returned is a proxy or an artifact of minikube, but things have started working again.