I am trying to run a simple wordcount application in Spark on Kubernetes. I am getting following issue.
Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Operation: [get] for kind: [Pod] with name: [spark-wordcount-1545506479587-driver] in namespace: [non-default-namespace] failed.
at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:62)
at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:71)
at io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:228)
at io.fabric8.kubernetes.client.dsl.base.BaseOperation.get(BaseOperation.java:184)
at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$anonfun$1.apply(ExecutorPodsAllocator.scala:57)
at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$anonfun$1.apply(ExecutorPodsAllocator.scala:55)
at scala.Option.map(Option.scala:146)
at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.<init>(ExecutorPodsAllocator.scala:55)
at org.apache.spark.scheduler.cluster.k8s.KubernetesClusterManager.createSchedulerBackend(KubernetesClusterManager.scala:89)
at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$createTaskScheduler(SparkContext.scala:2788)
... 20 more
Caused by: java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
I have followed all the steps mentioned in the RBAC setup. Only thing I could not do was I could not create clusterbinding spark-role since I don't have access to the default namespace. Instead I create rolebinding.
kubectl create rolebinding spark-role --clusterrole=edit --serviceaccount=non-default-namespace:spark --namespace=non-default-namespace
I am using following spark-submit command.
spark-submit \
--verbose \
--master k8s://<cluster-ip>:<port> \
--deploy-mode cluster --supervise \
--name spark-wordcount \
--class WordCount \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-test \
--conf spark.kubernetes.driver.limit.cores=1 \
--conf spark.kubernetes.executor.limit.cores=1 \
--conf spark.executor.instances=1 \
--conf spark.kubernetes.container.image=<image> \
--conf spark.kubernetes.namespace=non-default-namespace \
--conf spark.kubernetes.driver.pod.name=spark-wordcount-driver \
local:///opt/spark/work-dir/spark-k8s-1.0-SNAPSHOT.jar
Update: I was able to fix the first SockerTimeoutException issue. I did not have the network policy defined so the driver and executors were not able to talk to each other. This was the reason why it was timing out. I changed the network policy from default-deny-all to allow-all for ingress and egress and the timeout exception went away. However I am still getting the Operation get for kind pod not found error with following excepiton.
Caused by: java.net.UnknownHostException: kubernetes.default.svc: Try again
Any suggestion or help will be appreciated.
This is because your dns is unable to resolve. kubernetes.default.svc. Which in turn could be issue of your networking and iptables.
run this on specific node
kubectl run -it --rm --restart=Never --image=infoblox/dnstools:latest dnstools
and check
nslookup kubernetes.default.svc
Edit: I had this issue because in my case, flannel was using different network(10.244.x.x) any my kubernetes cluster was configured with networking (172.x.x.x)
I blindly ran the default one from https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml inside which pod network is configured to 10.244.x.x. To fix it , i download the file , change it to correct pod network and applied it.