How to fix error 'Operation [get] for kind [pod] with name [spark-wordcount-driver] in namespace [non-default-namespace] failed

12/24/2018

I am trying to run a simple wordcount application in Spark on Kubernetes. I am getting following issue.

Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Operation: [get]  for kind: [Pod]  with name: [spark-wordcount-1545506479587-driver]  in namespace: [non-default-namespace]  failed.
    at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:62)
    at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:71)
    at io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:228)
    at io.fabric8.kubernetes.client.dsl.base.BaseOperation.get(BaseOperation.java:184)
    at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$anonfun$1.apply(ExecutorPodsAllocator.scala:57)
    at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$anonfun$1.apply(ExecutorPodsAllocator.scala:55)
    at scala.Option.map(Option.scala:146)
    at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.<init>(ExecutorPodsAllocator.scala:55)
    at org.apache.spark.scheduler.cluster.k8s.KubernetesClusterManager.createSchedulerBackend(KubernetesClusterManager.scala:89)
    at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$createTaskScheduler(SparkContext.scala:2788)
    ... 20 more
Caused by: java.net.SocketTimeoutException: connect timed out
    at java.net.PlainSocketImpl.socketConnect(Native Method)
    at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
    at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)

I have followed all the steps mentioned in the RBAC setup. Only thing I could not do was I could not create clusterbinding spark-role since I don't have access to the default namespace. Instead I create rolebinding.

kubectl create rolebinding spark-role --clusterrole=edit --serviceaccount=non-default-namespace:spark --namespace=non-default-namespace

I am using following spark-submit command.

spark-submit \
 --verbose \
 --master k8s://<cluster-ip>:<port> \
 --deploy-mode cluster --supervise \
 --name spark-wordcount \
 --class WordCount \
 --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-test \
 --conf spark.kubernetes.driver.limit.cores=1 \
 --conf spark.kubernetes.executor.limit.cores=1 \
 --conf spark.executor.instances=1 \
 --conf spark.kubernetes.container.image=<image> \
 --conf spark.kubernetes.namespace=non-default-namespace \
 --conf spark.kubernetes.driver.pod.name=spark-wordcount-driver \
 local:///opt/spark/work-dir/spark-k8s-1.0-SNAPSHOT.jar

Update: I was able to fix the first SockerTimeoutException issue. I did not have the network policy defined so the driver and executors were not able to talk to each other. This was the reason why it was timing out. I changed the network policy from default-deny-all to allow-all for ingress and egress and the timeout exception went away. However I am still getting the Operation get for kind pod not found error with following excepiton.

Caused by: java.net.UnknownHostException: kubernetes.default.svc: Try again

Any suggestion or help will be appreciated.

-- hp2326
apache-spark
docker
kube-dns
kubernetes

1 Answer

5/19/2019

This is because your dns is unable to resolve. kubernetes.default.svc. Which in turn could be issue of your networking and iptables.

run this on specific node

kubectl run -it --rm --restart=Never --image=infoblox/dnstools:latest dnstools   

and check

nslookup  kubernetes.default.svc

Edit: I had this issue because in my case, flannel was using different network(10.244.x.x) any my kubernetes cluster was configured with networking (172.x.x.x)

I blindly ran the default one from https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml inside which pod network is configured to 10.244.x.x. To fix it , i download the file , change it to correct pod network and applied it.

-- user303730
Source: StackOverflow