Kubernetes GKE Error dialing backend: EOF on random exec command

6/20/2018

On GKE we experiencing some random error with the API. Many time ago we have "Error dialing backend: EOF".

We use Jenkins on top of K8s to manage our build. And afew time ago job is killed with this error:

Executing shell script inside container [protobuf] of pod [kubernetes-bad0aa993add416e80bdc1e66d1b30fc-536045ac8bbe]
java.net.ProtocolException: Expected HTTP 101 response but was '500 Internal Server Error'
    at com.squareup.okhttp.ws.WebSocketCall.createWebSocket(WebSocketCall.java:123)
    at com.squareup.okhttp.ws.WebSocketCall.access$000(WebSocketCall.java:40)
    at com.squareup.okhttp.ws.WebSocketCall$1.onResponse(WebSocketCall.java:98)
    at com.squareup.okhttp.Call$AsyncCall.execute(Call.java:177)
    at com.squareup.okhttp.internal.NamedRunnable.run(NamedRunnable.java:33)
    at 


  java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at 

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

This case looks a lot like: https://gitlab.com/gitlab-org/gitlab-runner/issues/3247

Many Audit log url:

permission:  "io.k8s.core.v1.pods.exec.create"     
resource:  "core/v1/namespaces/default/pods/pubsub-6132c0bc-2542-46a2-8041-c865f238698d-4ccc0-c1nkz-lqg5x/exec/pubsub-6132c0bc-2542-46a2-8041-c865f238698d-4ccc0-c1nkz-lqg5x"     

and

permission:  "io.k8s.core.v1.pods.exec.get"     
resource:  "core/v1/namespaces/default/pods/pubsub-a5a21f14-0bd1-4338-87b1-8658c3bbc7ad-9gm4n-8nz14/exec"     

But i don't unerstand why this error comes on Kubernetes...

Update:

Those error can be validated with kube-state-metrics with 2 of them: - ssh_tunnel_open_count - ssh_tunnel_open_fail_count

For me the number of open tunnel ssh fail grow with more than 200 ssh tunnel open.

For information, we have make some test with GKE - swith from zonal to regional cluster - use new native IP (old alias IP) But this not solve the problem.

After disabling auto-scaling on node-pool , we have no more error.

-- user3669002
eof
google-kubernetes-engine
kubernetes
networking

0 Answers