stackdriver-metadata-agent-cluster-level a lot of fail with [2a00:1450:400c:c09::5f]:443: i/o timeout

4/2/2020

I have k8s with 1.14.10-gke.27 in europe-west1-d zone.

In the last couple of days I have a lot stackdriver-metadata-agent-cluster-level pod restarts in kube-system ns with errors

I0402 16:39:12.688053       1 main.go:142] All resources are being watched, agent has started successfully
I0402 16:39:12.688108       1 main.go:145] No statusz port provided; not starting a server
I0402 16:39:29.383562       1 retry.go:80] call failed with err=rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp [2a00:1450:400c:c09::5f]:443: i/o timeout", retrying.
I0402 16:39:29.383667       1 retry.go:80] call failed with err=rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp [2a00:1450:400c:c09::5f]:443: i/o timeout", retrying.
I0402 16:39:30.483072       1 retry.go:80] call failed with err=rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp [2a00:1450:400c:c09::5f]:443: i/o timeout", retrying.
I0402 16:39:30.783091       1 retry.go:80] call failed with err=rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp [2a00:1450:400c:c09::5f]:443: i/o timeout", retrying.
I0402 16:40:09.186357       1 binarylog.go:265] rpc: flushed binary log to ""
I0402 16:41:29.383025       1 binarylog.go:265] rpc: flushed binary log to ""

logs screenshot

is this google network issue ?

-- Sergey Gals
gcloud
google-kubernetes-engine
kubernetes

1 Answer

4/3/2020

I'm adding it as an answer as there is quite a lot of code which will be totally unreadable if I put it in comments. Once we manage to figure out the solution I will edit it.

Could you run these Stackdriver logs queries and post the output in your question as a code sample (use ctrl+k on the selected text) ?

resource.type="k8s_container"
resource.labels.project_id="<project_id>"
resource.labels.location="<location e.g. us-central1-c>"
resource.labels.cluster_name="<cluster-name>"
resource.labels.namespace_name="kube-system"
labels.k8s-pod/app="stackdriver-metadata-agent"
labels.k8s-pod/cluster-level="true"
"oom"

resource.type="k8s_container"
resource.labels.project_id="<project-id>"
resource.labels.location="<location e.g. us-central1-c>"
resource.labels.cluster_name="<cluster-name>"
resource.labels.namespace_name="kube-system"
labels.k8s-pod/app="stackdriver-metadata-agent"
labels.k8s-pod/cluster-level="true"
severity>=WARNING
sourceLocation.file!="reflector.go"

Please don't put it as screenshot as it is quite useless when it comes to searching through it.

-- mario
Source: StackOverflow