Linkerd control plane pods don't come up on EKS

10/21/2019

I have a vanilla EKS cluster deployed with Terraform at version 1.14 with RBAC enabled, but nothing installed into the cluster. I just executed linkerd install | kubecetl apply -f -.

After that completes I have waited about 4 minutes for things to stabilize. Running kubectl get pods -n linkerd shows me the following:

linkerd-destination-8466bdc8cc-5mt5f      2/2     Running   0          4m20s
linkerd-grafana-7b9b6b9bbf-k5vc2          1/2     Running   0          4m19s
linkerd-identity-6f78cd5596-rhw72         2/2     Running   0          4m21s
linkerd-prometheus-64df8d5b5c-8fz2l       2/2     Running   0          4m19s
linkerd-proxy-injector-6775949867-m7vdn   1/2     Running   0          4m19s
linkerd-sp-validator-698479bcc8-xsxnk     1/2     Running   0          4m19s
linkerd-tap-64b854cdb5-45c2h              2/2     Running   0          4m18s
linkerd-web-bdff9b64d-kcfss               2/2     Running   0          4m20s

For some reason linkerd-proxy-injector, linkerd-proxy-injector, linkerd-controller, and linkerd-grafana are not fully started

Any ideas as to what I should check? The linkerd-check command is hanging.

The logs for the linkerd-controller show:

linkerd-controller-68d7f67bc4-kmwfw linkerd-proxy ERR! [   335.058670s] admin={bg=identity} linkerd2_proxy::app::identity Failed to certify identity: grpc-status: Unknown, grpc-message: "the request could not be dispatched in a timely fashion"

and

linkerd-proxy ERR! [   350.060965s] admin={bg=identity} linkerd2_proxy::app::identity Failed to certify identity: grpc-status: Unknown, grpc-message: "the request could not be dispatched in a timely fashion"
time="2019-10-18T21:57:49Z" level=info msg="starting admin server on :9996"

Deleting the pods and restarting the deployments results in different components becoming ready, but the entire control plane never becomes fully ready.

-- cpretzer
kubernetes
linkerd

1 Answer

10/21/2019

A Linkerd community member answered with:

Which VPC CNI version do you have installed? I ask because of: - https://github.com/aws/amazon-vpc-cni-k8s/issues/641 - https://github.com/mogren/amazon-vpc-cni-k8s/commit/7b2f7024f19d041396f9c05996b70d057f96da11

And after testing, this was the solution:

Sure enough, downgrading the AWS VPC CNI to v1.5.3 fixed everything in my cluster

Not sure why, but it does. It seems that admission controllers are not working with v1.5.4

So, the solution is to use AWS VPC CNI v1.5.3 until the root cause in AWS VPC CNIN v1.5.4 is determined.

-- cpretzer
Source: StackOverflow