Create a Amazon EKS cluster with jenkins-x and cluster-autoscaler gives fails ingress on even number of nodes

12/3/2018

I am creating an Amazon EKS cluster using jenkins-x with:

jx create cluster eks -n demo --node-type=t3.xlarge --nodes=1 --nodes-max=5 --nodes-min=1 --skip-installation

After that, I add the cluster-autoscaler IAM policy for auto discovery and the added tags on the autoscaling group and the created instance, according this guide.

I add the rbac roles for tiller and the autoscaler with this file (kubectl create -f rbac-config.yaml):

apiVersion: v1
kind: ServiceAccount
metadata:
  name: tiller
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: tiller
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cluster-admin
subjects:
  - kind: ServiceAccount
    name: tiller
    namespace: kube-system
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: autoscaler
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: autoscaler
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cluster-admin
subjects:
  - kind: ServiceAccount
    name: autoscaler
    namespace: kube-system

I installed tiller:

helm init --service-account tiller

and installed the cluster autoscaler:

helm install stable/cluster-autoscaler -f cluster-autoscaler-values.yaml --name cluster-autoscaler --namespace kube-system

Then I install the jenkins-x system:

jx install --provider=eks --domain=mydomain.com --default-environment-prefix=demo --skip-setup-tiller

I just accept all the defaults on the questions (nginx-ingress is created for me).

Then I create a default spring-boot-rest-prometheus app:

jx create quickstart

again, accepting all the defaults. This works fine, the application is picked up by jenkins is compiled, which I can see in:

http://jenkins.jx.mydomain.com

and I can reach the app through:

http://spring-boot-rest-prometheus.jx-staging.mydomain.com

Then I run a test to see if the autoscaler is working correctly, so I open up the file in the charts/spring-boot-rest-prometheus/values.yaml and change replicaCount: 1 to replicaCount: 8. Commit and push. This kicks of the Jenkins pipeline and spins up a new node because the autoscaler sees that there are not enough cpu resources on the first node.

After the second node has come up, I cannot reach Jenkins and the app anymore via the domain names. So for some reason, my ingress is not working anymore.

I have played around with this a lot, and manually changing the desired number of nodes directly on EC2, and when there is an even number of nodes, the domains are not reachable and when there is an odd number of nodes the domains are reachable.

I do not think this is related to the autoscaler, because the scale up and the scale down are working fine, and the problem is also there if I manually change the desired nodes of the server.

What causes the ingress to fail for an even number of nodes? How can I investigate this issue further?

Logs and desriptors for all ingress parts are posted here.

-- Martijn Burger
amazon-eks
amazon-web-services
jenkins--x
kubernetes

2 Answers

12/14/2018

FWIW, I seem to have run into this issue:

https://github.com/kubernetes/kubernetes/issues/64148

Still checking with AWS Support if that's the case for EKS also, but it seems very plausible.

-- Martijn Burger
Source: StackOverflow

12/3/2018

You can debug this by looking at the AWS ASG (AutoScaling Group) and the load balancer (ELB) target instances.

You can see that the instances are being added to the ASG:

ASG

Then you can see in your load balancer that the instances are in service:

In service

It could be that some of the even number of instances are not in service. Do they happen to be in a different availability zone? Are the ones that are 'odd' numbers being removed from the ELB? is traffic not being forwarded to them?

-- Rico
Source: StackOverflow