How to troubleshoot why the Endpoints in my service don't get updated?

6/30/2018

I have a Kubernetes cluster running on the Google Kubernetes Engine.

I have a deployment that I manually (by editing the hpa object) scaled up from 100 replicas to 300 replicas to do some load testing. When I was load testing the deployment by sending HTTP requests to the service, it seemed that not all pods were getting an equal amount of traffic, only around 100 pods were showing that they were processing traffic (by looking at their CPU-load, and our custom metrics). So my suspicion was that the service is not load balancing the requests among all the pods equally.

If I checked the deployment, I could see that all 300 replicas were ready.

$ k get deploy my-app --show-labels
NAME                DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE       LABELS
my-app              300       300       300          300         21d       app=my-app

On the other hand, when I checked the service, I saw this:

$ k describe svc my-app
Name:              my-app
Namespace:         production
Labels:            app=my-app
Selector:          app=my-app
Type:              ClusterIP
IP:                10.40.9.201
Port:              http  80/TCP
TargetPort:        http/TCP
Endpoints:         10.36.0.5:80,10.36.1.5:80,10.36.100.5:80 + 114 more...
Port:              https  443/TCP
TargetPort:        https/TCP
Endpoints:         10.36.0.5:443,10.36.1.5:443,10.36.100.5:443 + 114 more...
Session Affinity:  None
Events:            <none>

What was strange to me is this part

Endpoints:         10.36.0.5:80,10.36.1.5:80,10.36.100.5:80 + 114 more...

I was expecting to see 300 endpoints there, is that assumption correct?

(I also found this post, which is about a similar issue, but there the author was experiencing only a few minutes delay until the endpoints were updated, but for me it didn't change even in half an hour.)

How could I troubleshoot what was going wrong? I read that this is done by the Endpoints controller, but I couldn't find any info about where to check its logs.

Update: We managed to reproduce this a couple more times. Sometimes it was less severe, for example 381 endpoints instead of 445. One interesting thing we noticed is that if we retrieved the details of the endpoints:

$ k describe endpoints my-app
Name:         my-app
Namespace:    production
Labels:       app=my-app
Annotations:  <none>
Subsets:
  Addresses:          10.36.0.5,10.36.1.5,10.36.10.5,...
  NotReadyAddresses:  10.36.199.5,10.36.209.5,10.36.239.2,...

Then a bunch of IPs were "stuck" in the NotReadyAddresses state (not the ones that were "missing" from the service though, if I summed the number of IPs in Addresses and NotReadyAddresses, that was still less than the total number of ready pods). Although I don't know if this is related at all, I couldn't find much info online about this NotReadyAddresses field.

-- Mark Vincze
google-kubernetes-engine
kubernetes

2 Answers

7/7/2018

I refer to your first try with 300 pods.

I would check the following:

  • kubectl get po -l app=my-app to see if you get a 300 item list. Your service says you have 300 available pods, which makes your issue very interesting to analyze.

  • whether your pod/deployment manifest defined limit and request resources. This better helps scheduler.

  • whether some of your nodes have taints incompatible with your pod/deployment manifest

  • whether your pod/deploy manifest has liveness and readyness probes (please post them)

  • whether you defined some resourceQuota object, which limit the creation of pods/deployments

-- Nicola Ben
Source: StackOverflow

10/18/2018

It turned out that this is caused by using preemptible VMs in our node pools, it doesn't happen if the nodes are not preemtibles.
We couldn't figure out more details of the root cause, but using preemtibles as the nodes is not an officially supported scenario anyway, so we switched to regular VMs.

-- Mark Vincze
Source: StackOverflow