We are running a Kubernetes cluster in GKE (Google Kubernetes Engine version 1.13.10). It is a regional cluster that started with two nodes per zone (for a total of six nodes). We have several services running on this cluster including some web services and a Kerberos service.
Recently we changed the number of nodes per zone from two to three (so we now have nine nodes). When we did this the Kerberos service become inaccessible.
Some detail: the Kerberos service runs on three pods in a StatefulSet behind two Services (UDP and TCP) with a static IP address. The Service is a LoadBalancer and uses a local external traffic policy so we can more easily log the client's IP address.
When we added the extra nodes the Kerberos Service logged the following events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal UpdatedLoadBalancer 53m (x2 over 56m) service-controller Updated load balancer with new hosts
The pods kept running but the Service's external endpoint was no longer acessible: telneting to the endpoint showed nothing at the other end. Restarting the pods solved the problem.
Here is the definition for the TCP Service:
kind: Service
metadata:
annotations:
external-dns.alpha.kubernetes.io/hostname: kdc.example.org
name: kdc-tcp
namespace: kdc
spec:
clusterIP: 10.8.18.71
externalTrafficPolicy: Local
healthCheckNodePort: 32447
loadBalancerIP: 35.101.23.134
ports:
- name: kerberos-tcp
nodePort: 32056
port: 88
protocol: TCP
targetPort: 88
selector:
app: kdc
sessionAffinity: None
type: LoadBalancer
status:
loadBalancer:
ingress:
- ip: 35.101.23.134
Why would adding some extra nodes cause this to happen? How can we avoid this problem in the future?