TL;DR: A nginx-ingress-controller
affects another LoadBalancer
service on a different domain once every ~5 requests.
I have a weird situation with Kubernetes on GCE, and I am stuck. I don't know if I have a configuration or if a have stumbled upon a (very severe) bug in k8s.
I have two LoadBalancer services, each with their own static IP and a DNS record pointing to them.
One LoadBalancer points (through it's selector) directly to a Deployment with my API webserver running on it, this is api.domain.com
. This API cannot be behind an ingress controller due to a complex client side certificate authentication scheme, which is not (yet) possible with the nginx ingress.
The other LoadBalancer service points to a NGINX ingress controller. Which serves my website at site.domain.com
. I use a standard nginx-default-backend
to serve the 404 from the ingress controller.
The issue is that when I load the API (at api.domain.com
) in a browser, once every 3 or 4 times I hit refresh the 404 is served from nginx-default-backend
.
So once every 5 times or so, a page from a totally different domain (site.domain.com
, 234.234.234.234
) is served on my API domain (api.domain.com
, 123.123.123.123
). I don't understand how this can happen.
Once I remove the nginx-ingress-controller
, the API functions normally again. I'm really puzzled.
For the API:
apiVersion: v1
kind: Service
metadata:
name: api
spec:
type: LoadBalancer
loadBalancerIP: 123.123.123.123
selector:
app: api
ports:
- port: 443
And for the website:
apiVersion: v1
kind: Service
metadata:
name: nginx-ingress-lb
labels:
app: nginx-ingress-lb
spec:
type: LoadBalancer
loadBalancerIP: 234.234.234.234
ports:
- port: 443
name: https
selector:
# Selects nginx-ingress-controller pods
app: nginx-ingress-controller
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: nginx-ingress-controller
labels:
app: nginx-ingress-controller
spec:
replicas: 1
template:
metadata:
name: nginx-ingress-controller
labels:
app: nginx-ingress-controller
spec:
terminationGracePeriodSeconds: 60
containers:
- image: quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.9.0-beta.17
name: nginx-ingress-controller
readinessProbe:
httpGet:
path: /healthz
port: 10254
scheme: HTTP
livenessProbe:
httpGet:
path: /healthz
port: 10254
scheme: HTTP
initialDelaySeconds: 10
timeoutSeconds: 1
ports:
- containerPort: 443
hostPort: 443
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
args:
- /nginx-ingress-controller
- --default-backend-service=$(POD_NAMESPACE)/nginx-default-backend
- --publish-service=$(POD_NAMESPACE)/nginx-ingress-lb
---
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
name: ingress
namespace: development
spec:
tls:
- hosts:
- site.domain.com
secretName: "site.domain.com-tls"
rules:
- host: "site.domain.com"
http:
paths:
- backend:
serviceName: website
servicePort: http
What I have checked so far:
I have checked my DNS records using host -a
, they are both correct. I checked for name collisions in the selectors using kubectl get po -l app=website
, no collisions. I have checked the bound IP addresses:
> kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S)
api LoadBalancer 10.3.240.197 123.123.123.123 443:32126/TCP
nginx-default-backend ClusterIP 10.3.253.16 <none> 80/TCP
nginx-ingress-lb LoadBalancer 10.3.245.191 234.234.234.234 443:31051/TCP
website ClusterIP 10.3.254.180 <none> 80/TCP
> kubectl get ingress
NAME HOSTS ADDRESS PORTS
ingress site.domain.com 234.234.234.234 80, 443
> host api.domain.com
api.domain.com has address 123.123.123.123
> host site.domain.com
site.domain.com has address 234.234.234.234
All looks good to me.
Am I doing something wrong or is there something seriously wrong with k8s or nginx-ingress?
This was an interesting one.
I spent some time drawing diagrams and hypothesising why the error was occuring but the best answer comes from the GLBC README:
Don't start 2 instances of the controller in a single cluster, they will fight each other
I believe this behaviour is due to how the GCE loadbalancer forwarding rules work conflicting with the nginx-ingress-controller
(or vice versa :) )
From what I can tell the GCE loadbalancer forwarding rules accept traffic on the same port number that is forwarded to cluster hosts i.e. :443
in your example.
In the nginx-ingress-controller
definition:
ports:
- containerPort: 443
hostPort: 443
We see that the nginx-ingress pods are listening on hosts at :443
.
but the GCE load balancer is also forwarding to hosts at :443
.
Imagine your API pods are deployed on some subset of cluster nodes say 3/4.
Then 3/4 times the GCE load balancer directs traffic to a host with a listening API pod - success!
But the 4th request routes to a node on port443
with no API pod running. However an nginx-ingress-controller
pod is listening and so responds to the request with 404
.
So the issue is not really one of DNS resolution as it might appear.
The below quote from the k8s services shortcomings seems to support my theory, as the NodePort
values are unused, hence port forwarding is happening on the same port.
This is not strictly required on all cloud providers (e.g. Google Compute Engine does not need to allocate a NodePort to make LoadBalancer work, but AWS does)
GCE forwarding rule creation
https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/gce/gce_loadbalancer_external.go