I've done quite a bit of searching and cannot seem to find anyone that shows a resolution to this problem.
I'm getting intermittent 111 Connection refused errors on my kubernetes clusters. It seems that about 90% of my requests succeed and the other 10% fail. If you "refresh" the page, a previously failed request will then succeed. I have 2 different Kubernetes clusters with the same exact setup both showing the errors.
This looks to be very close to what I am experiencing. I did install my setup onto a new cluster, but the same problem persisted: https://stackoverflow.com/questions/58401610/kubernetes-clusterip-intermittent-502-connection-refused
Setup
Cluster Setup
Kubernetes nginx ingress controller that serves web traffic into the cluster: https://kubernetes.github.io/ingress-nginx/deploy/#gce-gke
From there I have 2 Ingresses defined that route traffic based on the referrer url. 1. Stage Ingress 2. Prod Ingress
Ingress
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
name: potr-tms-ingress-{{ .Values.environment }}
namespace: {{ .Values.environment }}
labels:
app: potr-tms-{{ .Values.environment }}
annotations:
kubernetes.io/ingress.class: "nginx"
nginx.ingress.kubernetes.io/ssl-redirect: "true"
nginx.ingress.kubernetes.io/from-to-www-redirect: "true"
# this line below doesn't seem to have an effect
# nginx.ingress.kubernetes.io/service-upstream: "true"
nginx.ingress.kubernetes.io/proxy-body-size: "100M"
cert-manager.io/cluster-issuer: "letsencrypt-{{ .Values.environment }}"
spec:
rules:
- host: {{ .Values.ingress_host }}
http:
paths:
- path: /
backend:
serviceName: potr-tms-service-{{ .Values.environment }}
servicePort: 8000
tls:
- hosts:
- {{ .Values.ingress_host }}
- www.{{ .Values.ingress_host }}
secretName: potr-tms-{{ .Values.environment }}-tls
These ingresses route to 2 services that I have defined for prod and stage:
Service
apiVersion: v1
kind: Service
metadata:
name: potr-tms-service-{{ .Values.environment }}
namespace: {{ .Values.environment }}
labels:
app: potr-tms-{{ .Values.environment }}
spec:
type: ClusterIP
ports:
- name: potr-tms-service-{{ .Values.environment }}
port: 8000
protocol: TCP
targetPort: 8000
selector:
app: potr-tms-{{ .Values.environment }}
These 2 services route to deployments that I have for both prod and stage:
Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: potr-tms-deployment-{{ .Values.environment }}
namespace: {{ .Values.environment }}
labels:
app: potr-tms-{{ .Values.environment }}
spec:
replicas: {{ .Values.deployment_replicas }}
selector:
matchLabels:
app: potr-tms-{{ .Values.environment }}
strategy:
type: RollingUpdate
template:
metadata:
annotations:
rollme: {{ randAlphaNum 5 | quote }}
labels:
app: potr-tms-{{ .Values.environment }}
spec:
containers:
- command: ["gunicorn", "--bind", ":8000", "config.wsgi"]
# - command: ["python", "manage.py", "runserver", "0.0.0.0:8000"]
envFrom:
- secretRef:
name: potr-tms-secrets-{{ .Values.environment }}
image: gcr.io/potrtms/potr-tms-{{ .Values.environment }}:latest
name: potr-tms-{{ .Values.environment }}
ports:
- containerPort: 8000
resources:
requests:
cpu: 200m
memory: 512Mi
restartPolicy: Always
serviceAccountName: "potr-tms-service-account-{{ .Values.environment }}"
status: {}
Error This is the error that I'm seeing inside of my ingress controller logs:
This seems pretty clear, if my deployment pods were failing or showing errors they would be "unavailable" and the service would not be able to route them to the pod. To try and debug this I did increase my deployment resources and replica counts. The amount of web traffic to this app is pretty low though, ~10 users.
What I've Tried 1. I tried using a completely different ingress controller https://github.com/kubernetes/ingress-nginx 2. Increasing deployment resources / replica counts (seems to have no effect) 3. Installing my whole setup on a brand new cluster (same results) 4. restart the ingress controller / deleting and re installing 5. Potentially it sounds like this could be a Gunicorn problem. To test I tried starting my pods with python manage.py runserver, problem remained.
Update
Raising the pod counts seems to have helped a little bit.
Some requests do fail still though.
Did you find a solution to this? I am seeing something very similar on a minikube setup.
In my case, I believe I also see the nginx controller restarting after the 502. The 502 is intermittent, frequently the first access fails, then reload works.
The best idea I've found so far is to increase the Nginx timeout parameter, but I have not tried that yet. Still trying to search out all options.
I was not able to figure out why these connection errors happen but I did find a work around that seems to solve the problem for our users.
Inside of your ingress config add the annotation
nginx.ingress.kubernetes.io/proxy-next-upstream-tries: "10"
I set it to 10 just to make sure it retried as I was fairly confident our services were working. You could probably get away with 2 or 3.
Here's my full ingress.yaml
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
name: potr-tms-ingress-{{ .Values.environment }}
namespace: {{ .Values.environment }}
labels:
app: potr-tms-{{ .Values.environment }}
annotations:
kubernetes.io/ingress.class: "nginx"
nginx.ingress.kubernetes.io/ssl-redirect: "true"
nginx.ingress.kubernetes.io/from-to-www-redirect: "true"
# nginx.ingress.kubernetes.io/service-upstream: "true"
nginx.ingress.kubernetes.io/proxy-body-size: "100M"
nginx.ingress.kubernetes.io/client-body-buffer-size: "100m"
nginx.ingress.kubernetes.io/proxy-max-temp-file-size: "1024m"
nginx.ingress.kubernetes.io/proxy-next-upstream-tries: "10"
cert-manager.io/cluster-issuer: "letsencrypt-{{ .Values.environment }}"
spec:
rules:
- host: {{ .Values.ingress_host }}
http:
paths:
- path: /
backend:
serviceName: potr-tms-service-{{ .Values.environment }}
servicePort: 8000
tls:
- hosts:
- {{ .Values.ingress_host }}
- www.{{ .Values.ingress_host }}
secretName: potr-tms-{{ .Values.environment }}-tls