I have an on-prem kubernetes cluster and I want to deploy to it a docker registry from which the cluster nodes can download images. In my attempts to do this, I've tried several methods of identifying the service: a NodePort, a LoadBalancer provided by MetalLB in Layer2 mode, its Flannel network IP (referring to the IP that, by default, would be on the 10.244.0.0/16 network), and its cluster IP (referring to the IP that, by default, would be on the 10.96.0.0/16 network). In every case, connecting to the registry via docker failed.
I performed a cURL against the IP and realized that while the requests were resolving as expected, the tcp dial step was consistently taking 63.15 +/- 0.05 seconds, followed by the HTTP(s) request itself completing in an amount of time that is within margin of error for the tcp dial. This is consistent across deployments with firewall rules varying from a relatively strict set to nothing except the rules added directly by kubernetes. It is also consistent across network architectures ranging from a single physical server with VMs for all cluster nodes to distinct physical hardware for each node and a physical switch. As mentioned previously, it is also consistent across the means by which the service is exposed. It is also consistent regardless of whether I use an ingress-nginx service to expose it or expose the docker registry directly.
Further, when I deploy another pod to the cluster, I am able to reach the pod at its cluster IP without any delays, but I do encounter an identical delay when trying to reach it at its external LoadBalancer IP or at a NodePort. No delays besides expected network latency are encountered when trying to reach the registry from a machine that is not a node on the cluster, e.g., using the LoadBalancer or NodePort.
As a matter of practice, my main inquiry is what is the "correct" way to do what I am attempting to do? Furthermore, as an academic matter, I would also like to know the source of the very long, very consistent delay that I've been seeing?
My deployment yaml file has been included below for reference. The ingress handler is ingress-nginx.
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: registry-pv-claim
namespace: docker-registry
labels:
app: registry
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: docker-registry
namespace: docker-registry
spec:
replicas: 1
selector:
matchLabels:
app: docker-registry
template:
metadata:
labels:
app: docker-registry
spec:
containers:
- name: docker-registry
image: registry:2.7.1
env:
- name: REGISTRY_HTTP_ADDR
value: ":5000"
- name: REGISTRY_STORAGE_FILESYSTEM_ROOTDIRECTORY
value: "/var/lib/registry"
ports:
- name: http
containerPort: 5000
volumeMounts:
- name: image-store
mountPath: "/var/lib/registry"
volumes:
- name: image-store
persistentVolumeClaim:
claimName: registry-pv-claim
---
kind: Service
apiVersion: v1
metadata:
name: docker-registry
namespace: docker-registry
labels:
app: docker-registry
spec:
selector:
app: docker-registry
ports:
- name: http
port: 5000
targetPort: 5000
---
apiVersion: v1
items:
- apiVersion: extensions/v1beta1
kind: Ingress
metadata:
annotations:
nginx.ingress.kubernetes.io/proxy-body-size: "0"
nginx.ingress.kubernetes.io/proxy-read-timeout: "600"
nginx.ingress.kubernetes.io/proxy-send-timeout: "600"
kubernetes.io/ingress.class: docker-registry
name: docker-registry
namespace: docker-registry
spec:
rules:
- host: example-registry.com
http:
paths:
- backend:
serviceName: docker-registry
servicePort: 5000
path: /
tls:
- hosts:
- example-registry.com
secretName: tls-secret
For future visitors, seems like your issue is related to Flannel.
The whole problem was described here:
https://github.com/kubernetes/kubernetes/issues/88986
https://github.com/coreos/flannel/issues/1268
including workaround:
https://github.com/kubernetes/kubernetes/issues/86507#issuecomment-595227060