I have a multi deployments application on K8s and suddenly DNS randomly fails for one of the components (deployer). From inside the deployer pod if I run curl
command with the service name or service IP of another component (bridge) randomly I get:
curl -v http://bridge:9998
* Could not resolve host: bridge
* Expire in 200 ms for 1 (transfer 0x555f0636fdd0)
* Closing connection 0
curl: (6) Could not resolve host: bridge
But if I use the IP of bridge pod it can resolve and connects:
curl -v http://10.36.0.25:9998
* Expire in 0 ms for 6 (transfer 0x558d6c3eadd0)
* Trying 10.36.0.25...
* TCP_NODELAY set
* Expire in 200 ms for 4 (transfer 0x558d6c3eadd0)
* Connected to 10.36.0.25 (10.36.0.25) port 9998 (#0)
> GET / HTTP/1.1
> Host: 10.36.0.25:9998
> User-Agent: curl/7.64.0
> Accept: */*
>
< HTTP/1.1 200 OK
< X-Powered-By: Express
< Accept-Ranges: bytes
< Cache-Control: public, max-age=0
< Last-Modified: Mon, 08 Apr 2019 14:06:42 GMT
< ETag: W/"179-169fd45c550"
< Content-Type: text/html; charset=UTF-8
< Content-Length: 377
< Date: Wed, 11 Sep 2019 08:25:24 GMT
< Connection: keep-alive
And my deployer yaml file:
---
apiVersion: v1
kind: Service
metadata:
annotations:
Process: deployer
creationTimestamp: null
labels:
io.kompose.service: deployer
name: deployer
spec:
ports:
- name: "8004"
port: 8004
targetPort: 8004
selector:
io.kompose.service: deployer
status:
loadBalancer: {}
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
annotations:
Process: deployer
creationTimestamp: null
labels:
io.kompose.service: deployer
name: deployer
spec:
replicas: 1
strategy: {}
template:
metadata:
creationTimestamp: null
labels:
io.kompose.service: deployer
spec:
containers:
- args:
- bash
- -c
- lttng create && python src/rest.py
env:
- name: CONFIG_OVERRIDE
value: {{ .Values.CONFIG_OVERRIDE | quote}}
- name: WWS_RTMP_SERVER_URL
value: {{ .Values.WWS_RTMP_SERVER_URL | quote}}
- name: WWS_DEPLOYER_DEFAULT_SITE
value: {{ .Values.WWS_DEPLOYER_DEFAULT_SITE | quote}}
image: {{ .Values.image }}
name: deployer
readinessProbe:
exec:
command:
- ls
- /tmp
initialDelaySeconds: 5
periodSeconds: 5
ports:
- containerPort: 8004
resources:
requests:
cpu: 0.1
memory: 250Mi
limits:
cpu: 2
memory: 5Gi
restartPolicy: Always
imagePullSecrets:
- name: deployersecret
status: {}
As I mentioned this happens for just this component and I ran the exact same command from inside other pods and it works properly. Any idea how I can solve this issue?
Since people is getting this wrong I describe the situation more: The yaml file above belongs to the component that is facing this problem (other components working properly) and the curl command is the command that I run from inside this troubled pod. If I run the exact same command from within another pod it resolves. And below is the deployment and service of the target for your information:
apiVersion: v1
kind: Service
metadata:
annotations:
Process: bridge
creationTimestamp: null
labels:
io.kompose.service: bridge
name: bridge
spec:
ports:
- name: "9998"
port: 9998
targetPort: 9998
- name: "9226"
port: 9226
targetPort: 9226
- name: 9226-udp
port: 9226
protocol: UDP
targetPort: 9226
selector:
io.kompose.service: bridge
status:
loadBalancer: {}
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
annotations:
Process: bridge
creationTimestamp: null
labels:
io.kompose.service: bridge
name: bridge
spec:
replicas: 1
strategy: {}
template:
metadata:
creationTimestamp: null
labels:
io.kompose.service: bridge
spec:
containers:
- args:
- bash
- -c
- npm run startDebug
env:
- name: NODE_ENV
value: {{ .Values.NODE_ENV | quote }}
image: {{ .Values.image }}
name: bridge
readinessProbe:
httpGet:
port: 9998
initialDelaySeconds: 3
periodSeconds: 15
ports:
- containerPort: 9998
- containerPort: 9226
- containerPort: 9226
protocol: UDP
resources:
requests:
cpu: 0.1
memory: 250Mi
limits:
cpu: 2
memory: 5Gi
restartPolicy: Always
imagePullSecrets:
- name: bridgesecret
status: {}
Your service having open port 8004
while you are sending curl on the port : 9998
curl -v http://bridge:9998
due to this miss match i think it's not working
While you have expose service as the LoadBalancer
so from outside cluster you have to use the IP address from Loadbalancer to access the service.
If you want to resolve the internally in the cluster it self you can use the service name. Like
http://bridge:9998
From outside the internet only you can access it using the load balancer.
Defining "targetPort: 8004", you are publishing your service on the this port. Why your are trying to curl the service on another port 9998?
The problem was the image that I was using. The troubled component and also one other component were using an image based on python2.7 but with different configurations and both had DNS problems but all the other components working properly. I built an image based on Ubuntu and now everything is good.
I think this might be related to the GO implementation that CoreDNS is using and for some reason, the python image can't work properly with that implementation, this is what one of my colleagues told me and he has faced the same issue before on another project when he was working with GO.