DNS is not working for one deployment on K8s

9/11/2019

I have a multi deployments application on K8s and suddenly DNS randomly fails for one of the components (deployer). From inside the deployer pod if I run curl command with the service name or service IP of another component (bridge) randomly I get:

curl -v http://bridge:9998
* Could not resolve host: bridge
* Expire in 200 ms for 1 (transfer 0x555f0636fdd0)
* Closing connection 0
curl: (6) Could not resolve host: bridge

But if I use the IP of bridge pod it can resolve and connects:

curl -v http://10.36.0.25:9998
* Expire in 0 ms for 6 (transfer 0x558d6c3eadd0)
*   Trying 10.36.0.25...
* TCP_NODELAY set
* Expire in 200 ms for 4 (transfer 0x558d6c3eadd0)
* Connected to 10.36.0.25 (10.36.0.25) port 9998 (#0)
> GET / HTTP/1.1
> Host: 10.36.0.25:9998
> User-Agent: curl/7.64.0
> Accept: */*
> 
< HTTP/1.1 200 OK
< X-Powered-By: Express
< Accept-Ranges: bytes
< Cache-Control: public, max-age=0
< Last-Modified: Mon, 08 Apr 2019 14:06:42 GMT
< ETag: W/"179-169fd45c550"
< Content-Type: text/html; charset=UTF-8
< Content-Length: 377
< Date: Wed, 11 Sep 2019 08:25:24 GMT
< Connection: keep-alive

And my deployer yaml file:

---
apiVersion: v1
kind: Service
metadata:
  annotations:
    Process: deployer
  creationTimestamp: null
  labels:
    io.kompose.service: deployer
  name: deployer
spec:
  ports:
  - name: "8004"
    port: 8004
    targetPort: 8004
  selector:
    io.kompose.service: deployer
status:
  loadBalancer: {}
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  annotations:
    Process: deployer
  creationTimestamp: null
  labels:
    io.kompose.service: deployer
  name: deployer
spec:
  replicas: 1
  strategy: {}
  template:
    metadata:
      creationTimestamp: null
      labels:
        io.kompose.service: deployer
    spec:
      containers:
      - args:
        - bash
        - -c
        - lttng create && python src/rest.py
        env:
        - name: CONFIG_OVERRIDE
          value: {{ .Values.CONFIG_OVERRIDE | quote}}
        - name: WWS_RTMP_SERVER_URL
          value: {{ .Values.WWS_RTMP_SERVER_URL | quote}}
        - name: WWS_DEPLOYER_DEFAULT_SITE
          value: {{ .Values.WWS_DEPLOYER_DEFAULT_SITE | quote}}
        image: {{ .Values.image }}
        name: deployer
        readinessProbe:
          exec:
            command:
            - ls
            - /tmp
          initialDelaySeconds: 5
          periodSeconds: 5
        ports:
        - containerPort: 8004
        resources:
          requests:
            cpu: 0.1
            memory: 250Mi
          limits:
            cpu: 2
            memory: 5Gi
      restartPolicy: Always
      imagePullSecrets:
      - name: deployersecret
status: {}

As I mentioned this happens for just this component and I ran the exact same command from inside other pods and it works properly. Any idea how I can solve this issue?

Update

Since people is getting this wrong I describe the situation more: The yaml file above belongs to the component that is facing this problem (other components working properly) and the curl command is the command that I run from inside this troubled pod. If I run the exact same command from within another pod it resolves. And below is the deployment and service of the target for your information:

apiVersion: v1
kind: Service
metadata:
  annotations:
    Process: bridge
  creationTimestamp: null
  labels:
    io.kompose.service: bridge
  name: bridge
spec:
  ports:
  - name: "9998"
    port: 9998
    targetPort: 9998
  - name: "9226"
    port: 9226
    targetPort: 9226
  - name: 9226-udp
    port: 9226
    protocol: UDP
    targetPort: 9226
  selector:
    io.kompose.service: bridge
status:
  loadBalancer: {}
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  annotations:
    Process: bridge
  creationTimestamp: null
  labels:
    io.kompose.service: bridge
  name: bridge
spec:
  replicas: 1
  strategy: {}
  template:
    metadata:
      creationTimestamp: null
      labels:
        io.kompose.service: bridge
    spec:
      containers:
      - args:
        - bash
        - -c
        - npm run startDebug
        env:
        - name: NODE_ENV
          value: {{ .Values.NODE_ENV | quote }}
        image: {{ .Values.image }}
        name: bridge
        readinessProbe:
          httpGet:
            port: 9998
          initialDelaySeconds: 3
          periodSeconds: 15
        ports:
        - containerPort: 9998
        - containerPort: 9226
        - containerPort: 9226
          protocol: UDP
        resources:
          requests:
            cpu: 0.1
            memory: 250Mi
          limits:
            cpu: 2
            memory: 5Gi
      restartPolicy: Always
      imagePullSecrets:
      - name: bridgesecret
status: {}
-- AVarf
dns
kubernetes

3 Answers

9/11/2019

Your service having open port 8004

while you are sending curl on the port : 9998

curl -v http://bridge:9998

due to this miss match i think it's not working

While you have expose service as the LoadBalancer so from outside cluster you have to use the IP address from Loadbalancer to access the service.

If you want to resolve the internally in the cluster it self you can use the service name. Like

http://bridge:9998

From outside the internet only you can access it using the load balancer.

-- Harsh Manvar
Source: StackOverflow

9/11/2019

Defining "targetPort: 8004", you are publishing your service on the this port. Why your are trying to curl the service on another port 9998?

-- Maryam Tavakkoli
Source: StackOverflow

9/11/2019

The problem was the image that I was using. The troubled component and also one other component were using an image based on python2.7 but with different configurations and both had DNS problems but all the other components working properly. I built an image based on Ubuntu and now everything is good.

I think this might be related to the GO implementation that CoreDNS is using and for some reason, the python image can't work properly with that implementation, this is what one of my colleagues told me and he has faced the same issue before on another project when he was working with GO.

-- AVarf
Source: StackOverflow