I'm having issues with the internal DNS/service resolution within Kubernetes and I can't seem to track the issue down. I have an api-gateway pod running Kong, which calls other services by their internal service name, i.e srv-name.staging.svc.cluster.local
. Which was working fine up until recently. I attempted to deploy 3 more services, into two namespaces, staging and production.
The first service, works as expected when calling booking-service.staging.svc.cluster.local
, however the same code doesn't seem to work in the production service. And the other two service don't worth in either namespace.
The behavior I'm getting is a timeout. If I curl these services from my gateway pod, they all timeout, apart from the first service deployed (booking-service.staging.svc.cluster.local
). When I call these services from another container within the same pod, they do work as expected.
I have Node services set up for each service I wish to expose to the client side.
Here's an example Kubernetes deployment:
---
# API
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: {{SRV_NAME}}
spec:
replicas: 1
template:
metadata:
labels:
app: {{SRV_NAME}}
spec:
containers:
- name: booking-api
image: microhq/micro:kubernetes
args:
- "api"
- "--handler=rpc"
env:
- name: PORT
value: "8080"
- name: ENV
value: {{ENV}}
- name: MICRO_REGISTRY
value: "kubernetes"
ports:
- containerPort: 8080
- name: {{SRV_NAME}}
image: eu.gcr.io/{{PROJECT_NAME}}/{{SRV_NAME}}:latest
imagePullPolicy: Always
command: [
"./service",
"--selector=static"
]
env:
- name: MICRO_REGISTRY
value: "kubernetes"
- name: ENV
value: {{ENV}}
- name: DB_HOST
value: {{DB_HOST}}
- name: VERSION
value: "{{VERSION}}"
- name: MICRO_SERVER_ADDRESS
value: ":50051"
ports:
- containerPort: 50051
name: srv-port
---
apiVersion: v1
kind: Service
metadata:
name: booking-service
spec:
ports:
- name: api-http
port: 80
targetPort: 8080
protocol: TCP
selector:
app: booking-api
I'm using go-micro https://github.com/micro/go-micro with the Kubernetes pre-configuration. Which again works in one case absolutely fine, but not all the others. Which leads me to believe it's not code related. It also works fine locally.
When I do nslookup
from another pod, it resolves the name and finds the cluster IP for the internal Node service as expected. When I attempt to cURL that IP address, I get the same timeout behavior.
I'm using Kubernetes 1.8 on Google Cloud.
I don't understand why you think that it is an issue with the internal DNS/service resolution within Kubernetes since when you perform the DNS lookup it works, but if you query that IP you get a connection timeout.
It seems an issue with the connection between pods more than a DNS issue therefore I would focus your troubleshooting towards that direction, but correct me if I'am wrong.
Can you perform the classical networking troubleshooting (ping, telnet, traceroute)from a pod toward the IP given by the DNS lookup and from one of the container that is giving timeout to one of the other pods and update the question with the results?