I'm using google compute to have a kubernetes cluster setup. It's running all fine without any issues. But sometimes; not very often, kubernetes service discovery fails. Structurally I'm using replication controller and k8 service to distribute the load.
I verified that none of the node in the cluster was restarted. Does anyone have any opinion on this ? Also to avoid such rare scenario what should be the best practice ?
Your problem likely occurs when a pod is starting up or terminating and traffic is being directed to a container. You should use a livelinessProbe to health check the container. You can do health checks 3 ways: HTTP GET, Socket open, and running a command. HTTP GETs and running a command needs to return a success status. For sockets, if they can be opened then the probe is considered successful.
apiVersion: v1
kind: ReplicationController
metadata:
name: my-nginx
spec:
replicas: 2
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx
ports:
- containerPort: 80
livenessProbe:
httpGet:
# Path to probe; should be cheap, but representative of typical behavior
path: /index.html
port: 80
initialDelaySeconds: 30
timeoutSeconds: 1