I have a kubernetes cluster setup on AWS. When i make call to elasticsearch-client.default.svc.cluster.local from a pod, i get unknown host exception occasionaly. It must have something to do with the name resolution, coz hitting the service IP directly works fine.
Note : I already have kube-dns autoscaler enabled. I manually tried with almost 6 kube-dns pods. SO i dont think it is because of dns pod scaling.
When I set the kube-dns configMap with the upstreamserver values to google nameservers (8.8.8.8 and 8.8.4.4) I am not getting the issue. I assume it is because of api ratelimiting done by AWS on route53. But I dont know why the name resolution request would got to AWS NS.
I also faced with the similar issue with my custom Kubernetes cluster and MySQL and Solr. Kube DNS checks suggested by tutorial from official site were fine (https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/) and I had to apply the following retry logic for data source and Solr client:
...
import org.apache.commons.dbcp.BasicDataSource;
...
public class CommunicationSafeDataSource extends BasicDataSource {
private static final Logger LOGGER = LoggerFactory.getLogger(CommunicationSafeDataSource.class);
@Override
public Connection getConnection() throws SQLException {
for (int i = 1; i <= 10; i++) {
try {
return super.getConnection();
} catch (Exception e) {
if ((e instanceof CommunicationsException) || (e.getCause() instanceof CommunicationsException)) {
LOGGER.warn("Communication exception occurred, retry " + i);
try {
Thread.sleep(i * 1000);
} catch (InterruptedException ie) {
//
}
} else {
throw e;
}
}
}
throw new IllegalStateException("Cannot get connection");
}
}
...
import org.apache.solr.client.solrj.impl.HttpSolrClient;
...
public class CommunicationSafeSolrClient extends HttpSolrClient {
private static final Logger LOGGER = LoggerFactory.getLogger(CommunicationSafeSolrClient.class);
protected CommunicationSafeSolrClient(Builder builder) {
super(builder);
}
@Override
protected NamedList<Object> executeMethod(HttpRequestBase method, ResponseParser processor, boolean isV2Api)
throws SolrServerException {
for (int i = 1; i <= 10; i++) {
try {
return super.executeMethod(method, processor, isV2Api);
} catch (Exception e) {
if ((e instanceof UnknownHostException) || (e.getCause() instanceof UnknownHostException)
|| (e instanceof ConnectException) || (e.getCause() instanceof ConnectException)) {
LOGGER.warn("Communication exception occurred, retry " + i);
try {
Thread.sleep(i * 1000);
} catch (InterruptedException ie) {
//
}
} else {
throw e;
}
}
}
throw new IllegalStateException("Cannot execute method");
}
}
Here's a good write-up that may be related to your problems, also check this one out by Weaveworks.
Basically there have been a number of issues during the last year created at the GitHub Kubernetes issue tracker that has to do with various DNS latencies/problems from within a cluster.
Worth mentioning, although not a fix to every DNS related problem, is that CoreDNS are generally available since version 1.11
and are or will be default thus replacing kube-dns
as the default DNS add-on for clusters.
Here's a couple of issues that might be related to the problem you're experiencing:
Hopefully this may help you moving forward.