Occasional Unknownhost Exception from for a service within kubernetes

8/13/2018

I have a kubernetes cluster setup on AWS. When i make call to elasticsearch-client.default.svc.cluster.local from a pod, i get unknown host exception occasionaly. It must have something to do with the name resolution, coz hitting the service IP directly works fine.

Note : I already have kube-dns autoscaler enabled. I manually tried with almost 6 kube-dns pods. SO i dont think it is because of dns pod scaling.

When I set the kube-dns configMap with the upstreamserver values to google nameservers (8.8.8.8 and 8.8.4.4) I am not getting the issue. I assume it is because of api ratelimiting done by AWS on route53. But I dont know why the name resolution request would got to AWS NS.

-- Vidhyashankar Madheswaraswamy
kops
kube-dns
kubernetes

2 Answers

3/12/2020

I also faced with the similar issue with my custom Kubernetes cluster and MySQL and Solr. Kube DNS checks suggested by tutorial from official site were fine (https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/) and I had to apply the following retry logic for data source and Solr client:

...
import org.apache.commons.dbcp.BasicDataSource;
...
public class CommunicationSafeDataSource extends BasicDataSource {
  private static final Logger LOGGER = LoggerFactory.getLogger(CommunicationSafeDataSource.class);

  @Override
  public Connection getConnection() throws SQLException {
    for (int i = 1; i <= 10; i++) {
      try {
        return super.getConnection();
      } catch (Exception e) {
        if ((e instanceof CommunicationsException) || (e.getCause() instanceof CommunicationsException)) {
          LOGGER.warn("Communication exception occurred, retry " + i);
          try {
            Thread.sleep(i * 1000);
          } catch (InterruptedException ie) {
            //
          }
        } else {
          throw e;
        }
      }
    }

    throw new IllegalStateException("Cannot get connection");
  }
}

...
import org.apache.solr.client.solrj.impl.HttpSolrClient;
...
public class CommunicationSafeSolrClient extends HttpSolrClient {

  private static final Logger LOGGER = LoggerFactory.getLogger(CommunicationSafeSolrClient.class);

  protected CommunicationSafeSolrClient(Builder builder) {
    super(builder);
  }

  @Override
  protected NamedList<Object> executeMethod(HttpRequestBase method, ResponseParser processor, boolean isV2Api)
      throws SolrServerException {
    for (int i = 1; i <= 10; i++) {
      try {
        return super.executeMethod(method, processor, isV2Api);
      } catch (Exception e) {
        if ((e instanceof UnknownHostException) || (e.getCause() instanceof UnknownHostException)
        || (e instanceof ConnectException) || (e.getCause() instanceof ConnectException)) {
          LOGGER.warn("Communication exception occurred, retry " + i);
          try {
            Thread.sleep(i * 1000);
          } catch (InterruptedException ie) {
            //
          }
        } else {
          throw e;
        }
      }
    }

    throw new IllegalStateException("Cannot execute method");
  }
}
-- Alex Rewa
Source: StackOverflow

8/13/2018

Here's a good write-up that may be related to your problems, also check this one out by Weaveworks.

Basically there have been a number of issues during the last year created at the GitHub Kubernetes issue tracker that has to do with various DNS latencies/problems from within a cluster.

Worth mentioning, although not a fix to every DNS related problem, is that CoreDNS are generally available since version 1.11 and are or will be default thus replacing kube-dns as the default DNS add-on for clusters.

Here's a couple of issues that might be related to the problem you're experiencing:

#47142

#45976

#56903

Hopefully this may help you moving forward.

-- mikejoh
Source: StackOverflow