ElasticSearch Service becomes randomly unavailable from NodeJS API on Kubernetes 1.6

5/15/2017

We are currently troubleshooting an issue with a particular Kubernetes Service (LoadBalancer) for ElasticSearch cluster that seems to become unavailable quite randomly giving ECONNRESET on the client side. The setup is the following:

  • Kubernetes 1.6 on AWS deployed with Kops
  • Elasticsearch 2.4 cluster (Fabric8 discovery plugin deployment). It exposes itself as a Service / LoadBalancer
  • NodeJS API deployment acting as the client, connecting to the above Elasticsearch service using request module

After running for enough time (not much, around 10-15 minutes), not necessarily under a lot of load (around 1 request per second), the ElasticSearch Service seems to become unavailable, being the symptoms ECONNRESET errors from the API side:

{ Error: connect ECONNRESET 100.69.12.100:9200 at 
 Object.exports._errnoException (util.js:953:11) at 
 exports._exceptionWithHostPort (util.js:976:20) at 
 TCPConnectWrap.afterConnect [as oncomplete] (net.js:1080:14) cause: { 
 Error: connect ECONNRESET 100.69.12.100:9200 at 
 Object.exports._errnoException (util.js:953:11) at 
 exports._exceptionWithHostPort (util.js:976:20) at 
 TCPConnectWrap.afterConnect [as oncomplete] (net.js:1080:14) code: 
 'ECONNRESET', errno: 'ECONNRESET', syscall: 'connect', address: 
 '100.69.12.100', port: 9200 }, isOperational: true, code: 
 'ECONNRESET', errno: 'ECONNRESET', syscall: 'connect', address: 
 '100.69.12.100', port: 9200 }

This error happens quite randomly, no high load needed as indicated, not a defined period of time. At the exact times of those ECONNRESET, kube-dns addon logs creation (recreation?) of ElasticSearch Service (and other services):

I0515 08:11:49.941166       1 dns.go:264] New service: elasticsearch
I0515 08:11:49.941226       1 dns.go:462] Added SRV record &{Host:elasticsearch.prod.svc.cluster.local. Port:9200 Priority:10 Weight:10 Text: Mail:false Ttl:30 TargetStrip:0 Group: Key:}
I0515 08:11:49.941275       1 dns.go:462] Added SRV record &{Host:elasticsearch.prod.svc.cluster.local. Port:9300 Priority:10 Weight:10 Text: Mail:false Ttl:30 TargetStrip:0 Group: Key:}

These lines repeat for every Kubernetes Service, nearly every 5 minutes, continuously. I am quite clueless what they mean, if this is normal behaviour or definitely related with the failures we are observing.

The interaction between the API and ElasticSearch over the network is basically a set of search queries (up to 30/50 in parallel). No error logs are observed in ElasticSearch side during the interactions.

This worked fine before, for months, the changes were:

  • Kubernetes upgraded from version 1.4 to 1.6 using Kops
  • Added CPU / Memory limits to ElasticSearch cluster. Followed Elastic recommendations
  • Added extra initial 30 calls to Elastic, which are cached after first call

Actions attempted:

  • Reduced and increased resource limits (CPU/Memory) on ElasticSearch side to see any changes in behaviour. No effect on the ECONNRESETs
  • Increased threadpool and queue size for search on ElasticSearch side. It handles more load, but the problem itself manifests under low load.

Considering rolling back to Kubernetes 1.4, and removing any limits. Any pointers or information is highly appreciated.

UPDATE 1: Some extra info around the elasticsearch Service:

$ kubectl get svc elasticsearch
NAME             CLUSTER-IP       EXTERNAL-IP    PORT(S)            AGE
elasticsearch   100.65.113.208    [AWS ELB] 9200:30295/TCP,9300:32535/TCP   6d

$ kubectl describe svc elasticsearch
  Name:         elasticsearch
  Namespace:        ppe
  Labels:           <none>
  Annotations:      kubernetes.io/change-cause=kubectl create --filename=app/elasticsearch/manifest/ --recursive=true --record=true --namespace=ppe --context=
  Selector:     component=elasticsearch,provider=fabric8,type=client
  Type:         LoadBalancer
  IP:           100.65.113.208
  LoadBalancer Ingress: [AWS ELB]
  Port:         http    9200/TCP
  NodePort:     http    30295/TCP
  Endpoints:        100.96.1.25:9200,100.96.1.26:9200,100.96.2.24:9200 + 5 more...
  Port:         transport   9300/TCP
  NodePort:     transport   32535/TCP
  Endpoints:        100.96.1.25:9300,100.96.1.26:9300,100.96.2.24:9300 + 5 more...
  Session Affinity: None
  Events:           <none>

UPDATE 2

I was able to mitigate for now the issue by using the official ElasticSearch NodeJS client to do search queries and communicate with the Service. The code on the API was using request module to directly invoke ElasticSearch REST API.

I am still investigating this, as the issue is still there, but it seems it does not manifest itself when using the NodeJS client with following config:

const elasticSearchClient = elasticSearch.Client({host: config.endpoints.elasticsearch, apiVersion: '2.3',
maxRetries: 5, requestTimeout: 15000, deadTimeout: 30000, keepAlive: true});

UPDATE 3

We observed this behaviour not only with this NodeJS API communicating with Elastic, but also with kibana-logging Kubernetes service and elasticdump:

Mon, 15 May 2017 13:32:31 GMT | sent 1000 objects to destination 
elasticsearch, wrote 1000
Mon, 15 May 2017 13:32:42 GMT | got 1000 objects from source 
elasticsearch (offset: 24000)
Mon, 15 May 2017 13:32:44 GMT | Error Emitted => failed to parse json 
(message: "Unexpected token E in JSON at position 0") - source: "Error: 
'read tcp 172.20.33.123:46758->100.96.4.15:9200: read: connection reset 
by peer'\nTrying to reach: 'http://100.96.4.15:9200/ghs.products-2017-
05-13/_bulk'"

Those two modules and the original NodeJS API all use request NPM module to communicate with ElasticSearch. Interestingly, the ElasticSearch NodeJS client (https://github.com/elastic/elasticsearch-js) does not seem to use request.

Still investigating, but this could end up being an issue / incompatibility between then Elastic Kubernetes Service exposed + request NodeJS module

-- David Fernandez
amazon-web-services
elasticsearch
kube-dns
kubernetes
node.js

0 Answers