We are currently troubleshooting an issue with a particular Kubernetes Service (LoadBalancer) for ElasticSearch cluster that seems to become unavailable quite randomly giving ECONNRESET on the client side. The setup is the following:
request
moduleAfter running for enough time (not much, around 10-15 minutes), not necessarily under a lot of load (around 1 request per second), the ElasticSearch Service seems to become unavailable, being the symptoms ECONNRESET errors from the API side:
{ Error: connect ECONNRESET 100.69.12.100:9200 at
Object.exports._errnoException (util.js:953:11) at
exports._exceptionWithHostPort (util.js:976:20) at
TCPConnectWrap.afterConnect [as oncomplete] (net.js:1080:14) cause: {
Error: connect ECONNRESET 100.69.12.100:9200 at
Object.exports._errnoException (util.js:953:11) at
exports._exceptionWithHostPort (util.js:976:20) at
TCPConnectWrap.afterConnect [as oncomplete] (net.js:1080:14) code:
'ECONNRESET', errno: 'ECONNRESET', syscall: 'connect', address:
'100.69.12.100', port: 9200 }, isOperational: true, code:
'ECONNRESET', errno: 'ECONNRESET', syscall: 'connect', address:
'100.69.12.100', port: 9200 }
This error happens quite randomly, no high load needed as indicated, not a defined period of time. At the exact times of those ECONNRESET, kube-dns
addon logs creation (recreation?) of ElasticSearch Service (and other services):
I0515 08:11:49.941166 1 dns.go:264] New service: elasticsearch
I0515 08:11:49.941226 1 dns.go:462] Added SRV record &{Host:elasticsearch.prod.svc.cluster.local. Port:9200 Priority:10 Weight:10 Text: Mail:false Ttl:30 TargetStrip:0 Group: Key:}
I0515 08:11:49.941275 1 dns.go:462] Added SRV record &{Host:elasticsearch.prod.svc.cluster.local. Port:9300 Priority:10 Weight:10 Text: Mail:false Ttl:30 TargetStrip:0 Group: Key:}
These lines repeat for every Kubernetes Service, nearly every 5 minutes, continuously. I am quite clueless what they mean, if this is normal behaviour or definitely related with the failures we are observing.
The interaction between the API and ElasticSearch over the network is basically a set of search queries (up to 30/50 in parallel). No error logs are observed in ElasticSearch side during the interactions.
This worked fine before, for months, the changes were:
Actions attempted:
Considering rolling back to Kubernetes 1.4, and removing any limits. Any pointers or information is highly appreciated.
UPDATE 1: Some extra info around the elasticsearch
Service:
$ kubectl get svc elasticsearch
NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
elasticsearch 100.65.113.208 [AWS ELB] 9200:30295/TCP,9300:32535/TCP 6d
$ kubectl describe svc elasticsearch
Name: elasticsearch
Namespace: ppe
Labels: <none>
Annotations: kubernetes.io/change-cause=kubectl create --filename=app/elasticsearch/manifest/ --recursive=true --record=true --namespace=ppe --context=
Selector: component=elasticsearch,provider=fabric8,type=client
Type: LoadBalancer
IP: 100.65.113.208
LoadBalancer Ingress: [AWS ELB]
Port: http 9200/TCP
NodePort: http 30295/TCP
Endpoints: 100.96.1.25:9200,100.96.1.26:9200,100.96.2.24:9200 + 5 more...
Port: transport 9300/TCP
NodePort: transport 32535/TCP
Endpoints: 100.96.1.25:9300,100.96.1.26:9300,100.96.2.24:9300 + 5 more...
Session Affinity: None
Events: <none>
UPDATE 2
I was able to mitigate for now the issue by using the official ElasticSearch NodeJS client to do search queries and communicate with the Service. The code on the API was using request
module to directly invoke ElasticSearch REST API.
I am still investigating this, as the issue is still there, but it seems it does not manifest itself when using the NodeJS client with following config:
const elasticSearchClient = elasticSearch.Client({host: config.endpoints.elasticsearch, apiVersion: '2.3',
maxRetries: 5, requestTimeout: 15000, deadTimeout: 30000, keepAlive: true});
UPDATE 3
We observed this behaviour not only with this NodeJS API communicating with Elastic, but also with kibana-logging
Kubernetes service and elasticdump
:
Mon, 15 May 2017 13:32:31 GMT | sent 1000 objects to destination
elasticsearch, wrote 1000
Mon, 15 May 2017 13:32:42 GMT | got 1000 objects from source
elasticsearch (offset: 24000)
Mon, 15 May 2017 13:32:44 GMT | Error Emitted => failed to parse json
(message: "Unexpected token E in JSON at position 0") - source: "Error:
'read tcp 172.20.33.123:46758->100.96.4.15:9200: read: connection reset
by peer'\nTrying to reach: 'http://100.96.4.15:9200/ghs.products-2017-
05-13/_bulk'"
Those two modules and the original NodeJS API all use request
NPM module to communicate with ElasticSearch. Interestingly, the ElasticSearch NodeJS client (https://github.com/elastic/elasticsearch-js) does not seem to use request
.
Still investigating, but this could end up being an issue / incompatibility between then Elastic Kubernetes Service exposed + request
NodeJS module