Kubernetes Kibana operator failures and Nginx ingress timeouts

11/25/2020

I just started implementing a Kubernetes cluster on an Azure Linux VM. I'm very new with all this. The cluster is running on a small VM (2 core, 16gb). I set up the ECK stack using their tutorial online, and an Nginx Ingress controller to expose it.

Most of the day, everything runs fine. I can access the Kibana dashboard, run Elastic queries, Nginx is working. But about once each day, something happens that causes the Kibana Endpoint matching the Kibana Service to not have any IP address. As a result, the Service can't route correctly to the container. When this happens, the Kibana pod has a status of Running, but says that 0/1 are running. It never triggers any restarts, and as a result, the Kibana dashboard becomes inaccessible. I've tried reproducing this by shutting down the Docker container, force killing the pod, but can't reliably reproduce it.

Looking at the logs on the Kibana pod, there are a bunch of errors due to timeouts. The Nginx logs say that it can't find the Endpoint for the Service. It looks like this could potentially be the source. Has anyone encountered this? Does anyone know a reliable way to prevent this?

This should probably be a separate question, but the other issue this causes is completely blocking all Nginx Ingress. Any new requests are not seen in the logs, and the logs completely stop after the message about not finding an endpoint. As a result, all URLs that Ingress is normally responsible for time out, and the whole cluster becomes externally unusable. This is fixed by deleting the Nginx controller pod, but the pod doesn't restart itself. Can someone explain why an issue like this would completely block Nginx? And why the Nginx pod can't detect this and restart?

Edit: <br>

The Nginx logs end with this:

W1126 16:20:31.517113       6 controller.go:950] Service "default/gwam-kb-http" does not have any active Endpoint.
W1126 16:20:34.848942       6 controller.go:950] Service "default/gwam-kb-http" does not have any active Endpoint.
W1126 16:21:52.555873       6 controller.go:950] Service "default/gwam-kb-http" does not have any active Endpoint.

Any further requests timeout and do not appear in the logs.

I don't have logs for the kibana pod, but they were just consistent timeouts to the kibana service default/gwam-kb-http (same as in Nginx logs above). This caused the readiness probe to fail, and show 0/1 Running, but did not trigger a restart of the pod.

Kibana Endpoints when everything is normal

Name:         gwam-kb-http
Namespace:    default
Labels:       common.k8s.elastic.co/type=kibana
              kibana.k8s.elastic.co/name=gwam
Annotations:  endpoints.kubernetes.io/last-change-trigger-time: 2020-11-26T16:27:20Z
Subsets:
  Addresses:          10.244.0.6
  NotReadyAddresses:  <none>
  Ports:
    Name   Port  Protocol
    ----   ----  --------
    https  5601  TCP

Events:  <none>

When I run into this issue, Addresses is empty, and the pod IP is under NotReadyAddresses

I'm using the very basic YAML from the ECK setup tutorial:

Elastic (no problems here)

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: gwam
spec:
  version: 7.10.0
  nodeSets:
  - name: default
    count: 3
    volumeClaimTemplates:
    - metadata:
        name: elasticsearch-data
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 2Gi
        storageClassName: elasticsearch

Kibana:

apiVersion: kibana.k8s.elastic.co/v1
kind: Kibana
metadata:
  name: gwam
spec:
  version: 7.10.0
  count: 1
  elasticsearchRef:
    name: gwam

Ingress for the Kibana service:

kind: Ingress
apiVersion: extensions/v1beta1
metadata:
  name: nginx-ingress-secure-backend-no-rewrite
  annotations: 
    kubernetes.io/ingress.class: nginx
    nginx.org/proxy-connect-timeout: "30s"
    nginx.org/proxy-read-timeout: "20s"
    nginx.org/proxy-send-timeout: "60s"
    nginx.org/client-max-body-size: "4m"
    nginx.ingress.kubernetes.io/backend-protocol: "HTTPS"
spec:
  tls: 
  - hosts:
    - <internal company site>
    secretName: gwam-tls-secret
  rules:
    - host: <internal company site>
      http:
        paths:
          - path: /
            backend:
              serviceName: gwam-kb-http
              servicePort: 5601

Some more environment details:<br> Kubernetes version: 1.19.3<br> OS: Ubuntu 18.04.5 LTS (GNU/Linux 5.4.0-1031-azure x86_64)

edit 2:

Seems like I'm getting some kind of network error here. None of my pods can do a dnslookup for kubernetes.default. All the networking pods are running, but after adding logs to CoreDNS, I'm seeing the following: [ERROR] plugin/errors: 2 1699910358767628111.9001703618875455268. HINFO: read udp 10.244.0.69:35222->10.234.44.20:53: i/o timeout

I'm using Flannel for my network. Thinking of trying to reset and switch to Calico and increasing nf_conntrack_max as some answers suggest.

-- Mark Sherer
elasticsearch
kibana
kubernetes
nginx-ingress

1 Answer

11/30/2020

This ended up being a very simple mistake on my part. I thought it was a pod or DNS issue, but was just a general network issue. My IP forwarding was turned off. I turned it on with:

sysctl -w net.ipv4.ip_forward=1

And added net.ipv4.ip_forward=1 to /etc/sysctl.conf

-- Mark Sherer
Source: StackOverflow