NEG says Pods are 'unhealthy', but actually the Pods are healthy

9/26/2019

I'm trying to apply gRPC load balancing with Ingress on GCP, and for this I referenced this example. The example shows gRPC load balancing is working by 2 ways(one with envoy side-car and the other one is HTTP mux, handling both gRPC/HTTP-health-check on same Pod.) However, the envoy proxy example doesn't work.

What makes me confused is, the Pods are running/healthy(confirmed by kubectl describe, kubectl logs)

$ kubectl get pods
NAME                             READY   STATUS    RESTARTS   AGE
fe-deployment-757ffcbd57-4w446   2/2     Running   0          4m22s
fe-deployment-757ffcbd57-xrrm9   2/2     Running   0          4m22s


$ kubectl describe pod fe-deployment-757ffcbd57-4w446
Name:               fe-deployment-757ffcbd57-4w446
Namespace:          default
Priority:           0
PriorityClassName:  <none>
Node:               gke-ingress-grpc-loadbal-default-pool-92d3aed5-l7vc/10.128.0.64
Start Time:         Thu, 26 Sep 2019 16:15:18 +0900
Labels:             app=fe
                    pod-template-hash=757ffcbd57
Annotations:        kubernetes.io/limit-ranger: LimitRanger plugin set: cpu request for container fe-envoy; cpu request for container fe-container
Status:             Running
IP:                 10.56.1.29
Controlled By:      ReplicaSet/fe-deployment-757ffcbd57
Containers:
  fe-envoy:
    Container ID:  docker://b4789909494f7eeb8d3af66cb59168e009c582d412d8ca683a7f435559989421
    Image:         envoyproxy/envoy:latest
    Image ID:      docker-pullable://envoyproxy/envoy@sha256:9ef9c4fd6189fdb903929dc5aa0492a51d6783777de65e567382ac7d9a28106b
    Port:          8080/TCP
    Host Port:     0/TCP
    Command:
      /usr/local/bin/envoy
    Args:
      -c
      /data/config/envoy.yaml
    State:          Running
      Started:      Thu, 26 Sep 2019 16:15:19 +0900
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:        100m
    Liveness:     http-get https://:fe/_ah/health delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness:    http-get https://:fe/_ah/health delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /data/certs from certs-volume (rw)
      /data/config from envoy-config-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-c7nqc (ro)
  fe-container:
    Container ID:  docker://a533224d3ea8b5e4d5e268a616d73762b37df69f434342459f35caa8fac32dab
    Image:         salrashid123/grpc_only_backend
    Image ID:      docker-pullable://salrashid123/grpc_only_backend@sha256:ebfac594116445dd67aff7c9e7a619d73222b60947e46ef65ee6d918db3e1f4b
    Port:          50051/TCP
    Host Port:     0/TCP
    Command:
      /grpc_server
    Args:
      --grpcport
      :50051
      --insecure
    State:          Running
      Started:      Thu, 26 Sep 2019 16:15:20 +0900
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:        100m
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-c7nqc (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  certs-volume:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  fe-secret
    Optional:    false
  envoy-config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      envoy-configmap
    Optional:  false
  default-token-c7nqc:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-c7nqc
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason     Age                    From                                                          Message
  ----     ------     ----                   ----                                                          -------
  Normal   Scheduled  4m25s                  default-scheduler                                             Successfully assigned default/fe-deployment-757ffcbd57-4w446 to gke-ingress-grpc-loadbal-default-pool-92d3aed5-l7vc
  Normal   Pulled     4m25s                  kubelet, gke-ingress-grpc-loadbal-default-pool-92d3aed5-l7vc  Container image "envoyproxy/envoy:latest" already present on machine
  Normal   Created    4m24s                  kubelet, gke-ingress-grpc-loadbal-default-pool-92d3aed5-l7vc  Created container
  Normal   Started    4m24s                  kubelet, gke-ingress-grpc-loadbal-default-pool-92d3aed5-l7vc  Started container
  Normal   Pulling    4m24s                  kubelet, gke-ingress-grpc-loadbal-default-pool-92d3aed5-l7vc  pulling image "salrashid123/grpc_only_backend"
  Normal   Pulled     4m24s                  kubelet, gke-ingress-grpc-loadbal-default-pool-92d3aed5-l7vc  Successfully pulled image "salrashid123/grpc_only_backend"
  Normal   Created    4m24s                  kubelet, gke-ingress-grpc-loadbal-default-pool-92d3aed5-l7vc  Created container
  Normal   Started    4m23s                  kubelet, gke-ingress-grpc-loadbal-default-pool-92d3aed5-l7vc  Started container
  Warning  Unhealthy  4m10s (x2 over 4m20s)  kubelet, gke-ingress-grpc-loadbal-default-pool-92d3aed5-l7vc  Readiness probe failed: HTTP probe failed with statuscode: 503
  Warning  Unhealthy  4m9s (x2 over 4m19s)   kubelet, gke-ingress-grpc-loadbal-default-pool-92d3aed5-l7vc  Liveness probe failed: HTTP probe failed with statuscode: 503


$ kubectl describe pod fe-deployment-757ffcbd57-xrrm9
Name:               fe-deployment-757ffcbd57-xrrm9
Namespace:          default
Priority:           0
PriorityClassName:  <none>
Node:               gke-ingress-grpc-loadbal-default-pool-92d3aed5-52l9/10.128.0.22
Start Time:         Thu, 26 Sep 2019 16:15:18 +0900
Labels:             app=fe
                    pod-template-hash=757ffcbd57
Annotations:        kubernetes.io/limit-ranger: LimitRanger plugin set: cpu request for container fe-envoy; cpu request for container fe-container
Status:             Running
IP:                 10.56.0.23
Controlled By:      ReplicaSet/fe-deployment-757ffcbd57
Containers:
  fe-envoy:
    Container ID:  docker://255dd6cab1e681e30ccfe158f7d72540576788dbf6be60b703982a7ecbb310b1
    Image:         envoyproxy/envoy:latest
    Image ID:      docker-pullable://envoyproxy/envoy@sha256:9ef9c4fd6189fdb903929dc5aa0492a51d6783777de65e567382ac7d9a28106b
    Port:          8080/TCP
    Host Port:     0/TCP
    Command:
      /usr/local/bin/envoy
    Args:
      -c
      /data/config/envoy.yaml
    State:          Running
      Started:      Thu, 26 Sep 2019 16:15:19 +0900
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:        100m
    Liveness:     http-get https://:fe/_ah/health delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness:    http-get https://:fe/_ah/health delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /data/certs from certs-volume (rw)
      /data/config from envoy-config-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-c7nqc (ro)
  fe-container:
    Container ID:  docker://f6a0246129cc89da846c473daaa1c1770d2b5419b6015098b0d4f35782b0a9da
    Image:         salrashid123/grpc_only_backend
    Image ID:      docker-pullable://salrashid123/grpc_only_backend@sha256:ebfac594116445dd67aff7c9e7a619d73222b60947e46ef65ee6d918db3e1f4b
    Port:          50051/TCP
    Host Port:     0/TCP
    Command:
      /grpc_server
    Args:
      --grpcport
      :50051
      --insecure
    State:          Running
      Started:      Thu, 26 Sep 2019 16:15:20 +0900
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:        100m
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-c7nqc (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  certs-volume:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  fe-secret
    Optional:    false
  envoy-config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      envoy-configmap
    Optional:  false
  default-token-c7nqc:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-c7nqc
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason     Age                   From                                                          Message
  ----     ------     ----                  ----                                                          -------
  Normal   Scheduled  5m8s                  default-scheduler                                             Successfully assigned default/fe-deployment-757ffcbd57-xrrm9 to gke-ingress-grpc-loadbal-default-pool-92d3aed5-52l9
  Normal   Pulled     5m8s                  kubelet, gke-ingress-grpc-loadbal-default-pool-92d3aed5-52l9  Container image "envoyproxy/envoy:latest" already present on machine
  Normal   Created    5m7s                  kubelet, gke-ingress-grpc-loadbal-default-pool-92d3aed5-52l9  Created container
  Normal   Started    5m7s                  kubelet, gke-ingress-grpc-loadbal-default-pool-92d3aed5-52l9  Started container
  Normal   Pulling    5m7s                  kubelet, gke-ingress-grpc-loadbal-default-pool-92d3aed5-52l9  pulling image "salrashid123/grpc_only_backend"
  Normal   Pulled     5m7s                  kubelet, gke-ingress-grpc-loadbal-default-pool-92d3aed5-52l9  Successfully pulled image "salrashid123/grpc_only_backend"
  Normal   Created    5m7s                  kubelet, gke-ingress-grpc-loadbal-default-pool-92d3aed5-52l9  Created container
  Normal   Started    5m6s                  kubelet, gke-ingress-grpc-loadbal-default-pool-92d3aed5-52l9  Started container
  Warning  Unhealthy  4m53s (x2 over 5m3s)  kubelet, gke-ingress-grpc-loadbal-default-pool-92d3aed5-52l9  Readiness probe failed: HTTP probe failed with statuscode: 503
  Warning  Unhealthy  4m52s (x2 over 5m2s)  kubelet, gke-ingress-grpc-loadbal-default-pool-92d3aed5-52l9  Liveness probe failed: HTTP probe failed with statuscode: 503


$ kubectl get services
NAME             TYPE           CLUSTER-IP     EXTERNAL-IP    PORT(S)           AGE
fe-srv-ingress   NodePort       10.123.5.165   <none>         8080:30816/TCP    6m43s
fe-srv-lb        LoadBalancer   10.123.15.36   35.224.69.60   50051:30592/TCP   6m42s
kubernetes       ClusterIP      10.123.0.1     <none>         443/TCP           2d2h


$ kubectl describe service fe-srv-ingress
Name:                     fe-srv-ingress
Namespace:                default
Labels:                   type=fe-srv
Annotations:              cloud.google.com/neg: {"ingress": true}
                          cloud.google.com/neg-status:
                            {"network_endpoint_groups":{"8080":"k8s1-963b7b91-default-fe-srv-ingress-8080-e459b0d2"},"zones":["us-central1-a"]}
                          kubectl.kubernetes.io/last-applied-configuration:
                            {"apiVersion":"v1","kind":"Service","metadata":{"annotations":{"cloud.google.com/neg":"{\"ingress\": true}","service.alpha.kubernetes.io/a...
                          service.alpha.kubernetes.io/app-protocols: {"fe":"HTTP2"}
Selector:                 app=fe
Type:                     NodePort
IP:                       10.123.5.165
Port:                     fe  8080/TCP
TargetPort:               8080/TCP
NodePort:                 fe  30816/TCP
Endpoints:                10.56.0.23:8080,10.56.1.29:8080
Session Affinity:         None
External Traffic Policy:  Cluster
Events:
  Type    Reason  Age    From            Message
  ----    ------  ----   ----            -------
  Normal  Create  6m47s  neg-controller  Created NEG "k8s1-963b7b91-default-fe-srv-ingress-8080-e459b0d2" for default/fe-srv-ingress-8080/8080 in "us-central1-a".
  Normal  Attach  6m40s  neg-controller  Attach 2 network endpoint(s) (NEG "k8s1-963b7b91-default-fe-srv-ingress-8080-e459b0d2" in zone "us-central1-a")

but NEG says they are unhealthy(so Ingress also says backend is unhealthy).

I couldn't found what caused this. Does anyone know how to solve this?

Test environment:

  1. GKE, 1.13.7-gke.8 (VPC enabled)
  2. Default HTTP(s) load balancer on Ingress

YAML files I used(same with the example previously mentioned),

envoy-configmap.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: envoy-configmap
  labels:
    app: fe
data:
  config: |-
    ---
    admin:
      access_log_path: /dev/null
      address:
        socket_address:
          address: 127.0.0.1
          port_value: 9000
    node:
      cluster: service_greeter
      id: test-id
    static_resources:
      listeners:
      - name: listener_0
        address:
          socket_address: { address: 0.0.0.0, port_value: 8080 }
        filter_chains:
        - filters:
          - name: envoy.http_connection_manager
            config:
              stat_prefix: ingress_http
              codec_type: AUTO
              route_config:
                name: local_route
                virtual_hosts:
                - name: local_service
                  domains: ["*"]
                  routes:
                  - match:
                      path: "/echo.EchoServer/SayHello"
                    route: { cluster: local_grpc_endpoint  }
              http_filters:
              - name: envoy.lua
                config:
                  inline_code: |
                    package.path = "/etc/envoy/lua/?.lua;/usr/share/lua/5.1/nginx/?.lua;/etc/envoy/lua/" .. package.path
                    function envoy_on_request(request_handle)

                      if request_handle:headers():get(":path") == "/_ah/health" then
                        local headers, body = request_handle:httpCall(
                        "local_admin",
                        {
                          [":method"] = "GET",
                          [":path"] = "/clusters",
                          [":authority"] = "local_admin"
                        },"", 50)


                        str = "local_grpc_endpoint::127.0.0.1:50051::health_flags::healthy"
                        if string.match(body, str) then
                          request_handle:respond({[":status"] = "200"},"ok")
                        else
                          request_handle:logWarn("Envoy healthcheck failed")     
                          request_handle:respond({[":status"] = "503"},"unavailable")
                        end
                      end
                    end              
              - name: envoy.router
                typed_config: {}
          tls_context:
            common_tls_context:
              tls_certificates:
                - certificate_chain:
                    filename: "/data/certs/tls.crt"
                  private_key:
                    filename: "/data/certs/tls.key"
      clusters:
      - name: local_grpc_endpoint
        connect_timeout: 0.05s
        type:  STATIC
        http2_protocol_options: {}
        lb_policy: ROUND_ROBIN
        common_lb_config:
          healthy_panic_threshold:
            value: 50.0   
        health_checks:
          - timeout: 1s
            interval: 5s
            interval_jitter: 1s
            no_traffic_interval: 5s
            unhealthy_threshold: 1
            healthy_threshold: 3
            grpc_health_check:
              service_name: "echo.EchoServer"
              authority: "server.domain.com"
        hosts:
        - socket_address:
            address: 127.0.0.1
            port_value: 50051
      - name: local_admin
        connect_timeout: 0.05s
        type:  STATIC
        lb_policy: ROUND_ROBIN
        hosts:
        - socket_address:
            address: 127.0.0.1
            port_value: 9000

fe-deployment.yaml

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: fe-deployment
  labels:
    app: fe
spec:
  replicas: 2
  template:
    metadata:
      labels:
        app: fe
    spec:
      containers:

      - name: fe-envoy
        image: envoyproxy/envoy:latest
        imagePullPolicy: IfNotPresent
        livenessProbe:
          httpGet:
            path: /_ah/health
            scheme: HTTPS
            port: fe
        readinessProbe:
          httpGet:
            path: /_ah/health
            scheme: HTTPS
            port: fe
        ports:
        - name: fe
          containerPort: 8080
          protocol: TCP               
        command: ["/usr/local/bin/envoy"]
        args: ["-c", "/data/config/envoy.yaml"]
        volumeMounts:
        - name: certs-volume
          mountPath: /data/certs
        - name: envoy-config-volume
          mountPath: /data/config

      - name: fe-container
        image: salrashid123/grpc_only_backend  # This runs gRPC secure/insecure server using port argument(:50051). Port 50051 is also exposed on Dockerfile.
        imagePullPolicy: Always         
        ports:
        - containerPort: 50051
          protocol: TCP                 
        command: ["/grpc_server"]
        args: ["--grpcport", ":50051", "--insecure"]

      volumes:
        - name: certs-volume
          secret:
            secretName: fe-secret
        - name: envoy-config-volume
          configMap:
             name: envoy-configmap
             items:
              - key: config
                path: envoy.yaml

fe-srv-ingress.yaml

---
apiVersion: v1
kind: Service
metadata:
  name: fe-srv-ingress
  labels:
    type: fe-srv
  annotations:
    service.alpha.kubernetes.io/app-protocols: '{"fe":"HTTP2"}'
    cloud.google.com/neg: '{"ingress": true}'
spec:
  type: NodePort 
  ports:
  - name: fe
    port: 8080
    protocol: TCP
    targetPort: 8080       
  selector:
    app: fe

fe-ingress.yaml

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: fe-ingress
  annotations:
    kubernetes.io/ingress.allow-http: "false"
spec:
  tls:
  - hosts:
    - server.domain.com
    secretName: fe-secret
  rules:
  - host: server.domain.com  
    http:
      paths:
      - path: /echo.EchoServer/*
        backend:
          serviceName: fe-srv-ingress
          servicePort: 8080
-- isbee
google-cloud-platform
google-kubernetes-engine
kubernetes
kubernetes-health-check

2 Answers

1/16/2020

I had to allow any traffic from IP range specified as health checks source in documentation pages - 130.211.0.0/22, 35.191.0.0/16 , seen it here: https://cloud.google.com/kubernetes-engine/docs/how-to/standalone-neg And I had to allow it for default network and for the new network (regional) the cluster lives in. When I added these firewall rules, health checks could reach the pods exposed in NEG used as a regional backend within a backend service of our Http(s) load balancer.

May be there is a more restrictive firewall setup, but I just cut the corners and allowed anything from IP range declared to be healthcheck source range from the page referenced above.

-- Kote Isaev
Source: StackOverflow

9/27/2019

GCP committer says this is kind of bug, so there is no way to fix this at this time.

Related issue is this, and pull request is now progressing.

-- isbee
Source: StackOverflow