Google Endpoints on GKE (ESP + gRPC) - how to debug this further?

1/12/2020

After following Getting started with Cloud Endpoints on GKE and deploying the API and backend everything looks like it's working (spoiler alert, it's not);

  • My application's deployed on a cluster.
  • A single pod is deployed in Kubernetes.
  • The pod is running two containers (ESP and my GPRC service)
    1. gcr.io/endpoints-release/endpoints-runtime:1
    2. gcr.io/my-app/server:latest

...both are running; kubectl get pods yields:

NAME                          READY   STATUS    RESTARTS   AGE
my-app-855d8447f-nnvz8        2/2     Running   0          24m

Need improved/HTTPS-centric GKE Ingress + ESP + gRPC example was very helpful in getting me this far.

Bad news; HTTP requests from my Angular app (running on http://localhost:4200) result in 500 Internal Server Error.

I ran through Troubleshooting Cloud Endpoints in GKE and couldn't see anything that applied (there is nothing obviously wrong).

OK. Time to debug.

Looking at the ESP (Docker container) logs with kubectl logs -f my-app-855d8447f-nnvz8 -c esp, I can see the preflight request (OPTIONS) and subsequent HTTP POST request.

INFO:Fetching an access token from the metadata service
INFO:Fetching the service config ID from the rollouts service
INFO:Fetching the service configuration from the service management service
INFO:Attribute zone: us-central1-a
INFO:Attribute project_id: my-cloud-project
INFO:Attribute kube_env: KUBE_ENV

nginx: [warn] Using trusted CA certificates file: /etc/nginx/trusted-ca-certificates.crt

143.159.55.199 - - [12/Jan/2020:17:34:11 +0000] "OPTIONS /v1/documents?key=Ei2ZvlV5zASIkuRf4LRl&format=JSON&showDefaults=false HTTP/1.1" 204 0 "http://localhost:4200/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36"
143.159.55.199 - - [12/Jan/2020:17:34:11 +0000] "POST /v1/documents?key=Ei2ZvlV5zASIkuRf4LRl&format=JSON&showDefaults=false HTTP/1.1" 500 242 "http://localhost:4200/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36"

The CORS (OPTIONS) request results in HTTP 204, however the POST request is HTTP 500. So there's clearly a problem here.

Despite plenty of logging, there are no logs in my grpc server; kubectl logs -f my-app-855d8447f-nnvz8 -c server yields nothing. So

  • The Kubernetes pod is healthy and both containers arerunning.
  • The ESP is starting and successfully using the metadata service
  • CORS is working (it's wide-open while I'm developing)
  • The gRPC service is running but not being hit (no logs)

Somewhere between CORS and POST HTTP requests, it's going wrong and the gRPC service never gets called.

I can't find any errors; How can I debug what's going on between ESP/nginx and my gRPC server?

For completeness, here's my deployment yaml (with added comments):

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: my-app
  namespace: default
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
      - name: esp
        image: gcr.io/endpoints-release/endpoints-runtime:1
        # The default value for the backend address is http://127.0.0.1:8080
        args: [
          "--http_port=8081",

          # My gPRC service is running on port 8080 on the same VM.
          "--backend=grpc://127.0.0.1:8080",
          "--service=my-project.endpoints.my-project-dev.cloud.goog",
          "--rollout_strategy=managed",

          # Makes ESP return code 200 for location /healthz,
          # instead of forwarding the request to the backend.
          "--healthz=healthz",

          # Wide-open CORS during development.
          "--cors_preset=basic",
          "--cors_allow_origin=*",
        ]
        # HEALTH CHECK START
        # Used by ingress to perform health check
        readinessProbe:
          httpGet:
            path: /healthz
            port: 8081
        ports:
          - containerPort: 8081
        # HEALTH CHECK END
      - name: server
        image: gcr.io/my-project-dev/server:latest
        imagePullPolicy: Always
        ports:
          - containerPort: 8080

The CORS response headers from the ESP/nginx container are:

Access-Control-Allow-Headers: DNT,User-Agent,X-Requested-With,If-Modified-Since,Cache-Control,Content-Type,Range,Authorization
Access-Control-Allow-Methods: GET, POST, PUT, PATCH, DELETE, OPTIONS
Access-Control-Allow-Origin: *
Access-Control-Expose-Headers: Content-Length,Content-Range
Content-Type: application/json
Date: Sun, 12 Jan 2020 17:34:11 GMT
Server: nginx
Transfer-Encoding: chunked
Via: 1.1 google

I don't expect anyone to magic-up an answer, but right now I'm unsure of how to debug this any further when everthing looks good?

-- Jack
docker
google-kubernetes-engine
kubernetes
nginx

1 Answer

1/13/2020

A new day and a solution. I'll briefly explain the solution, but focus more on how I discovered it and what I would do differently.

Problem

The "Compute Engine default service account" was missing some permissions. Getting started with Cloud Endpoints on GKE does not use Datastore. My app does. Simply going to IAM & admin > IAM, and adding the "Cloud Datastore User" role to the service account fixed my problem.

Lessons

1. Logging in __init__ of the Python gPRC server doesn't provide as much cover as it seems. AFAICT, the main gRPC service module was being called/loaded in some way but the insufficient permissions meant from google.cloud import datastore prevented my Cloud log statements from being executed:

# This requires "Cloud Datastore User" role.
from google.cloud import datastore

class MyServer(myservice_pb2_grpc.MyServer):
    """Implements the GRPC interface to handle client calls."""

    def __init__(self):
        # Logging using CloudLoggingHandler was never executed.
        logger.info('__init__')

def serve():
    """Runs a gRPC server to handle client calls."""
    server = grpc.server(futures.ThreadPoolExecutor(max_workers=10))
    myservice_pb2_grpc.add_MyServiceServicer_to_server(MyServer(), server)

    # The default value for the backend address is http://127.0.0.1:8080
    server.add_insecure_port('[::]:8080')
    server.start()
    server.wait_for_termination()

if __name__ == '__main__':
    serve()

2. While the Container logs did include a very helpful error message, it was hidden by the default Stackdriver filters:

resource.type="container"
resource.labels.cluster_name="standard-cluster-2"
resource.labels.namespace_id="default"
resource.labels.project_id="my-project-dev"
resource.labels.zone:"us-central1-a"
resource.labels.container_name=("esp" OR "server")
resource.labels.pod_id="my-app-855d8447f-nnvz8"

Seeing my error required a resource.type of api. Moreover, only the project_id of the resource labels was relevant. I strongly recommend viewing api logs for your project. Don't just view container logs:

resource.type="api"
resource.labels.project_id="json-schema-dev"

In my case, the exact error could be found with Stackdriver filters set to

resource.type="api"
resource.labels.project_id="json-schema-dev"
jsonPayload.log_message:("Exception calling application: 403 Missing or insufficient permissions.")

3. Given that I couldn't find the error in the Stackdriver logs, where did I find it? The browser console. The humble browser console. I was looking only at the network tab of the Developer Console, seeing the HTTP 500 response, and going straight to the Cloud Console.

4. View the full error message. The error received by the browser may render as:

headers: HttpHeaders {normalizedNames: Map(0), lazyUpdate: null, lazyInit: ƒ}
status: 500
statusText: "Internal Server Error"
url:"http://34.107.225.145/v1/documents:inferkey=123&format=JSON&showDefaults=false"
ok: false
name: "HttpErrorResponse"
message: "Http failure response for http://123.456.789.342 500 Internal Server Error".
error: {code: 2, message:...

I didn't expand the error property which was truncated with ellipsis. Upon expanding this I got:

code: 2
message: "Exception calling application: 403 Missing or insufficient permissions."

Summary

  • Logging in __init__ may not be enough.
    • Consider logging in serve()
  • Merely importing a module for which you have insufficient permissions may prevent log statements from being executed.
  • Check api logs, not just container logs.
    • Kubernetes links to Container logs for convenience, but the default filters are decidedly inconvenient.
  • Spend more time looking the client error, and don't be so quick to move to the server.
    • Expand the console error message as much as possible (go beyond the response code)
-- Jack
Source: StackOverflow