After following Getting started with Cloud Endpoints on GKE and deploying the API and backend everything looks like it's working (spoiler alert, it's not);
gcr.io/endpoints-release/endpoints-runtime:1
gcr.io/my-app/server:latest
...both are running; kubectl get pods
yields:
NAME READY STATUS RESTARTS AGE
my-app-855d8447f-nnvz8 2/2 Running 0 24m
Need improved/HTTPS-centric GKE Ingress + ESP + gRPC example was very helpful in getting me this far.
Bad news; HTTP requests from my Angular app (running on http://localhost:4200
) result in 500 Internal Server Error.
I ran through Troubleshooting Cloud Endpoints in GKE and couldn't see anything that applied (there is nothing obviously wrong).
OK. Time to debug.
Looking at the ESP (Docker container) logs with kubectl logs -f my-app-855d8447f-nnvz8 -c esp
, I can see the preflight request (OPTIONS
) and subsequent HTTP POST
request.
INFO:Fetching an access token from the metadata service
INFO:Fetching the service config ID from the rollouts service
INFO:Fetching the service configuration from the service management service
INFO:Attribute zone: us-central1-a
INFO:Attribute project_id: my-cloud-project
INFO:Attribute kube_env: KUBE_ENV
nginx: [warn] Using trusted CA certificates file: /etc/nginx/trusted-ca-certificates.crt
143.159.55.199 - - [12/Jan/2020:17:34:11 +0000] "OPTIONS /v1/documents?key=Ei2ZvlV5zASIkuRf4LRl&format=JSON&showDefaults=false HTTP/1.1" 204 0 "http://localhost:4200/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36"
143.159.55.199 - - [12/Jan/2020:17:34:11 +0000] "POST /v1/documents?key=Ei2ZvlV5zASIkuRf4LRl&format=JSON&showDefaults=false HTTP/1.1" 500 242 "http://localhost:4200/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36"
The CORS (OPTIONS
) request results in HTTP 204, however the POST
request is HTTP 500. So there's clearly a problem here.
Despite plenty of logging, there are no logs in my grpc server; kubectl logs -f my-app-855d8447f-nnvz8 -c server
yields nothing. So
Somewhere between CORS and POST HTTP requests, it's going wrong and the gRPC service never gets called.
I can't find any errors; How can I debug what's going on between ESP/nginx and my gRPC server?
For completeness, here's my deployment yaml (with added comments):
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: my-app
namespace: default
spec:
replicas: 1
template:
metadata:
labels:
app: my-app
spec:
containers:
- name: esp
image: gcr.io/endpoints-release/endpoints-runtime:1
# The default value for the backend address is http://127.0.0.1:8080
args: [
"--http_port=8081",
# My gPRC service is running on port 8080 on the same VM.
"--backend=grpc://127.0.0.1:8080",
"--service=my-project.endpoints.my-project-dev.cloud.goog",
"--rollout_strategy=managed",
# Makes ESP return code 200 for location /healthz,
# instead of forwarding the request to the backend.
"--healthz=healthz",
# Wide-open CORS during development.
"--cors_preset=basic",
"--cors_allow_origin=*",
]
# HEALTH CHECK START
# Used by ingress to perform health check
readinessProbe:
httpGet:
path: /healthz
port: 8081
ports:
- containerPort: 8081
# HEALTH CHECK END
- name: server
image: gcr.io/my-project-dev/server:latest
imagePullPolicy: Always
ports:
- containerPort: 8080
The CORS response headers from the ESP/nginx container are:
Access-Control-Allow-Headers: DNT,User-Agent,X-Requested-With,If-Modified-Since,Cache-Control,Content-Type,Range,Authorization
Access-Control-Allow-Methods: GET, POST, PUT, PATCH, DELETE, OPTIONS
Access-Control-Allow-Origin: *
Access-Control-Expose-Headers: Content-Length,Content-Range
Content-Type: application/json
Date: Sun, 12 Jan 2020 17:34:11 GMT
Server: nginx
Transfer-Encoding: chunked
Via: 1.1 google
I don't expect anyone to magic-up an answer, but right now I'm unsure of how to debug this any further when everthing looks good?
A new day and a solution. I'll briefly explain the solution, but focus more on how I discovered it and what I would do differently.
Problem
The "Compute Engine default service account" was missing some permissions. Getting started with Cloud Endpoints on GKE does not use Datastore. My app does. Simply going to IAM & admin > IAM, and adding the "Cloud Datastore User" role to the service account fixed my problem.
Lessons
1. Logging in __init__
of the Python gPRC server doesn't provide as much cover as it seems. AFAICT, the main gRPC service module was being called/loaded in some way but the insufficient permissions meant from google.cloud import datastore
prevented my Cloud log statements from being executed:
# This requires "Cloud Datastore User" role.
from google.cloud import datastore
class MyServer(myservice_pb2_grpc.MyServer):
"""Implements the GRPC interface to handle client calls."""
def __init__(self):
# Logging using CloudLoggingHandler was never executed.
logger.info('__init__')
def serve():
"""Runs a gRPC server to handle client calls."""
server = grpc.server(futures.ThreadPoolExecutor(max_workers=10))
myservice_pb2_grpc.add_MyServiceServicer_to_server(MyServer(), server)
# The default value for the backend address is http://127.0.0.1:8080
server.add_insecure_port('[::]:8080')
server.start()
server.wait_for_termination()
if __name__ == '__main__':
serve()
2. While the Container logs did include a very helpful error message, it was hidden by the default Stackdriver filters:
resource.type="container"
resource.labels.cluster_name="standard-cluster-2"
resource.labels.namespace_id="default"
resource.labels.project_id="my-project-dev"
resource.labels.zone:"us-central1-a"
resource.labels.container_name=("esp" OR "server")
resource.labels.pod_id="my-app-855d8447f-nnvz8"
Seeing my error required a resource.type
of api
. Moreover, only the project_id
of the resource labels was relevant. I strongly recommend viewing api logs for your project. Don't just view container logs:
resource.type="api"
resource.labels.project_id="json-schema-dev"
In my case, the exact error could be found with Stackdriver filters set to
resource.type="api"
resource.labels.project_id="json-schema-dev"
jsonPayload.log_message:("Exception calling application: 403 Missing or insufficient permissions.")
3. Given that I couldn't find the error in the Stackdriver logs, where did I find it? The browser console. The humble browser console. I was looking only at the network tab of the Developer Console, seeing the HTTP 500 response, and going straight to the Cloud Console.
4. View the full error message. The error received by the browser may render as:
headers: HttpHeaders {normalizedNames: Map(0), lazyUpdate: null, lazyInit: ƒ}
status: 500
statusText: "Internal Server Error"
url:"http://34.107.225.145/v1/documents:inferkey=123&format=JSON&showDefaults=false"
ok: false
name: "HttpErrorResponse"
message: "Http failure response for http://123.456.789.342 500 Internal Server Error".
error: {code: 2, message:...
I didn't expand the error
property which was truncated with ellipsis. Upon expanding this I got:
code: 2
message: "Exception calling application: 403 Missing or insufficient permissions."
Summary
__init__
may not be enough.serve()
api
logs, not just container
logs.