We are running production workloads with Istio 1.1.4 and noticed that for a specific timeframe, the request latency reported to the telemetry component for client invoked traffic increased from 50-60ms to 6-7 seconds and at the same time we started observing 500 (internal server error) response codes from Envoy.
We are trying to understand under what cases Envoy returns 500 and the only thing I could find in the documentation/source code was that a 500 is returned if the response body must be buffered and it exceeds the buffer limit. This is certainly not the case for us, as those 500 occurred for a health check endpoint beyond other endpoints, whose response body is very small.
What are the cases where Envoy will return 500? What should we investigate as the root cause of the issue?
Can you please provide the status code for below ?
a) Log Entry b) Telemetry c) Prometheus and Grafana
and just see if all three above shows response code as 500 or any deviation ?