GKE streaming large file download fails with partial response

11/14/2019

I have an app hosted on GKE which, among many tasks, serve's a zip file to clients. These zip files are constructed on the fly through many individual files on google cloud storage.

The issue that I'm facing is that when these zip's get particularly large, the connection fails randomly part way through (anywhere between 1.4GB to 2.5GB). There doesn't seem to be any pattern with timing either - it could happen between 2-8 minutes.

AFAIK, the connection is disconnecting somewhere between the load balancer and my app. Is GKE ingress (load balancer) known to close long/large connections?

GKE setup:

  • HTTP(S) load balancer ingress
  • NodePort backend service
  • Deployment (my app)

More details/debugging steps:

  • I can't reproduce it locally (without kubernetes).
  • The load balancer logs statusDetails: "backend_connection_closed_after_partial_response_sent" while the response has a 200 status code. A google of this gave nothing helpful.
  • Directly accessing the pod and downloading using k8s port-forward worked successfully
  • My app logs that the request was cancelled (by the requester)
  • I can verify none of the files are corrupt (can download all directly from storage)
-- Taylor Graham
google-cloud-load-balancer
google-cloud-platform
google-kubernetes-engine
kubernetes
kubernetes-ingress

1 Answer

11/18/2019

I believe your "backend_connection_closed_after_partial_response_sent" issue is caused by websocket connection being killed by the back-end prematurily. You can see the documentation on websocket proxying in nginx - it explains the nature of this process. In short - by default WebSocket connection is killed after 10 minutes.

Why it works when you download the file directly from the pod ? Because you're bypassing the load-balancer and the websocket connection is kept alive properly. When you proxy websocket then things start to happen because WebSocket relies on hop-by-hop headers which are not proxied.

Similar case was discussed here. It was resolved by sending ping frames from the back-end to the client.

In my opinion your best shot is to do the same. I've found many cases with similar issues when websocket was proxied and most of them suggest to use pings because it will reset the connection timer and will keep it alive.

Here's more about pinging the client using WebSocket and timeouts

I work for Google and this is as far as I can help you - if this doesn't resolve your issue you have to contact GCP support.

-- W_B
Source: StackOverflow