GKE Node.JS app: Error during WebSocket handshake

10/28/2020

I am quite new to Google Kubernetes Engine, and just Kubernetes in general. I've created a "production-ish" Kubernetes cluster running my old Node.js app that's using socket.io. To help myself to do that I used Google's "Deploying a containerized web application" how-to after which I've set up a Load Balancer with an Ingress which would use a Managed Certificate by somewhat following another guide which is Using Google-managed SSL certificates (the Setting up the managed certificate part). This left me with a cluster using 1 pool with three instance groups, each using 1-2 nodes.

The backend was up and the frontends were able to connect to it correctly. The problem is with WebSockets and frontend getting an error WebSocket connection to 'wss://mycooldomain.com/socket.io/?EIO=3&transport=websocket&sid=afskjaisfhf-afasfoiaofis' failed: Error during WebSocket handshake: Unexpected response code: 400, which I've been trying to figure out all day.

The latter of the two guides I've been using mentions creating a Node Port and a Managed Certificate and then an Ingress which would link the two together. I have decided to create an Ingress with a different backend config for the load balancer in order to fix the problem:

apiVersion: cloud.google.com/v1beta1
kind: BackendConfig
metadata:
  name: my-cool-backendconfig
  namespace: my-cool-namespace
spec:
  timeoutSec: 60
  connectionDraining:
    drainingTimeoutSec: 30
  sessionAffinity:
    affinityType: "CLIENT_IP"

The reason for creating this is to try different values for timeouts in order to keep the WebSocket connection. I've also tried such values as timeoutSec: 20000 or drainingTimeoutSec: 3000. sessionAffinity part also came from many StackOverflow threads and GitHub issues.

So that config had to be applied on my NodePort:

apiVersion: v1
kind: Service
metadata:
  namespace: my-cool-namespace
  name: my-cool-nodeport
  labels:
    app: my-cool-app
  annotations:
    cloud.google.com/backend-config: '{"ports": {"80":"my-cool-backendconfig"}}'
spec:
  selector:
    app: my-cool-app
  type: NodePort
  ports:
    - protocol: TCP
      port: 80
      targetPort: 8080

And on my Ingress, if I understood correctly:

apiVersion: networking.k8s.io/v1beta1
kind: Ingress
metadata:
  namespace: my-cool-namespace
  name: my-cool-ingress
  annotations:
    kubernetes.io/ingress.global-static-ip-name: my-cool-global-ip
    networking.gke.io/managed-certificates: my-cool-certificate
    cloud.google.com/backend-config: '{"default": "my-cool-backendconfig"}'
spec:
  backend:
    serviceName: my-cool-nodeport
    servicePort: 80

After trying different values for timeouts, I've noticed that the Error during WebSocket handshake: Unexpected response code: 400 error does not necessarily happen on every socket.emit() and is rather variable, depending on, I guess whether the load balancer has allowed the connection (?).

Even if Google guides mention using larger timeout values, even the most obscene ones (timeoutSec: 20000 as I described above) don't really help establish stable WebSocket connections, because they end up throwing the error occasionally.

Looking at the problem from backend/frontend node apps standpoint, I've only gone as far as changing the socket.io config to try to establish websocket connection first before polling:

const server = http.createServer(app);
const io = require('socket.io').listen(server);
io.set('transports', ['websocket', 'polling']);

Which didn't help either.

How do I make it work without throwing the error every now and then?

Bonus question: I've noticed a lot of users with the same/similar problem use Nginx Ingress controllers, is that necessary for proper load balancing at all or is it only for real production environments?

-- Pablo Kirilo
google-kubernetes-engine
kubernetes
load-balancing
socket.io
websocket

1 Answer

11/10/2020

If it's not working only in some of the cases, this might be due to the fact that you're losing your session affinity. For starters, read into session affinity docs. Based on what you're saying, you might be losing it due to the fact that you have multiple nodes. Try to scale down to one node and see what happens. If the same issue persists, check how many replicas are in each node, try reducing replicas to one per node - perhaps you're losing session affinity at replica level. You may probably be able to scale up if you keep one replica per node.

-- unsame
Source: StackOverflow