I have a microservice (Let's call it the publisher) which generates events for a particular resource. There's a web app which shows the updated list of the resource, but this view doesn't talk to the microservice directly. There's a view service in between which calls the publisher whenevever the web app request the list of resources.
To show the latest update, the web app uses the long polling method which calls the view service and the view service calls the publisher (The view service does more than that, for eg. collecting data from different other sources, but that's probably not relevant here). The publisher also publishes any update on all the resources using pubsub which the view service is currently not using.
To reduce the API calls and make the system more efficient, I am trying to implement websockets instead of the long polling. In theory it works like the web app will subscribe to events from the view service. The view service will listen to the resource update events from the publisher and whenever there's any message on the pubsub topic, it'll send an event to the web app.
The problem with websocket which I am experiencing now is that the view service is deployed using Kubernetes and thus can have multiple pods running at the same time. Different instances of the web app can listen to the events from those different pods, thus it may happen that the pubsub message is received by pod A, but the websocket listener which require this resource is connected to the pod B. If pod A ack the message to discard it since it can't find any event listener, we will lose the event and the web app will not have the updated data. If pod A nack the message so that it can be listened by any other pod which may be benefiting with that message, it may happen that there's no pod which have any websocket event listener which can be benefited with that message and the message will keep circulating blocking the pubsub queue forever.
The first solution which came into my mind is to create different subscription for different pods so every pod will receive the same event at least once and we won't be blocking the pubsub queue. However, the challenge in this approach is that the pods can die anytime thus leaving the subscription abandoned and after few weeks I'll be dealing with tons of abandoned subscription with overflowing messages.
Another solution is to have a database where the pubsub messages will be stored and the different pods will query it to receive the events periodically check for any listener and remove it from there, but it doesn't solve the problem when there's no listener for the events. Additionally, I don't want to add a database just because of this issue (Current long polling is much better architecture than this).
Third solution is to implement the websockets inside the publisher, however, this is the least possible solution as the codebase is really huge there and no one likes to add any new functionality there.
Final solution is to have just one pod of the view service all the time, but then it defeats the purpose of having a microservice and being on Kubernetes. Additionally, it won't be scalable.
Is there any recommended way or any other way I can connect the pubsub events to the web app using websockets without adding unnecessary complexity? I would love an example if there's one available anywhere else.
There is no easy solution. First of all, in websocket pattern, the pod is responsible to send the event to the web app. And therefore to gather the correct backend events to the web app. In this design, the pod need to filter the correct messages to deliver.
The most naive implementation will be to duplicate all the messages, in all pod (and therefore in all subscription), but it's not really efficient and money consuming (in addition to time consuming to discard all the messages).
We can imagine a more efficient solution, and to create, on each pod, a list of subscription, one per open webapp channel. On these subscription you can add a filter parameter. Of course, the publisher need to add an attribute to allow the subscription to filter on that.
When the session is over, the subscription must be deleted.
In case of crash, I propose this pattern: Store, in a database, the subscription ID, the filter content (the webapp session) and the pod ID in charge of the filtering and delivering. Then, detect the pod crash, or run a scheduled pod, to check if all the running pod are registered in database. If one pod is in database and not running, delete all the related subscription.
If you are able to detect in realtime the pod crash, you can dispatch the active webapp sessions to the other running pods, or on the new one created.
As you see, the design isn't simple and required controls, check, and garbage collection.