Architecture for ML jobs platform

8/25/2021

I'm building a platform to run ML jobs. Jobs will be started from an interface. I'm making a service for each type of jobs. Some times, a service S1 might require to first make a request to another service S2 and get its output before running its own job.

Each service is split into 2 Kubernetes deployment:

  • one that will pull the message from a topic, check it and persist it to a database (D1)
  • one that will read request from the database, run the actual job, update the request state in the database and then answer to the client (D2)

Here is the flow:

  • interface generates a PubSub message to a topic T1
  • D1 pulls message from T1 and persist a request to a database
  • D2 sees the new request in the database and runs it then update its state in the database and answer to the client

To answer to the client, D2 has 2 options:

  • push a message to a pubsub topic T2 that will continiously be checked by the client. An id is passed in both request and response so that only the client can pull it from the topic.
  • use a callback provided by the client to make a POST request

What do you think abouut this architecture ? Does the usage of PubSub makes sense ? Also does it make sense to split each service into 2 deployment (1 that deals with request, 1 that runs the actual job ) ?

-- user16317357
architecture
kubernetes

1 Answer

12/6/2021
  • interface generates a PubSub message to a topic T1 D1 pulls message
  • from T1 and persist a request to a database

If there's only one database, I'm not sure I see much advantage in using a topic (implying pub/sub). Another approach would be to use a queue: the interface creates jobs into the queue, then you can have any number of workers processing it. Depending on the situation you may not even need the database at all - if all the data needed can be in the message in the queue.

use a callback provided by the client to make a POST request

That's better if you can do it, on the assumption that there's only one consumer for the event; pub/sub is more for broadcasting out to multiple consumers. Polling works but is really inefficient and has limits on how much it can scale.

Also does it make sense to split each service into 2 deployment (1 that deals with request, 1 that runs the actual job ) ?

Having separate deployables make sense if they are built by different teams and have a different release cadence or if you need to scale them out independently, otherwise it may not be necessary.

-- Adrian K
Source: StackOverflow