I am re-designing our current ETL for file ingestion where spring-batch containers will be deployed on Kubernetes.
Each file type is processed by a different job, which will be triggered on demand via http by a dispatcher application listening on a SQS where s3 push messages on new object created event. So when N files of the same type have to be processed the dispatcher has to make N http requests to the same job.
Considering that one instance can process only one file at a time and with some tolerance before scaling up (i.e. if job A has to process two files in the whole day and they both arrive at the same time it's acceptable to process them sequentially without having to scale), how can I scale the job instances basing on this scenario ?
I could query the job execution table to know how many jobs of the same types are running and apply some logic to spin-up more instances but it doesn't look like an elegant solution.
Currently all the jobs are within the same application (even the dispatcher) and we are scaling basing on the number of SQS messages but in the new architecture this will require a queue per job (i.e. per file type) while I would like to centralise the logic for triggering the job on demand into a separated dispatcher application.
Here are some recommendations about how I would do it:
orders
type -> ordersJob
, etc)orders-2020-06-18.csv
-> job instance for this file , aka the file name (+ hash if needed) is an identifying job parameter)JobLaunchRequest
described in Launching Batch Jobs through MessagesCurrently all the jobs are within the same application
I would recommend to package each job in its own jar (container) for all the good reasons of making one thing do one thing and do it well, see more details in:
Finally, I see you tagged your question with kubernetes
so the following thread might be of interest to you: https://stackoverflow.com/questions/60924076/batch-processing-on-kubernetes