Scale spring-batch instances executed from within a web container

6/18/2020

I am re-designing our current ETL for file ingestion where spring-batch containers will be deployed on Kubernetes.

Each file type is processed by a different job, which will be triggered on demand via http by a dispatcher application listening on a SQS where s3 push messages on new object created event. So when N files of the same type have to be processed the dispatcher has to make N http requests to the same job.

Considering that one instance can process only one file at a time and with some tolerance before scaling up (i.e. if job A has to process two files in the whole day and they both arrive at the same time it's acceptable to process them sequentially without having to scale), how can I scale the job instances basing on this scenario ?

I could query the job execution table to know how many jobs of the same types are running and apply some logic to spin-up more instances but it doesn't look like an elegant solution.

Currently all the jobs are within the same application (even the dispatcher) and we are scaling basing on the number of SQS messages but in the new architecture this will require a queue per job (i.e. per file type) while I would like to centralise the logic for triggering the job on demand into a separated dispatcher application.

-- gzp___
amazon-web-services
kubernetes
scalability
spring
spring-batch

1 Answer

6/18/2020

Here are some recommendations about how I would do it:

  • Create a job per file type (for example orders type -> ordersJob, etc)
  • Create a job instance per file (orders-2020-06-18.csv -> job instance for this file , aka the file name (+ hash if needed) is an identifying job parameter)
  • Use a single queue for job requests (no need for a queue per job type). You can use the same mechanism of JobLaunchRequest described in Launching Batch Jobs through Messages

Currently all the jobs are within the same application

I would recommend to package each job in its own jar (container) for all the good reasons of making one thing do one thing and do it well, see more details in:

Finally, I see you tagged your question with kubernetes so the following thread might be of interest to you: https://stackoverflow.com/questions/60924076/batch-processing-on-kubernetes

-- Mahmoud Ben Hassine
Source: StackOverflow