How to process large number of files on multiple computers using multiple processes?

10/19/2017

I have hundreds of binary files varying in size from 5mb to 500mb and a python script which takes one file as input and outputs small .txt file in 10 minutes (250mb file).

In order to process it ASAP I have 10 (local) servers with 20 cores each. What would be the best way to split this job if I wanted to add more hardware later? I'm certain this has been done million times before and that there should be some open source solution?

I was thinking kubernetes, because it has docker containers which can easily isolate dependencies of script.py, and putting all the binary files on a single network shared drive mounted on all servers at /mnt/shrd_drive from which they can read.

-- Dusan J.
cloud
infrastructure
kubernetes
python
scalability

0 Answers