Parallel queries in python, with a file and several pods in kubernetes

11/19/2020

I'm developing a python project that will be hosted on kubernets, on the Google Cloud provider. The idea is to read a file of millions of rows, where each row is the query's input key in an API

def getEndpoint(line):
    payload="{\r\n    \"keyQuery\": \""+line+"\"\r\n}"
    headers = {
        'Content-Type': 'application/json'
    }
    response = requests.request("POST", url, headers=headers, data=payload, verify=False)
    response = response.text.encode('utf8')

fileOpen = open('file.txt', 'r')
for lines in fileOpen:
                getEndpoint(lines)

I want to run my application on several Kubernetes PODs, because I want to have scalability, that is, multiple queries running at the same time. However, in this code structure, each pod will end up iterating over the file from the beginning, reading lines already consulted. And it is not what I want. Then two ideas came up:

1) Split the files, and each split would be distributed among the pods. (Example: for a 100 line file with 10 pods, each pod would read 10 lines)

2) Before running the application, create a consumption queue with the lines of the file, so that all pods would read the queue and not the file directly.

Option 2 seems to me to be more scalable and faster. But I would like suggestions for the best way to make a query using a file as a reference. I may want to run, for example, 1 million queries in 24 hours.

-- Guilherme Duarte
kubernetes
python
request

1 Answer

11/19/2020

I think that you should work it with asynchronous tasks worker such as Celery or Dramatiq.

The conceptual is send the tasks (your query line) into the Message Broker (Redis or RabbitMQ) then let the worker consumed the tasks (receive the query line and make a request to API)

However, The Dramatiq or Celery have a feature like retry the task if it failed.

Not sure is that you want, But you can explore with this repo, https://github.com/matiaslindgren/celery-kubernetes-example

-- zzob
Source: StackOverflow