non-stop workers in celery

6/3/2021

I'm creating a distributed web crawler that crawls multiple social media at the same time. This system is designed to distribute the available resources to the different social media based on their current post rate.

For example, if social media 1 has 10 new posts per hour and social media 2 has 5 posts per hour, 2 crawlers focus on social media 1 and 1 crawler focus on social media 2 (if we are allowed to have just three crawlers).

I have decided to implement this project via Celery, Flask, rabbitMQ, and Kubernetes as the resource manager.

I have some questions regarding the deployment:

How can I tell celery to keep a fixed number of tasks in rabbitMQ? This crawler should never stop crawling and should create a new task based on the social media's post rates (which is gathered from the previous crawling data), but the problem is, I don't have a task submitter for this process. Usually, there is a task submitter for celery that submits tasks, but there is no such thing as a task submitter in this project. We have a list of social media and the number of workers they need (stored in Postgres) and need celery to put a task in rabbitMQ as soon as a task is finished.

I have tried the solution to submit a task at the end of every job (Crawling Process), but this approach has a problem and is not scalable. In this case, the submitted job would be the last in the rabbitMQ queue.

I need a system to manage the free workers and assign tasks to them immediately. The system I want should check the free and busy workers and database post rates and give a task to the worker. I think using rabbitMQ (or even Redis) might not be good because they are message brokers which assign a worker to a task in the queue but here, I don't want to have a queue; I want to start a task immediately when a free worker is found. The main reason queueing is not good is that the task should be decided when the job is starting, not before that.

-- Leo
celery
kubernetes
rabbitmq

1 Answer

6/3/2021

My insights on your problem.

I need a system to manage the free workers and assign tasks to them immediately.

-- Celery does this job for you

The system I want should check the free and busy workers and database post rates and give a task to the worker.

Celery is a task distribution system, it will distribute the tasks as you expect

I think using rabbitMQ (or even Redis) might not be good because they are message brokers which assign a worker to a task in the queue

Using celery, you definitely need a broker, they just hold your messages, celery will poll the queues and distribute them to the right workers(priority, timeout, soft handling, retries)

but here, I don't want to have a queue; I want to start a task immediately when a free worker is found. The main reason queueing is not good is that the task should be decided when the job is starting, not before that.

This is kind of a chain reaction or like triggering a new job once the previous one is done. If this is the case, you don't even need celery or a distributed producer-consumer system.

Identify the problem: 1. Do you need a periodic task to be executed at a point in time? ---> go with a cronjob or celery-beat(cron job-based celery scheduler) 2. Do you require multiple tasks to be executed without blocking the other running tasks - You need a producer-consumer system(Celery(out of the box solution, Rabbitmq/Redis Native Python Consumers)) 3.If the same task should be triggering the new task, there is no need to have multiple workers, what will we achieve from having multiple workers if your work is just a single thread. 4. Outcome -- Celery, RabbitMQ, and Kubernetes - Good combo for a distributed orchestrated system or a webhook model or recursive python script


Reply to your below comment @alavi

One way of doing it can be like, write a periodic job(can run every second/minute or an hour or whatever rate) using celery beat, which will act as a producer or parent task. It can iterate all media sites from DB and spawn a new task for crawling. The same work status can be maintained in DB, based on the status, new tasks can be spawn. For a start I can say like this parenting task will check if the previous job is still running, or check the progress of the last task, based on the progress decide upon, even we can think about splitting the crawl job again into micro tasks and being triggered from the parent job. You can collect some more x and y going further during development or with performance.

-- Thomas John
Source: StackOverflow