I'm running a Kubernetes cluster with three Celery pods, using a single Redis pod as the message queue. Celery version 4.1.0, Python 3.6.3, standard Redis pod from helm.
At a seemingly quick influx of tasks, the Celery pods to stop processing tasks whatsoever. They will be fine for the first few tasks, but then eventually stop working and my tasks hang.
My tasks follow this format:
@app.task(bind=True)
def my_task(some_param):
result = get_data(some_param)
if result != expectation:
task.retry(throw=False, countdown=5)
And are generally queued as follows:
from my_code import my_task
my_task.apply_async(queue='worker', kwargs=celery_params)
The relevant portion of my deployment.yaml
:
command: ["celery", "worker", "-A", "myapp.implementation.celery_app", "-Q", "http"]
The only difference between this cluster and my local cluster, which I use docker-compose
to manage, is that the cluster is running a prefork
pool and locally I run eventlet
pool to be able to put together a code coverage report. I've tried running eventlet
on the cluster but I see no difference in the results, the tasks still hang.
Is there something I'm missing about running a Celery worker in Kubernetes? Is there a bug that could be affecting my results? Are there any good ways to break into the cluster to see what's actually happening with this issue?
Running the celery tasks without apply_async
allowed me to debug this issue, showing that there was a concurrency logic error in the Celery tasks. I highly recommend this method of debugging Celery tasks.
Instead of:
from my_code import my_task
celery_params = {'key': 'value'}
my_task.apply_async(queue='worker', kwargs=celery_params)
I used:
from my_code import my_task
celery_params = {'key': 'value'}
my_task(**celery_params)
This allowed me to locate the concurrency issue. After I had found the bug, I converted the code back to an asynchronous method call using apply_async
.