We have a lot of the long running, memory/cpu intensive jobs in k8s which are run with celery on kubernetes on google cloud platform. However we have big problems with scaling/retrying/monitoring/alerting/guarantee of delivery. We want to move from celery to some more advanced framework.
There is a comparison: https://github.com/argoproj/argo/issues/849 but it's not enough.
Airflow:
Argoproj:
Our DAGs are not that much complicated. Which of those frameworks should we choose?
Idiomatic Airflow isn't really designed to execute long-running jobs by itself. Rather, Airflow is meant to serve as the facilitator for kicking off compute jobs within another service (this is done with Operators) while monitoring the status of the given compute job (this is done with Sensors).
Given your example, any compute task necessary within Airflow would be initiated with the appropriate Operator for the given service being used (Airflow has GCP hooks for simplifying this) and the appropriate Sensor would determine when the task was completed and no longer blocked downstream tasks dependent on that operation.
While not intimately familiar on the details of Argoproj, it appears to be less of a "scheduling system" like Airflow and more of a system used to orchestrate and actually execute much of the compute.