Is there a way to downscale pods only when message is processed (the pod finished its task) with the HorizontalPodAutoscaler in Kubernetes?

6/11/2019

I'v setup Kubernetes Horizontal Pod Autoscaler with custom metrics using the prometheus adapter https://github.com/DirectXMan12/k8s-prometheus-adapter. Prometheus is monitoring rabbitmq, and Im watching the rabbitmq_queue_messages metric. The messages from the queue are picked up by the pods, that then do some processing, which can last for several hours.

The scale-up and scale-down is working based on the number of messages in the queue.

The problem: When a pod finishes the processing and acks the message, that will lower the num. of messages in the queue, and that would trigger the Autoscaler terminate a pod. If I have multipe pods doing the processing and one of them finishes, if Im not mistaking, Kubernetes could terminate a pod that is still doing the processing of its own message. This wouldnt be desirable as all the processing that the pod is doing would be lost.

Is there a way to overcome this, or another way how this could be acheveed?

here is the Autoscaler configuration:

kind: HorizontalPodAutoscaler
apiVersion: autoscaling/v2beta1
metadata:
  name: sample-app-rabbitmq
  namespace: monitoring
spec:
  scaleTargetRef:
    # you created above
    apiVersion: apps/v1
    kind: Deployment
    name: sample-app
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: Object
    object:
      target:
        kind: Service
        name: rabbitmq-cluster
      metricName: rabbitmq_queue_messages_ready
      targetValue: 5
-- Zarko
autoscaling
kubernetes
rabbitmq

2 Answers

7/5/2019

Horizontal Pod Autoscaler is not designed for long-running tasks, and will not be a good fit. If you need to spawn one long-running processing tasks per message, I'd take one of these two approaches:

  • Use a task queue such as Celery. It is designed to solve your exact problem: have a queue of tasks that needs to be distributed to workers, and ensure that the tasks run to completion. Kubernetes even provides an official example of this setup.
  • If you don't want to introduce another component such as Celery, you can spawn a Kubernetes job for every incoming message by yourself. Kubernetes will make sure that the job runs to completion at least once - reschedule the pod if it dies, etc. In this case you will need to write a script that reads RabbitMQ messages and creates jobs for them by yourself.

In both cases, make sure you also have Cluster Autoscaler enabled so that new nodes get automatically provisioned if your current nodes are not sufficient to handle the load.

-- Shnatsel
Source: StackOverflow

6/14/2019

You could consider approach using preStop hook.

As per documentation Container States, Define postStart and preStop handlers:

Before a container enters into Terminated, preStop hook (if any) is executed.

So you can use in your deployment:

lifecycle:
      preStop:
        exec:
          command: ["your script"]

### update:

  1. I would like to provide more information due to some research: There is an interesting project:

    KEDA allows for fine grained autoscaling (including to/from zero) for event driven Kubernetes workloads. KEDA serves as a Kubernetes Metrics Server and allows users to define autoscaling rules using a dedicated Kubernetes custom resource definition. KEDA can run on both the cloud and the edge, integrates natively with Kubernetes components such as the Horizontal Pod Autoscaler, and has no external dependencies.

  2. For the main question "Kubernetes could terminate a pod that is still doing the processing of its own message".

    As per documentation:

    "Deployment is a higher-level concept that manages ReplicaSets and provides declarative updates to Pods along with a lot of other useful features"

Deployment is backed by Replicaset. As per this controller code there exist function "getPodsToDelete". In combination with "filteredPods" it gives the result: "This ensures that we delete pods in the earlier stages whenever possible."

So as proof of concept:

You can create deployment with init container. Init container should check if there is a message in the queue and exit when at least one message appears. This will allow main container to start, take and process that message. In this case we will have two kinds of pods - those which process the message and consume CPU and those who are in the starting state, idle and waiting for the next message. In this case starting containers will be deleted at the first place when HPA decide to decrease number of replicas in the deployment.

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  labels:
    app: complete
  name: complete
spec:
  replicas: 5
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: complete
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: complete
    spec:
      hostname: c1
      containers:
      - name: complete
        command: 
        - "bash"
        args:
        - "-c"
        - "wa=$(shuf -i 15-30 -n 1)&& echo $wa && sleep $wa"
        image: ubuntu
        imagePullPolicy: IfNotPresent
        resources: {}
      initContainers:
      - name: wait-for
        image: ubuntu
        command: ['bash', '-c', 'sleep 30']


  dnsPolicy: ClusterFirst
  restartPolicy: Always
  terminationGracePeriodSeconds: 30

Hope this help.

-- Hanx
Source: StackOverflow