How to make fault tollerant data processing k8s pods?

4/30/2020

We have an image processing pipeline on GKE which is feed from a GCP Topic which in turn is feed by bucket notifications i..e

image upload > bucket > notification > topic < pods consume files off topic.

This scales nicely but occasioanlly pods die or get scalled down and with them the data from the topic that they consumed. Is there a container design pattern to make sure that the file gets processed even if the pod dies unexpectantly?

(Sorting out what was missed is kind of a nightmare when your dealing with millions of image files).

-- CpILL
google-kubernetes-engine
kubernetes

1 Answer

4/30/2020

Yeah, i just had a good long think about it and came up with a 2 queue solution with, what I'm going to call, the Accountant pod/container (as the idea is based on double entry book keeping):

  1. Jobs get posted to 2 queues, Q1 and Q2
  2. Workers process Q1 popping items until it is empty
  3. When Q1 is empty an accountant pod go through Q2 checking the expected output for each job. If something is missing its put back on Q1 & Q2 again (after Q2 is depleted).
  4. Repeat until accountant posts nothing back to the queue.

I call it the: Double-Entry/Accountant Design Pattern :)

I think this can be applied to most data processing queue systems.

The only flaw I see in it is if the accountant dies (but it should be a lightweight job to check input vs output). I guess then you can have N queues with N-1 accountants depending on how certain you want to be (but coordinating more than 1 accountant might be tricky)

-- CpILL
Source: StackOverflow