Can Google Cloud DataFlow be used as a Task Queue to process multiple data in parallel?

9/18/2018

We are currently evaluating our options on google cloud platform for a solution that works this way. We are expecting a lot of messages from our application and we intend to queue these transactions using google cloud pub/ sub. Now a typical message can have multiple JSON objects in it like this :

{
 groupId: "3003030330",
 groupTitle: "Multiple Payments Processing",
 transactions: [
   {id: "3030303" , amount: "2000" , to: "XXXX-XXX"},
   {id: "3030304" , amount: "5000" , to: "XXXX-XXX"},
   {id: "3030304" , amount: "5000" , to: "XXXX-XXX"},
 ]
}

Now we need to pass each of these transactions to our payment gateway synchronously and in parallel using google cloud dataflow then collate the responses into a different PCollection and write it to another pub / sub topic . My confusion is if Google Cloud Dataflow is the most efficient and scalable solution to this problem or using the Kubernetes HorizontalPodAutoScaler to scale based on the messages in the pub/sub queue. Any ideas and thoughts would be appreciated.

-- I.Tyger
google-cloud-dataflow
kubernetes

2 Answers

9/19/2018

I would go with DataFlow or the new Streaming Engine which I assume it's something like Apache Spark Streaming or Apache Flink Streaming under the hood. The downside to all this is probably GCP vendor lock-in.

Although there are several tools for Kubernetes and it probably works fine, there's an extra cost associated with the maintenance of your environment(s). For example. making sure your pods/deployments work smoothly, etc, and learning/investing in running Spark/Flink streaming on your cluster. Also, Kubernetes has not been battle tested in lots of production big data pipelines. The upside of this solution is no vendor lock-in.

My two cents.

-- Rico
Source: StackOverflow

9/19/2018

By default, Cloud Dataflow can auto-scale from 1 to 1000 instances, each one of them having 4 vCPU, 15GB memory & 420GB Persistent Disk, so if you have enough quota, you can scale up to 4,000 cores, 15,000 GB of memory and 420 TB of storage usage.

But, there is currently a beta release of the Streaming Engine, which provides a more responsive autoscaling according to variations in incoming data volume, by moving pipeline execution out of the worker VMs and into the Cloud Dataflow service backend. This way, it works best with smaller worker machine types and uses less storage space.

-- Héctor Neri
Source: StackOverflow