ETL on Google Cloud - (Dataflow vs. Spring Batch) -> BigQuery

8/11/2017

I am considering BigQuery as my data warehouse requirement. Right now, I have my data in google cloud (cloud SQL and BigTable). I have exposed my REST APIs to retrieve data from both. Now, I would like to retrieve data from these APIs, do the ETL and load the data into BigQuery. I am evaluating 2 options of ETL (daily frequency of job for hourly data) right now:-

  1. Use JAVA Spring Batch and create microservice and use Kubernetes as deployment environment. Will it scale?
  2. Use Cloud DataFlow for ETL

Then use BigQuery batch insert API (for initial load) and streaming insert API (for incremental load when new data available in source) to load BigQuery denormalized schema.

Please let me know your opinions.

-- Abhay
google-bigquery
google-cloud-dataflow
kubernetes
microservices
spring-batch

1 Answer

8/21/2017

Without knowing your data volumes, specifically how much new or diff data you have per day and how you are doing paging with your REST APIs - here is my guidance...

If you go down the path of a using Spring Batch you are more than likely going to have to come up with your own sharding mechanism: how will you divide up REST calls to instantiate your Spring services? You will also be in the Kub management space and will have to handle retries with the streaming API to BQ.

If you go down the Dataflow route you will have to write a some transform code to call your REST API and peform the paging to populate your PCollection destined for BQ. With the recent addition of Dataflow templates you could: create a pipeline that is triggered every N hours and parameterize your REST call(s) to just pull data ?since=latestCall. From there you could execute BigQuery writes. I recommend doing this in batch mode as 1) it will scale better if you have millions of rows 2) be less cumbersome to manage (during non-active times).

Since Cloud Dataflow has built in re-try logic for BiqQuery and provides consistency across all input and output collections -- my vote is for Dataflow in this case.

How big are your REST call results in record count?

-- Eric Schmidt
Source: StackOverflow