How to download, transform and upload multiple files in parallel using Google Kubernetes Engine?

6/13/2019

I have a large collection of data stored in google storage bucket with the following structure: gs://project_garden/plant_logs/2019/01/01/humidity/plant001/hour.gz. What I want is to make a Kubernetes Job which downloads all of it, parses it and upload the parsed files to BigQuery in parallel. So far I've managed to do it locally without any parallelism by writing a python code which takes a date interval as input and loops over each of the plants executing gsutil -m cp -r for download, gunzip for extraction and pandas for transforming. I want to do the same thing but in parallel for each plant using Kubernetes. Is it possible to parallelise the process by defining a job that passes down different plant id's for each pod and downloads the files for each of them?

-- chris_user
google-bigquery
google-cloud-platform
google-cloud-storage
kubernetes
python

1 Answer

6/14/2019

A direct upload from Kubernetes to BigQuery is not possible, you can only upload data into BigQuery [1] with the following methods:

  • From Cloud Storage
  • From other Google services, such as Google Ad Manager and Google Ads
  • From a readable data source (such as your local machine)
  • By inserting individual records using streaming inserts
  • Using DML statements to perform bulk inserts
  • Using a BigQuery I/O transform in a Cloud Dataflow pipeline to write data to BigQuery

As mentioned in the previous comment the easiest solution would be to upload the data using DataFlow, you can find a template to upload text from Google Cloud Storage (GCS) to BigQuery in link [2]

If you have to use Google Cloud Engine (GKE) you will need to perform the following steps:

  1. Read the data from GCS with GKE. You can find an example of how to mount a bucket in your containers in the next link [3]
  2. Parse the data with your code as mentioned in your question
  3. Upload data from GCS to BigQuery, more info in link [4]

[1] https://cloud.google.com/bigquery/docs/loading-data

[2] https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming#gcstexttobigquerystream

[3] https://github.com/maciekrb/gcs-fuse-sample

[4] https://cloud.google.com/bigquery/docs/loading-data-cloud-storage

-- Ernesto U
Source: StackOverflow