How to download, transform and upload multiple files in parallel using Google Kubernetes Engine?

6/13/2019

I have a large collection of data stored in google storage bucket with the following structure: gs://project_garden/plant_logs/2019/01/01/humidity/plant001/hour.gz. What I want is to make a Kubernetes Job which downloads all of it, parses it and upload the parsed files to BigQuery in parallel. So far I've managed to do it locally without any parallelism by writing a python code which takes a date interval as input and loops over each of the plants executing gsutil -m cp -r for download, gunzip for extraction and pandas for transforming. I want to do the same thing but in parallel for each plant using Kubernetes. Is it possible to parallelise the process by defining a job that passes down different plant id's for each pod and downloads the files for each of them?

-- chris_user

google-bigquery

google-cloud-platform

google-cloud-storage

kubernetes

python

1 Answer

6/14/2019

A direct upload from Kubernetes to BigQuery is not possible, you can only upload data into BigQuery [1] with the following methods:

From Cloud Storage
From other Google services, such as Google Ad Manager and Google Ads
From a readable data source (such as your local machine)
By inserting individual records using streaming inserts
Using DML statements to perform bulk inserts
Using a BigQuery I/O transform in a Cloud Dataflow pipeline to write data to BigQuery

As mentioned in the previous comment the easiest solution would be to upload the data using DataFlow, you can find a template to upload text from Google Cloud Storage (GCS) to BigQuery in link [2]

If you have to use Google Cloud Engine (GKE) you will need to perform the following steps:

Read the data from GCS with GKE. You can find an example of how to mount a bucket in your containers in the next link [3]
Parse the data with your code as mentioned in your question
Upload data from GCS to BigQuery, more info in link [4]

[1] https://cloud.google.com/bigquery/docs/loading-data

[2] https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming#gcstexttobigquerystream

[3] https://github.com/maciekrb/gcs-fuse-sample

[4] https://cloud.google.com/bigquery/docs/loading-data-cloud-storage

-- Ernesto U

Source: StackOverflow

K
Q

How to download, transform and upload multiple files in parallel using Google Kubernetes Engine?

Similar Questions

1 Answer

KQ

How to download, transform and upload multiple files in parallel using Google Kubernetes Engine?

Similar Questions

1 Answer

K
Q