I have a large collection of data stored in google storage bucket with the following structure: gs://project_garden/plant_logs/2019/01/01/humidity/plant001/hour.gz
. What I want is to make a Kubernetes Job which downloads all of it, parses it and upload the parsed files to BigQuery in parallel. So far I've managed to do it locally without any parallelism by writing a python code which takes a date interval as input and loops over each of the plants executing gsutil -m cp -r
for download, gunzip
for extraction and pandas for transforming. I want to do the same thing but in parallel for each plant using Kubernetes. Is it possible to parallelise the process by defining a job that passes down different plant id's for each pod and downloads the files for each of them?
A direct upload from Kubernetes to BigQuery is not possible, you can only upload data into BigQuery [1] with the following methods:
As mentioned in the previous comment the easiest solution would be to upload the data using DataFlow, you can find a template to upload text from Google Cloud Storage (GCS) to BigQuery in link [2]
If you have to use Google Cloud Engine (GKE) you will need to perform the following steps:
[1] https://cloud.google.com/bigquery/docs/loading-data
[2] https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming#gcstexttobigquerystream
[3] https://github.com/maciekrb/gcs-fuse-sample
[4] https://cloud.google.com/bigquery/docs/loading-data-cloud-storage