Running data processing tasks on Google Buckets in GCP

10/3/2018

We have a lot of big files (~ gigabytes) in our Google bucket. I would like to process these files and generate new ones. To be specific, these are JSON files, from which I want to extract one field and join some files into one.

I could write some scripts running as pods in Kubernetes, which would connect to the bucket and stream the data from there and back. But I find it ugly - is there something made specifically for data processing in buckets?

-- Vojtěch
google-cloud-platform
google-cloud-storage
kubernetes

1 Answer

10/3/2018

Smells like a Big Data problem.

Use Big Data softwares like Apache Spark for the processing of the huge files. Since, the data is there in the Google Cloud, would recommend Google Cloud Dataproc. Also, Big Data on K8S is a WIP and would recommend to leave K8S for now. Maybe use Big Data on K8S down the line in the future. More on Big Data on K8S (here and here).

With your solution (using K8S and hand made code), all the fault tolerance has to be handled manually. But, in the case of Apache Spark the fault tolerances (node going down, network failures etc) are taken care of automatically.

To conclude, I would recommend to forget about K8S for now and focus on Big Data for solving the problem.

-- Praveen Sripati
Source: StackOverflow