I have about 100 million json files (10 TB), each with a particular field containing a bunch of text, for which I would like to perform a simple substring search and return the filenames of all the relevant json files. They're all currently stored on Google Cloud Storage. Normally for a smaller number of files I might just spin up a VM with many CPUs and run multiprocessing via Python, but alas this is a bit too much.
I want to avoid spending too much time setting up infrastructure like a Hadoop server, or loading all of that into some MongoDB database. My question is: what would be a quick and dirty way to perform this task? My original thoughts were to set up something on Kubernetes with some parallel processing running Python scripts, but I'm open to suggestions and don't really have a clue how to go about this.
Easier would be to just load the GCS data into Big Query and just run your query from there.
Send your data to AWS S3 and use Amazon Athena.
The Kubernetes option would be set up a cluster in GKE and install Presto in it with a lot of workers, use a hive metastore with GCS and query from there. (Presto doesn't have direct GCS connector yet, afaik) -- This option seems more elaborate.
Hope it helps!