What is the best way to get (stream) data from BigQuery to R (Rstudio server in Docker)

5/14/2015

I have a number of large tables in Google BigQuery, containing data to be processed in R. I am running RStudio via Docker on Google Cloud Platform using the Container Engine.

I have tested a few routes with a table of 38 million rows (three columns) with a table size of 862 MB in BigQuery.

The first route I tested was using the R package bigrquery. This option was preferred as data can be directly queried from BigQuery. And data-acquisition can be incorporated in R loops. This option is unfortunately very slow, it takes close to an hour to complete.

The second option I tried was exporting the BigQuery table to a csv file on Google Cloud Storage (approx 1 minute), and using the public link to import in Rstudio (another 5 minutes). This route entails quite some manual handling, which is at least not desirable.

In Google Cloud Console I noticed VM instances can be granted access to BigQuery. Also, RStudio can be configured to have root access in its Docker container.

So finally my question: Is there a way to use this backdoor to enable fast data-transfer from BigQuery into an R dataframe in an automated way? Or are there other ways to achieve this goal?

Any help is highly appreciated!


Edit:

I have loaded the same table into a MySQL database hosted in Google Cloud SQL, this time it took only about 20 seconds to load the same amount of data. So some kind of translation from BigQuery to SQL is an option too.

-- Guus
docker
google-bigquery
google-kubernetes-engine
r
rstudio-server

0 Answers