How to put a dataset on a gcloud kubernetes cluster?

4/5/2018

I have a gcloud Kubernetes cluster initialized, and I'm using a Dask Client on my local machine to connect to the cluster, but I can't seem to find any documentation on how to upload my dataset to the cluster.

I originally tried to just run Dask locally with my dataset loaded in my local RAM, but obviously that's sending it over the network and the cluster is only running at 2% utilization when performing the task.

Is there a way to put the dataset onto the Kubernetes cluster so I can get 100% CPU utilization?

-- Brendan Martin
dask
dask-distributed
google-cloud-platform
kubernetes

1 Answer

4/5/2018

Many people store data on a cloud object store, like Amazon's S3, Google Cloud Storage.

If you're interested about Dask in particular these data stores are supported in most of the data ingestion functions by using a protocol like the following:

import dask.dataframe as dd
df = dd.read_csv('gcs://bucket/2018-*-*.csv')

You will need to also have the relevant Python library installed to access this cloud storage (gcsfs in this case). See http://dask.pydata.org/en/latest/remote-data-services.html#known-storage-implementations for more information.

-- MRocklin
Source: StackOverflow