I have a gcloud Kubernetes cluster initialized, and I'm using a Dask Client on my local machine to connect to the cluster, but I can't seem to find any documentation on how to upload my dataset to the cluster.
I originally tried to just run Dask locally with my dataset loaded in my local RAM, but obviously that's sending it over the network and the cluster is only running at 2% utilization when performing the task.
Is there a way to put the dataset onto the Kubernetes cluster so I can get 100% CPU utilization?
Many people store data on a cloud object store, like Amazon's S3, Google Cloud Storage.
If you're interested about Dask in particular these data stores are supported in most of the data ingestion functions by using a protocol like the following:
import dask.dataframe as dd
df = dd.read_csv('gcs://bucket/2018-*-*.csv')
You will need to also have the relevant Python library installed to access this cloud storage (gcsfs in this case). See http://dask.pydata.org/en/latest/remote-data-services.html#known-storage-implementations for more information.