What is the suggested workflow when working on a Kubernetes cluster using Dask?

3/6/2019

I have set up a Kubernetes cluster using Kubernetes Engine on GCP to work on some data preprocessing and modelling using Dask. I installed Dask using Helm following these instructions.

Right now, I see that there are two folders, work and examples

enter image description here

I was able to execute the contents of the notebooks in the example folder confirming that everything is working as expected.

My questions now are as follows

  • What are the suggested workflow to follow when working on a cluster? Should I just create a new notebook under work and begin prototyping my data preprocessing scripts?
  • How can I ensure that my work doesn't get erased whenever I upgrade my Helm deployment? Would you just manually move them to a bucket every time you upgrade (which seems tedious)? or would you create a simple vm instance, prototype there, then move everything to the cluster when running on the full dataset?

I'm new to working with data in a distributed environment in the cloud so any suggestions are welcome.

-- PollPenn
dask
kubernetes

1 Answer

3/10/2019

What are the suggested workflow to follow when working on a cluster?

There are many workflows that work well for different groups. There is no single blessed workflow.

Should I just create a new notebook under work and begin prototyping my data preprocessing scripts?

Sure, that would be fine.

How can I ensure that my work doesn't get erased whenever I upgrade my Helm deployment?

You might save your data to some more permanent store, like cloud storage, or a git repository hosted elsewhere.

Would you just manually move them to a bucket every time you upgrade (which seems tedious)?

Yes, that would work (and yes, it is)

or would you create a simple vm instance, prototype there, then move everything to the cluster when running on the full dataset?

Yes, that would also work.

In Summary

The Helm chart includes a Jupyter notebook server for convenience and easy testing, but it is no substitute for a full fledged long-term persistent productivity suite. For that you might consider a project like JupyterHub (which handles the problems you list above) or one of the many enterprise-targeted variants on the market today. It would be easy to use Dask alongside any of those.

-- MRocklin
Source: StackOverflow