I have set up a Kubernetes cluster using Kubernetes Engine on GCP to work on some data preprocessing and modelling using Dask. I installed Dask using Helm following these instructions.
Right now, I see that there are two folders, work
and examples
I was able to execute the contents of the notebooks in the example
folder confirming that everything is working as expected.
My questions now are as follows
work
and begin prototyping my data preprocessing scripts?I'm new to working with data in a distributed environment in the cloud so any suggestions are welcome.
What are the suggested workflow to follow when working on a cluster?
There are many workflows that work well for different groups. There is no single blessed workflow.
Should I just create a new notebook under work and begin prototyping my data preprocessing scripts?
Sure, that would be fine.
How can I ensure that my work doesn't get erased whenever I upgrade my Helm deployment?
You might save your data to some more permanent store, like cloud storage, or a git repository hosted elsewhere.
Would you just manually move them to a bucket every time you upgrade (which seems tedious)?
Yes, that would work (and yes, it is)
or would you create a simple vm instance, prototype there, then move everything to the cluster when running on the full dataset?
Yes, that would also work.
The Helm chart includes a Jupyter notebook server for convenience and easy testing, but it is no substitute for a full fledged long-term persistent productivity suite. For that you might consider a project like JupyterHub (which handles the problems you list above) or one of the many enterprise-targeted variants on the market today. It would be easy to use Dask alongside any of those.