What is the suggested workflow when working on a Kubernetes cluster using Dask?

3/6/2019

I have set up a Kubernetes cluster using Kubernetes Engine on GCP to work on some data preprocessing and modelling using Dask. I installed Dask using Helm following these instructions.

Right now, I see that there are two folders, work and examples

I was able to execute the contents of the notebooks in the example folder confirming that everything is working as expected.

My questions now are as follows

What are the suggested workflow to follow when working on a cluster? Should I just create a new notebook under work and begin prototyping my data preprocessing scripts?
How can I ensure that my work doesn't get erased whenever I upgrade my Helm deployment? Would you just manually move them to a bucket every time you upgrade (which seems tedious)? or would you create a simple vm instance, prototype there, then move everything to the cluster when running on the full dataset?

I'm new to working with data in a distributed environment in the cloud so any suggestions are welcome.

-- PollPenn

dask

kubernetes

1 Answer

3/10/2019

What are the suggested workflow to follow when working on a cluster?

There are many workflows that work well for different groups. There is no single blessed workflow.

Should I just create a new notebook under work and begin prototyping my data preprocessing scripts?

Sure, that would be fine.

How can I ensure that my work doesn't get erased whenever I upgrade my Helm deployment?

You might save your data to some more permanent store, like cloud storage, or a git repository hosted elsewhere.

Would you just manually move them to a bucket every time you upgrade (which seems tedious)?

Yes, that would work (and yes, it is)

or would you create a simple vm instance, prototype there, then move everything to the cluster when running on the full dataset?

Yes, that would also work.

In Summary

The Helm chart includes a Jupyter notebook server for convenience and easy testing, but it is no substitute for a full fledged long-term persistent productivity suite. For that you might consider a project like JupyterHub (which handles the problems you list above) or one of the many enterprise-targeted variants on the market today. It would be easy to use Dask alongside any of those.

-- MRocklin

Source: StackOverflow

K
Q

What is the suggested workflow when working on a Kubernetes cluster using Dask?

Similar Questions

1 Answer

In Summary

KQ

What is the suggested workflow when working on a Kubernetes cluster using Dask?

Similar Questions

1 Answer

In Summary

K
Q