I am testing dask.distributed for big data and machine learning related things. I've watched videos, read blog pages and tried to understand library documentations. But I am confused. There was always jupyter notebook/lab/hub in every source I found. Do I have to use jupyter notebook/lab/hub in order to run Dask on a Kubernetes cluster? Can't I build a Kubernetes cluster with 2 laptops and run Dask without jupyter related things on them?
Why? Because I want to use my own server (kubernetes cluster) to serve users my own web page (flask in the background).
No you don't. Jupyter is just the most common setup for working with Dask, and JupyterLab has nice extensions so you can visualize task graphs as they are executing. But for just orchestrating dask workers on kubernetes, I'd have a look at dask-kubernetes. That's the library we're using at Saturn Cloud to deploy dask for our enterprise customers.
In the docs, these lines should be sufficient to get you started
from dask_kubernetes import KubeCluster
cluster = KubeCluster.from_yaml('worker-spec.yml')
cluster.adapt(minimum=1, maximum=100) # or dynamically scale based on current
It's important to understand that the KubeCluster
works by attaching a PeriodicCallback
to the asyncio event loop. Which means that you definitely want to make sure it doesn't get garbage collected. You can pass the cluster
instance directly into the distributed.client
, Or grab the scheduler address and communicate that way.
I see no jupyter notebooks here. Jupyter notebooks are convenient for data science folks, but that not a requirement to use tools, you still can import dask.distributed
into your flask application as any other python package, containerize it and ship it to work in your Kubernetes cluster as service. its all up to you as developer.