Can Dask on KubeCluster/HelmCluster distribute compute/data to containers that transform and return data for subsequent processing?

6/2/2021

Still trying to understand how Dask and Kubernetes can work with other container, so was hoping whether someone could generally say if the following could work? Specifically, I don't quite have an understanding of if/how Dask distributed can distribute data to spooled up pods in the cluster running another container that parses that data and returns it back to dask for subsequent functions. Here, the 'other' containers are compiled programs that transform the data.

Something akin to the below:

import dask
from dask import delayed, compute
from dask_kubernetes import KubeCluster
cluster = KubeCluster('worker-spec.yml')
cluster.scale(10)
client = Client(cluster)

@delayed
def run_transformer(raw_data):
    transformed_data = run_transformer_container(raw_data)
    return transformed_data

@delayed
def upload_to_s3(transformed_data):
    success = True
    [...]
    return success

raw_data = [string1, string2, ..., stringN]

output = []
for x in raw_data:
    f = run_transformer(x)
    g = upload_to_s3(f)
    output.append(g)

transformed_data = compute(output)

Where Dask delayed handles distributing the N tasks to the 10 worker nodes, each of which then passes the raw_data contents (likely a string or, possibly, a pickled object) to a spooled up pod on that worker node containing a container that will ingest and transform the data and return the parsed data (via unspecified run_transformer_container function, however that would work) before being uploaded to S3.

-- StuckOnDiffyQ
dask
dask-distributed
dask-kubernetes
docker
kubernetes

0 Answers