Still trying to understand how Dask and Kubernetes can work with other container, so was hoping whether someone could generally say if the following could work? Specifically, I don't quite have an understanding of if/how Dask distributed can distribute data to spooled up pods in the cluster running another container that parses that data and returns it back to dask for subsequent functions. Here, the 'other' containers are compiled programs that transform the data.
Something akin to the below:
import dask
from dask import delayed, compute
from dask_kubernetes import KubeCluster
cluster = KubeCluster('worker-spec.yml')
cluster.scale(10)
client = Client(cluster)
@delayed
def run_transformer(raw_data):
transformed_data = run_transformer_container(raw_data)
return transformed_data
@delayed
def upload_to_s3(transformed_data):
success = True
[...]
return success
raw_data = [string1, string2, ..., stringN]
output = []
for x in raw_data:
f = run_transformer(x)
g = upload_to_s3(f)
output.append(g)
transformed_data = compute(output)
Where Dask delayed handles distributing the N tasks to the 10 worker nodes, each of which then passes the raw_data contents (likely a string or, possibly, a pickled object) to a spooled up pod on that worker node containing a container that will ingest and transform the data and return the parsed data (via unspecified run_transformer_container function, however that would work) before being uploaded to S3.