`dask-kubernetes` scheduler - worker on AWS

3/12/2018

I've been trying to set up a dask.distributed cluster using kubernetes. Setting up the kube cluster itself is pretty straightforward, the problem I am currently struggling with is that I can't get the local scheduler to connect to the workers. Workers can connect to the scheduler, but they advertise an address inside the kube network that is not accessible to the scheduler running outside the kube network.

Following the examples from the dask-kubernetes docs I got a kube cluster running on AWS and (on a separate AWS machine) started a notebook with the local dask.distributed scheduler. The scheduler launches a number of workers on the kube cluster, but it can not connect to said workers because the workers are on a different network: the internal kube network.

The network setup looks like the following:

  • notebook server running on 192.168.0.0/24
  • kube cluster EC2 instances also on 192.168.0.0/24
  • kube pods on 100.64.0.0/16

the dask scheduler runs on 192.168.0.0/24 but the dask workers are on 100.64.0.0/16 - how do I connect the two? Should I be running the scheduler also in a kube pod, edit routing tables, try to figure out the host machines' IPs address on the workers?

The workers are able to connect to the scheduler, but in the scheduler I get a errors of the form

distributed.scheduler - ERROR - Failed to connect to worker 'tcp://100.96.2.4:40992': Timed out trying to connect to 'tcp://100.96.2.4:40992' after 3.0 s: connect() didn't finish in time

I'm not looking for a list of possible things I could do, I'm looking for the recommended way of setting this up, specifically in relation to dask.distributed.

I set up the kube cluster using kops.

https://dask-kubernetes.readthedocs.io/en/latest/

-- Matti Lyra
amazon-web-services
dask
dask-kubernetes
dask.distributed
kubernetes

1 Answer

3/12/2018

I've typically used dask-kubernetes from within the Kubernetes cluster, though obviously this isn't ideal for everyone.

Networks can vary. My guess is that the IP address chosen by default is not visible to your Kubernetes network. If you do have an address to which your workers can connect you can specify it in the ip= keyword argument.

cluster = KubeCluster(ip='scheduler-address-visible-to-workers')

If there is a network interface that you know to be visible then you can generalize this as follows:

from distributed.utils import get_ip_interface
ip = get_ip_interface('eth0')  # replace eth0 with your visible network interface

On UNIX based systems you can usually find a list of suitable interfaces with the ifconfig command. You might look through that list for an address that is similar to the addresses that you're seeing on the workers.

If neither of these is possible then I recommend raising an issue at https://github.com/dask/dask-kubernetes/issues/new

-- MRocklin
Source: StackOverflow