I've been trying to set up a dask.distributed
cluster using kubernetes
. Setting up the kube
cluster itself is pretty straightforward, the problem I am currently struggling with is that I can't get the local scheduler to connect to the workers. Workers can connect to the scheduler, but they advertise an address inside the kube
network that is not accessible to the scheduler running outside the kube
network.
Following the examples from the dask-kubernetes
docs I got a kube
cluster running on AWS and (on a separate AWS machine) started a notebook
with the local dask.distributed
scheduler. The scheduler launches a number of workers on the kube
cluster, but it can not connect to said workers because the workers are on a different network: the internal kube
network.
The network setup looks like the following:
kube
cluster EC2 instances also on 192.168.0.0/24kube
pods on 100.64.0.0/16the dask
scheduler runs on 192.168.0.0/24
but the dask
workers are on 100.64.0.0/16
- how do I connect the two? Should I be running the scheduler also in a kube
pod, edit routing tables, try to figure out the host machines' IPs address on the workers?
The workers are able to connect to the scheduler, but in the scheduler I get a errors of the form
distributed.scheduler - ERROR - Failed to connect to worker 'tcp://100.96.2.4:40992': Timed out trying to connect to 'tcp://100.96.2.4:40992' after 3.0 s: connect() didn't finish in time
I'm not looking for a list of possible things I could do, I'm looking for the recommended way of setting this up, specifically in relation to dask.distributed
.
I set up the kube
cluster using kops
.
I've typically used dask-kubernetes from within the Kubernetes cluster, though obviously this isn't ideal for everyone.
Networks can vary. My guess is that the IP address chosen by default is not visible to your Kubernetes network. If you do have an address to which your workers can connect you can specify it in the ip=
keyword argument.
cluster = KubeCluster(ip='scheduler-address-visible-to-workers')
If there is a network interface that you know to be visible then you can generalize this as follows:
from distributed.utils import get_ip_interface
ip = get_ip_interface('eth0') # replace eth0 with your visible network interface
On UNIX based systems you can usually find a list of suitable interfaces with the ifconfig
command. You might look through that list for an address that is similar to the addresses that you're seeing on the workers.
If neither of these is possible then I recommend raising an issue at https://github.com/dask/dask-kubernetes/issues/new