Im running tpot with dask running on kubernetes cluster on gcp, the cluster is 24 cores 120 gb memory with 4 nodes of kubernetes, my kubernetes yaml is
apiVersion: v1
kind: Service
metadata:
name: daskd-scheduler
labels:
app: daskd
role: scheduler
spec:
ports:
- port: 8786
targetPort: 8786
name: scheduler
- port: 8787
targetPort: 8787
name: bokeh
- port: 9786
targetPort: 9786
name: http
- port: 8888
targetPort: 8888
name: jupyter
selector:
app: daskd
role: scheduler
type: LoadBalancer
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: daskd-scheduler
spec:
replicas: 1
template:
metadata:
labels:
app: daskd
role: scheduler
spec:
containers:
- name: scheduler
image: uyogesh/daskml-tpot-gcpfs # CHANGE THIS TO BE YOUR DOCKER HUB IMAGE
imagePullPolicy: Always
command: ["/opt/conda/bin/dask-scheduler"]
resources:
requests:
cpu: 1
memory: 20000Mi # set aside some extra resources for the scheduler
ports:
- containerPort: 8786
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: daskd-worker
spec:
replicas: 3
template:
metadata:
labels:
app: daskd
role: worker
spec:
containers:
- name: worker
image: uyogesh/daskml-tpot-gcpfs # CHANGE THIS TO BE YOUR DOCKER HUB IMAGE
imagePullPolicy: Always
command: [
"/bin/bash",
"-cx",
"env && /opt/conda/bin/dask-worker $DASKD_SCHEDULER_SERVICE_HOST:$DASKD_SCHEDULER_SERVICE_PORT_SCHEDULER --nthreads 8 --nprocs 1 --memory-limit 5e9",
]
resources:
requests:
cpu: 2
memory: 20000Mi
My data is 4 million rows and 77 columns, whenever i run fit on the tpot classifier, it runs on the dask cluster for a while then it crashes, the output log looks like
KilledWorker:
("('gradientboostingclassifier-fit-1c9d29ce92072868462946c12335e5dd',
0, 4)", 'tcp://10.8.1.14:35499')
I tried increasing threads per worker as suggested by the dask distributed docs, yet the problem persists. Some observations i have made are:
It'll take longer time to crash if n_jobs is less (for n_jobs=4, it ran for 20 mins before crashing) where as crashes instantly for n_jobs=-1.
It'll actually start working and get optimized model for fewer data, with 10000 data it works fine.
So my question is, what changes and modifications do i need to make this work, I guess its doable as ive heard dask is capable of handling even bigger data than mine.
Best practices described on Dask`s official documentation page say:
Kubernetes resource limits and requests should match the --memory-limit and --nthreads parameters given to the dask-worker command. Otherwise your workers may get killed by Kubernetes as they pack into the same node and overwhelm that nodes’ available memory, leading to KilledWorker errors.
In your case these configuration parameters` values don`t match from what I can see:
Kubernetes` Container limit 20 GB vs. dask-worker command limit 5 GB