After uploading a parquet file to my kubernetes cluster for processing with Dask, I get a FileNotFoundError when trying to read
df=dd.read_parquet('home/jovyan/foo.parquet')
df.head()
Here is the full error:
FileNotFoundError: [Errno 2] No such file or directory: '/home/jovyan/user_engagement_anon.parquet/part.0.parquet'
I can see that the file does indeed exist, and relative to the working directory of my jupyter notebook instance, it's in the expected location.
I'm not sure if it matters, but to start the dask client on my kubernetes cluster, I used the following code:
from dask.distributed import Client, progress
client=Client('dask-scheduler:8786', processes=False, threads_per_worker=4, n_workers=1, memory_limit='1GB')
client
Furthermore, the same operation works fine on my local machine with the same parquet file
The problem was that I was installing dask separately using a helm release. Thus, the dask workers did not share the same file system as the jupyter notebook
To fix this, I used dask-kubernetes python library to create the workers, rather than a separate helm release.