python-snappy appears to be installed - Dask returns a ValueError.
Helm Config for jupyter and workers:
env:
- name: EXTRA_CONDA_PACKAGES
value: numba xarray s3fs python-snappy pyarrow ruamel.yaml -c conda-forge
- name: EXTRA_PIP_PACKAGES
value: dask-ml --upgrade
The containers shows python-snappy (via conda list)
The dataframe is loaded from a multi-part parquet file generated by Apache Drill:
files = ['s3://{}'.format(f) for f in fs.glob(path='{}/*.parquet'.format(filename))]
df = dd.read_parquet(files)
Running len(df)
on the dataframe returns:
distributed.utils - ERROR - Data is compressed as snappy but we don't have this installed
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/site-packages/distributed/utils.py", line 622, in log_errors
yield
File "/opt/conda/lib/python3.6/site-packages/distributed/client.py", line 921, in _handle_report
six.reraise(*clean_exception(**msg))
File "/opt/conda/lib/python3.6/site-packages/six.py", line 692, in reraise
raise value.with_traceback(tb)
File "/opt/conda/lib/python3.6/site-packages/distributed/comm/tcp.py", line 203, in read
msg = yield from_frames(frames, deserialize=self.deserialize)
File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
return
File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 315, in wrapper
future.set_result(_value_from_stopiteration(e))
File "/opt/conda/lib/python3.6/site-packages/distributed/comm/utils.py", line 75, in from_frames
res = _from_frames()
File "/opt/conda/lib/python3.6/site-packages/distributed/comm/utils.py", line 61, in _from_frames
return protocol.loads(frames, deserialize=deserialize)
File "/opt/conda/lib/python3.6/site-packages/distributed/protocol/core.py", line 96, in loads
msg = loads_msgpack(small_header, small_payload)
File "/opt/conda/lib/python3.6/site-packages/distributed/protocol/core.py", line 171, in loads_msgpack
" installed" % str(header['compression']))
ValueError: Data is compressed as snappy but we don't have this installed
Can anyone please suggest a correct configuration here or remediation steps?
This error actually isn't coming from reading your parquet files, it's coming from how Dask compresses data between machines. You can probably resolve this by installing or not installing python-snappy
consistently on all of your client/scheduler/worker pods.
You should do either of the following:
jupyter
and worker
pods. If you're using pyarrow
then this is unnecessary, I believe that Arrow includes snappy at the C++ level.python-snappy
to your scheduler
podFWIW I personally recommend lz4
for inter-machine compression over snappy
.