Dask Dataframe "ValueError: Data is compressed as snappy but we don't have this installed"

5/15/2018

python-snappy appears to be installed - Dask returns a ValueError.

Helm Config for jupyter and workers:

env:
  - name: EXTRA_CONDA_PACKAGES
    value: numba xarray s3fs python-snappy pyarrow ruamel.yaml -c conda-forge
  - name: EXTRA_PIP_PACKAGES
    value: dask-ml --upgrade

The containers shows python-snappy (via conda list)

The dataframe is loaded from a multi-part parquet file generated by Apache Drill:

files = ['s3://{}'.format(f) for f in fs.glob(path='{}/*.parquet'.format(filename))]
df = dd.read_parquet(files)

Running len(df) on the dataframe returns:

distributed.utils - ERROR - Data is compressed as snappy but we don't have this installed
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/distributed/utils.py", line 622, in log_errors
    yield
  File "/opt/conda/lib/python3.6/site-packages/distributed/client.py", line 921, in _handle_report
    six.reraise(*clean_exception(**msg))
  File "/opt/conda/lib/python3.6/site-packages/six.py", line 692, in reraise
    raise value.with_traceback(tb)
  File "/opt/conda/lib/python3.6/site-packages/distributed/comm/tcp.py", line 203, in read
    msg = yield from_frames(frames, deserialize=self.deserialize)
  File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    return
  File "/opt/conda/lib/python3.6/site-packages/tornado/gen.py", line 315, in wrapper
    future.set_result(_value_from_stopiteration(e))
  File "/opt/conda/lib/python3.6/site-packages/distributed/comm/utils.py", line 75, in from_frames
    res = _from_frames()
  File "/opt/conda/lib/python3.6/site-packages/distributed/comm/utils.py", line 61, in _from_frames
    return protocol.loads(frames, deserialize=deserialize)
  File "/opt/conda/lib/python3.6/site-packages/distributed/protocol/core.py", line 96, in loads
    msg = loads_msgpack(small_header, small_payload)
  File "/opt/conda/lib/python3.6/site-packages/distributed/protocol/core.py", line 171, in loads_msgpack
    " installed" % str(header['compression']))
ValueError: Data is compressed as snappy but we don't have this installed

Can anyone please suggest a correct configuration here or remediation steps?

-- fmcmac
dask
kubernetes-helm
pyarrow
python

1 Answer

5/15/2018

This error actually isn't coming from reading your parquet files, it's coming from how Dask compresses data between machines. You can probably resolve this by installing or not installing python-snappy consistently on all of your client/scheduler/worker pods.

You should do either of the following:

  1. Remove python-snappy from your list of conda packages for your jupyter and worker pods. If you're using pyarrow then this is unnecessary, I believe that Arrow includes snappy at the C++ level.
  2. Add python-snappy to your scheduler pod

FWIW I personally recommend lz4 for inter-machine compression over snappy.

-- MRocklin
Source: StackOverflow