Skip to content

New Dask Arrow-based strings cause test failures #335

@jakirkham

Description

@jakirkham

(Edited by @m-albert)

In the presence of pyarrow, dask by default assumes dataframes of type object to be pyarrow strings (see dask/dask#10139 (comment)).

This creates problems revealed by failing tests (e.g. test_dask_image/test_ndmeasure/test_find_objects.py::test_3d_find_objects)

meta = dd.utils.make_meta([(i, object) for i in range(ndim)])
if isinstance(df1, Delayed):
df1 = dd.from_delayed(df1, meta=meta)

dd.from_delayed(df1, meta=meta).compute().dtypes

Working install:

0 object
1 object
2 object
dtype: object

Failing install:

0 string[pyarrow]
1 string[pyarrow]
2 string[pyarrow]
dtype: object

The failing test had come up when releasing v2023.08.0 in conda-forge/dask-image-feedstock#14.

@jakirkham found that pyarrow is installed with the conda distribution of dask, but not when installing over pip, where it just part of the [complete] target.

Also @jakirkham found that the above described conflicting behaviour can be turned off using the dask configuration.

He did this for the tests performed by the dask-image conda feedstock on v2023.08.0.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions