Skip to content

Conversation

shoyer
Copy link
Member

@shoyer shoyer commented Sep 16, 2025

Zarr stores written with Xarray now consistently use a default Zarr fill value of NaN for float variables, for both Zarr v2 and v3. All other dtypes still use the Zarr default fill_value of zero. To customize, explicitly set encoding in Dataset.to_zarr, e.g., encoding=dict.fromkey(ds.data_vars, {'fill_value': 0}).

Fixes #10646

Zarr stores written with Xarray now consistently use a default Zarr fill value
of ``NaN`` for float variables, for both Zarr v2 and v3. All
other dtypes still use the Zarr default ``fill_value`` of zero. To customize,
explicitly set encoding in :py:meth:`~Dataset.to_zarr`, e.g.,
``encoding=dict.fromkey(ds.data_vars, {'fill_value': 0})``.

Fixes pydata#10646
@github-actions github-actions bot added topic-backends topic-zarr Related to zarr storage library io labels Sep 16, 2025
@shoyer
Copy link
Member Author

shoyer commented Sep 23, 2025

@dcherian @rabernat Any concerns here? This should be a quick review.

Copy link
Contributor

@rabernat rabernat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

But I think there is still a need to rethink how we handle missing data more broadly across Xarray and Zarr.

# For floating point data, Xarray defaults to a fill_value
# of NaN (unlike Zarr, which uses zero):
# https://github.com/pydata/xarray/issues/10646
fill_value = v.dtype.type(np.nan)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether there is any point to use a typed nan, since ultimately it just ends up as JSON. Somewhat related to zarr-developers/zarr-python#3478

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Probably best to just use plain np.nan then.

with self.temp_dir() as (d, store):
inputs.to_zarr(store, compute=False)
with open_dataset(store) as on_disk:
assert np.isnan(on_disk.variables["floats"].encoding["_FillValue"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker, but it seems a bit unnecessary to have _FillValue set at all in this scenario, doesn't it? We can just write NaNs directly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Created #10781 so we don't forget about this.

@shoyer shoyer enabled auto-merge (squash) September 24, 2025 06:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
io topic-backends topic-zarr Related to zarr storage library
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unwritten Zarr v3 arrays values should default to NaN
2 participants