-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Use a default fill_value of NaN for floats in Zarr v3 #10757
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Zarr stores written with Xarray now consistently use a default Zarr fill value of ``NaN`` for float variables, for both Zarr v2 and v3. All other dtypes still use the Zarr default ``fill_value`` of zero. To customize, explicitly set encoding in :py:meth:`~Dataset.to_zarr`, e.g., ``encoding=dict.fromkey(ds.data_vars, {'fill_value': 0})``. Fixes pydata#10646
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
But I think there is still a need to rethink how we handle missing data more broadly across Xarray and Zarr.
xarray/backends/zarr.py
Outdated
# For floating point data, Xarray defaults to a fill_value | ||
# of NaN (unlike Zarr, which uses zero): | ||
# https://github.com/pydata/xarray/issues/10646 | ||
fill_value = v.dtype.type(np.nan) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure whether there is any point to use a typed nan, since ultimately it just ends up as JSON. Somewhat related to zarr-developers/zarr-python#3478
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Probably best to just use plain np.nan
then.
with self.temp_dir() as (d, store): | ||
inputs.to_zarr(store, compute=False) | ||
with open_dataset(store) as on_disk: | ||
assert np.isnan(on_disk.variables["floats"].encoding["_FillValue"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a blocker, but it seems a bit unnecessary to have _FillValue
set at all in this scenario, doesn't it? We can just write NaNs directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Created #10781 so we don't forget about this.
Zarr stores written with Xarray now consistently use a default Zarr fill value of
NaN
for float variables, for both Zarr v2 and v3. All other dtypes still use the Zarr defaultfill_value
of zero. To customize, explicitly set encoding inDataset.to_zarr
, e.g.,encoding=dict.fromkey(ds.data_vars, {'fill_value': 0})
.Fixes #10646
whats-new.rst