Skip to content

new default h5netcdf engine raises UnicodeEncodeError #10811

@veenstrajelmer

Description

@veenstrajelmer

What happened?

When reading+writing a file with waterlevel observations downloaded from CMEMS, I get a "UnicodeEncodeError: 'utf-8' codec can't encode characters in position 4-5: surrogates not allowed" when using the new default engine (h5netcdf)

What did you expect to happen?

This was not the case before, when engine netcdf4 was the default. I can also easily work around this by setting the engine. However, I see the benefits of h5netcdf and I guess this file should be no issue for that engine. Hopefully this can be resolved, either in xarray or h5netcdf.

Minimal Complete Verifiable Example

import xarray as xr
file_cmems = r"dfmtools_cmems_ssh_retrieve_data_temporary_file_6.nc"
ds = xr.open_dataset(file_cmems, engine='h5netcdf')
ds.to_netcdf("temp_file.nc")

File to test with: dfmtools_cmems_ssh_retrieve_data_temporary_file_6.zip

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.
  • Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

UnicodeEncodeError                        Traceback (most recent call last)
File ~\AppData\Local\miniforge3\envs\dfm_tools_env\Lib\site-packages\spyder_kernels\customize\utils.py:209, in exec_encapsulate_locals(code_ast, globals, locals, exec_fun, filename)
    207     if filename is None:
    208         filename = "<stdin>"
--> 209     exec_fun(compile(code_ast, filename, "exec"), globals, None)
    210 finally:
    211     if use_locals_hack:
    212         # Cleanup code

File c:\data\checkouts\dfm_tools\tests\untitled2.py:10
      8 file_cmems = r"c:\Users\veenstra\Downloads\dfmtools_cmems_ssh_retrieve_data_temporary_file_6.nc"
      9 ds = xr.open_dataset(file_cmems, engine='h5netcdf')
---> 10 ds.to_netcdf("temp_file.nc")

File ~\AppData\Local\miniforge3\envs\dfm_tools_env\Lib\site-packages\xarray\core\dataset.py:2102, in Dataset.to_netcdf(self, path, mode, format, group, engine, encoding, unlimited_dims, compute, invalid_netcdf, auto_complex)
   2099     encoding = {}
   2100 from xarray.backends.api import to_netcdf
-> 2102 return to_netcdf(  # type: ignore[return-value]  # mypy cannot resolve the overloads:(
   2103     self,
   2104     path,
   2105     mode=mode,
   2106     format=format,
   2107     group=group,
   2108     engine=engine,
   2109     encoding=encoding,
   2110     unlimited_dims=unlimited_dims,
   2111     compute=compute,
   2112     multifile=False,
   2113     invalid_netcdf=invalid_netcdf,
   2114     auto_complex=auto_complex,
   2115 )

File ~\AppData\Local\miniforge3\envs\dfm_tools_env\Lib\site-packages\xarray\backends\api.py:2107, in to_netcdf(dataset, path_or_file, mode, format, group, engine, encoding, unlimited_dims, compute, multifile, invalid_netcdf, auto_complex)
   2102 # TODO: figure out how to refactor this logic (here and in save_mfdataset)
   2103 # to avoid this mess of conditionals
   2104 try:
   2105     # TODO: allow this work (setting up the file for writing array data)
   2106     # to be parallelized with dask
-> 2107     dump_to_store(
   2108         dataset, store, writer, encoding=encoding, unlimited_dims=unlimited_dims
   2109     )
   2110     if autoclose:
   2111         store.close()

File ~\AppData\Local\miniforge3\envs\dfm_tools_env\Lib\site-packages\xarray\backends\api.py:2157, in dump_to_store(dataset, store, writer, encoder, encoding, unlimited_dims)
   2154 if encoder:
   2155     variables, attrs = encoder(variables, attrs)
-> 2157 store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)

File ~\AppData\Local\miniforge3\envs\dfm_tools_env\Lib\site-packages\xarray\backends\common.py:527, in AbstractWritableDataStore.store(self, variables, attributes, check_encoding_set, writer, unlimited_dims)
    523     writer = ArrayWriter()
    525 variables, attributes = self.encode(variables, attributes)
--> 527 self.set_attributes(attributes)
    528 self.set_dimensions(variables, unlimited_dims=unlimited_dims)
    529 self.set_variables(
    530     variables, check_encoding_set, writer, unlimited_dims=unlimited_dims
    531 )

File ~\AppData\Local\miniforge3\envs\dfm_tools_env\Lib\site-packages\xarray\backends\common.py:544, in AbstractWritableDataStore.set_attributes(self, attributes)
    534 """
    535 This provides a centralized method to set the dataset attributes on the
    536 data store.
   (...)
    541     Dictionary of key/value (attribute name / attribute) pairs
    542 """
    543 for k, v in attributes.items():
--> 544     self.set_attribute(k, v)

File ~\AppData\Local\miniforge3\envs\dfm_tools_env\Lib\site-packages\xarray\backends\netCDF4_.py:555, in NetCDF4DataStore.set_attribute(self, key, value)
    553     self.ds.setncattr_string(key, value)
    554 else:
--> 555     self.ds.setncattr(key, value)

File src\\netCDF4\\_netCDF4.pyx:3087, in netCDF4._netCDF4.Dataset.setncattr()

File src\\netCDF4\\_netCDF4.pyx:1858, in netCDF4._netCDF4._set_att()

File src\\netCDF4\\_netCDF4.pyx:6733, in netCDF4._netCDF4._strencode()

UnicodeEncodeError: 'utf-8' codec can't encode characters in position 4-5: surrogates not allowed

Environment

INSTALLED VERSIONS
------------------
commit: None
python: 3.12.8 | packaged by conda-forge | (main, Dec  5 2024, 14:06:27) [MSC v.1942 64 bit (AMD64)]
python-bits: 64
OS: Windows
OS-release: 11
machine: AMD64
processor: Intel64 Family 6 Model 154 Stepping 4, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en
LOCALE: ('Dutch_Netherlands', '1252')
libhdf5: 1.14.4
libnetcdf: 4.9.2

xarray: 2025.9.1
pandas: 2.2.3
numpy: 2.2.6
scipy: 1.15.1
netCDF4: 1.7.2
pydap: 3.5.3
h5netcdf: 1.5.0
h5py: 3.12.1
zarr: 2.18.4
cftime: 1.6.4.post1
nc_time_axis: None
iris: None
bottleneck: 1.4.2
dask: 2025.1.0
distributed: None
matplotlib: 3.10.0
cartopy: None
seaborn: None
numbagg: None
fsspec: 2024.12.0
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 75.8.0
pip: 25.0
conda: None
pytest: 8.3.4
mypy: None
IPython: 8.31.0
sphinx: 8.1.3

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions