-
Notifications
You must be signed in to change notification settings - Fork 40
Description
xarray provides powerful resample and groupby methods for transforming data onto a target coordinate grid, but it has no concept of QC variables so any transformations applied (e.g., mean, nearest) are performed naively and could result in the use of data that has been flagged as bad.
I think ACT would be a great place to host extensions to xarray's methods that do account for QC values in transformations. The proposed interface would be a series of methods that mirror the transformation types offered by the ARM Data Integrator (ADI) made available to ARM users in the PCM Interface:
- nearest neighbor
- bilinear interpolation
- bin averaging
- auto (picks between interpolation and averaging based on bin size) -- optional
The ADI library makes a few key decisions for QC-aware transformations that I think should be mirrored here:
- data values QC'd as bad are excluded from consideration in the transformation
- for averaging this is equivalent to marking these as NaN
- for nearest neighbor(interpolation) the nearest non-bad point(s) are used
- for averaging, if >threshold % of values in a bin are bad, then the output value is set to missing
- also if >other threshold are indeterminate, output value is also set to missing
- there are reasonable defaults set for each (I think these are 50% and 80%, respectively)
- PCM/ADI goes a step further and lets you customize this for each variable, but I'm not sure if the functionality/complexity trade-off is worth it for ACT
- optionally, an output summary QC variable is generated for each QC'd input variable
I think this could be implemented as a method applied to an xarray DatasetResample / DatasetGroupBy object returned by ds.resample / ds.groupby, e.g.,:
# Proposed API
import act
import xarray as xr
ds = act.io.armfiles.read_netcdf(...)
ds.resample(time="30min").apply(
act.qc.transform.NearestNeighbor(tolerance="15min")
)
ds.resample(time="30min").apply(
act.qc.transform.Interpolate(method="linear")
)
ds.groupby("time.hour").apply(
act.qc.transform.BinAverage(
bad_threshold=0.5,
indeterminate_threshold=0.8,
add_transform_qc=False, # maybe also a roll-up QC option like PCM (4 bits instead of 10+)
),
)The transform functions/classes (NearestNeighbor, Interpolate, BinAverage) should take and return xarray Dataset objects. The input passed by the apply method contains all the points in the given bin and the output is expected to be a 0-coord Dataset with scalar values for each data variable (metadata included).
I'm totally open to any changes/feedback. This could probably use several iterations of revisions to make it easier for users. Let me know what you think!