Skip to content

QC-aware transformations #703

@maxwelllevin

Description

@maxwelllevin

xarray provides powerful resample and groupby methods for transforming data onto a target coordinate grid, but it has no concept of QC variables so any transformations applied (e.g., mean, nearest) are performed naively and could result in the use of data that has been flagged as bad.

I think ACT would be a great place to host extensions to xarray's methods that do account for QC values in transformations. The proposed interface would be a series of methods that mirror the transformation types offered by the ARM Data Integrator (ADI) made available to ARM users in the PCM Interface:

  • nearest neighbor
  • bilinear interpolation
  • bin averaging
  • auto (picks between interpolation and averaging based on bin size) -- optional

The ADI library makes a few key decisions for QC-aware transformations that I think should be mirrored here:

  • data values QC'd as bad are excluded from consideration in the transformation
    • for averaging this is equivalent to marking these as NaN
    • for nearest neighbor(interpolation) the nearest non-bad point(s) are used
  • for averaging, if >threshold % of values in a bin are bad, then the output value is set to missing
    • also if >other threshold are indeterminate, output value is also set to missing
    • there are reasonable defaults set for each (I think these are 50% and 80%, respectively)
    • PCM/ADI goes a step further and lets you customize this for each variable, but I'm not sure if the functionality/complexity trade-off is worth it for ACT
  • optionally, an output summary QC variable is generated for each QC'd input variable

I think this could be implemented as a method applied to an xarray DatasetResample / DatasetGroupBy object returned by ds.resample / ds.groupby, e.g.,:

# Proposed API

import act
import xarray as xr


ds = act.io.armfiles.read_netcdf(...)

ds.resample(time="30min").apply(
    act.qc.transform.NearestNeighbor(tolerance="15min")
)

ds.resample(time="30min").apply(
    act.qc.transform.Interpolate(method="linear")
)

ds.groupby("time.hour").apply(
    act.qc.transform.BinAverage(
        bad_threshold=0.5,
        indeterminate_threshold=0.8,
        add_transform_qc=False,  # maybe also a roll-up QC option like PCM (4 bits instead of 10+)
    ),
)

The transform functions/classes (NearestNeighbor, Interpolate, BinAverage) should take and return xarray Dataset objects. The input passed by the apply method contains all the points in the given bin and the output is expected to be a 0-coord Dataset with scalar values for each data variable (metadata included).

I'm totally open to any changes/feedback. This could probably use several iterations of revisions to make it easier for users. Let me know what you think!

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions