Skip to content

Conversation

@luisheb
Copy link
Contributor

@luisheb luisheb commented May 14, 2025

Summary

This pull request introduces a complete framework for centering and scaling functional data in the scikit-fda library. These tools are particularly relevant in the context of vector-valued functional data and mixed datasets, where components may differ in units, scale, or variability.

Motivation

In classical functional data analysis, centering and scaling are often not required, since all functions are assumed to be sampled over the same domain and measured in the same units. However, when combining multiple functional components or integrating scalar and functional features, normalization becomes critical to ensure fair comparisons and effective learning.

This PR addresses this gap by providing:

  • Scikit-learn compatible transformers for centering and scaling.
  • Several statistical tools to compute meaningful scaling factors from the data.

Main Additions

Transformers

  • CenterScaler: Flexible transformer that applies user-defined or data-driven centering and scaling operations to FDataGrid or FDataBasis.
  • StandardScaler: Computes the mean and standard deviation of the dataset and uses them to standardize functional data, mimicking scikit-learn’s StandardScaler.

Summary Statistics (New Functions)

These utilities compute scalar summaries useful for centering and scaling:

  • individual_observation_mean: Integrated average of each function (vertical shift).
  • grand_mean: Global scalar mean of all functions.
  • root_integrated_sample_variance: A robust measure of variability, subtracting the mean function before integration.
  • root_mean_square_l2: RMS of the L2 norm of each function (total magnitude without centering).
  • individual_root_mean_square_l2: Computes RMS individually per function.

These functions are available under the skfda.exploratory.stats module and complement the existing set of functional location and dispersion statistics.

Documentation

  • Added a new documentation section: Scaling, under preprocessing/.
  • Describes when and why centering and scaling are relevant in functional and mixed settings.
  • Includes usage guidelines and mathematical definitions for each transformation and statistic.

Checklist before requesting a review

  • I have performed a self-review of my code
  • The code conforms to the style used in this package
  • The code is fully documented and typed (type-checked with Mypy)
  • I have added thorough tests for the new/changed functionality

@luisheb luisheb changed the title Feature/centering and scaling Centering and scaling May 14, 2025
@luisheb luisheb marked this pull request as ready for review June 18, 2025 15:40
References
----------

* J. Prothero, J. Hannig, and J. Marron. *New perspectives on centering*. The New England Journal of Statistics in Data Science, vol. 1, no. 2, 216–236, 2023.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would rather use sphinxcontrib-bibtex so that all references are formatted in a consistent way.

self.with_std = with_std
self.correction_ = correction

self.mean_: FData | None = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameters that end in underscore should be set only in fit.

msg = "Cannot center with more than one sample"
raise ValueError(msg)
result = result - center
else:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand the else cases. When are they applied?

msg = "Cannot center with more than one sample"
raise ValueError(msg)
result = result - center
else:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also do not understand the else cases here.



@pytest.fixture
def sample_fdgrid() -> Generator[FDataGrid, None, None]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since when is this a Generator?

with pytest.raises(TypeError):
root_integrated_sample_variance(
"not an FData object", # type: ignore[arg-type]
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am missing tests for FDataBasis and for the centering and scaling methods.

Co-authored-by: Carlos Ramos Carreño <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants