-
Notifications
You must be signed in to change notification settings - Fork 63
Centering and scaling #688
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
| References | ||
| ---------- | ||
|
|
||
| * J. Prothero, J. Hannig, and J. Marron. *New perspectives on centering*. The New England Journal of Statistics in Data Science, vol. 1, no. 2, 216–236, 2023. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would rather use sphinxcontrib-bibtex so that all references are formatted in a consistent way.
| self.with_std = with_std | ||
| self.correction_ = correction | ||
|
|
||
| self.mean_: FData | None = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parameters that end in underscore should be set only in fit.
| msg = "Cannot center with more than one sample" | ||
| raise ValueError(msg) | ||
| result = result - center | ||
| else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not understand the else cases. When are they applied?
| msg = "Cannot center with more than one sample" | ||
| raise ValueError(msg) | ||
| result = result - center | ||
| else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also do not understand the else cases here.
|
|
||
|
|
||
| @pytest.fixture | ||
| def sample_fdgrid() -> Generator[FDataGrid, None, None]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since when is this a Generator?
| with pytest.raises(TypeError): | ||
| root_integrated_sample_variance( | ||
| "not an FData object", # type: ignore[arg-type] | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am missing tests for FDataBasis and for the centering and scaling methods.
Co-authored-by: Carlos Ramos Carreño <[email protected]>
Summary
This pull request introduces a complete framework for centering and scaling functional data in the
scikit-fdalibrary. These tools are particularly relevant in the context of vector-valued functional data and mixed datasets, where components may differ in units, scale, or variability.Motivation
In classical functional data analysis, centering and scaling are often not required, since all functions are assumed to be sampled over the same domain and measured in the same units. However, when combining multiple functional components or integrating scalar and functional features, normalization becomes critical to ensure fair comparisons and effective learning.
This PR addresses this gap by providing:
Main Additions
Transformers
CenterScaler: Flexible transformer that applies user-defined or data-driven centering and scaling operations toFDataGridorFDataBasis.StandardScaler: Computes the mean and standard deviation of the dataset and uses them to standardize functional data, mimicking scikit-learn’sStandardScaler.Summary Statistics (New Functions)
These utilities compute scalar summaries useful for centering and scaling:
individual_observation_mean: Integrated average of each function (vertical shift).grand_mean: Global scalar mean of all functions.root_integrated_sample_variance: A robust measure of variability, subtracting the mean function before integration.root_mean_square_l2: RMS of the L2 norm of each function (total magnitude without centering).individual_root_mean_square_l2: Computes RMS individually per function.These functions are available under the
skfda.exploratory.statsmodule and complement the existing set of functional location and dispersion statistics.Documentation
preprocessing/.Checklist before requesting a review