diff --git a/CHANGELOG b/CHANGELOG index 426358f29..3ae370f43 100644 --- a/CHANGELOG +++ b/CHANGELOG @@ -13,6 +13,10 @@ The rules for CHANGELOG file: 0.3.0 (XXXX/XX/XX) ------------------ +- Add ``_BasePCov`` class (#248) +- Add ``PCovC`` class that inherits shared functionality from ``_BasePCov`` (#248) +- Add ``PCovC`` testing suite and examples (#248) +- Modify ``PCovR`` to inherit shared functionality from ``_BasePCov_`` (#248) - Update to sklearn >= 1.6.0 and scipy >= 1.15.0 (#239) - Fixed moved function import from scipy and bump scipy dependency to 1.15.0 (#236) - Fix rendering issues for `SparseKDE` and `QuickShift` (#236) diff --git a/docs/src/bibliography.rst b/docs/src/bibliography.rst index 896b24478..0a7eec358 100644 --- a/docs/src/bibliography.rst +++ b/docs/src/bibliography.rst @@ -45,3 +45,9 @@ References Michele Ceriotti, "Improving Sample and Feature Selection with Principal Covariates Regression" 2021 Mach. Learn.: Sci. Technol. 2 035038. https://iopscience.iop.org/article/10.1088/2632-2153/abfe7c. + +.. [Jorgensen2025] + Christian Jorgensen, Arthur Y. Lin, Rhushil Vasavada, and Rose K. Cersonsky, + "Interpretable Visualizations of Data Spaces for Classification Problems" + 2025 arXiv. 2503.05861. + https://doi.org/10.48550/arXiv.2503.05861. diff --git a/docs/src/conf.py b/docs/src/conf.py index 7c72f4cef..48e617faf 100644 --- a/docs/src/conf.py +++ b/docs/src/conf.py @@ -54,7 +54,14 @@ "sphinx_toggleprompt", ] -example_subdirs = ["pcovr", "selection", "regression", "reconstruction", "neighbors"] +example_subdirs = [ + "pcovr", + "pcovc", + "selection", + "regression", + "reconstruction", + "neighbors", +] sphinx_gallery_conf = { "filename_pattern": "/*", "examples_dirs": [f"../../examples/{p}" for p in example_subdirs], diff --git a/docs/src/getting-started.rst b/docs/src/getting-started.rst index 8d09e285b..3f9dd1d1e 100644 --- a/docs/src/getting-started.rst +++ b/docs/src/getting-started.rst @@ -37,10 +37,10 @@ Notebook Examples .. include:: examples/reconstruction/index.rst :start-line: 4 -.. _getting_started-pcovr: +.. _getting_started-hybrid: -Principal Covariates Regression -------------------------------- +Hybrid Mapping Techniques +------------------------- .. automodule:: skmatter.decomposition :noindex: @@ -50,3 +50,5 @@ Notebook Examples .. include:: examples/pcovr/index.rst :start-line: 4 +.. include:: examples/pcovc/index.rst + :start-line: 4 diff --git a/docs/src/index.rst b/docs/src/index.rst index 2095eab22..33858328d 100644 --- a/docs/src/index.rst +++ b/docs/src/index.rst @@ -33,7 +33,7 @@ .. only:: html - :ref:`getting_started-pcovr` + :ref:`getting_started-hybrid` .. image:: /examples/pcovr/images/thumb/sphx_glr_PCovR_thumb.png :alt: @@ -41,7 +41,7 @@ .. raw:: html -

Utilises a combination between a PCA-like and a LR-like loss +

PCovR and PCovC utilize a combination between a PCA-like and a LR-like loss to determine the decomposition matrix to project feature into latent space

diff --git a/docs/src/references/decomposition.rst b/docs/src/references/decomposition.rst index 8ae92be4b..0ee5caf9c 100644 --- a/docs/src/references/decomposition.rst +++ b/docs/src/references/decomposition.rst @@ -1,5 +1,5 @@ -Principal Covariates Regression (PCovR) -======================================= +Hybrid Mapping Techniques +========================= .. _PCovR-api: @@ -20,6 +20,26 @@ PCovR .. automethod:: inverse_transform .. automethod:: score +.. _PCovC-api: + +PCovC +----- + +.. autoclass:: skmatter.decomposition.PCovC + :show-inheritance: + :special-members: + + .. automethod:: fit + + .. automethod:: _fit_feature_space + .. automethod:: _fit_sample_space + + .. automethod:: transform + .. automethod:: predict + .. automethod:: inverse_transform + .. automethod:: decision_function + .. automethod:: score + .. _KPCovR-api: Kernel PCovR diff --git a/docs/src/tutorials.rst b/docs/src/tutorials.rst index 96381ea39..8c5402542 100644 --- a/docs/src/tutorials.rst +++ b/docs/src/tutorials.rst @@ -3,6 +3,7 @@ .. toctree:: examples/pcovr/index + examples/pcovc/index examples/selection/index examples/regression/index examples/reconstruction/index diff --git a/examples/pcovc/PCovC_Comparison.py b/examples/pcovc/PCovC_Comparison.py new file mode 100644 index 000000000..d3e769142 --- /dev/null +++ b/examples/pcovc/PCovC_Comparison.py @@ -0,0 +1,110 @@ +#!/usr/bin/env python +# coding: utf-8 + +""" +Comparing PCovC with PCA and LDA +================================ +""" +# %% +# + +import matplotlib.pyplot as plt +import numpy as np +from sklearn.datasets import load_breast_cancer +from sklearn.decomposition import PCA +from sklearn.discriminant_analysis import LinearDiscriminantAnalysis +from sklearn.linear_model import LogisticRegressionCV +from sklearn.preprocessing import StandardScaler + +from skmatter.decomposition import PCovC + + +plt.rcParams["image.cmap"] = "tab10" +plt.rcParams["scatter.edgecolors"] = "k" + +random_state = 0 + +# %% +# +# For this, we will use the :func:`sklearn.datasets.load_breast_cancer` dataset from +# ``sklearn``. + +X, y = load_breast_cancer(return_X_y=True) + +scaler = StandardScaler() +X_scaled = scaler.fit_transform(X) + +# %% +# +# PCA +# --- +# + +pca = PCA(n_components=2) + +pca.fit(X_scaled, y) +T_pca = pca.transform(X_scaled) + +fig, ax = plt.subplots() +scatter = ax.scatter(T_pca[:, 0], T_pca[:, 1], c=y) +ax.set(xlabel="PC$_1$", ylabel="PC$_2$") +ax.legend( + scatter.legend_elements()[0][::-1], + load_breast_cancer().target_names[::-1], + loc="upper right", + title="Classes", +) + +# %% +# +# LDA +# --- +# + +lda = LinearDiscriminantAnalysis(n_components=1) +lda.fit(X_scaled, y) + +T_lda = lda.transform(X_scaled) + +fig, ax = plt.subplots() +ax.scatter(T_lda[:], np.zeros(len(T_lda[:])), c=y) +ax.set(xlabel="LDA$_1$", ylabel="LDA$_2$") + +# %% +# +# PCovC +# ------------------- +# +# Below, we see the map produced +# by a PCovC model with :math:`\alpha` = 0.5 and a logistic +# regression classifier. + +mixing = 0.5 + +pcovc = PCovC( + mixing=mixing, + n_components=2, + random_state=random_state, + classifier=LogisticRegressionCV(), +) +pcovc.fit(X_scaled, y) + +T_pcovc = pcovc.transform(X_scaled) + +fig, ax = plt.subplots() +ax.scatter(T_pcovc[:, 0], T_pcovc[:, 1], c=y) +ax.set(xlabel="PCov$_1$", ylabel="PCov$_2$") + +# %% +# +# A side-by-side comparison of the +# three maps (PCA, LDA, and PCovC): + +fig, axs = plt.subplots(1, 3, figsize=(18, 5)) +axs[0].scatter(T_pca[:, 0], T_pca[:, 1], c=y) +axs[0].set_title("PCA") +axs[1].scatter(T_lda, np.zeros(len(T_lda)), c=y) +axs[1].set_title("LDA") +axs[2].scatter(T_pcovc[:, 0], T_pcovc[:, 1], c=y) +axs[2].set_title("PCovC") +plt.show() diff --git a/examples/pcovc/PCovC_Hyperparameters.py b/examples/pcovc/PCovC_Hyperparameters.py new file mode 100644 index 000000000..49846c687 --- /dev/null +++ b/examples/pcovc/PCovC_Hyperparameters.py @@ -0,0 +1,158 @@ +#!/usr/bin/env python +# coding: utf-8 + +""" +PCovC Hyperparameter Tuning +=========================== +""" +# %% +# + +import matplotlib.pyplot as plt +from matplotlib.colors import LinearSegmentedColormap +from sklearn.datasets import load_iris +from sklearn.decomposition import PCA +from sklearn.inspection import DecisionBoundaryDisplay +from sklearn.linear_model import LogisticRegressionCV, Perceptron, RidgeClassifierCV +from sklearn.preprocessing import StandardScaler +from sklearn.svm import LinearSVC + +from skmatter.decomposition import PCovC + + +plt.rcParams["image.cmap"] = "tab10" +plt.rcParams["scatter.edgecolors"] = "k" + +random_state = 10 +n_components = 2 + +# %% +# +# For this, we will use the :func:`sklearn.datasets.load_iris` dataset from +# ``sklearn``. + +X, y = load_iris(return_X_y=True) + +scaler = StandardScaler() +X_scaled = scaler.fit_transform(X) + +# %% +# +# PCA +# --- +# + +pca = PCA(n_components=n_components) + +pca.fit(X_scaled, y) +T_pca = pca.transform(X_scaled) + +fig, axis = plt.subplots() +scatter = axis.scatter(T_pca[:, 0], T_pca[:, 1], c=y) +axis.set(xlabel="PC$_1$", ylabel="PC$_2$") +axis.legend( + scatter.legend_elements()[0], + load_iris().target_names, + loc="lower right", + title="Classes", +) + +# %% +# +# Effect of Mixing Parameter :math:`\alpha` on PCovC Map +# ------------------------------------------------------ +# +# Below, we see how different :math:`\alpha` values for our PCovC model +# result in varying class distinctions between setosa, versicolor, +# and virginica on the PCovC map. + +n_mixing = 5 +mixing_params = [0, 0.25, 0.50, 0.75, 1] + +fig, axs = plt.subplots(1, n_mixing, figsize=(4 * n_mixing, 4), sharey="row") + +for id in range(0, n_mixing): + mixing = mixing_params[id] + + pcovc = PCovC( + mixing=mixing, + n_components=n_components, + random_state=random_state, + classifier=LogisticRegressionCV(), + ) + + pcovc.fit(X_scaled, y) + T = pcovc.transform(X_scaled) + + axs[id].set_xticks([]) + axs[id].set_yticks([]) + + axs[id].set_title(r"$\alpha=$" + str(mixing)) + axs[id].set_xlabel("PCov$_1$") + axs[id].scatter(T[:, 0], T[:, 1], c=y) + +axs[0].set_ylabel("PCov$_2$") + +fig.subplots_adjust(wspace=0) + +# %% +# +# Effect of PCovC Classifier on PCovC Map and Decision Boundaries +# --------------------------------------------------------------- +# +# Here, we see how a PCovC model (:math:`\alpha` = 0.5) fitted with +# different classifiers produces varying PCovC maps. In addition, +# we see the varying decision boundaries produced by the +# respective PCovC classifiers. + +mixing = 0.5 +fig, axs = plt.subplots(1, 4, figsize=(16, 4)) + +models = { + RidgeClassifierCV(): "Ridge Classification", + LogisticRegressionCV(random_state=random_state): "Logistic Regression", + LinearSVC(random_state=random_state): "Support Vector Classification", + Perceptron(random_state=random_state): "Single-Layer Perceptron", +} + +for id in range(0, len(models)): + model = list(models)[id] + + pcovc = PCovC( + mixing=mixing, + n_components=n_components, + random_state=random_state, + classifier=model, + ) + + pcovc.fit(X_scaled, y) + T = pcovc.transform(X_scaled) + + graph = axs[id] + graph.set_title(models[model]) + + DecisionBoundaryDisplay.from_estimator( + estimator=pcovc.classifier_, + X=T, + ax=graph, + response_method="predict", + grid_resolution=1000, + ) + + scatter = graph.scatter(T[:, 0], T[:, 1], c=y) + + graph.set_xlabel("PCov$_1$") + graph.set_xticks([]) + graph.set_yticks([]) + +axs[0].set_ylabel("PCov$_2$") +axs[0].legend( + scatter.legend_elements()[0], + load_iris().target_names, + loc="lower right", + title="Classes", + fontsize=8, +) + +fig.subplots_adjust(wspace=0.04) +plt.show() diff --git a/examples/pcovc/README.rst b/examples/pcovc/README.rst new file mode 100644 index 000000000..4018f7ffa --- /dev/null +++ b/examples/pcovc/README.rst @@ -0,0 +1,2 @@ +PCovC +===== diff --git a/src/skmatter/decomposition/__init__.py b/src/skmatter/decomposition/__init__.py index 415fcf54a..4fbb6d92c 100644 --- a/src/skmatter/decomposition/__init__.py +++ b/src/skmatter/decomposition/__init__.py @@ -3,12 +3,17 @@ in order to compress data or visualise trends in the dataset. In the archetypal method for this dimensionality reduction, principal components analysis (PCA), features are transformed into the latent space which best preserves the -variance of the original data. This module provides the Principal Covariates -Regression (PCovR), as introduced by [deJong1992]_, is a modification to PCA +variance of the original data. + +This module provides the Principal Covariates +Regression (PCovR), as introduced by [deJong1992]_, which is a modification to PCA that incorporates target information, such that the resulting embedding could be tuned using a mixing parameter α to improve performance in regression tasks (:math:`\alpha = 0` corresponding to linear regression and :math:`\alpha = 1` -corresponding to PCA). [Helfrecht2020]_ introduced the non-linear version, +corresponding to PCA). Also provided is Principal Covariates Classification (PCovC), +proposed in [Jorgensen2025]_, which can similarly be used for classification problems. + +[Helfrecht2020]_ introduced the non-linear version of PCovR, Kernel Principal Covariates Regression (KPCovR), where the mixing parameter α now interpolates between kernel ridge regression (:math:`\alpha = 0`) and kernel principal components analysis (KPCA, :math:`\alpha = 1`). @@ -20,16 +25,23 @@ a low-dimensional projection of the feature vectors that simultaneously minimises information loss and error in predicting the target properties using only the latent space vectors :math:`\mathbf{T}`. -* :ref:`KPCovR-api` the Kernel Principal Covariates Regression - a kernel-based variation on the +* :ref:`PCovC-api` the standard Principal Covariates Classification, proposed in + [Jorgensen2025]_. +* :ref:`KPCovR-api` the Kernel Principal Covariates Regression. + A kernel-based variation on the original PCovR method, proposed in [Helfrecht2020]_. """ +from ._pcov import _BasePCov + +from ._pcovr import PCovR +from ._pcovc import PCovC + from ._kernel_pcovr import KernelPCovR -from ._pcovr import ( - PCovR, - pcovr_covariance, - pcovr_kernel, -) -__all__ = ["pcovr_covariance", "pcovr_kernel", "PCovR", "KernelPCovR"] +__all__ = [ + "_BasePCov", + "PCovR", + "PCovC", + "KernelPCovR", +] diff --git a/src/skmatter/decomposition/_kernel_pcovr.py b/src/skmatter/decomposition/_kernel_pcovr.py index 093195674..e47e771fb 100644 --- a/src/skmatter/decomposition/_kernel_pcovr.py +++ b/src/skmatter/decomposition/_kernel_pcovr.py @@ -40,11 +40,13 @@ class KernelPCovR(_BasePCA, LinearModel): ---------- mixing : float, default=0.5 mixing parameter, as described in PCovR as :math:`{\alpha}` + n_components : int, float or str, default=None Number of components to keep. if n_components is not set all components are kept:: n_components == n_samples + svd_solver : {'auto', 'full', 'arpack', 'randomized'}, default='auto' If auto : The solver is selected by a default policy based on `X.shape` and @@ -62,6 +64,7 @@ class KernelPCovR(_BasePCA, LinearModel): 0 < n_components < min(X.shape) If randomized : run randomized SVD by the method of Halko et al. + regressor : {instance of `sklearn.kernel_ridge.KernelRidge`, `precomputed`, None}, default=None The regressor to use for computing the property predictions :math:`\hat{\mathbf{Y}}`. @@ -72,36 +75,47 @@ class KernelPCovR(_BasePCA, LinearModel): If `precomputed`, we assume that the `y` passed to the `fit` function is the regressed form of the targets :math:`{\mathbf{\hat{Y}}}`. + kernel : "linear" | "poly" | "rbf" | "sigmoid" | "cosine" | "precomputed" Kernel. Default="linear". + gamma : float, default=None Kernel coefficient for rbf, poly and sigmoid kernels. Ignored by other kernels. + degree : int, default=3 Degree for poly kernels. Ignored by other kernels. + coef0 : float, default=1 Independent term in poly and sigmoid kernels. Ignored by other kernels. + kernel_params : mapping of str to any, default=None Parameters (keyword arguments) and values for kernel passed as callable object. Ignored by other kernels. + center : bool, default=False Whether to center any computed kernels + fit_inverse_transform : bool, default=False Learn the inverse transform for non-precomputed kernels. (i.e. learn to find the pre-image of a point) + tol : float, default=1e-12 Tolerance for singular values computed by svd_solver == 'arpack' and for matrix inversions. Must be of range [0.0, infinity). + n_jobs : int, default=None The number of parallel jobs to run. :obj:`None` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. + iterated_power : int or 'auto', default='auto' Number of iterations for the power method computed by svd_solver == 'randomized'. Must be of range [0, infinity). + random_state : int, :class:`numpy.random.RandomState` instance or None, default=None Used when the 'arpack' or 'randomized' solvers are used. Pass an int for reproducible results across multiple function calls. @@ -111,18 +125,23 @@ class KernelPCovR(_BasePCA, LinearModel): pt__: numpy.darray of size :math:`({n_{components}, n_{components}})` pseudo-inverse of the latent-space projection, which can be used to contruct projectors from latent-space + pkt_: numpy.ndarray of size :math:`({n_{samples}, n_{components}})` the projector, or weights, from the input kernel :math:`\mathbf{K}` to the latent-space projection :math:`\mathbf{T}` + pky_: numpy.ndarray of size :math:`({n_{samples}, n_{properties}})` the projector, or weights, from the input kernel :math:`\mathbf{K}` to the properties :math:`\mathbf{Y}` + pty_: numpy.ndarray of size :math:`({n_{components}, n_{properties}})` the projector, or weights, from the latent-space projection :math:`\mathbf{T}` to the properties :math:`\mathbf{Y}` + ptx_: numpy.ndarray of size :math:`({n_{components}, n_{features}})` the projector, or weights, from the latent-space projection :math:`\mathbf{T}` to the feature matrix :math:`\mathbf{X}` + X_fit_: numpy.ndarray of shape (n_samples, n_features) The data used to fit the model. This attribute is used to build kernels from new data. @@ -133,12 +152,10 @@ class KernelPCovR(_BasePCA, LinearModel): >>> from skmatter.decomposition import KernelPCovR >>> from skmatter.preprocessing import StandardFlexibleScaler as SFS >>> from sklearn.kernel_ridge import KernelRidge - >>> >>> X = np.array([[-1, 1, -3, 1], [1, -2, 1, 2], [-2, 0, -2, -2], [1, 0, 2, -1]]) >>> X = SFS().fit_transform(X) >>> Y = np.array([[0, -5], [-1, 1], [1, -5], [-3, 2]]) >>> Y = SFS(column_wise=True).fit_transform(Y) - >>> >>> kpcovr = KernelPCovR( ... mixing=0.1, ... n_components=2, @@ -248,6 +265,7 @@ def fit(self, X, Y, W=None): means and scaled. If features are related, the matrix should be scaled to have unit variance, otherwise :math:`\mathbf{X}` should be scaled so that each feature has a variance of 1 / n_features. + Y : numpy.ndarray, shape (n_samples, n_properties) Training data, where n_samples is the number of samples and n_properties is the number of properties @@ -256,6 +274,7 @@ def fit(self, X, Y, W=None): means and scaled. If features are related, the matrix should be scaled to have unit variance, otherwise :math:`\mathbf{Y}` should be scaled so that each feature has a variance of 1 / n_features. + W : numpy.ndarray, shape (n_samples, n_properties) Regression weights, optional when regressor=`precomputed`. If not passed, it is assumed that `W = np.linalg.lstsq(K, Y, self.tol)[0]` @@ -463,6 +482,7 @@ def score(self, X, y): ---------- X : numpy.ndarray independent (predictor) variable + Y : numpy.ndarray dependent (response) variable diff --git a/src/skmatter/decomposition/_pcov.py b/src/skmatter/decomposition/_pcov.py new file mode 100644 index 000000000..972f097dc --- /dev/null +++ b/src/skmatter/decomposition/_pcov.py @@ -0,0 +1,286 @@ +import numbers +import warnings + +import numpy as np +from numpy.linalg import LinAlgError +from scipy import linalg +from scipy.linalg import sqrtm as MatrixSqrt +from scipy.sparse.linalg import svds +from sklearn.decomposition._base import _BasePCA +from sklearn.decomposition._pca import _infer_dimension +from sklearn.linear_model._base import LinearModel +from sklearn.utils import check_random_state +from sklearn.utils._arpack import _init_arpack_v0 +from sklearn.utils.extmath import randomized_svd, stable_cumsum, svd_flip +from sklearn.utils.validation import check_is_fitted + +from skmatter.utils import pcovr_covariance, pcovr_kernel + + +class _BasePCov(_BasePCA, LinearModel): + def __init__( + self, + mixing=0.5, + n_components=None, + svd_solver="auto", + tol=1e-12, + space="auto", + iterated_power="auto", + random_state=None, + whiten=False, + ): + self.mixing = mixing + self.n_components = n_components + self.svd_solver = svd_solver + self.tol = tol + self.space = space + self.iterated_power = iterated_power + self.random_state = random_state + self.whiten = whiten + + def fit(self, X): + """Contains the common functionality for the PCovR and PCovC fit methods, + but leaves the rest of the functionality to the subclass. + """ + # saved for inverse transformations from the latent space, + # should be zero in the case that the features have been properly centered + self.mean_ = np.mean(X, axis=0) + + if np.max(np.abs(self.mean_)) > self.tol: + warnings.warn( + "This class does not automatically center data, and your data mean is" + " greater than the supplied tolerance.", + stacklevel=1, + ) + + if self.space is not None and self.space not in [ + "feature", + "sample", + "auto", + ]: + raise ValueError("Only feature and sample space are supported.") + + # Handle self.n_components==None + if self.n_components is None: + if self.svd_solver != "arpack": + self.n_components_ = min(X.shape) + else: + self.n_components_ = min(X.shape) - 1 + else: + self.n_components_ = self.n_components + + # Handle svd_solver + self.fit_svd_solver_ = self.svd_solver + if self.fit_svd_solver_ == "auto": + # Small problem or self.n_components_ == 'mle', just call full PCA + if max(X.shape) <= 500 or self.n_components_ == "mle": + self.fit_svd_solver_ = "full" + elif self.n_components_ >= 1 and self.n_components_ < 0.8 * min(X.shape): + self.fit_svd_solver_ = "randomized" + # This is also the case of self.n_components_ in (0,1) + else: + self.fit_svd_solver_ = "full" + + self.n_samples_in_, self.n_features_in_ = X.shape + self.space_ = self.space + if self.space_ is None or self.space_ == "auto": + if self.n_samples_in_ > self.n_features_in_: + self.space_ = "feature" + else: + self.space_ = "sample" + + def _fit_feature_space(self, X, Y, Yhat, compute_pty_=True): + Ct, iCsqrt = pcovr_covariance( + mixing=self.mixing, + X=X, + Y=Yhat, + rcond=self.tol, + return_isqrt=True, + ) + try: + Csqrt = np.linalg.lstsq(iCsqrt, np.eye(len(iCsqrt)), rcond=None)[0] + + # if we can avoid recomputing Csqrt, we should, but sometimes we + # run into a singular matrix, which is what we do here + except LinAlgError: + Csqrt = np.real(MatrixSqrt(X.T @ X)) + + if self.fit_svd_solver_ == "full": + U, S, Vt = self._decompose_full(Ct) + elif self.fit_svd_solver_ in ["arpack", "randomized"]: + U, S, Vt = self._decompose_truncated(Ct) + else: + raise ValueError(f"Unrecognized svd_solver='{self.fit_svd_solver_}'") + + self.singular_values_ = np.sqrt(S.copy()) + self.explained_variance_ = S / (X.shape[0] - 1) + self.explained_variance_ratio_ = ( + self.explained_variance_ / self.explained_variance_.sum() + ) + + S_sqrt = np.diagflat([np.sqrt(s) if s > self.tol else 0.0 for s in S]) + S_sqrt_inv = np.diagflat([1.0 / np.sqrt(s) if s > self.tol else 0.0 for s in S]) + + self.pxt_ = np.linalg.multi_dot([iCsqrt, Vt.T, S_sqrt]) + self.ptx_ = np.linalg.multi_dot([S_sqrt_inv, Vt, Csqrt]) + + if compute_pty_: + self.pty_ = np.linalg.multi_dot([S_sqrt_inv, Vt, iCsqrt, X.T, Y]) + + def _fit_sample_space(self, X, Y, Yhat, W, compute_pty_=True): + Kt = pcovr_kernel(mixing=self.mixing, X=X, Y=Yhat) + + if self.fit_svd_solver_ == "full": + U, S, Vt = self._decompose_full(Kt) + elif self.fit_svd_solver_ in ["arpack", "randomized"]: + U, S, Vt = self._decompose_truncated(Kt) + else: + raise ValueError( + "Unrecognized svd_solver='{0}'" "".format(self.fit_svd_solver_) + ) + + self.singular_values_ = np.sqrt(S.copy()) + self.explained_variance_ = S / (X.shape[0] - 1) + self.explained_variance_ratio_ = ( + self.explained_variance_ / self.explained_variance_.sum() + ) + + P = (self.mixing * X.T) + (1.0 - self.mixing) * W @ Yhat.T + S_sqrt_inv = np.diagflat([1.0 / np.sqrt(s) if s > self.tol else 0.0 for s in S]) + T = Vt.T @ S_sqrt_inv + + self.pxt_ = P @ T + self.ptx_ = T.T @ X + + if compute_pty_: + self.pty_ = T.T @ Y + + def inverse_transform(self, T): + if np.max(np.abs(self.mean_)) > self.tol: + warnings.warn( + "This class does not automatically un-center data, and your data mean " + "is greater than the supplied tolerance, so the inverse transformation " + "will be off by the original data mean.", + stacklevel=1, + ) + + return T @ self.ptx_ + + def transform(self, X=None): + check_is_fitted(self, ["pxt_", "mean_"]) + + return super().transform(X) + + def _decompose_truncated(self, mat): + if not 1 <= self.n_components_ <= min(self.n_samples_in_, self.n_features_in_): + raise ValueError( + "n_components=%r must be between 1 and " + "min(n_samples, n_features)=%r with " + "svd_solver='%s'" + % ( + self.n_components_, + min(self.n_samples_in_, self.n_features_in_), + self.svd_solver, + ) + ) + elif not isinstance(self.n_components_, numbers.Integral): + raise ValueError( + "n_components=%r must be of type int " + "when greater than or equal to 1, was of type=%r" + % (self.n_components_, type(self.n_components_)) + ) + elif self.svd_solver == "arpack" and self.n_components_ == min( + self.n_samples_in_, self.n_features_in_ + ): + raise ValueError( + "n_components=%r must be strictly less than " + "min(n_samples, n_features)=%r with " + "svd_solver='%s'" + % ( + self.n_components_, + min(self.n_samples_in_, self.n_features_in_), + self.svd_solver, + ) + ) + + random_state = check_random_state(self.random_state) + + if self.fit_svd_solver_ == "arpack": + v0 = _init_arpack_v0(min(mat.shape), random_state) + U, S, Vt = svds(mat, k=self.n_components_, tol=self.tol, v0=v0) + # svds doesn't abide by scipy.linalg.svd/randomized_svd + # conventions, so reverse its outputs. + S = S[::-1] + # flip eigenvectors' sign to enforce deterministic output + U, Vt = svd_flip(U[:, ::-1], Vt[::-1]) + + # We have already eliminated all other solvers, so this must be "randomized" + else: + # sign flipping is done inside + U, S, Vt = randomized_svd( + mat, + n_components=self.n_components_, + n_iter=self.iterated_power, + flip_sign=True, + random_state=random_state, + ) + + return U, S, Vt + + def _decompose_full(self, mat): + if self.n_components_ == "mle": + if self.n_samples_in_ < self.n_features_in_: + raise ValueError( + "n_components='mle' is only supported " "if n_samples >= n_features" + ) + elif ( + not 0 <= self.n_components_ <= min(self.n_samples_in_, self.n_features_in_) + ): + raise ValueError( + "n_components=%r must be between 1 and " + "min(n_samples, n_features)=%r with " + "svd_solver='%s'" + % ( + self.n_components_, + min(self.n_samples_in_, self.n_features_in_), + self.svd_solver, + ) + ) + elif self.n_components_ >= 1: + if not isinstance(self.n_components_, numbers.Integral): + raise ValueError( + "n_components=%r must be of type int " + "when greater than or equal to 1, " + "was of type=%r" % (self.n_components_, type(self.n_components_)) + ) + + U, S, Vt = linalg.svd(mat, full_matrices=False) + + # flip eigenvectors' sign to enforce deterministic output + U, Vt = svd_flip(U, Vt) + + # Get variance explained by singular values + explained_variance_ = S / (self.n_samples_in_ - 1) + total_var = explained_variance_.sum() + explained_variance_ratio_ = explained_variance_ / total_var + + # Postprocess the number of components required + if self.n_components_ == "mle": + self.n_components_ = _infer_dimension( + explained_variance_, self.n_samples_in_ + ) + elif 0 < self.n_components_ < 1.0: + # number of components for which the cumulated explained + # variance percentage is superior to the desired threshold + # side='right' ensures that number of features selected + # their variance is always greater than self.n_components_ float + # passed. More discussion in issue: #15669 + ratio_cumsum = stable_cumsum(explained_variance_ratio_) + self.n_components_ = ( + np.searchsorted(ratio_cumsum, self.n_components_, side="right") + 1 + ) + return ( + U[:, : self.n_components_], + S[: self.n_components_], + Vt[: self.n_components_], + ) diff --git a/src/skmatter/decomposition/_pcovc.py b/src/skmatter/decomposition/_pcovc.py new file mode 100644 index 000000000..fe4434b4f --- /dev/null +++ b/src/skmatter/decomposition/_pcovc.py @@ -0,0 +1,447 @@ +import numpy as np +from sklearn import clone +from sklearn.discriminant_analysis import LinearDiscriminantAnalysis +from sklearn.linear_model import ( + LogisticRegression, + LogisticRegressionCV, + Perceptron, + RidgeClassifier, + RidgeClassifierCV, + SGDClassifier, +) +from sklearn.linear_model._base import LinearClassifierMixin +from sklearn.svm import LinearSVC +from sklearn.utils import check_array +from sklearn.utils.multiclass import check_classification_targets, type_of_target +from sklearn.utils.validation import check_is_fitted, validate_data + +from skmatter.decomposition import _BasePCov +from skmatter.utils import check_cl_fit + + +class PCovC(LinearClassifierMixin, _BasePCov): + r"""Principal Covariates Classification, as described in [Jorgensen2025]_, + determines a latent-space projection :math:`\mathbf{T}` + which minimizes a combined loss in supervised and unsupervised tasks. + + This projection is determined by the eigendecomposition of a modified gram + matrix :math:`\mathbf{\tilde{K}}` + + .. math:: + \mathbf{\tilde{K}} = \alpha \mathbf{X} \mathbf{X}^T + + (1 - \alpha) \mathbf{Z}\mathbf{Z}^T + + where :math:`\alpha` is a mixing parameter, :math:`\mathbf{X}` is an input matrix of shape + :math:`(n_{samples}, n_{features})`, and :math:`\mathbf{Z}` is a matrix of class confidence scores + of shape :math:`(n_{samples}, n_{classes})`. For :math:`(n_{samples} < n_{features})`, + this can be more efficiently computed using the eigendecomposition of a modified covariance matrix + :math:`\mathbf{\tilde{C}}` + + .. math:: + \mathbf{\tilde{C}} = \alpha \mathbf{X}^T \mathbf{X} + + (1 - \alpha) \left(\left(\mathbf{X}^T + \mathbf{X}\right)^{-\frac{1}{2}} \mathbf{X}^T + \mathbf{Z}\mathbf{Z}^T \mathbf{X} \left(\mathbf{X}^T + \mathbf{X}\right)^{-\frac{1}{2}}\right) + + For all PCovC methods, it is strongly suggested that :math:`\mathbf{X}` and + :math:`\mathbf{Y}` are centered and scaled to unit variance, otherwise the + results will change drastically near :math:`\alpha \to 0` and :math:`\alpha \to 1`. + This can be done with the companion preprocessing classes, where + + >>> from skmatter.preprocessing import StandardFlexibleScaler as SFS + >>> import numpy as np + >>> + >>> # Set column_wise to True when the columns are relative to one another, + >>> # False otherwise. + >>> scaler = SFS(column_wise=True) + >>> + >>> A = np.array([[1, 2], [2, 1]]) # replace with your matrix + >>> scaler.fit(A) + StandardFlexibleScaler(column_wise=True) + >>> A = scaler.transform(A) + + Parameters + ---------- + mixing: float, default=0.5 + mixing parameter, as described in PCovC as :math:`{\alpha}`, here named + to avoid confusion with regularization parameter `alpha` + + n_components : int, float or str, default=None + Number of components to keep. + if n_components is not set all components are kept:: + + n_components == min(n_samples, n_features) + + svd_solver : {'auto', 'full', 'arpack', 'randomized'}, default='auto' + If auto : + The solver is selected by a default policy based on `X.shape` and + `n_components`: if the input data is larger than 500x500 and the + number of components to extract is lower than 80% of the smallest + dimension of the data, then the more efficient 'randomized' + method is enabled. Otherwise the exact full SVD is computed and + optionally truncated afterwards. + If full : + run exact full SVD calling the standard LAPACK solver via + `scipy.linalg.svd` and select the components by postprocessing + If arpack : + run SVD truncated to n_components calling ARPACK solver via + `scipy.sparse.linalg.svds`. It requires strictly + 0 < n_components < min(X.shape) + If randomized : + run randomized SVD by the method of Halko et al. + + tol : float, default=1e-12 + Tolerance for singular values computed by svd_solver == 'arpack'. + Must be of range [0.0, infinity). + + space: {'feature', 'sample', 'auto'}, default='auto' + whether to compute the PCovC in `sample` or `feature` space + default=`sample` when :math:`{n_{samples} < n_{features}}` and + `feature` when :math:`{n_{features} < n_{samples}}` + + classifier: `estimator object` or `precomputed`, default=None + classifier for computing :math:`{\mathbf{Z}}`. The classifier should be one of + `sklearn.linear_model.LogisticRegression`, `sklearn.linear_model.LogisticRegressionCV`, + `sklearn.svm.LinearSVC`, `sklearn.discriminant_analysis.LinearDiscriminantAnalysis`, + `sklearn.linear_model.RidgeClassifier`, `sklearn.linear_model.RidgeClassifierCV`, + `sklearn.linear_model.SGDClassifier`, or `Perceptron`. If a pre-fitted classifier + is provided, it is used to compute :math:`{\mathbf{Z}}`. + Note that any pre-fitting of the classifier will be lost if `PCovC` is + within a composite estimator that enforces cloning, e.g., + `sklearn.pipeline.Pipeline` with model caching. + In such cases, the classifier will be re-fitted on the same + training data as the composite estimator. + If None, ``sklearn.linear_model.LogisticRegression()`` + is used as the classifier. + + iterated_power : int or 'auto', default='auto' + Number of iterations for the power method computed by + svd_solver == 'randomized'. + Must be of range [0, infinity). + + random_state : int, RandomState instance or None, default=None + Used when the 'arpack' or 'randomized' solvers are used. Pass an int + for reproducible results across multiple function calls. + + whiten : boolean, deprecated + + Attributes + ---------- + mixing: float, default=0.5 + mixing parameter, as described in PCovC as :math:`{\alpha}` + + tol: float, default=1e-12 + Tolerance for singular values computed by svd_solver == 'arpack'. + Must be of range [0.0, infinity). + + space: {'feature', 'sample', 'auto'}, default='auto' + whether to compute the PCovC in `sample` or `feature` space + default=`sample` when :math:`{n_{samples} < n_{features}}` and + `feature` when :math:`{n_{features} < n_{samples}}` + + n_components_ : int + The estimated number of components, which equals the parameter + n_components, or the lesser value of n_features and n_samples + if n_components is None. + + classifier : estimator object + The linear classifier passed for fitting. + + z_classifier_ : estimator object + The linear classifier fit between :math:`\mathbf{X}` and :math:`\mathbf{Y}`. + + classifier_ : estimator object + The linear classifier fit between :math:`\mathbf{T}` and :math:`\mathbf{Y}`. + + pxt_ : ndarray of size :math:`({n_{features}, n_{components}})` + the projector, or weights, from the input space :math:`\mathbf{X}` + to the latent-space projection :math:`\mathbf{T}` + + pxz_ : ndarray of size :math:`({n_{features}, })` or :math:`({n_{features}, n_{classes}})` + the projector, or weights, from the input space :math:`\mathbf{X}` + to the class confidence scores :math:`\mathbf{Z}` + + ptz_ : ndarray of size :math:`({n_{components}, })` or :math:`({n_{components}, n_{classes}})` + the projector, or weights, from the latent-space projection + :math:`\mathbf{T}` to the class confidence scores :math:`\mathbf{Z}` + + explained_variance_ : numpy.ndarray of shape (n_components,) + The amount of variance explained by each of the selected components. + Equal to n_components largest eigenvalues + of the PCovC-modified covariance matrix of :math:`\mathbf{X}`. + + singular_values_ : numpy.ndarray of shape (n_components,) + The singular values corresponding to each of the selected components. + + Examples + -------- + >>> import numpy as np + >>> from skmatter.decomposition import PCovC + >>> from sklearn.preprocessing import StandardScaler + >>> X = np.array([[-1, 0, -2, 3], [3, -2, 0, 1], [-3, 0, -1, -1], [1, 3, 0, -2]]) + >>> X = StandardScaler().fit_transform(X) + >>> Y = np.array([0, 1, 2, 0]) + >>> pcovc = PCovC(mixing=0.1, n_components=2) + >>> pcovc.fit(X, Y) + PCovC(mixing=0.1, n_components=2) + >>> pcovc.transform(X) + array([[-0.4794854 , -0.46228114], + [ 1.9416966 , 0.2532831 ], + [-1.08744947, 0.89117784], + [-0.37476173, -0.6821798 ]]) + >>> pcovc.predict(X) + array([0, 1, 2, 0]) + """ # NoQa: E501 + + def __init__( + self, + mixing=0.5, + n_components=None, + svd_solver="auto", + tol=1e-12, + space="auto", + classifier=None, + iterated_power="auto", + random_state=None, + whiten=False, + ): + super().__init__( + mixing=mixing, + n_components=n_components, + svd_solver=svd_solver, + tol=tol, + space=space, + iterated_power=iterated_power, + random_state=random_state, + whiten=whiten, + ) + self.classifier = classifier + + def fit(self, X, Y, W=None): + r"""Fit the model with X and Y. Note that W is taken from the + coefficients of a linear classifier fit between X and Y to compute + Z: + + .. math:: + \mathbf{Z} = \mathbf{X} \mathbf{W} + + We then call either `_fit_feature_space` or `_fit_sample_space`, + using Z as our approximation of Y. Finally, we refit a classifier on + T and Y to obtain :math:`\mathbf{P}_{TZ}`. + + Parameters + ---------- + X : numpy.ndarray, shape (n_samples, n_features) + Training data, where n_samples is the number of samples and + n_features is the number of features. + + It is suggested that :math:`\mathbf{X}` be centered by its column- + means and scaled. If features are related, the matrix should be + scaled to have unit variance, otherwise :math:`\mathbf{X}` should + be scaled so that each feature has a variance of 1 / n_features. + + Y : numpy.ndarray, shape (n_samples,) + Training data, where n_samples is the number of samples. + + W : numpy.ndarray, shape (n_features, n_properties) + Classification weights, optional when classifier= `precomputed`. If + not passed, it is assumed that the weights will be taken from a + linear classifier fit between :math:`\mathbf{X}` and :math:`\mathbf{Y}` + """ + X, Y = validate_data(self, X, Y, y_numeric=False) + check_classification_targets(Y) + self.classes_ = np.unique(Y) + + super().fit(X) + + compatible_classifiers = ( + LogisticRegression, + LogisticRegressionCV, + LinearSVC, + LinearDiscriminantAnalysis, + RidgeClassifier, + RidgeClassifierCV, + SGDClassifier, + Perceptron, + ) + + if self.classifier not in ["precomputed", None] and not isinstance( + self.classifier, compatible_classifiers + ): + raise ValueError( + "Classifier must be an instance of `" + f"{'`, `'.join(c.__name__ for c in compatible_classifiers)}`" + ", or `precomputed`" + ) + + if self.classifier != "precomputed": + if self.classifier is None: + classifier = LogisticRegression() + else: + classifier = self.classifier + + self.z_classifier_ = check_cl_fit(classifier, X, Y) + W = self.z_classifier_.coef_.T.reshape(X.shape[1], -1) + + else: + # If precomputed, use default classifier to predict Y from T + classifier = LogisticRegression() + if W is None: + W = LogisticRegression().fit(X, Y).coef_.T + W = W.reshape(X.shape[1], -1) + + Z = X @ W + + if self.space_ == "feature": + self._fit_feature_space(X, Y, Z) + else: + self._fit_sample_space(X, Y, Z, W) + + # instead of using linear regression solution, refit with the + # classifier and steal weights to get pxz and ptz + self.classifier_ = clone(classifier).fit(X @ self.pxt_, Y) + + self.ptz_ = self.classifier_.coef_.T + self.pxz_ = self.pxt_ @ self.ptz_ + + if len(Y.shape) == 1 and type_of_target(Y) == "binary": + self.pxz_ = self.pxz_.reshape( + X.shape[1], + ) + self.ptz_ = self.ptz_.reshape( + self.n_components_, + ) + + self.components_ = self.pxt_.T # for sklearn compatibility + return self + + def _fit_feature_space(self, X, Y, Z): + r"""In feature-space PCovC, the projectors are determined by: + + .. math:: + \mathbf{\tilde{C}} = \alpha \mathbf{X}^T \mathbf{X} + + (1 - \alpha) \left(\left(\mathbf{X}^T + \mathbf{X}\right)^{-\frac{1}{2}} \mathbf{X}^T + \mathbf{Z}\mathbf{Z}^T \mathbf{X} \left(\mathbf{X}^T + \mathbf{X}\right)^{-\frac{1}{2}}\right) + + where + + .. math:: + \mathbf{P}_{XT} = (\mathbf{X}^T \mathbf{X})^{-\frac{1}{2}} + \mathbf{U}_\mathbf{\tilde{C}}^T + \mathbf{\Lambda}_\mathbf{\tilde{C}}^{\frac{1}{2}} + + .. math:: + \mathbf{P}_{TX} = \mathbf{\Lambda}_\mathbf{\tilde{C}}^{-\frac{1}{2}} + \mathbf{U}_\mathbf{\tilde{C}}^T + (\mathbf{X}^T \mathbf{X})^{\frac{1}{2}} + """ + return super()._fit_feature_space(X, Y, Z, compute_pty_=False) + + def _fit_sample_space(self, X, Y, Z, W): + r"""In sample-space PCovC, the projectors are determined by: + + .. math:: + \mathbf{\tilde{K}} = \alpha \mathbf{X} \mathbf{X}^T + + (1 - \alpha) \mathbf{Z}\mathbf{Z}^T + + where + + .. math:: + \mathbf{P}_{XT} = \left(\alpha \mathbf{X}^T + (1 - \alpha) + \mathbf{W} \mathbf{Z}^T\right) + \mathbf{U}_\mathbf{\tilde{K}} + \mathbf{\Lambda}_\mathbf{\tilde{K}}^{-\frac{1}{2}} + + .. math:: + \mathbf{P}_{TX} = \mathbf{\Lambda}_\mathbf{\tilde{K}}^{-\frac{1}{2}} + \mathbf{U}_\mathbf{\tilde{K}}^T \mathbf{X} + """ + return super()._fit_sample_space(X, Y, Z, W, compute_pty_=False) + + def inverse_transform(self, T): + r"""Transform data back to its original space. + + .. math:: + \mathbf{\hat{X}} = \mathbf{T} \mathbf{P}_{TX} + = \mathbf{X} \mathbf{P}_{XT} \mathbf{P}_{TX} + + Parameters + ---------- + T : ndarray, shape (n_samples, n_components) + Projected data, where n_samples is the number of samples + and n_components is the number of components. + + Returns + ------- + X_original : numpy.ndarray, shape (n_samples, n_features) + """ + return super().inverse_transform(T) + + def decision_function(self, X=None, T=None): + r"""Predicts confidence scores from X or T. + + .. math:: + \mathbf{Z} = \mathbf{T} \mathbf{P}_{TZ} + = \mathbf{X} \mathbf{P}_{XT} \mathbf{P}_{TZ} + = \mathbf{X} \mathbf{P}_{XZ} + + Parameters + ---------- + X : ndarray, shape(n_samples, n_features) + Original data for which we want to get confidence scores, + where n_samples is the number of samples and n_features is the + number of features. + T : ndarray, shape (n_samples, n_components) + Projected data for which we want to get confidence scores, + where n_samples is the number of samples and n_components is the + number of components. + + Returns + ------- + Z : numpy.ndarray, shape (n_samples,) or (n_samples, n_classes) + Confidence scores. For binary classification, has shape `(n_samples,)`, + for multiclass classification, has shape `(n_samples, n_classes)` + """ + check_is_fitted(self, attributes=["pxz_", "ptz_"]) + + if X is None and T is None: + raise ValueError("Either X or T must be supplied.") + + if X is not None: + X = validate_data(self, X, reset=False) + # Or self.classifier_.decision_function(X @ self.pxt_) + return X @ self.pxz_ + self.classifier_.intercept_ + else: + T = check_array(T) + return T @ self.ptz_ + self.classifier_.intercept_ + + def predict(self, X=None, T=None): + """Predicts the property labels using classification on T.""" + check_is_fitted(self, attributes=["pxz_", "ptz_"]) + + if X is None and T is None: + raise ValueError("Either X or T must be supplied.") + + if X is not None: + X = validate_data(self, X, reset=False) + return self.classifier_.predict(X @ self.pxt_) + else: + T = check_array(T) + return self.classifier_.predict(T) + + def transform(self, X=None): + """Apply dimensionality reduction to X. + + ``X`` is projected on the first principal components as determined by + the modified PCovC distances. + + Parameters + ---------- + X : numpy.ndarray, shape (n_samples, n_features) + New data, where n_samples is the number of samples + and n_features is the number of features. + """ + return super().transform(X) diff --git a/src/skmatter/decomposition/_pcovr.py b/src/skmatter/decomposition/_pcovr.py index bc094a720..417a82c12 100644 --- a/src/skmatter/decomposition/_pcovr.py +++ b/src/skmatter/decomposition/_pcovr.py @@ -1,25 +1,15 @@ -import numbers -import warnings - import numpy as np -from numpy.linalg import LinAlgError -from scipy import linalg -from scipy.linalg import sqrtm as MatrixSqrt -from scipy.sparse.linalg import svds -from sklearn.decomposition._base import _BasePCA -from sklearn.decomposition._pca import _infer_dimension +from sklearn.base import MultiOutputMixin, RegressorMixin from sklearn.linear_model import LinearRegression, Ridge, RidgeCV -from sklearn.linear_model._base import LinearModel -from sklearn.utils import check_array, check_random_state -from sklearn.utils._arpack import _init_arpack_v0 -from sklearn.utils.extmath import randomized_svd, stable_cumsum, svd_flip +from sklearn.utils import check_array from sklearn.utils.validation import check_is_fitted, validate_data -from ..utils import check_lr_fit, pcovr_covariance, pcovr_kernel +from skmatter.decomposition import _BasePCov +from skmatter.utils import check_lr_fit -class PCovR(_BasePCA, LinearModel): - r"""Principal Covariates Regression, as described in [deJong1992]_ +class PCovR(RegressorMixin, MultiOutputMixin, _BasePCov): + r"""Principal Covariates Regression, as described in [deJong1992]_, determines a latent-space projection :math:`\mathbf{T}` which minimizes a combined loss in supervised and unsupervised tasks. @@ -67,11 +57,13 @@ class PCovR(_BasePCA, LinearModel): mixing: float, default=0.5 mixing parameter, as described in PCovR as :math:`{\alpha}`, here named to avoid confusion with regularization parameter `alpha` + n_components : int, float or str, default=None Number of components to keep. if n_components is not set all components are kept:: n_components == min(n_samples, n_features) + svd_solver : {'auto', 'full', 'arpack', 'randomized'}, default='auto' If auto : The solver is selected by a default policy based on `X.shape` and @@ -88,13 +80,16 @@ class PCovR(_BasePCA, LinearModel): min(X.shape) If randomized : run randomized SVD by the method of Halko et al. + tol : float, default=1e-12 Tolerance for singular values computed by svd_solver == 'arpack'. Must be of range [0.0, infinity). + space: {'feature', 'sample', 'auto'}, default='auto' whether to compute the PCovR in `sample` or `feature` space default=`sample` when :math:`{n_{samples} < n_{features}}` and `feature` when :math:`{n_{features} < n_{samples}}` + regressor: {`Ridge`, `RidgeCV`, `LinearRegression`, `precomputed`}, default=None regressor for computing approximated :math:`{\mathbf{\hat{Y}}}`. The regressor should be one `sklearn.linear_model.Ridge`, `sklearn.linear_model.RidgeCV`, or @@ -108,42 +103,52 @@ class PCovR(_BasePCA, LinearModel): regressed form of the targets :math:`{\mathbf{\hat{Y}}}`. If None, ``sklearn.linear_model.Ridge('alpha':1e-6, 'fit_intercept':False, 'tol':1e-12)`` is used as the regressor. + iterated_power : int or 'auto', default='auto' - Number of iterations for the power method computed by svd_solver == - 'randomized'. Must be of range [0, infinity). + Number of iterations for the power method computed by svd_solver == + 'randomized'. Must be of range [0, infinity). + random_state : int, :class:`numpy.random.RandomState` instance or None, default=None - Used when the 'arpack' or 'randomized' solvers are used. Pass an int for - reproducible results across multiple function calls. - whiten : bool, deprecated + Used when the 'arpack' or 'randomized' solvers are used. Pass an int for + reproducible results across multiple function calls. + + whiten : boolean, deprecated Attributes ---------- mixing: float, default=0.5 mixing parameter, as described in PCovR as :math:`{\alpha}` + tol: float, default=1e-12 Tolerance for singular values computed by svd_solver == 'arpack'. Must be of range [0.0, infinity). + space: {'feature', 'sample', 'auto'}, default='auto' whether to compute the PCovR in `sample` or `feature` space default=`sample` when :math:`{n_{samples} < n_{features}}` and `feature` when :math:`{n_{features} < n_{samples}}` + n_components_ : int The estimated number of components, which equals the parameter n_components, or the lesser value of n_features and n_samples if n_components is None. + pxt_ : numpy.ndarray of size :math:`({n_{samples}, n_{components}})` the projector, or weights, from the input space :math:`\mathbf{X}` to the latent-space projection :math:`\mathbf{T}` - pty_ : numpy.ndarray of size :math:`({n_{components}, n_{properties}})` - the projector, or weights, from the latent-space projection :math:`\mathbf{T}` - to the properties :math:`\mathbf{Y}` + pxy_ : numpy.ndarray of size :math:`({n_{samples}, n_{properties}})` the projector, or weights, from the input space :math:`\mathbf{X}` to the properties :math:`\mathbf{Y}` + + pty_ : numpy.ndarray of size :math:`({n_{components}, n_{properties}})` + the projector, or weights, from the latent-space projection :math:`\mathbf{T}` + to the properties :math:`\mathbf{Y}` + explained_variance_ : numpy.ndarray of shape (n_components,) The amount of variance explained by each of the selected components. - Equal to n_components largest eigenvalues of the PCovR-modified covariance matrix of :math:`\mathbf{X}`. + singular_values_ : numpy.ndarray of shape (n_components,) The singular values corresponding to each of the selected components. @@ -166,7 +171,7 @@ class PCovR(_BasePCA, LinearModel): [-1.02805338, 1.06736871], [ 0.98166504, -4.98307078], [-2.9963189 , 1.98238856]]) - """ + """ # NoQa: E501 def __init__( self, @@ -180,16 +185,16 @@ def __init__( random_state=None, whiten=False, ): - self.mixing = mixing - self.n_components = n_components - self.space = space - - self.whiten = whiten - self.svd_solver = svd_solver - self.tol = tol - self.iterated_power = iterated_power - self.random_state = random_state - + super().__init__( + mixing=mixing, + n_components=n_components, + svd_solver=svd_solver, + tol=tol, + space=space, + iterated_power=iterated_power, + random_state=random_state, + whiten=whiten, + ) self.regressor = regressor def fit(self, X, Y, W=None): @@ -206,6 +211,7 @@ def fit(self, X, Y, W=None): means and scaled. If features are related, the matrix should be scaled to have unit variance, otherwise :math:`\mathbf{X}` should be scaled so that each feature has a variance of 1 / n_features. + Y : numpy.ndarray, shape (n_samples, n_properties) Training data, where n_samples is the number of samples and n_properties is the number of properties @@ -217,51 +223,23 @@ def fit(self, X, Y, W=None): If the passed regressor = `precomputed`, it is assumed that Y is the regressed form of the properties, :math:`{\mathbf{\hat{Y}}}`. + W : numpy.ndarray, shape (n_features, n_properties) - Regression weights, optional when regressor=`precomputed`. If not + Regression weights, optional when regressor= `precomputed`. If not passed, it is assumed that `W = np.linalg.lstsq(X, Y, self.tol)[0]` """ X, Y = validate_data(self, X, Y, y_numeric=True, multi_output=True) + super().fit(X) - # saved for inverse transformations from the latent space, - # should be zero in the case that the features have been properly centered - self.mean_ = np.mean(X, axis=0) - - if np.max(np.abs(self.mean_)) > self.tol: - warnings.warn( - "This class does not automatically center data, and your data mean is" - " greater than the supplied tolerance.", - stacklevel=1, - ) + compatible_regressors = (LinearRegression, Ridge, RidgeCV) - if self.space is not None and self.space not in [ - "feature", - "sample", - "auto", - ]: - raise ValueError("Only feature and sample space are supported.") - - # Handle self.n_components==None - if self.n_components is None: - if self.svd_solver != "arpack": - self.n_components_ = min(X.shape) - else: - self.n_components_ = min(X.shape) - 1 - else: - self.n_components_ = self.n_components - - if not any( - [ - self.regressor is None, - self.regressor == "precomputed", - isinstance(self.regressor, LinearRegression), - isinstance(self.regressor, Ridge), - isinstance(self.regressor, RidgeCV), - ] + if self.regressor not in ["precomputed", None] and not isinstance( + self.regressor, compatible_regressors ): raise ValueError( - "Regressor must be an instance of " - "`LinearRegression`, `Ridge`, `RidgeCV`, or `precomputed`" + "Regressor must be an instance of `" + f"{'`, `'.join(r.__name__ for r in compatible_regressors)}`" + ", or `precomputed`" ) # Assign the default regressor @@ -275,7 +253,7 @@ def fit(self, X, Y, W=None): else: regressor = self.regressor - self.regressor_ = check_lr_fit(regressor, X, y=Y) + self.regressor_ = check_lr_fit(regressor, X, Y) W = self.regressor_.coef_.T.reshape(X.shape[1], -1) Yhat = self.regressor_.predict(X).reshape(X.shape[0], -1) @@ -284,26 +262,6 @@ def fit(self, X, Y, W=None): if W is None: W = np.linalg.lstsq(X, Yhat, self.tol)[0] - # Handle svd_solver - self.fit_svd_solver_ = self.svd_solver - if self.fit_svd_solver_ == "auto": - # Small problem or self.n_components_ == 'mle', just call full PCA - if max(X.shape) <= 500 or self.n_components_ == "mle": - self.fit_svd_solver_ = "full" - elif self.n_components_ >= 1 and self.n_components_ < 0.8 * min(X.shape): - self.fit_svd_solver_ = "randomized" - # This is also the case of self.n_components_ in (0,1) - else: - self.fit_svd_solver_ = "full" - - self.n_samples_in_, self.n_features_in_ = X.shape - self.space_ = self.space - if self.space_ is None or self.space_ == "auto": - if self.n_samples_in_ > self.n_features_in_: - self.space_ = "feature" - else: - self.space_ = "sample" - if self.space_ == "feature": self._fit_feature_space(X, Y.reshape(Yhat.shape), Yhat) else: @@ -349,41 +307,7 @@ def _fit_feature_space(self, X, Y, Yhat): \mathbf{X})^{-\frac{1}{2}} \mathbf{X}^T \mathbf{Y} """ - Ct, iCsqrt = pcovr_covariance( - mixing=self.mixing, - X=X, - Y=Yhat, - rcond=self.tol, - return_isqrt=True, - ) - try: - Csqrt = np.linalg.lstsq(iCsqrt, np.eye(len(iCsqrt)), rcond=None)[0] - - # if we can avoid recomputing Csqrt, we should, but sometimes we - # run into a singular matrix, which is what we do here - except LinAlgError: - Csqrt = np.real(MatrixSqrt(X.T @ X)) - - if self.fit_svd_solver_ == "full": - U, S, Vt = self._decompose_full(Ct) - elif self.fit_svd_solver_ in ["arpack", "randomized"]: - U, S, Vt = self._decompose_truncated(Ct) - else: - raise ValueError( - "Unrecognized svd_solver='{0}'" "".format(self.fit_svd_solver_) - ) - - self.singular_values_ = np.sqrt(S.copy()) - self.explained_variance_ = S / (X.shape[0] - 1) - self.explained_variance_ratio_ = ( - self.explained_variance_ / self.explained_variance_.sum() - ) - - S_sqrt = np.diagflat([np.sqrt(s) if s > self.tol else 0.0 for s in S]) - S_sqrt_inv = np.diagflat([1.0 / np.sqrt(s) if s > self.tol else 0.0 for s in S]) - self.pxt_ = np.linalg.multi_dot([iCsqrt, Vt.T, S_sqrt]) - self.ptx_ = np.linalg.multi_dot([S_sqrt_inv, Vt, Csqrt]) - self.pty_ = np.linalg.multi_dot([S_sqrt_inv, Vt, iCsqrt, X.T, Y]) + return super()._fit_feature_space(X, Y, Yhat) def _fit_sample_space(self, X, Y, Yhat, W): r"""In sample-space PCovR, the projectors are determined by: @@ -408,144 +332,7 @@ def _fit_sample_space(self, X, Y, Yhat, W): \mathbf{P}_{TY} = \mathbf{\Lambda}_\mathbf{\tilde{K}}^{-\frac{1}{2}} \mathbf{U}_\mathbf{\tilde{K}}^T \mathbf{Y} """ - Kt = pcovr_kernel(mixing=self.mixing, X=X, Y=Yhat) - - if self.fit_svd_solver_ == "full": - U, S, Vt = self._decompose_full(Kt) - elif self.fit_svd_solver_ in ["arpack", "randomized"]: - U, S, Vt = self._decompose_truncated(Kt) - else: - raise ValueError( - "Unrecognized svd_solver='{0}'" "".format(self.fit_svd_solver_) - ) - - self.singular_values_ = np.sqrt(S.copy()) - self.explained_variance_ = S / (X.shape[0] - 1) - self.explained_variance_ratio_ = ( - self.explained_variance_ / self.explained_variance_.sum() - ) - - P = (self.mixing * X.T) + (1.0 - self.mixing) * W @ Yhat.T - S_sqrt_inv = np.diagflat([1.0 / np.sqrt(s) if s > self.tol else 0.0 for s in S]) - T = Vt.T @ S_sqrt_inv - - self.pxt_ = P @ T - self.pty_ = T.T @ Y - self.ptx_ = T.T @ X - - def _decompose_truncated(self, mat): - if not 1 <= self.n_components_ <= min(self.n_samples_in_, self.n_features_in_): - raise ValueError( - "n_components=%r must be between 1 and " - "min(n_samples, n_features)=%r with " - "svd_solver='%s'" - % ( - self.n_components_, - min(self.n_samples_in_, self.n_features_in_), - self.svd_solver, - ) - ) - elif not isinstance(self.n_components_, numbers.Integral): - raise ValueError( - "n_components=%r must be of type int " - "when greater than or equal to 1, was of type=%r" - % (self.n_components_, type(self.n_components_)) - ) - elif self.svd_solver == "arpack" and self.n_components_ == min( - self.n_samples_in_, self.n_features_in_ - ): - raise ValueError( - "n_components=%r must be strictly less than " - "min(n_samples, n_features)=%r with " - "svd_solver='%s'" - % ( - self.n_components_, - min(self.n_samples_in_, self.n_features_in_), - self.svd_solver, - ) - ) - - random_state = check_random_state(self.random_state) - - if self.fit_svd_solver_ == "arpack": - v0 = _init_arpack_v0(min(mat.shape), random_state) - U, S, Vt = svds(mat, k=self.n_components_, tol=self.tol, v0=v0) - # svds doesn't abide by scipy.linalg.svd/randomized_svd - # conventions, so reverse its outputs. - S = S[::-1] - # flip eigenvectors' sign to enforce deterministic output - U, Vt = svd_flip(U[:, ::-1], Vt[::-1]) - - # We have already eliminated all other solvers, so this must be "randomized" - else: - # sign flipping is done inside - U, S, Vt = randomized_svd( - mat, - n_components=self.n_components_, - n_iter=self.iterated_power, - flip_sign=True, - random_state=random_state, - ) - - return U, S, Vt - - def _decompose_full(self, mat): - if self.n_components_ == "mle": - if self.n_samples_in_ < self.n_features_in_: - raise ValueError( - "n_components='mle' is only supported " "if n_samples >= n_features" - ) - elif ( - not 0 <= self.n_components_ <= min(self.n_samples_in_, self.n_features_in_) - ): - raise ValueError( - "n_components=%r must be between 1 and " - "min(n_samples, n_features)=%r with " - "svd_solver='%s'" - % ( - self.n_components_, - min(self.n_samples_in_, self.n_features_in_), - self.svd_solver, - ) - ) - elif self.n_components_ >= 1: - if not isinstance(self.n_components_, numbers.Integral): - raise ValueError( - "n_components=%r must be of type int " - "when greater than or equal to 1, " - "was of type=%r" % (self.n_components_, type(self.n_components_)) - ) - - U, S, Vt = linalg.svd(mat, full_matrices=False) - - # flip eigenvectors' sign to enforce deterministic output - U, Vt = svd_flip(U, Vt) - - # Get variance explained by singular values - explained_variance_ = S / (self.n_samples_in_ - 1) - total_var = explained_variance_.sum() - explained_variance_ratio_ = explained_variance_ / total_var - - # Postprocess the number of components required - if self.n_components_ == "mle": - self.n_components_ = _infer_dimension( - explained_variance_, self.n_samples_in_ - ) - elif 0 < self.n_components_ < 1.0: - # number of components for which the cumulated explained - # variance percentage is superior to the desired threshold - # side='right' ensures that number of features selected - # their variance is always greater than self.n_components_ float - # passed. More discussion in issue: #15669 - ratio_cumsum = stable_cumsum(explained_variance_ratio_) - self.n_components_ = ( - np.searchsorted(ratio_cumsum, self.n_components_, side="right") + 1 - ) - return ( - U[:, : self.n_components_], - S[: self.n_components_], - Vt[: self.n_components_], - ) + return super()._fit_sample_space(X, Y, Yhat, W) def inverse_transform(self, T): r"""Transform data back to its original space. @@ -562,17 +349,9 @@ def inverse_transform(self, T): Returns ------- - X_original ndarray, shape (n_samples, n_features) + X_original : numpy.ndarray, shape (n_samples, n_features) """ - if np.max(np.abs(self.mean_)) > self.tol: - warnings.warn( - "This class does not automatically un-center data, and your data mean " - "is greater than the supplied tolerance, so the inverse transformation " - "will be off by the original data mean.", - stacklevel=1, - ) - - return T @ self.ptx_ + return super().inverse_transform(T) def predict(self, X=None, T=None): """Predicts the property values using regression on X or T.""" @@ -600,8 +379,6 @@ def transform(self, X=None): New data, where n_samples is the number of samples and n_features is the number of features. """ - check_is_fitted(self, ["pxt_", "mean_"]) - return super().transform(X) def score(self, X, y, T=None): @@ -647,3 +424,8 @@ def score(self, X, y, T=None): np.linalg.norm(X - Xrec) ** 2.0 / np.linalg.norm(X) ** 2.0 + np.linalg.norm(y - ypred) ** 2.0 / np.linalg.norm(y) ** 2.0 ) + + def __sklearn_tags__(self): + tags = super().__sklearn_tags__() + tags.regressor_tags.poor_score = True + return tags diff --git a/src/skmatter/utils/__init__.py b/src/skmatter/utils/__init__.py index 2f0c6b969..70e32616c 100644 --- a/src/skmatter/utils/__init__.py +++ b/src/skmatter/utils/__init__.py @@ -8,12 +8,16 @@ Y_feature_orthogonalizer, Y_sample_orthogonalizer, ) + +from ._pcovc_utils import check_cl_fit + from ._pcovr_utils import ( check_krr_fit, check_lr_fit, pcovr_covariance, pcovr_kernel, ) + from ._progress_bar import ( get_progress_bar, no_progress_bar, diff --git a/src/skmatter/utils/_pcovc_utils.py b/src/skmatter/utils/_pcovc_utils.py new file mode 100644 index 000000000..ea55dd60a --- /dev/null +++ b/src/skmatter/utils/_pcovc_utils.py @@ -0,0 +1,67 @@ +from copy import deepcopy + +import numpy as np +from sklearn import clone +from sklearn.exceptions import NotFittedError +from sklearn.utils.validation import check_is_fitted, validate_data + + +def check_cl_fit(classifier, X, y): + """ + Checks that a (linear) classifier is fitted, and if not, + fits it with the provided data. + + Parameters + ---------- + classifier : object + sklearn-style classifier + X : array-like + Feature matrix with which to fit the classifier if it is not already fitted + y : array-like + Target values with which to fit the classifier if it is not already fitted + + Returns + ------- + fitted_classifier : object + The fitted classifier. If input classifier was already fitted and compatible + with the data, returns a deep copy. Otherwise returns a newly fitted classifier. + + Raises + ------ + ValueError + If the fitted classifiers's coefficients have a shape incompatible with the + number of features in X or the number of classes in y. + """ + try: + check_is_fitted(classifier) + fitted_classifier = deepcopy(classifier) + + # Check compatibility with X + validate_data(fitted_classifier, X, y, reset=False, multi_output=True) + + # Check compatibility with the number of features in X and the number of + # classes in y + n_classes = len(np.unique(y)) + + if n_classes == 2: + if fitted_classifier.coef_.shape[0] != 1: + raise ValueError( + "For binary classification, expected classifier coefficients " + "to have shape (1, " + f"{X.shape[1]}) but got shape " + f"{fitted_classifier.coef_.shape}" + ) + else: + if fitted_classifier.coef_.shape[0] != n_classes: + raise ValueError( + "For multiclass classification, expected classifier coefficients " + "to have shape " + f"({n_classes}, {X.shape[1]}) but got shape " + f"{fitted_classifier.coef_.shape}" + ) + + except NotFittedError: + fitted_classifier = clone(classifier) + fitted_classifier.fit(X, y) + + return fitted_classifier diff --git a/tests/test_check_estimators.py b/tests/test_check_estimators.py index fc89ecdb4..1683b53f3 100644 --- a/tests/test_check_estimators.py +++ b/tests/test_check_estimators.py @@ -1,6 +1,6 @@ from sklearn.utils.estimator_checks import parametrize_with_checks -from skmatter.decomposition import KernelPCovR, PCovR +from skmatter.decomposition import KernelPCovR, PCovC, PCovR from skmatter.feature_selection import CUR as fCUR from skmatter.feature_selection import FPS as fFPS from skmatter.feature_selection import PCovCUR as fPCovCUR @@ -13,6 +13,7 @@ [ KernelPCovR(mixing=0.5), PCovR(mixing=0.5), + PCovC(mixing=0.5), fCUR(), fFPS(), fPCovCUR(), diff --git a/tests/test_pcovc.py b/tests/test_pcovc.py new file mode 100644 index 000000000..57e501270 --- /dev/null +++ b/tests/test_pcovc.py @@ -0,0 +1,566 @@ +import unittest +import warnings + +import numpy as np +from sklearn import exceptions +from sklearn.datasets import load_breast_cancer as get_dataset +from sklearn.decomposition import PCA +from sklearn.linear_model import LogisticRegression +from sklearn.naive_bayes import GaussianNB +from sklearn.preprocessing import StandardScaler +from sklearn.utils.validation import check_X_y + +from skmatter.decomposition import PCovC + + +class PCovCBaseTest(unittest.TestCase): + def __init__(self, *args, **kwargs): + super().__init__(*args, **kwargs) + + self.model = ( + lambda mixing=0.5, classifier=LogisticRegression(), **kwargs: PCovC( + mixing=mixing, classifier=classifier, **kwargs + ) + ) + + self.error_tol = 1e-5 + + self.X, self.Y = get_dataset(return_X_y=True) + + scaler = StandardScaler() + self.X = scaler.fit_transform(self.X) + + def setUp(self): + pass + + +class PCovCErrorTest(PCovCBaseTest): + def test_against_pca(self): + """Tests that mixing = 1.0 corresponds to PCA.""" + pcovc = PCovC( + mixing=1.0, n_components=2, space="feature", svd_solver="full" + ).fit(self.X, self.Y) + + pca = PCA(n_components=2, svd_solver="full").fit(self.X) + + # tests that the SVD is equivalent + self.assertTrue(np.allclose(pca.singular_values_, pcovc.singular_values_)) + self.assertTrue(np.allclose(pca.explained_variance_, pcovc.explained_variance_)) + + T_pcovc = pcovc.transform(self.X) + T_pca = pca.transform(self.X) + + # tests that the projections are equivalent + self.assertLessEqual( + np.linalg.norm(T_pcovc @ T_pcovc.T - T_pca @ T_pca.T), 1e-8 + ) + + def test_simple_reconstruction(self): + """Check that PCovC with a full eigendecomposition at mixing=1 can fully + reconstruct the input matrix. + """ + for space in ["feature", "sample", "auto"]: + with self.subTest(space=space): + pcovc = self.model( + mixing=1.0, n_components=self.X.shape[-1], space=space + ) + pcovc.fit(self.X, self.Y) + Xr = pcovc.inverse_transform(pcovc.transform(self.X)) + self.assertLessEqual( + np.linalg.norm(self.X - Xr) ** 2.0 / np.linalg.norm(self.X) ** 2.0, + self.error_tol, + ) + + def test_simple_prediction(self): + """ + Check that PCovC with a full eigendecomposition at mixing=0 + can fully reconstruct the input properties. + """ + for space in ["feature", "sample", "auto"]: + with self.subTest(space=space): + pcovc = self.model(mixing=0.0, n_components=2, space=space) + + pcovc.classifier.fit(self.X, self.Y) + Yhat = pcovc.classifier.predict(self.X) + + pcovc.fit(self.X, self.Y) + Yp = pcovc.predict(self.X) + self.assertLessEqual( + np.linalg.norm(Yp - Yhat) ** 2.0 / np.linalg.norm(Yhat) ** 2.0, + self.error_tol, + ) + + def test_cl_with_x_errors(self): + """ + Check that PCovC returns a non-null property prediction + and that the prediction error increases with `mixing` + """ + prev_error = -1.0 + + for mixing in np.linspace(0, 1, 11): + pcovc = self.model(mixing=mixing, n_components=2, tol=1e-12) + pcovc.fit(self.X, self.Y) + + Yp = pcovc.predict(X=self.X) + error = np.linalg.norm(self.Y - Yp) ** 2.0 / np.linalg.norm(self.Y) ** 2.0 + + with self.subTest(error=error): + self.assertFalse(np.isnan(error)) + with self.subTest(error=error, alpha=round(mixing, 4)): + self.assertGreaterEqual(error, prev_error - self.error_tol) + + prev_error = error + + def test_cl_with_t_errors(self): + """Check that PCovc returns a non-null property prediction from the latent space + projection and that the prediction error increases with `mixing`. + """ + prev_error = -1.0 + + for mixing in np.linspace(0, 1, 11): + pcovc = self.model(mixing=mixing, n_components=2, tol=1e-12) + pcovc.fit(self.X, self.Y) + + T = pcovc.transform(self.X) + Yp = pcovc.predict(T=T) + error = np.linalg.norm(self.Y - Yp) ** 2.0 / np.linalg.norm(self.Y) ** 2.0 + + with self.subTest(error=error): + self.assertFalse(np.isnan(error)) + with self.subTest(error=error, alpha=round(mixing, 4)): + self.assertGreaterEqual(error, prev_error - self.error_tol) + + prev_error = error + + def test_reconstruction_errors(self): + """Check that PCovC returns a non-null reconstructed X and that the + reconstruction error decreases with `mixing`. + """ + prev_error = 1.0 + + for mixing in np.linspace(0, 1, 11): + pcovc = self.model(mixing=mixing, n_components=2, tol=1e-12) + pcovc.fit(self.X, self.Y) + + Xr = pcovc.inverse_transform(pcovc.transform(self.X)) + error = np.linalg.norm(self.X - Xr) ** 2.0 / np.linalg.norm(self.X) ** 2.0 + + with self.subTest(error=error): + self.assertFalse(np.isnan(error)) + with self.subTest(error=error, alpha=round(mixing, 4)): + self.assertLessEqual(error, prev_error + self.error_tol) + + prev_error = error + + +class PCovCSpaceTest(PCovCBaseTest): + def test_select_feature_space(self): + """ + Check that PCovC implements the feature space version + when :math:`n_{features} < n_{samples}``. + """ + pcovc = self.model(n_components=2, tol=1e-12) + pcovc.fit(self.X, self.Y) + + self.assertTrue(pcovc.space_ == "feature") + + def test_select_sample_space(self): + """ + Check that PCovC implements the sample space version + when :math:`n_{features} > n_{samples}``. + """ + pcovc = self.model(n_components=2, tol=1e-12) + + n_samples = self.X.shape[1] - 1 + pcovc.fit(self.X[:n_samples], self.Y[:n_samples]) + + self.assertTrue(pcovc.space_ == "sample") + + def test_bad_space(self): + """ + Check that PCovC raises a ValueError when a non-valid + space is designated. + """ + with self.assertRaises(ValueError): + pcovc = self.model(n_components=2, tol=1e-12, space="bad") + pcovc.fit(self.X, self.Y) + + def test_override_spaceselection(self): + """ + Check that PCovC implements the space provided in the + constructor, overriding that chosen by the input dimensions. + """ + pcovc = self.model(n_components=2, tol=1e-12, space="sample") + pcovc.fit(self.X, self.Y) + + self.assertTrue(pcovc.space_ == "sample") + + def test_spaces_equivalent(self): + """ + Check that the results from PCovC, regardless of the space, + are equivalent. + """ + for alpha in np.linspace(0.01, 0.99, 11): + with self.subTest(alpha=alpha, type="prediction"): + pcovc_ss = self.model( + n_components=2, mixing=alpha, tol=1e-12, space="sample" + ) + pcovc_ss.fit(self.X, self.Y) + + pcovc_fs = self.model( + n_components=2, mixing=alpha, tol=1e-12, space="feature" + ) + pcovc_fs.fit(self.X, self.Y) + + self.assertTrue( + np.allclose( + pcovc_ss.decision_function(self.X), + pcovc_fs.decision_function(self.X), + self.error_tol, + ) + ) + + with self.subTest(alpha=alpha, type="reconstruction"): + pcovc_ss = self.model( + n_components=2, mixing=alpha, tol=1e-12, space="sample" + ) + pcovc_ss.fit(self.X, self.Y) + + pcovc_fs = self.model( + n_components=2, mixing=alpha, tol=1e-12, space="feature" + ) + pcovc_fs.fit(self.X, self.Y) + self.assertTrue( + np.allclose( + pcovc_ss.inverse_transform(pcovc_ss.transform(self.X)), + pcovc_fs.inverse_transform(pcovc_fs.transform(self.X)), + self.error_tol, + ) + ) + + +class PCovCTestSVDSolvers(PCovCBaseTest): + def test_svd_solvers(self): + """ + Check that PCovC works with all svd_solver modes and assigns + the right n_components + """ + for solver in ["arpack", "full", "randomized", "auto"]: + with self.subTest(solver=solver): + pcovc = self.model(tol=1e-12, svd_solver=solver) + pcovc.fit(self.X, self.Y) + + if solver == "arpack": + self.assertTrue(pcovc.n_components_ == min(self.X.shape) - 1) + else: + self.assertTrue(pcovc.n_components_ == min(self.X.shape)) + + def test_bad_solver(self): + """ + Check that PCovC will not work with a solver that isn't in + ['arpack', 'full', 'randomized', 'auto'] + """ + for space in ["feature", "sample"]: + with self.assertRaises(ValueError) as cm: + pcovc = self.model(svd_solver="bad", space=space) + pcovc.fit(self.X, self.Y) + + self.assertEqual(str(cm.exception), "Unrecognized svd_solver='bad'" "") + + def test_good_n_components(self): + """Check that PCovC will work with any allowed values of n_components.""" + # this one should pass + pcovc = self.model(n_components=0.5, svd_solver="full") + pcovc.fit(self.X, self.Y) + + for svd_solver in ["auto", "full"]: + # this one should pass + pcovc = self.model(n_components=2, svd_solver=svd_solver) + pcovc.fit(self.X, self.Y) + + # this one should pass + pcovc = self.model(n_components="mle", svd_solver=svd_solver) + pcovc.fit(self.X, self.Y) + + def test_bad_n_components(self): + """Check that PCovC will not work with any prohibited values of n_components.""" + with self.assertRaises(ValueError) as cm: + pcovc = self.model( + n_components="mle", classifier=LogisticRegression(), svd_solver="full" + ) + pcovc.fit(self.X[:20], self.Y[:20]) + self.assertEqual( + str(cm.exception), + "n_components='mle' is only supported " "if n_samples >= n_features", + ) + + with self.subTest(type="negative_ncomponents"): + with self.assertRaises(ValueError) as cm: + pcovc = self.model(n_components=-1, svd_solver="auto") + pcovc.fit(self.X, self.Y) + + self.assertEqual( + str(cm.exception), + "n_components=%r must be between 1 and " + "min(n_samples, n_features)=%r with " + "svd_solver='%s'" + % ( + pcovc.n_components_, + min(self.X.shape), + pcovc.svd_solver, + ), + ) + with self.subTest(type="0_ncomponents"): + with self.assertRaises(ValueError) as cm: + pcovc = self.model(n_components=0, svd_solver="randomized") + pcovc.fit(self.X, self.Y) + + self.assertEqual( + str(cm.exception), + "n_components=%r must be between 1 and " + "min(n_samples, n_features)=%r with " + "svd_solver='%s'" + % ( + pcovc.n_components_, + min(self.X.shape), + pcovc.svd_solver, + ), + ) + with self.subTest(type="arpack_X_ncomponents"): + with self.assertRaises(ValueError) as cm: + pcovc = self.model(n_components=min(self.X.shape), svd_solver="arpack") + pcovc.fit(self.X, self.Y) + self.assertEqual( + str(cm.exception), + "n_components=%r must be strictly less than " + "min(n_samples, n_features)=%r with " + "svd_solver='%s'" + % ( + pcovc.n_components_, + min(self.X.shape), + pcovc.svd_solver, + ), + ) + + for svd_solver in ["auto", "full"]: + with self.subTest(type="pi_ncomponents"): + with self.assertRaises(ValueError) as cm: + pcovc = self.model(n_components=np.pi, svd_solver=svd_solver) + pcovc.fit(self.X, self.Y) + self.assertEqual( + str(cm.exception), + "n_components=%r must be of type int " + "when greater than or equal to 1, was of type=%r" + % (pcovc.n_components_, type(pcovc.n_components_)), + ) + + +class PCovCInfrastructureTest(PCovCBaseTest): + def test_nonfitted_failure(self): + """ + Check that PCovC will raise a `NonFittedError` if + `transform` is called before the pcovc is fitted + """ + pcovc = self.model(n_components=2, tol=1e-12) + with self.assertRaises(exceptions.NotFittedError): + _ = pcovc.transform(self.X) + + def test_no_arg_predict(self): + """ + Check that PCovC will raise a `ValueError` if + `predict` is called without arguments + """ + pcovc = self.model(n_components=2, tol=1e-12) + pcovc.fit(self.X, self.Y) + with self.assertRaises(ValueError): + _ = pcovc.predict() + + def test_centering(self): + """ + Check that PCovC raises a warning if + given uncentered data. + """ + pcovc = self.model(n_components=2, tol=1e-12) + X = self.X.copy() + np.random.uniform(-1, 1, self.X.shape[1]) + with warnings.catch_warnings(record=True) as w: + pcovc.fit(X, self.Y) + self.assertEqual( + str(w[0].message), + "This class does not automatically center data, and your data " + "mean is greater than the supplied tolerance.", + ) + + def test_T_shape(self): + """Check that PCovC returns a latent space projection consistent with + the shape of the input matrix. + """ + n_components = 5 + pcovc = self.model(n_components=n_components, tol=1e-12) + pcovc.fit(self.X, self.Y) + T = pcovc.transform(self.X) + self.assertTrue(check_X_y(self.X, T, multi_output=True)) + self.assertTrue(T.shape[-1] == n_components) + + def test_Z_shape(self): + """Check that PCovC returns an evidence matrix consistent with the + number of samples and the number of classes. + """ + n_components = 5 + pcovc = self.model(n_components=n_components, tol=1e-12) + pcovc.fit(self.X, self.Y) + + # Shape (n_samples, ) for binary classifcation + Z = pcovc.decision_function(self.X) + + self.assertTrue(Z.ndim == 1) + self.assertTrue(Z.shape[0] == self.X.shape[0]) + + # Modify Y so that it now contains three classes + Y_multiclass = self.Y.copy() + Y_multiclass[0] = 2 + pcovc.fit(self.X, Y_multiclass) + n_classes = len(np.unique(Y_multiclass)) + + # Shape (n_samples, n_classes) for multiclass classification + Z = pcovc.decision_function(self.X) + + self.assertTrue(Z.ndim == 2) + self.assertTrue((Z.shape[0], Z.shape[1]) == (self.X.shape[0], n_classes)) + + def test_default_ncomponents(self): + pcovc = PCovC(mixing=0.5) + pcovc.fit(self.X, self.Y) + + self.assertEqual(pcovc.n_components_, min(self.X.shape)) + + def test_Y_Shape(self): + pcovc = self.model() + Y = np.vstack(self.Y) + pcovc.fit(self.X, Y) + + self.assertEqual(pcovc.pxz_.shape[0], self.X.shape[1]) + self.assertEqual(pcovc.ptz_.shape[0], pcovc.n_components_) + + def test_prefit_classifier(self): + classifier = LogisticRegression() + classifier.fit(self.X, self.Y) + pcovc = self.model(mixing=0.5, classifier=classifier) + pcovc.fit(self.X, self.Y) + + Z_classifier = classifier.decision_function(self.X).reshape(self.X.shape[0], -1) + W_classifier = classifier.coef_.T.reshape(self.X.shape[1], -1) + + Z_pcovc = pcovc.z_classifier_.decision_function(self.X).reshape( + self.X.shape[0], -1 + ) + W_pcovc = pcovc.z_classifier_.coef_.T.reshape(self.X.shape[1], -1) + + self.assertTrue(np.allclose(Z_classifier, Z_pcovc)) + self.assertTrue(np.allclose(W_classifier, W_pcovc)) + + def test_precomputed_classification(self): + classifier = LogisticRegression() + classifier.fit(self.X, self.Y) + + W = classifier.coef_.T.reshape(self.X.shape[1], -1) + pcovc1 = self.model(mixing=0.5, classifier="precomputed", n_components=1) + pcovc1.fit(self.X, self.Y, W) + t1 = pcovc1.transform(self.X) + + pcovc2 = self.model(mixing=0.5, classifier=classifier, n_components=1) + pcovc2.fit(self.X, self.Y) + t2 = pcovc2.transform(self.X) + + self.assertTrue(np.linalg.norm(t1 - t2) < self.error_tol) + + # Now check for match when W is not passed: + pcovc3 = self.model(mixing=0.5, classifier="precomputed", n_components=1) + pcovc3.fit(self.X, self.Y) + t3 = pcovc3.transform(self.X) + + self.assertTrue(np.linalg.norm(t3 - t2) < self.error_tol) + self.assertTrue(np.linalg.norm(t3 - t1) < self.error_tol) + + def test_classifier_modifications(self): + classifier = LogisticRegression() + pcovc = self.model(mixing=0.5, classifier=classifier) + + # PCovC classifier matches the original + self.assertTrue(classifier.get_params() == pcovc.classifier.get_params()) + + # PCovC classifier updates its parameters + # to match the original classifier + classifier.set_params(random_state=2) + self.assertTrue(classifier.get_params() == pcovc.classifier.get_params()) + + # Fitting classifier outside PCovC fits the PCovC classifier + classifier.fit(self.X, self.Y) + self.assertTrue(hasattr(pcovc.classifier, "coef_")) + + # PCovC classifier doesn't change after fitting + pcovc.fit(self.X, self.Y) + classifier.set_params(random_state=3) + self.assertTrue(hasattr(pcovc.classifier_, "coef_")) + self.assertTrue(classifier.get_params() != pcovc.classifier_.get_params()) + + def test_incompatible_classifier(self): + self.maxDiff = None + classifier = GaussianNB() + classifier.fit(self.X, self.Y) + pcovc = self.model(mixing=0.5, classifier=classifier) + + with self.assertRaises(ValueError) as cm: + pcovc.fit(self.X, self.Y) + self.assertEqual( + str(cm.exception), + "Classifier must be an instance of " + "`LogisticRegression`, `LogisticRegressionCV`, `LinearSVC`, " + "`LinearDiscriminantAnalysis`, `RidgeClassifier`, " + "`RidgeClassifierCV`, `SGDClassifier`, `Perceptron`, " + "or `precomputed`", + ) + + def test_none_classifier(self): + pcovc = PCovC(mixing=0.5, classifier=None) + + pcovc.fit(self.X, self.Y) + self.assertTrue(pcovc.classifier is None) + self.assertTrue(pcovc.classifier_ is not None) + + def test_incompatible_coef_shape(self): + classifier1 = LogisticRegression() + + # Modify Y to be multiclass + Y_multiclass = self.Y.copy() + Y_multiclass[0] = 2 + + classifier1.fit(self.X, Y_multiclass) + pcovc1 = self.model(mixing=0.5, classifier=classifier1) + + # Binary classification shape mismatch + with self.assertRaises(ValueError) as cm: + pcovc1.fit(self.X, self.Y) + self.assertEqual( + str(cm.exception), + "For binary classification, expected classifier coefficients " + "to have shape (1, %d) but got shape %r" + % (self.X.shape[1], classifier1.coef_.shape), + ) + + classifier2 = LogisticRegression() + classifier2.fit(self.X, self.Y) + pcovc2 = self.model(mixing=0.5, classifier=classifier2) + + # Multiclass classification shape mismatch + with self.assertRaises(ValueError) as cm: + pcovc2.fit(self.X, Y_multiclass) + self.assertEqual( + str(cm.exception), + "For multiclass classification, expected classifier coefficients " + "to have shape (%d, %d) but got shape %r" + % (len(np.unique(Y_multiclass)), self.X.shape[1], classifier2.coef_.shape), + ) + + +if __name__ == "__main__": + unittest.main(verbosity=2)