Add `split_col` (as a series/array) as optional argument to the `fit` method of `DropHighPSIFeatures`

**Is your feature request related to a problem? Please describe.**

`DropHighPSIFeatures` currently takes one of the input features ad uses it to split the dataframe in two, if no `split_col` column is provided, it will use the order of the dataframe.

This kind of behavior forces the user of this class to either:

1. Have the split_col as a  feature of the input dataframe.
2. Set `split_col` as dataframe index.

Both options can sometimes be not ideal because more often than not, the `split_col` is not a column to generate features on, but it is just metadata. Other pipeline components before `DropHighPSIFeatures` might not be very happy to receive `split_col` as part of the input Dataframe and it is needed to deal with the `split_col` there to explicitly ignore it.

Example of current workflow:

```python
from sklearn.pipeline import make_pipeline
from feature_engine.selection import DropFeatures, DropHighPSIFeatures
from sklearn.linear_model import LogisticRegression
import pandas as pd

df = pd.DataFrame(
    {
        "feat": [1, 2, 1, 2, 1] * 5,
        "date": ["2025-01-01", "2025-01-02", "2025-01-03", "2025-01-04", "2025-01-05"] * 5,
        "target": [1, 0, 1, 0, 1] * 5,
    }
)

pipeline = make_pipeline(
    DropHighPSIFeatures(
        threshold=0.95,
        split_col="date",
    ),
    # Here we are dropping the date column
    # because it is not needed for the model
    DropFeatures(features_to_drop=["date"]),
    LogisticRegression(),
)

pipeline.fit(df[["feat", "date"]], df["target"])
```

**Describe the solution you'd like**

To prevent this from happening, we might add the `split_col` (in the form of a series or array) to the `fit` method of `DropHighPSIFeatures`.

With this change, the user of `DropHighPSIFeatures` has the option of passing the split column as metadata using the new [sklearn metadata API](https://scikit-learn.org/stable/metadata_routing.html) ([more of it here](https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_metadata_routing.html#consuming-estimator)) without having to keep the split column as a feature of the input dataframe.

Example of workflow I would like to have:

```python
import sklearn
from sklearn.pipeline import make_pipeline
from feature_engine.selection import DropHighPSIFeatures
from sklearn.linear_model import LogisticRegression
import pandas as pd

sklearn.set_config(enable_metadata_routing=True)

df = pd.DataFrame(
    {
        "feat": [1, 2, 1, 2, 1] * 5,
        "date": ["2025-01-01", "2025-01-02", "2025-01-03", "2025-01-04", "2025-01-05"] * 5,
        "target": [1, 0, 1, 0, 1] * 5,
    }
)

pipeline = make_pipeline(
    DropHighPSIFeatures(
        threshold=0.95,
    # Here we tell sklearn components that split col needs to be passed
    # only to the fit method of DropHighPSIFeatures, and not to all of the 
    # other pipeline components
    ).set_fit_request(split_col=True),
    LogisticRegression(),
)

# No need to keep the date column to the X dataframe anymore
pipeline.fit(df[["feat"]], df["target"], split_col=df["date"])
```

**Describe alternatives you've considered**
Set `split_col` as dataframe index is a potential solution to the issue, but feels more of a hack than a solution.

**Additional context**
N/A


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add `split_col` (as a series/array) as optional argument to the `fit` method of `DropHighPSIFeatures` #860

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Add split_col (as a series/array) as optional argument to the fit method of DropHighPSIFeatures #860

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Add `split_col` (as a series/array) as optional argument to the `fit` method of `DropHighPSIFeatures` #860