-
-
Notifications
You must be signed in to change notification settings - Fork 330
Description
Is your feature request related to a problem? Please describe.
DropHighPSIFeatures
currently takes one of the input features ad uses it to split the dataframe in two, if no split_col
column is provided, it will use the order of the dataframe.
This kind of behavior forces the user of this class to either:
- Have the split_col as a feature of the input dataframe.
- Set
split_col
as dataframe index.
Both options can sometimes be not ideal because more often than not, the split_col
is not a column to generate features on, but it is just metadata. Other pipeline components before DropHighPSIFeatures
might not be very happy to receive split_col
as part of the input Dataframe and it is needed to deal with the split_col
there to explicitly ignore it.
Example of current workflow:
from sklearn.pipeline import make_pipeline
from feature_engine.selection import DropFeatures, DropHighPSIFeatures
from sklearn.linear_model import LogisticRegression
import pandas as pd
df = pd.DataFrame(
{
"feat": [1, 2, 1, 2, 1] * 5,
"date": ["2025-01-01", "2025-01-02", "2025-01-03", "2025-01-04", "2025-01-05"] * 5,
"target": [1, 0, 1, 0, 1] * 5,
}
)
pipeline = make_pipeline(
DropHighPSIFeatures(
threshold=0.95,
split_col="date",
),
# Here we are dropping the date column
# because it is not needed for the model
DropFeatures(features_to_drop=["date"]),
LogisticRegression(),
)
pipeline.fit(df[["feat", "date"]], df["target"])
Describe the solution you'd like
To prevent this from happening, we might add the split_col
(in the form of a series or array) to the fit
method of DropHighPSIFeatures
.
With this change, the user of DropHighPSIFeatures
has the option of passing the split column as metadata using the new sklearn metadata API (more of it here) without having to keep the split column as a feature of the input dataframe.
Example of workflow I would like to have:
import sklearn
from sklearn.pipeline import make_pipeline
from feature_engine.selection import DropHighPSIFeatures
from sklearn.linear_model import LogisticRegression
import pandas as pd
sklearn.set_config(enable_metadata_routing=True)
df = pd.DataFrame(
{
"feat": [1, 2, 1, 2, 1] * 5,
"date": ["2025-01-01", "2025-01-02", "2025-01-03", "2025-01-04", "2025-01-05"] * 5,
"target": [1, 0, 1, 0, 1] * 5,
}
)
pipeline = make_pipeline(
DropHighPSIFeatures(
threshold=0.95,
# Here we tell sklearn components that split col needs to be passed
# only to the fit method of DropHighPSIFeatures, and not to all of the
# other pipeline components
).set_fit_request(split_col=True),
LogisticRegression(),
)
# No need to keep the date column to the X dataframe anymore
pipeline.fit(df[["feat"]], df["target"], split_col=df["date"])
Describe alternatives you've considered
Set split_col
as dataframe index is a potential solution to the issue, but feels more of a hack than a solution.
Additional context
N/A