Skip to content

Validation set becomes subset of train set in SurvivalModel.fit() #149

@ShaunWolfe04

Description

@ShaunWolfe04

Bug Report: Data Leakage in auton_survival.estimators.SurvivalModel.fit

When weights is not passed (weights=None), the SurvivalModel.fit method trains the model on a dataset that contains the validation samples, leading to data leakage.

Cause:

A correct train/validation split is created internally within the .fit method. However, the subsequent internal call to _fit_dsm is passed the original, full dataset. 'features' is passed into '_fit_dsm', which is only updated in an if statement when 'weights=None'. Otherwise 'features' represents the entire train set.

Impact:

This data leakage causes the reported validation loss to be an unreliable and overly optimistic metric. It masks overfitting and can cause models to appear more stable than they are, where in reality the models are deeply overfitting due to the early stopping mechanism with the validation set having a much smaller likelihood of being triggered.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions