Skip to content

Refactor: Generalize dataset base classes & consolidate dynamic splitting logic #122

@aditya0by0

Description

@aditya0by0

Description:
Currently, the code related to dynamic splitting in chebi.py and the proteins repo’s data class is duplicated. Both implementations are effectively the same, which leads to unnecessary code redundancy.

Proposed changes:

  1. Move common code to base class — e.g., DynamicDataset — to encapsulate shared dynamic splitting logic.

    • Both ChEBI and protein dataset classes should inherit from this base class.
    • This will centralize changes and make maintenance easier.
  2. Refactor dataset hierarchy to be more generic:

    • Certain hyperparameters that are specific to ChEBI, such as

      chebi_version: int = 200

      in XYBaseDataModule, should be pushed down into a ChEBI-specific base class rather than existing in a generic base.

  3. Outcome:

    • Eliminate duplicate code between chebi.py and the proteins repo.
    • Improve maintainability by isolating dataset-specific configurations.
    • Make it easier to introduce new datasets without rewriting the splitting logic.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions