Skip to content

Detect Junk Datasets #29

@PGijsbers

Description

@PGijsbers

Heavily related to #16. For this item, we want a bot that can detect which datasets should be inspected for removal. Datasets should be slated for removal if it is clear that the dataset was never intended to be shared to be public for use in ML experiments.

There are quite a few datasets on OpenML which are uploaded by users that should not be on the production server. This includes users uploading datasets to test upload functionality, users that made mistakes on initial uploads so uploaded newer versions, and so on.

Image

Besides a bad title and description, other indications may also be: having no tasks, or only tasks without runs. A good title and description that is duplicate from existing datasets. It may not always be obvious, and it's ok if the bot misses some of the poor quality data. It is important that the bot has a relatively high precision, as each flagged dataset will require a human to asses if deactivation/deletion is warranted.

This is also true for studies.

Besides flagging the dataset, the bot should be able to generate a small report explaining why the dataset may be considered for removal.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions