Detect Junk Datasets

Heavily related to #16. For this item, we want a bot that can detect which datasets should be inspected for removal. Datasets should be slated for removal if it is clear that the dataset was never intended to be shared to be public for use in ML experiments.

There are quite a few datasets on OpenML which are uploaded by users that should not be on the production server. This includes users uploading datasets to test upload functionality, users that made mistakes on initial uploads so uploaded newer versions, and so on.

![Image](https://github.com/user-attachments/assets/b6350fd2-a740-495e-be19-b0c13c2276f9)

Besides a bad title and description, other indications may also be: having no tasks, or only tasks without runs. A good title and description that is duplicate from existing datasets. It may not always be obvious, and it's ok if the bot misses some of the poor quality data. It is important that the bot has a relatively high precision, as each flagged dataset will require a human to asses if deactivation/deletion is warranted. 

This is also true for studies.

Besides flagging the dataset, the bot should be able to generate a small report explaining why the dataset may be considered for removal.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Detect Junk Datasets #29

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Detect Junk Datasets #29

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions