The first online catalogue for Indonesian NLP datasets. This catalogue contains 200 datasets with more than 25 metadata annotations for each dataset. You can view the list of all datasets using the link of the webiste https://indonlp.github.io
Title IndoNLP: Metadata Sourcing for Indonesian and Indonesia Local Languages Data Resources
Authors:Abstract:
No.dataset numberNamename of the datasetSubsetssubsets of the datasetsLinkdirect link to the dataset or instructions on how to download itLicenselicense of the datasetYearyear of the publishing the dataset/paperLanguagear or multilingualDialectregion ar-LEV: (Arabic(Levant)), country ar-EGY: (Arabic (Egypt)) or type ar-MSA: (Arabic (Modern Standard Arabic))Domainsocial media, news articles, reviews, commentary, books, transcribed audio or otherFormtext, audio or sign languageCollection stylecrawling, crawling and annotation (translation), crawling and annotation (other), machine translation, human translation, human curation or otherDescriptionshort statement describing the datasetVolumethe size of the dataset in numbersUnitunit of the volume, could be tokens, sentences, documents, MB, GB, TB, hours or otherProvidercompany or university providing the datasetRelated Datasetsany datasets that is related in terms of content to the datasetPaper Titletitle of the paperPaper Linkdirect link to the paper pdfScriptwriting system either Arab, Latn, Arab-Latn or otherTokenizedwhether the dataset is segmented using morphology: Yes or NoHostthe host website for the data i.e GitHubAccessthe data is either free, upon-request or with-fee.Costcost of the data is with-fee.Test splitdoes the data contain test split: Yes or NoTasksthe tasks included in the dataset spearated by commaEvaluation Setthe data included in the evaluation suit by BigScienceVenue Titlethe venue title i.e ACLCitationsthe number of citationsVenue Typeconference, workshop, journal or preprintVenue Namefull name of the venue i.e Associations of computation linguisticsauthorslist of the paper authors separated by commaaffiliationslist of the paper authors' affiliations separated by commaabstractabstract of the paperAdded byname of the person who added the entryNotesany extra notes on the dataset
You can access the annoated dataset using datasets
from datasets import load_dataset 
nusa_catalogue = load_dataset('')
nusa_catalogue['train'][0]which gives the following output
The catalogue will be updated regularly. If you want to add a new dataset, use this form.
Will be added