The first online catalogue for Arabic NLP datasets.

ARBML

Last update: Dec 26, 2022

Related tags

Text Data & NLP masader

Overview

Masader

The first online catalogue for Arabic NLP datasets. This catalogue contains 200 datasets with more than 25 metadata annotations for each dataset. You can view the list of all datasets using the link of the webiste https://arbml.github.io/masader/

Title Masader: Metadata Sourcing for Arabic Text and Speech Data Resources
Authors Zaid Alyafeai, Maraim Masoud, Mustafa Ghaleb, Maged S. Al-shaibani
https://arxiv.org/abs/2110.06744

Abstract: The NLP pipeline has evolved dramatically in the last few years. The first step in the pipeline is to find suitable annotated datasets to evaluate the tasks we are trying to solve. Unfortunately, most of the published datasets lack metadata annotations that describe their attributes. Not to mention, the absence of a public catalogue that indexes all the publicly available datasets related to specific regions or languages. When we consider low-resource dialectical languages, for example, this issue becomes more prominent. In this paper we create \textit{Masader}, the largest public catalogue for Arabic NLP datasets, which consists of 200 datasets annotated with 25 attributes. Furthermore, We develop a metadata annotation strategy that could be extended to other languages. We also make remarks and highlight some issues about the current status of Arabic NLP datasets and suggest recommendations to address them.*

Metadata

No. dataset number
Name name of the dataset
Subsets subsets of the datasets
Link direct link to the dataset or instructions on how to download it
License license of the dataset
Year year of the publishing the dataset/paper
Language ar or multilingual
Dialect region ar-LEV: (Arabic(Levant)), country ar-EGY: (Arabic (Egypt)) or type ar-MSA: (Arabic (Modern Standard Arabic))
Domain social media, news articles, reviews, commentary, books, transcribed audio or other
Form text, audio or sign language
Collection style crawling, crawling and annotation (translation), crawling and annotation (other), machine translation, human translation, human curation or other
Description short statement describing the dataset
Volume the size of the dataset in numbers
Unit unit of the volume, could be tokens, sentences, documents, MB, GB, TB, hours or other
Provider company or university providing the dataset
Related Datasets any datasets that is related in terms of content to the dataset
Paper Title title of the paper
Paper Link direct link to the paper pdf
Script writing system either Arab, Latn, Arab-Latn or other
Tokenized whether the dataset is segmented using morphology: Yes or No
Host the host website for the data i.e GitHub
Access is the data free, upon-request or with-fee.
Cost cost of the data is with-fee.
Test split does the data contain test split: Yes or No
Tasks the tasks included in the dataset spearated by comma
Evaluation Set is the data included in the evaluation suit by BigScience
Venue Title the venue title i.e ACL
Citations the number of citations
Venue Type conference, workshop, journal or preprint
Venue Name full name of the venue i.e Associations of computation linguistics
authors list of the paper authors separated by comma
affiliations list of the paper authors' affiliations separated by comma
abstract abstract of the paper
Added by name of the person who added the entry
Notes any extra notes on the dataset

Contribution

If you want to add a new dataset feel free to update the sheet. Please follow the instructions there for adding the entry.

Citation

@misc{alyafeai2021masader,
      title={Masader: Metadata Sourcing for Arabic Text and Speech Data Resources}, 
      author={Zaid Alyafeai and Maraim Masoud and Mustafa Ghaleb and Maged S. Al-shaibani},
      year={2021},
      eprint={2110.06744},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Comments

Front-end improvements
Problem

The problem is the website performance is so slow, and it's so hard to browse and I think there's code that is not readable at all, for ex. in card.js there's a lot of code that no need for (I've deleted a little bit of it during reading and understanding the code)!

Solution

I've designed the front end and made it easier to browse check it out

Notes

I don't know what's the point of updating MaxRowLength every single time someone enters a new data point in google sheets.

Performance is so slow because of google sheets because of the google sheets request that takes more than 3s to load!

A lot of external libraries, such as Axios and jquery, both of them slow the website because it's a client-based website(HTML, CSS, JS) and there's an alternative built-in same exact functionality.

Thanks to @sudomaze for helping me with the review of the design.
opened by MutlaqAldhbuiub 13

Add `pre-commit`

Tools to include

[ ] black - code formatter
[ ] mypy - static type checker
[ ] pycln - remove unused imports
[ ] isort - sort imports

Skeleton to start from

# See https://pre-commit.com for more information
# See https://pre-commit.com/hooks.html for more hooks
repos:
  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v4.0.1
    hooks:
      - id: check-ast
      - id: check-builtin-literals
      - id: check-case-conflict
      - id: check-docstring-first
      - id: check-merge-conflict
      - id: check-json
      - id: check-toml
      - id: check-yaml
      - id: end-of-file-fixer
      - id: mixed-line-ending
      - id: trailing-whitespace
      - id: check-vcs-permalinks
      - id: check-shebang-scripts-are-executable
  - repo: https://github.com/pre-commit/pygrep-hooks
    rev: v1.9.0
    hooks:
      - id: python-check-mock-methods
      - id: python-no-log-warn
      - id: python-use-type-annotations
  - repo: https://github.com/hadialqattan/pycln
    rev: v1.2.5
    hooks:
      - id: pycln
        args: [--all]
  - repo: https://github.com/psf/black
    rev: 22.3.0
    hooks:
      - id: black
        additional_dependencies: ['click==8.0.4']
  - repo: https://github.com/PyCQA/isort
    rev: 5.10.1
    hooks:
      - id: isort
        args: ["--profile", "black"]

bug

opened by sudomaze 12

An API endpoint to fetch data and preprocess them
Per what @AliOsm has done in https://github.com/ARBML/masader-webservice, we will need to add the following to complete the required tasks for the paper submission:

[x] Get the entire datasets at once

[x] Get a single dataset by id

[x] Get explore page data in the right format

[x] Get graphs page data in the right format

enhancement
opened by sudomaze 8
feat(page): add filteration page
Added filteration page.

You should consider the following

[ ] Build the ability to share the link with the filteration applied without making the page reload each time filter updated

[ ] Solve bunch of boilerplates

[ ] Solve long pagination buttons ( Should be one row )

[ ] Refactor the code make it more better and less lines to achieve happiness
opened by ghost 4
Reporting Functionality Protection

We will need to have some way to prevent spam either by using reCAPTCHA or auth the account with GitHub account to prevent who will spam using the reporting functionality.
bug help wanted

opened by sudomaze 4
Improve the UI
Refactored the project structure and removed unused pages and scripts.

Redesigned the pages according to the org. brand.

Adding icon(Favicons).

issue : #6
opened by MutlaqAldhbuiub 4
Improve `Dialect` section of `Stats` page
Is your feature request related to a problem? Please describe. The current pie chart under stats page doesn't have labels which make it hard to track which piece belongs to which dialect. In addition, the pie chart shows all the dialects at once, making it hard to understand the separation of dialects based on region.

Describe the solution you'd like

Including labels for each piece of the pie chart.

Include other pie charts based on region (Gulf, Lev, North Africa, etc.).

Relevant #93
enhancement
opened by sudomaze 3
Show datasets corresponding to a certain dialect

The dialects graph in https://arbml.github.io/masader/charting.html only shows the number of resources. It will be helpful if we can show the list of datasets that correspond to a chosen country.
enhancement

opened by zaidalyafeai 3
Moving front-end to `jekyll`
For this issue, we will refactor the front-end code under this branch https://github.com/ARBML/masader/tree/dev/move_to_jekyll, so use it as a base for any of the tasks below:

Core changes:

[x] Update js scripts to expect data from the API endpoint

[x] index.js

[x] card.js

[x] charts.js

[x] graphs.js

[ ] Ensure that the website is responsive on mobile screens

Additions

[ ] Add a simple filtering functionality for #17

enhancement
opened by sudomaze 3
[Plan] Updating Masader for Demo paper publication
Per ARBML's discord group meeting, we have concluded the following tasks to be added to improve Masadar for the Demo paper publication.

Changes

Move the front-end to Jekyll to clean up the code and have it easy to contribute to

Build an API endpoint to fix the issue of slowdown when loading the page due to fetching data from the database from the front-end and preprocessing them (#22)

Add additional functionalities: better filtering, report data (#25)

Updating data entry (#24)

Todos

[x] #19

[x] #29

[x] #24

[x] #45

Pushed for later

Issue with DDoS for #25

~#18~

~#20~

Refactor the entire codebase to be under one framework and web app (Flask/FastAPI)
opened by sudomaze 3
White spaces in search query leads to no results
Describe the bug Adding white spaces to begging or end of the search query (dataset name) prevents the search engine from matching it to already existing datasets in the catalogue.

To Reproduce

Go to : https://arbml.github.io/masader/search

Type name of dataset with a white space (or multiple) at the beginning or end (example: https://arbml.github.io/masader/search?name=artest+)

No results returned

Expected behavior Search engine should handle such typo issue from the user as white spaces are most probably not part of any dataset name. Search engine should return the dataset even if the user added a space to its name in the search query.

Desktop (please complete the following information):

OS: Windows 8

Browser: Chrome

bug
opened by MaramHasanain 1
Landing page's header

Rather than having a simple text like this Having a more unique styling of the landing page's header like this (not black and white) On mobile (not black and white) ,

cc: @unifoliatus
enhancement good first issue

opened by sudomaze 1

The first online catalogue for Arabic NLP datasets.

Related tags

Overview

Masader

Metadata

Contribution

Citation

Comments

Problem

Solution

Notes

Tools to include

Skeleton to start from

Core changes:

Additions

Changes

Todos

Pushed for later

Owner

ARBML

Grading tools for Advanced NLP (11-711)Grading tools for Advanced NLP (11-711)

Text-Summarization-using-NLP - Text Summarization using NLP to fetch BBC News Article and summarize its text and also it includes custom article Summarization

🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

The tool to make NLP datasets ready to use

T‘rex Park is a Youzan sponsored project. Offering Chinese NLP and image models pretrained from E-commerce datasets

A collection of Korean Text Datasets ready to use using Tensorflow-Datasets.

STS Benchmark comprises a selection of the English datasets used in the STS tasks organized in the context of SemEval between 2012 and 2017. The selection of datasets include text from image captions, news headlines and user forums.

Making text a first-class citizen in TensorFlow.

Making text a first-class citizen in TensorFlow.

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset.

IndoBERTweet is the first large-scale pretrained model for Indonesian Twitter. Published at EMNLP 2021 (main conference)

ExKaldi-RT: An Online Speech Recognition Extension Toolkit of Kaldi

lightweight, fast and robust columnar dataframe for data analytics with online update

iBOT: Image BERT Pre-Training with Online Tokenizer

Full Spectrum Bioinformatics - a free online text designed to introduce key topics in Bioinformatics using the Python

Twitter-Sentiment-Analysis - Twitter sentiment analysis for india's top online retailers(2019 to 2022)

Python bindings to the dutch NLP tool Frog (pos tagger, lemmatiser, NER tagger, morphological analysis, shallow parser, dependency parser)