The first online catalogue for Arabic NLP datasets.

Overview

Masader

The first online catalogue for Arabic NLP datasets. This catalogue contains 200 datasets with more than 25 metadata annotations for each dataset. You can view the list of all datasets using the link of the webiste https://arbml.github.io/masader/

Title Masader: Metadata Sourcing for Arabic Text and Speech Data Resources
Authors Zaid Alyafeai, Maraim Masoud, Mustafa Ghaleb, Maged S. Al-shaibani
https://arxiv.org/abs/2110.06744

Abstract: The NLP pipeline has evolved dramatically in the last few years. The first step in the pipeline is to find suitable annotated datasets to evaluate the tasks we are trying to solve. Unfortunately, most of the published datasets lack metadata annotations that describe their attributes. Not to mention, the absence of a public catalogue that indexes all the publicly available datasets related to specific regions or languages. When we consider low-resource dialectical languages, for example, this issue becomes more prominent. In this paper we create \textit{Masader}, the largest public catalogue for Arabic NLP datasets, which consists of 200 datasets annotated with 25 attributes. Furthermore, We develop a metadata annotation strategy that could be extended to other languages. We also make remarks and highlight some issues about the current status of Arabic NLP datasets and suggest recommendations to address them.*

Metadata

  • No. dataset number
  • Name name of the dataset
  • Subsets subsets of the datasets
  • Link direct link to the dataset or instructions on how to download it
  • License license of the dataset
  • Year year of the publishing the dataset/paper
  • Language ar or multilingual
  • Dialect region ar-LEV: (Arabic(Levant)), country ar-EGY: (Arabic (Egypt)) or type ar-MSA: (Arabic (Modern Standard Arabic))
  • Domain social media, news articles, reviews, commentary, books, transcribed audio or other
  • Form text, audio or sign language
  • Collection style crawling, crawling and annotation (translation), crawling and annotation (other), machine translation, human translation, human curation or other
  • Description short statement describing the dataset
  • Volume the size of the dataset in numbers
  • Unit unit of the volume, could be tokens, sentences, documents, MB, GB, TB, hours or other
  • Provider company or university providing the dataset
  • Related Datasets any datasets that is related in terms of content to the dataset
  • Paper Title title of the paper
  • Paper Link direct link to the paper pdf
  • Script writing system either Arab, Latn, Arab-Latn or other
  • Tokenized whether the dataset is segmented using morphology: Yes or No
  • Host the host website for the data i.e GitHub
  • Access is the data free, upon-request or with-fee.
  • Cost cost of the data is with-fee.
  • Test split does the data contain test split: Yes or No
  • Tasks the tasks included in the dataset spearated by comma
  • Evaluation Set is the data included in the evaluation suit by BigScience
  • Venue Title the venue title i.e ACL
  • Citations the number of citations
  • Venue Type conference, workshop, journal or preprint
  • Venue Name full name of the venue i.e Associations of computation linguistics
  • authors list of the paper authors separated by comma
  • affiliations list of the paper authors' affiliations separated by comma
  • abstract abstract of the paper
  • Added by name of the person who added the entry
  • Notes any extra notes on the dataset

Contribution

If you want to add a new dataset feel free to update the sheet. Please follow the instructions there for adding the entry.

Citation

@misc{alyafeai2021masader,
      title={Masader: Metadata Sourcing for Arabic Text and Speech Data Resources}, 
      author={Zaid Alyafeai and Maraim Masoud and Mustafa Ghaleb and Maged S. Al-shaibani},
      year={2021},
      eprint={2110.06744},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
Comments
  • Front-end improvements

    Front-end improvements

    Problem

    The problem is the website performance is so slow, and it's so hard to browse and I think there's code that is not readable at all, for ex. in card.js there's a lot of code that no need for (I've deleted a little bit of it during reading and understanding the code)!

    Solution

    I've designed the front end and made it easier to browse check it out

    Notes

    • I don't know what's the point of updating MaxRowLength every single time someone enters a new data point in google sheets.
    • Performance is so slow because of google sheets because of the google sheets request that takes more than 3s to load!
    • A lot of external libraries, such as Axios and jquery, both of them slow the website because it's a client-based website(HTML, CSS, JS) and there's an alternative built-in same exact functionality.

    Thanks to @sudomaze for helping me with the review of the design.

    opened by MutlaqAldhbuiub 13
  • Add `pre-commit`

    Add `pre-commit`

    Tools to include

    • [ ] black - code formatter
    • [ ] mypy - static type checker
    • [ ] pycln - remove unused imports
    • [ ] isort - sort imports

    Skeleton to start from

    # See https://pre-commit.com for more information
    # See https://pre-commit.com/hooks.html for more hooks
    repos:
      - repo: https://github.com/pre-commit/pre-commit-hooks
        rev: v4.0.1
        hooks:
          - id: check-ast
          - id: check-builtin-literals
          - id: check-case-conflict
          - id: check-docstring-first
          - id: check-merge-conflict
          - id: check-json
          - id: check-toml
          - id: check-yaml
          - id: end-of-file-fixer
          - id: mixed-line-ending
          - id: trailing-whitespace
          - id: check-vcs-permalinks
          - id: check-shebang-scripts-are-executable
      - repo: https://github.com/pre-commit/pygrep-hooks
        rev: v1.9.0
        hooks:
          - id: python-check-mock-methods
          - id: python-no-log-warn
          - id: python-use-type-annotations
      - repo: https://github.com/hadialqattan/pycln
        rev: v1.2.5
        hooks:
          - id: pycln
            args: [--all]
      - repo: https://github.com/psf/black
        rev: 22.3.0
        hooks:
          - id: black
            additional_dependencies: ['click==8.0.4']
      - repo: https://github.com/PyCQA/isort
        rev: 5.10.1
        hooks:
          - id: isort
            args: ["--profile", "black"]
    
    bug 
    opened by sudomaze 12
  • An API endpoint to fetch data and preprocess them

    An API endpoint to fetch data and preprocess them

    Per what @AliOsm has done in https://github.com/ARBML/masader-webservice, we will need to add the following to complete the required tasks for the paper submission:

    • [x] Get the entire datasets at once
    • [x] Get a single dataset by id
    • [x] Get explore page data in the right format
    • [x] Get graphs page data in the right format
    enhancement 
    opened by sudomaze 8
  • feat(page): add filteration page

    feat(page): add filteration page

    Added filteration page.

    You should consider the following

    • [ ] Build the ability to share the link with the filteration applied without making the page reload each time filter updated
    • [ ] Solve bunch of boilerplates
    • [ ] Solve long pagination buttons ( Should be one row )
    • [ ] Refactor the code make it more better and less lines to achieve happiness
    opened by ghost 4
  • Reporting Functionality Protection

    Reporting Functionality Protection

    We will need to have some way to prevent spam either by using reCAPTCHA or auth the account with GitHub account to prevent who will spam using the reporting functionality.

    bug help wanted 
    opened by sudomaze 4
  • Improve the UI

    Improve the UI

    • Refactored the project structure and removed unused pages and scripts.
    • Redesigned the pages according to the org. brand.
    • Adding icon(Favicons).

    issue : #6

    opened by MutlaqAldhbuiub 4
  • Improve `Dialect` section of `Stats` page

    Improve `Dialect` section of `Stats` page

    Is your feature request related to a problem? Please describe. The current pie chart under stats page doesn't have labels which make it hard to track which piece belongs to which dialect. In addition, the pie chart shows all the dialects at once, making it hard to understand the separation of dialects based on region.

    Describe the solution you'd like

    • Including labels for each piece of the pie chart.
    • Include other pie charts based on region (Gulf, Lev, North Africa, etc.).

    Relevant #93

    enhancement 
    opened by sudomaze 3
  • Show datasets corresponding to a certain dialect

    Show datasets corresponding to a certain dialect

    The dialects graph in https://arbml.github.io/masader/charting.html only shows the number of resources. It will be helpful if we can show the list of datasets that correspond to a chosen country.

    enhancement 
    opened by zaidalyafeai 3
  • Moving front-end to `jekyll`

    Moving front-end to `jekyll`

    For this issue, we will refactor the front-end code under this branch https://github.com/ARBML/masader/tree/dev/move_to_jekyll, so use it as a base for any of the tasks below:

    Core changes:

    • [x] Update js scripts to expect data from the API endpoint
      • [x] index.js
      • [x] card.js
      • [x] charts.js
      • [x] graphs.js
    • [ ] Ensure that the website is responsive on mobile screens

    Additions

    • [ ] Add a simple filtering functionality for #17
    enhancement 
    opened by sudomaze 3
  • [Plan] Updating Masader for Demo paper publication

    [Plan] Updating Masader for Demo paper publication

    Per ARBML's discord group meeting, we have concluded the following tasks to be added to improve Masadar for the Demo paper publication.

    Changes

    • Move the front-end to Jekyll to clean up the code and have it easy to contribute to
    • Build an API endpoint to fix the issue of slowdown when loading the page due to fetching data from the database from the front-end and preprocessing them (#22)
    • Add additional functionalities: better filtering, report data (#25)
    • Updating data entry (#24)

    Todos

    • [x] #19
    • [x] #29
    • [x] #24
    • [x] #45

    Pushed for later

    • Issue with DDoS for #25
    • ~#18~
    • ~#20~
    • Refactor the entire codebase to be under one framework and web app (Flask/FastAPI)
    opened by sudomaze 3
  • White spaces in search query leads to no results

    White spaces in search query leads to no results

    Describe the bug Adding white spaces to begging or end of the search query (dataset name) prevents the search engine from matching it to already existing datasets in the catalogue.

    To Reproduce

    1. Go to : https://arbml.github.io/masader/search
    2. Type name of dataset with a white space (or multiple) at the beginning or end (example: https://arbml.github.io/masader/search?name=artest+)
    3. No results returned

    Expected behavior Search engine should handle such typo issue from the user as white spaces are most probably not part of any dataset name. Search engine should return the dataset even if the user added a space to its name in the search query.

    Desktop (please complete the following information):

    • OS: Windows 8
    • Browser: Chrome
    bug 
    opened by MaramHasanain 1
  • Landing page's header

    Landing page's header

    Rather than having a simple text like this image Having a more unique styling of the landing page's header like this (not black and white) image On mobile (not black and white) , image

    cc: @unifoliatus

    enhancement good first issue 
    opened by sudomaze 1
Owner
ARBML
Arabic NLP
ARBML
Grading tools for Advanced NLP (11-711)Grading tools for Advanced NLP (11-711)

Grading tools for Advanced NLP (11-711) Installation You'll need docker and unzip to use this repo. For docker, visit the official guide to get starte

Hao Zhu 2 Sep 27, 2022
🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

?? The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

Hugging Face 15k Jan 2, 2023
The tool to make NLP datasets ready to use

chazutsu photo from Kaikado, traditional Japanese chazutsu maker chazutsu is the dataset downloader for NLP. >>> import chazutsu >>> r = chazutsu.data

chakki 243 Dec 29, 2022
T‘rex Park is a Youzan sponsored project. Offering Chinese NLP and image models pretrained from E-commerce datasets

T‘rex Park is a Youzan sponsored project. Offering Chinese NLP and image models pretrained from E-commerce datasets (product titles, images, comments, etc.).

null 55 Nov 22, 2022
A collection of Korean Text Datasets ready to use using Tensorflow-Datasets.

tfds-korean A collection of Korean Text Datasets ready to use using Tensorflow-Datasets. TensorFlow-Datasets를 이용한 한국어/한글 데이터셋 모음입니다. Dataset Catalog |

Jeong Ukjae 20 Jul 11, 2022
Abhijith Neil Abraham 2 Nov 5, 2021
Making text a first-class citizen in TensorFlow.

TensorFlow Text - Text processing in Tensorflow IMPORTANT: When installing TF Text with pip install, please note the version of TensorFlow you are run

null 1k Dec 26, 2022
Making text a first-class citizen in TensorFlow.

TensorFlow Text - Text processing in Tensorflow IMPORTANT: When installing TF Text with pip install, please note the version of TensorFlow you are run

null 692 Feb 16, 2021
TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset.

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset. TunBERT was applied to three NLP downstream tasks: Sentiment Analysis (SA), Tunisian Dialect Identification (TDI) and Reading Comprehension Question-Answering (RCQA)

InstaDeep Ltd 72 Dec 9, 2022
IndoBERTweet is the first large-scale pretrained model for Indonesian Twitter. Published at EMNLP 2021 (main conference)

IndoBERTweet ?? ???? 1. Paper Fajri Koto, Jey Han Lau, and Timothy Baldwin. IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effe

IndoLEM 40 Nov 30, 2022
ExKaldi-RT: An Online Speech Recognition Extension Toolkit of Kaldi

ExKaldi-RT is an online ASR toolkit for Python language. It reads realtime streaming audio and do online feature extraction, probability computation, and online decoding.

Wang Yu 31 Aug 16, 2021
lightweight, fast and robust columnar dataframe for data analytics with online update

streamdf Streamdf is a lightweight data frame library built on top of the dictionary of numpy array, developed for Kaggle's time-series code competiti

null 23 May 19, 2022
iBOT: Image BERT Pre-Training with Online Tokenizer

Image BERT Pre-Training with iBOT Official PyTorch implementation and pretrained models for paper iBOT: Image BERT Pre-Training with Online Tokenizer.

Bytedance Inc. 435 Jan 6, 2023
Full Spectrum Bioinformatics - a free online text designed to introduce key topics in Bioinformatics using the Python

Full Spectrum Bioinformatics is a free online text designed to introduce key topics in Bioinformatics using the Python programming language. The text is written in interactive Jupyter Notebooks, which allow you to try out and modify example code and analyses.

Jesse Zaneveld 33 Dec 28, 2022
Twitter-Sentiment-Analysis - Twitter sentiment analysis for india's top online retailers(2019 to 2022)

Twitter-Sentiment-Analysis Twitter sentiment analysis for india's top online retailers(2019 to 2022) Project Overview : Sentiment Analysis helps us to

Balaji R 1 Jan 1, 2022
Python bindings to the dutch NLP tool Frog (pos tagger, lemmatiser, NER tagger, morphological analysis, shallow parser, dependency parser)

Frog for Python This is a Python binding to the Natural Language Processing suite Frog. Frog is intended for Dutch and performs part-of-speech tagging

Maarten van Gompel 46 Dec 14, 2022