darija <-> english dictionary

DODa

Last update: Jan 1, 2023

Related tags

Deep Learning dataset

Overview

darija-dictionary

Having advanced IT solutions that are well adapted to the Moroccan context passes inevitably through understanding Moroccan dialect. Hence, darija (Moroccan dialect) should be an active player in the domain of Natural Language Processing (NLP).

However, it turns out that step 0 in any serious engagement with darija in NLP will consist of translating its vocabulary to the widely used and most documented language in this field, namely English.

This open source project aims to be a reference in addressing this issue. We hope for the contribution of the Moroccan IT community in order to build up the largest dataset of darija-english vocabulary which will serve as a pedestal for any future application of NLP to benefit Moroccan people.

How to contribute

We've made a tutorial for you in DODa's website

Guidelines / Recommendations

3ndk ح dir ح xD (shout-out to this guy 😆 ), often try to use:

darija	3	7	9	8	2 - 'a' - 'i'	5 - 'kh'
arabic	ع	ح	ق	ه	همزة	خ

Try to use capitalization to differentiate between the following letters:

t	T	s	S	d	D
ت	ط	س	ص	د	ض

Arabic characters with two-letters Latin equivalent:

Arabic alphabet	ش	غ	خ
Latin alphabet	ch	gh	kh

Double characters to refer to the emphasis or "الشدة":

darija	7mam	7mmam
english	pigeons	bathroom

We usually don't add "e" in the end of darija words : louz instead of louze
We usually don't use "Z" or "th" for ظ ، ذ ، ث , because we generally don't use these letters in darija (except in northern Morocco, but for the sake of simplicity, we are focusing primarily on standard darija)
We do NOT use apostrophes. In fact, since we are working on csv files, apostrophes will break off words
We use spaces as word delimiters, not _ nor - : thank you instead of thank_you
Respect the number of columns in every row you add, you can use empty quotation marks "" in case you don't have extra variations
In every row, always start with the most used form (in your opinion of course) of the word in question
For future use of this dataset to train deep neural networks, try to reserve each row to similar variations of the same word. For instance, "sou9" and "marchi" both translate to "market", yet it's better to separate them into two different rows:

"sou9","souk","souq","market"

"marchi","","","market"

verbs.csv: The darija translation is reserved to the past tense of the third pronoun "he", whereas the other pronouns and tenses are handled in separate files. The English translation present the basic form (or root) of the English verb.

"ghnna","ghenna","ghanna","","","","sing"

masculine_feminine_plural.csv: If it does exist, feminine-plural translation column is for nouns. Regarding adjectives feminine-plural = feminine.

Citation

@misc{outchakoucht2021moroccan,
      title={Moroccan Dialect -Darija- Open Dataset},
      author={Aissam Outchakoucht and Hamza Es-Samaali},
      year={2021},
      eprint={2103.09687},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Comments

sentences2.csv makhdamch

bghit ndir had dataset f dataframe w kaitl3o li had l error

ParserError Traceback (most recent call last) in 1 import pandas as pd 2 ----> 3 print(pd.read_csv("dataset/sentences2.csv"))

~\anaconda3\lib\site-packages\pandas\io\parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision) 674 ) 675 --> 676 return _read(filepath_or_buffer, kwds) 677 678 parser_f.name = name

~\anaconda3\lib\site-packages\pandas\io\parsers.py in _read(filepath_or_buffer, kwds) 452 453 try: --> 454 data = parser.read(nrows) 455 finally: 456 parser.close()

~\anaconda3\lib\site-packages\pandas\io\parsers.py in read(self, nrows) 1131 def read(self, nrows=None): 1132 nrows = _validate_integer("nrows", nrows) -> 1133 ret = self._engine.read(nrows) 1134 1135 # May alter columns / col_dict

~\anaconda3\lib\site-packages\pandas\io\parsers.py in read(self, nrows) 2035 def read(self, nrows=None): 2036 try: -> 2037 data = self._reader.read(nrows) 2038 except StopIteration: 2039 if self._first_chunk:

pandas_libs\parsers.pyx in pandas._libs.parsers.TextReader.read()

pandas_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory()

pandas_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_rows()

pandas_libs\parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows()

pandas_libs\parsers.pyx in pandas._libs.parsers.raise_parser_error()

ParserError: Error tokenizing data. C error: Expected 2 fields in line 1865, saw 4

o fach kandir rror_bad_lines=False kaitl3lya hdchi

opened by Strifee 3
Hosting the dataset on DataStack?
Hello!

I enjoy seeing community data projects like this. As a former student of Cantonese (a dialect of Chinese spoken in HongKong), I know what it's like when the language you're interested in has no dictionary. At one point, I tried to create one myself, but gave up after 500 words. It's too much effort to do it alone.

The issue today is, that working together on these things is very technical. I'm sure your contributors here are familiar with Github and have no problem helping out, but it would be great if anyone with just the knowledge of English and Darija could help out, without technical skills.

For this reason I am building DataStack. It's a collaboration platform for table data, and works similarly to Github, but much more easy to use for data. No tools or technical knowledge needed.

To show you what it looks like, I've uploaded two files from your dataset there:

https://datastack.net/boukeversteegh/darija-open-dataset-demo/

If you're interested, please have a look. Does this solution suit your needs? What is still missing in your opinion?
opened by boukeversteegh 2
How to know if a word or a sentence is already added

I want to start contributing to DODA and been exploring the repo, and I am wondering how you do identify if a sentence or a word is already added to the dataset.

opened by m0saan 2
Some additions to colors

8 stands for Ghayn (غين) as there's no need to spell H in numerics cause it sounds the same, by tradition people tend to spell GH as 8 to avoid typing two characters

opened by zouhair-isk 1
First set of new words

Hello,

I have added new words to the dataset. That's just a first set. I will add more in the next weeks. (I'll try to do some scrapping stuff to make it easier to get more of Darija's words).

Thanks for the great work and initiative.

opened by anasselhoud 1

darija <-> english dictionary

Related tags

Overview

darija-dictionary

How to contribute

Guidelines / Recommendations

Citation

Comments

sentences2.csv makhdamch

bghit ndir had dataset f dataframe w kaitl3o li had l error

ParserError: Error tokenizing data. C error: Expected 2 fields in line 1865, saw 4

Hosting the dataset on DataStack?

How to know if a word or a sentence is already added

Some additions to colors

First set of new words

Owner

DODa

Official implementation of Deep Convolutional Dictionary Learning for Image Denoising.

Pytorch implementation of "Geometrically Adaptive Dictionary Attack on Face Recognition" (WACV 2022)

HyperDict - Self linked dictionary in Python

BABEL: Bodies, Action and Behavior with English Labels [CVPR 2021]

Code Repo for the ACL21 paper "Common Sense Beyond English: Evaluating and Improving Multilingual LMs for Commonsense Reasoning"

Neural machine translation between the writings of Shakespeare and modern English using TensorFlow

LexGLUE: A Benchmark Dataset for Legal Language Understanding in English

Yomichad - a Japanese pop-up dictionary that can display readings and English definitions of Japanese words

This python module is an easy-to-use port of the text normalization used in the paper "Not low-resource anymore: Aligner ensembling, batch filtering, and new datasets for Bengali-English machine translation". It is intended to be used for normalizing / cleaning Bengali and English text.

VADER Sentiment Analysis. VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media, and works well on texts from other domains.

VADER Sentiment Analysis. VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media, and works well on texts from other domains.

git《USD-Seg:Learning Universal Shape Dictionary for Realtime Instance Segmentation》(2020) GitHub: [fig2]

A Plover python dictionary allowing for consistent symbol input with specification of attachment and capitalisation in one stroke.

Official implementation of Deep Convolutional Dictionary Learning for Image Denoising.

A BERT-based reverse-dictionary of Korean proverbs

Telegram bot for Urban Dictionary.

A BERT-based reverse dictionary of Korean proverbs

DirBruter is a Python based CLI tool. It looks for hidden or existing directories/files using brute force method. It basically works by launching a dictionary based attack against a webserver and analyse its response.

Mini Tool to lovers of debe from eksisozluk (one of the most famous website -reffered as collaborative dictionary like reddit- in Turkey) for pushing debe (Most Liked Entries of Yesterday) to kindle every day via Github Actions.

PyMultiDictionary is a Dictionary Module for Python 3+ to get meanings, translations, synonyms and antonyms of words in 20 different languages