Maha is a text processing library specially developed to deal with Arabic text.

Mohammad Al-Fetyani

Last update: Nov 27, 2022

Related tags

Overview

An Arabic text processing library intended for use in NLP applications

Maha is a text processing library specially developed to deal with Arabic text. The beta version can be used to clean and parse text, files, and folders with or without streaming capability.

If you need help or want to discuss topics related to Maha, feel free to reach out to our Discord server. If you would like to submit a bug report or feature request, please open an issue.

Installation

Simply run the following to install Maha:

pip install mahad # pronounced maha d

For source installation, check the documentation.

Overview

Check out the overview section in the documentation to get started with Maha.

Documentation

Documentation are hosted at ReadTheDocs.

Contributing

Maha welcomes and encourages everyone to contribute. Contributions are always appreciated. Feel free to take a look at our contribution guidelines in the documentation.

License

Maha is BSD-licensed.

Comments

Time: Add the ability to parse Hijri dates
What does this pull request change?

Closes #27.

Status (please check what you already did):

[x] added some tests for the functionality

[ ] updated the documentation

[x] tox passes

new feature highlight
opened by TRoboto 6
Added distance to dimension parsing
What does this pull request change?

Resolves #15.

Status (please check what you already did):

[x] added some tests for the functionality

[x] updated the documentation

[x] tox passes

parsing highlight
opened by TRoboto 5
Introduce :mod:`~.datasets` module and the first dataset, `names`, with over 40,000 unique names
What does this pull request change?

This PR introduces a new datasets module that offers an interface for all upcoming datasets. A new dataset, names, is released along with the module. It comprises 44,161 unique names with descriptions and name origin included for most names.

Link to updated docs: https://maha--40.org.readthedocs.build/en/40/overview.html#datasets

Status (please check what you already did):

[x] added some tests for the functionality

[x] updated the documentation

[x] tox passes

new feature highlight
opened by TRoboto 4
Add pyupgrade to pre-commit and upgrade to future-style type annotations
What does this pull request change?

Upgrades to new type annotations style.

Status (please check what you already did):

[ ] added some tests for the functionality

[ ] updated the documentation

[x] tox passes

maintenance
opened by TRoboto 3
Deprecate and remove `datasets` module and host datasets on Hugging Face instead
What does this pull request change?

Removes datasets module.

Datasets are now hosted here

Status (please check what you already did):

[ ] added some tests for the functionality

[ ] updated the documentation

[x] tox passes

breaking changes deprecation
opened by TRoboto 3
Add the ability to parse names from text
What does this pull request change?

Adds #24. Depends on #40

Status (please check what you already did):

[x] added some tests for the functionality

[x] updated the documentation

[x] tox passes

new feature highlight
opened by TRoboto 3
Add a deprecation system
What does this pull request change?

Closes #23

Adds 3 deprecation decorators; for functions, for parameters, for default parameters.

Status (please check what you already did):

[x] added some tests for the functionality

[ ] updated the documentation

[x] tox passes

development
opened by saedx1 3
Prepare for the next release of Maha (v0.3.0)
This is an auto-generated PR to prepare for the next release of Maha. The following changes were automatically made:

Generated changelogs for release v0.3.0.

Bumped pypi version to v0.3.0.

Updated the citation information.
opened by github-actions[bot] 2
Ordinal: Add support to `بعد` in ordinal parsing
What does this pull request change?

Closes #48.

Status (please check what you already did):

[x] added some tests for the functionality

[ ] updated the documentation

[x] tox passes

new feature
opened by TRoboto 2
Numeral: Add support for hierarchical parsing
What does this pull request change?

Closes #25

Status (please check what you already did):

[x] added some tests for the functionality

[ ] updated the documentation

[x] tox passes

new feature
opened by TRoboto 2
Prepare for the next release of Maha (v0.2.0)
This is an auto-generated PR to prepare for the next release of Maha. The following changes were automatically made:

Generated changelogs for release v0.2.0.

Bumped pypi version to v0.2.0.

Updated the citation information.
opened by github-actions[bot] 2
Update ci.yml
Check the support for python 3,10

What does this pull request change? It checks if the library is supporting python 3.10.

...

Status (please check what you already did):

[ ] added some tests for the functionality

[ ] updated the documentation

[ ] tox passes
opened by PAIN-BARHAM 1
[pre-commit.ci] pre-commit autoupdate
updates:

github.com/pre-commit/pre-commit-hooks: v4.3.0 → v4.4.0

github.com/psf/black: 22.6.0 → 22.12.0

github.com/pycqa/isort: 5.10.1 → 5.11.4

github.com/asottile/pyupgrade: v2.37.3 → v3.3.1
opened by pre-commit-ci[bot] 1
Add the option to ignore Harakat when removing or replacing
What problem are you trying to solve?

Currently, the cleaner functions do not consider two strings similar if they have different Harakat/diacritics, which is the correct behavior. However, it would be great if the user had the option to ignore Harakat when comparing strings.

Examples (if relevant)

Current:

>> from maha.cleaners.functions import remove >> output = remove("يُدَرِّسُ اللُّغَةَ العَرَبِيَّةَ الفُصْحَى", custom_expressions=r"اللغة") >> output يُدَرِّسُ اللُّغَةَ العَرَبِيَّةَ الفُصْحَى

Suggested:

>> from maha.cleaners.functions import remove >> remove("يُدَرِّسُ اللُّغَةَ العَرَبِيَّةَ الفُصْحَى", custom_expressions=r"اللغة", ignore_harakat=True) >> output يُدَرِّسُ العَرَبِيَّةَ الفُصْحَى

Definition of Done

It must adhere to the coding style used in the defined cleaner functions.

The implementation should cover most use cases.

Adding tests

feature request
opened by xaleel 1
Wrong parsed name using name dimension
What happened?

The name parser extracted wrong name likes : بي, شكرا.

Example: text: أريد البحث في سجل الإنفاق الخاص بي [Dimension(body=بي, value=بي, start=32, end=34, dimension_type=DimensionType.NAME)]

I expect to extract the names on the name dataset only.

Python version

3.8

What operating system are you using?

Linux

Code to reproduce the issue

>>> from maha.parsers.functions import parse_dimension >>> text = `أريد البحث في سجل الإنفاق الخاص بي` >>> extracted = parse_dimension(text, names=True) [Dimension(body=بي, value=بي, start=32, end=34, dimension_type=DimensionType.NAME)]

Relevant log output

No response
bug parsing
opened by PAIN-BARHAM 0
Add feature to parse duration period
What problem are you trying to solve?

Parsing the duration from the text that has the difference between the two dates.

Examples (if relevant)

>>> from maha.parsers.functions import parse_dimension >>> output = parse_dimension('عن ربع نمو سكان العالم القديم والتحضر بين 1700 و 1900 ميلادي', duration=True)[0].value >>> output DurationValue(values=[ValueUnit(value=200, unit=<DurationUnit.YEARS: 7>)], normalized_unit=<DurationUnit.SECONDS: 1>)

Definition of Done

It must adhere to the coding style used in the defined dimensions, duration dimension.

The implementation should cover most use cases.

Adding tests

feature request
opened by PAIN-BARHAM 1

Adding the parser functionality to Processors

What problem are you trying to solve?

Adding the parser functionality to Processors to parse different dimensions.

Examples (if relevant)

>>> from pathlib import Path
>>> import maha
>>> resource_path = Path(maha.__file__).parents[1] / "sample_data/tweets.txt"
>>> data = resource_path.read_text()
>>> print(data)

الساعة الآن 12:00 في اسبانيا 🇪🇸, انتهى بشكل رسمي عقد الأسطورة ليو ميسي مع برشلونة . .
طبعا بكونو حاطين المكيف ع٣ مئوية وخود تقلبات وبرد وحر وCNS وزعيق المراقب وألف نيلة وقر فتحت اشوف درجة الحرارة هتبقي كام يو الامتحان لقيتها ٤٢ والامتحان الساعه ١ فعايز انورماليز اننا ننزل بالفالنه الحمالات Hot fac
يسعدلي مساكم ❤🌹 شرح كلمة zwa هالمنشور رح تلاقو (zwar) سهل و لذيذ (aber) ناقصو شوية ملح وكزبر #منقو
مـعلش استحملوني ب الاصفر هالفتره 💛 #ريشـه هههههههه
لما حد يسالني بتختفي كتير لية =..
زيِّنوا ليلة الجمع بالصلاة على النَّبِيِّ ﷺ" ❤
#Windows11 is on the horizon. What feature are you looking forward to
Get vaccinate #savethesaviour
Today I am beginning project on 10 days duratio #30daysofcod #DEVCommunit

>>> from maha.processors import FileProcessor
>>> proc = FileProcessor(resource_path)
>>> parsed = proc.parse_dimension(time=True)
[Dimension(body=الساعة الآن 12:00, value=TimeValue(years=0, months=0, days=0, hours=0, minutes=0, seconds=0, hour=12, minute=0, second=0, microsecond=0), start=0, end=17, dimension_type=DimensionType.TIME),
 Dimension(body=الساعه ١, value=TimeValue(hour=1, minute=0, second=0, microsecond=0), start=238, end=246, dimension_type=DimensionType.TIME),
 Dimension(body=ليلة, value=TimeValue(am_pm='PM'), start=491, end=495, dimension_type=DimensionType.TIME)]

Definition of Done

It must adhere to the coding style.
The implementation should cover most use cases.
Adding tests.

good first issue feature request parsing

opened by PAIN-BARHAM 0

Releases(v0.3.0)

v0.3.0(Apr 4, 2022)

Check out the changelog for this release.
Source code(tar.gz)
Source code(zip)
v0.2.0(Nov 16, 2021)

Check out the changelog for this release.
Source code(tar.gz)
Source code(zip)
v0.1.2(Sep 23, 2021)
Quick fix:

Added readme badges

Fixed missing regex dependency

Source code(tar.gz)
Source code(zip)

Owner

Mohammad Al-Fetyani

Machine Learning Engineer

GitHub

Develop open-source Python Arabic NLP libraries that the Arab world will easily use in all Natural Language Processing applications

2 Oct 22, 2022

easySpeech is an open-source Python wrapper for google speech to text API that doesn't require PyAudio(So you especially windows user don't have to deal with the errors while installing PyAudio) and also works with hugging face transformers

easySpeech easySpeech is an open source python wrapper for google speech to text api that doesn't require PyAaudio(So you specially windows user don't

14 May 24, 2022

In this repository, I have developed an end to end Automatic speech recognition project. I have developed the neural network model for automatic speech recognition with PyTorch and used MLflow to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry.

End to End Automatic Speech Recognition In this repository, I have developed an end to end Automatic speech recognition project. I have developed the

22 Nov 13, 2022

Text to speech is a process to convert any text into voice. Text to speech project takes words on digital devices and convert them into audio. Here I have used Google-text-to-speech library popularly known as gTTS library to convert text file to .mp3 file. Hope you like my project!

Text to speech (using Python) Text to speech is a process to convert any text into voice. Text to speech project takes words on digital devices and co

19 Jun 30, 2022

Python library for processing Chinese text

SnowNLP: Simplified Chinese Text Processing SnowNLP是一个python写的类库，可以方便的处理中文文本内容，是受到了TextBlob的启发而写的，由于现在大部分的自然语言处理库基本都是针对英文的，于是写了一个方便处理中文的类库，并且和TextBlob

6k Jan 2, 2023

Nateve compiler developed with python.

Adam Adam is a Nateve Programming Language compiler developed using Python. Nateve Nateve is a new general domain programming language open source ins

7 Jan 15, 2022

Contains descriptions and code of the mini-projects developed in various programming languages

TexttoSpeechAndLanguageTranslator-project introduction A pleasant application where the client will be given buttons like play,reset and exit. The cli

1 Dec 22, 2021

Repository of the Code to Chatbots, developed in Python

Description In this repository you will find the Code to my Chatbots, developed in Python. I'll explain the structure of this Repository later. Requir

0 Oct 25, 2022

🗣️ NALP is a library that covers Natural Adversarial Language Processing.

NALP: Natural Adversarial Language Processing Welcome to NALP. Have you ever wanted to create natural text from raw sources? If yes, NALP is for you!

21 Aug 12, 2022

A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

2.9k Jan 2, 2023

A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

2.6k Feb 18, 2021

LightSeq: A High-Performance Inference Library for Sequence Processing and Generation

LightSeq is a high performance inference library for sequence processing and generation implemented in CUDA. It enables highly efficient computation of modern NLP models such as BERT, GPT2, Transformer, etc. It is therefore best useful for Machine Translation, Text Generation, Dialog， Language Modelling, and other related tasks using these models.

2.5k Jan 3, 2023

Natural Language Processing library built with AllenNLP 🌲🌱

Custom Natural Language Processing with big and small models ????

65 Sep 13, 2022

A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

2.9k Dec 31, 2022

Maha is a text processing library specially developed to deal with Arabic text.

Related tags

Overview

Installation

Overview

Documentation

Contributing

License

Comments

What problem are you trying to solve?

Examples (if relevant)

Definition of Done

What happened?

Python version

What operating system are you using?

Code to reproduce the issue

Relevant log output

What problem are you trying to solve?

Examples (if relevant)

Definition of Done

What problem are you trying to solve?

Examples (if relevant)

Definition of Done

Releases(v0.3.0)

v0.3.0(Apr 4, 2022)

v0.2.0(Nov 16, 2021)

v0.1.2(Sep 23, 2021)

Owner

Mohammad Al-Fetyani

Develop open-source Python Arabic NLP libraries that the Arab world will easily use in all Natural Language Processing applications

easySpeech is an open-source Python wrapper for google speech to text API that doesn't require PyAudio(So you especially windows user don't have to deal with the errors while installing PyAudio) and also works with hugging face transformers

Text to speech is a process to convert any text into voice. Text to speech project takes words on digital devices and convert them into audio. Here I have used Google-text-to-speech library popularly known as gTTS library to convert text file to .mp3 file. Hope you like my project!

Python library for processing Chinese text

Nateve compiler developed with python.

Contains descriptions and code of the mini-projects developed in various programming languages

Repository of the Code to Chatbots, developed in Python

🗣️ NALP is a library that covers Natural Adversarial Language Processing.

A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

LightSeq: A High-Performance Inference Library for Sequence Processing and Generation

Natural Language Processing library built with AllenNLP 🌲🌱

A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

A high-level Python library for Quantum Natural Language Processing

Python library for Serbian Natural language processing (NLP)

Textlesslib - Library for Textless Spoken Language Processing

Tools, wrappers, etc... for data science with a concentration on text processing

Multilingual text (NLP) processing toolkit