NLPretext packages in a unique library all the text preprocessing functions you need to ease your NLP project.

Artefact

Last update: Dec 15, 2022

Related tags

Text Data & NLP NLPretext

Overview

NLPretext

Working on an NLP project and tired of always looking for the same silly preprocessing functions on the web? 😫

Need to efficiently extract email adresses from a document? Hashtags from tweets? Remove accents from a French post? 😥

NLPretext got you covered! 🚀

NLPretext packages in a unique library all the text preprocessing functions you need to ease your NLP project.

🔍 Quickly explore below our preprocessing pipelines and individual functions referential.

Default preprocessing pipeline
Custom preprocessing pipeline
Replacing phone numbers
Removing hashtags
Extracting emojis
Data augmentation

Cannot find what you were looking for? Feel free to open an issue.

Installation

This package has been tested on Python 3.6, 3.7 and 3.8.

We strongly advise you to do the remaining steps in a virtual environnement.

To install this library you just have to run the following command:

pip install nlpretext

This library uses Spacy as tokenizer. Current models supported are en_core_web_sm and fr_core_news_sm. If not installed, run the following commands:

pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.3.1/en_core_web_sm-2.3.1.tar.gz

pip install https://github.com/explosion/spacy-models/releases/download/fr_core_news_sm-2.3.0/fr_core_news_sm-2.3.0.tar.gz

Preprocessing pipeline

Default pipeline

Need to preprocess your text data but no clue about what function to use and in which order? The default preprocessing pipeline got you covered:

from nlpretext import Preprocessor
text = "I just got the best dinner in my life @latourdargent !!! I  recommend 😀 #food #paris \n"
preprocessor = Preprocessor()
text = preprocessor.run(text)
print(text)
# "I just got the best dinner in my life !!! I recommend"

Create your custom pipeline

Another possibility is to create your custom pipeline if you know exactly what function to apply on your data, here's an example:

from nlpretext import Preprocessor
from nlpretext.basic.preprocess import (normalize_whitespace, remove_punct, remove_eol_characters,
remove_stopwords, lower_text)
from nlpretext.social.preprocess import remove_mentions, remove_hashtag, remove_emoji
text = "I just got the best dinner in my life @latourdargent !!! I  recommend 😀 #food #paris \n"
preprocessor = Preprocessor()
preprocessor.pipe(lower_text)
preprocessor.pipe(remove_mentions)
preprocessor.pipe(remove_hashtag)
preprocessor.pipe(remove_emoji)
preprocessor.pipe(remove_eol_characters)
preprocessor.pipe(remove_stopwords, args={'lang': 'en'})
preprocessor.pipe(remove_punct)
preprocessor.pipe(normalize_whitespace)
text = preprocessor.run(text)
print(text)
# "dinner life recommend"

Take a look at all the functions that are available here in the preprocess.py scripts in the different folders: basic, social, token.

Individual Functions

Replacing emails

from nlpretext.basic.preprocess import replace_emails
example = "I have forwarded this email to [email protected]"
example = replace_emails(example, replace_with="*EMAIL*")
print(example)
# "I have forwarded this email to *EMAIL*"

Replacing phone numbers

from nlpretext.basic.preprocess import replace_phone_numbers
example = "My phone number is 0606060606"
example = replace_phone_numbers(example, country_to_detect=["FR"], replace_with="*PHONE*")
print(example)
# "My phone number is *PHONE*"

Removing Hashtags

from nlpretext.social.preprocess import remove_hashtag
example = "This restaurant was amazing #food #foodie #foodstagram #dinner"
example = remove_hashtag(example)
print(example)
# "This restaurant was amazing"

Extracting emojis

from nlpretext.social.preprocess import extract_emojis
example = "I take care of my skin 😀"
example = extract_emojis(example)
print(example)
# [':grinning_face:']

Data augmentation

The augmentation module helps you to generate new texts based on your given examples by modifying some words in the initial ones and to keep associated entities unchanged, if any, in the case of NER tasks. If you want words other than entities to remain unchanged, you can specify it within the stopwords argument. Modifications depend on the chosen method, the ones currently supported by the module are substitutions with synonyms using Wordnet or BERT from the nlpaug library.

from nlpretext.augmentation.text_augmentation import augment_text
example = "I want to buy a small black handbag please."
entities = [{'entity': 'Color', 'word': 'black', 'startCharIndex': 22, 'endCharIndex': 27}]
example = augment_text(example, method=”wordnet_synonym”, entities=entities)
print(example)
# "I need to buy a small black pocketbook please."

Make HTML documentation

In order to make the html Sphinx documentation, you need to run at the nlpretext root path: sphinx-apidoc -f nlpretext -o docs/ This will generate the .rst files. You can generate the doc with cd docs && make html

You can now open the file index.html located in the build folder.

Project Organization

├── LICENSE
├── VERSION
├── CONTRIBUTING.md     <- Contribution guidelines
├── README.md           <- The top-level README for developers using this project.
├── .github/workflows   <- Where the CI lives
├── datasets/external   <- Bash scripts to download external datasets
├── docs                <- Sphinx HTML documentation
├── nlpretext           <- Main Package. This is where the code lives
│   ├── preprocessor.py <- Main preprocessing script
│   ├── augmentation    <- Text augmentation script
│   ├── basic           <- Basic text preprocessing 
│   ├── social          <- Social text preprocessing
│   ├── token           <- Token text preprocessing
│   ├── _config         <- Where the configuration and constants live
│   └── _utils          <- Where preprocessing utils scripts lives
├── tests               <- Where the tests lives
├── setup.py            <- makes project pip installable (pip install -e .) so the package can be imported
├── requirements.txt    <- The requirements file for reproducing the analysis environment, e.g.
│                          generated with `pip freeze > requirements.txt`
└── pylintrc            <- The linting configuration file

Comments

Bump actions/cache from 2.1.6 to 3.2.1
Bumps actions/cache from 2.1.6 to 3.2.1.

Release notes

Sourced from actions/cache's releases.

v3.2.1

What's Changed

Release compression related changes for windows by @Phantsure in actions/cache#1039

Upgrade codeql to v2 by @Phantsure in actions/cache#1023

Full Changelog: https://github.com/actions/cache/compare/v3.2.0...v3.2.1

v3.2.0

What's Changed

fix wrong timeout env var key in README.md by @walterddr in actions/cache#959

Updated release doc with correct env variable by @kotewar in actions/cache#960

Create pull_request_template.md by @pdotl in actions/cache#963

Update README with clearer info about cache-hit and its value by @kotewar in actions/cache#961

Change datadog/squid to Ubuntu/squid in CI check by @bishal-pdMSFT in actions/cache#976

Add more details to version section in readme by @bishal-pdMSFT in actions/cache#971

Update hashFiles documentation reference by @asaf400 in actions/cache#979

Updated link for cache segment download info by @kotewar in actions/cache#986

Readme update for deleting caches by @t-dedah in actions/cache#981

Add oncall logic to assign issues and PRs by @vsvipul in actions/cache#997

Bump minimatch from 3.0.4 to 3.1.2 by @dependabot in actions/cache#998

Revert "Bump minimatch from 3.0.4 to 3.1.2" by @vsvipul in actions/cache#1005

Fix npm vulnerability by @Phantsure in actions/cache#1007

refactor: Use early return pattern to avoid nested conditions by @jongwooo in actions/cache#1013

Use cache in check-dist.yml by @jongwooo in actions/cache#1004

chore: Use built-in cache action to cache dependencies by @jongwooo in actions/cache#1014

Updated node example by @t-dedah in actions/cache#1008

Fix: Node npm doc example by @apascualm in actions/cache#1026

docs: fix an invalid link in workarounds.md by @teatimeguest in actions/cache#929

General Availability release for granular cache by @kotewar in actions/cache#1035 More details here on beta release.

New Contributors

@walterddr made their first contribution in actions/cache#959

@asaf400 made their first contribution in actions/cache#979

@jongwooo made their first contribution in actions/cache#1013

@apascualm made their first contribution in actions/cache#1026

@teatimeguest made their first contribution in actions/cache#929

Full Changelog: https://github.com/actions/cache/compare/v3...v3.2.0

v3.2.0-beta.1

What's Changed

Actions Cache Granular Control Implementation by @kotewar in actions/cache#1006

v3.1.0-beta.3

What's Changed

Bug fixes for bsdtar fallback, if gnutar not available, and gzip fallback, if cache saved using old cache action, on windows.

Full Changelog: https://github.com/actions/cache/compare/v3.1.0-beta.2...v3.1.0-beta.3

... (truncated)

Changelog

Sourced from actions/cache's changelog.

3.2.1

Update @actions/cache on windows to use gnu tar and zstd by default and fallback to bsdtar and zstd if gnu tar is not available. (issue)

Added support for fallback to gzip to restore old caches on windows.

Added logs for cache version in case of a cache miss.

Commits

c1a5de8 Upgrade codeql to v2 (#1023)

9b0be58 Release compression related changes for windows (#1039)

c17f4bf GA for granular cache (#1035)

ac25611 docs: fix an invalid link in workarounds.md (#929)

dc097e3 Update examples.md (#1026)

fb86cbf Updated node example (#1008)

a57932f Merge pull request #1014 from jongwooo/chore/use-built-in-cache-action

04b13ca chore: Use built-in cache action to cache dependencies

941bc71 Merge pull request #1004 from jongwooo/chore/use-cache-in-check-dist

08d8639 Merge branch 'main' into chore/use-cache-in-check-dist

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

draft dependencies github_actions
opened by dependabot[bot] 0
Bump python from 3.9.7-slim-buster to 3.11.1-slim-buster in /docker
Bumps python from 3.9.7-slim-buster to 3.11.1-slim-buster.

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

draft docker dependencies
opened by dependabot[bot] 0
The current release is not functional as emoji lib has changed
🐛 Bug Report

🔬 How To Reproduce

Steps to reproduce the behavior:

install nlpretext from pip (1.1.0)

run from nlpretext._config import constants

Code sample

Environment

OS: macOS Silicon

Python version: 3.7, 3.8, 3.9

📈 Expected behavior

EMOJI_PATTERN = _emoji.get_emoji_regexp()

AttributeError: module 'emoji' has no attribute 'get_emoji_regexp'

bug
opened by Guillaume6606 1
Bump release-drafter/release-drafter from 5.15.0 to 5.21.1
Bumps release-drafter/release-drafter from 5.15.0 to 5.21.1.

Release notes

Sourced from release-drafter/release-drafter's releases.

v5.21.1

What's Changed

Dependency Updates

Address set-output deprecation (#1247) @NotMyFault

Full Changelog: https://github.com/release-drafter/release-drafter/compare/v5.21.0...v5.21.1

v5.21.0

What's Changed

New

fetch 100 labels for pull requests instead of 10 (#1220) @matoubidou

Full Changelog: https://github.com/release-drafter/release-drafter/compare/v5.20.1...v5.21.0

v5.20.1

What's Changed

Bug Fixes

Add missing inputs to action config (#1202) @gilbertsoft

Documentation

Add more comments about pull requests permission (#1187) @Kirade

Fix Vercel link (#1188) @shinshin86

Add permissions to README (#1132) @danyeaw

Dependency Updates

Bump eslint-plugin-unicorn from 42.0.0 to 43.0.2 (#1192) @dependabot

Bump node from af50279 to 4c8f734 (#1191) @dependabot

Bump node from 17.9.0-alpine to 18.7.0-alpine (#1190) @dependabot

Bump jest from 28.1.0 to 28.1.3 (#1182) @dependabot

Bump eslint from 8.16.0 to 8.20.0 (#1185) @dependabot

Bump nock from 13.2.4 to 13.2.9 (#1186) @dependabot

Bump probot from 12.2.4 to 12.2.5 (#1178) @dependabot

Bump eslint-plugin-prettier from 4.0.0 to 4.2.1 (#1176) @dependabot

Bump lint-staged from 13.0.0 to 13.0.3 (#1172) @dependabot

Bump prettier from 2.6.2 to 2.7.1 (#1166) @dependabot

Bump @actions/core from 1.8.2 to 1.9.0 (#1164) @dependabot

Bump lint-staged from 12.4.3 to 13.0.0 (#1156) @dependabot

Bump probot from 12.2.3 to 12.2.4 (#1155) @dependabot

Bump @vercel/ncc from 0.33.4 to 0.34.0 (#1151) @dependabot

... (truncated)

Commits

6df64e4 v5.21.1

26be07d Address set-output deprecation (#1247)

df69d58 v5.21.0

ecbbed9 fetch 100 labels for pull requests instead of 10 (#1220)

06a49bf v5.20.1

6e6a13c Add missing inputs to action config (#1202)

0e58cd4 Bump eslint-plugin-unicorn from 42.0.0 to 43.0.2 (#1192)

c3d9042 quote schema defaults that contain *

bd579b5 Bump node from af50279 to 4c8f734 (#1191)

c464263 Bump node from 17.9.0-alpine to 18.7.0-alpine (#1190)

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

draft dependencies github_actions
opened by dependabot[bot] 0
Bump cloudpickle from 2.0.0 to 2.2.0
Bumps cloudpickle from 2.0.0 to 2.2.0.

Changelog

Sourced from cloudpickle's changelog.

2.2.0

Fix support of PyPy 3.8 and later. ([issue #455](cloudpipe/cloudpickle#455))

2.1.0

Support for pickling abc.abstractproperty, abc.abstractclassmethod, and abc.abstractstaticmethod. ([PR #450](cloudpipe/cloudpickle#450))

Support for pickling subclasses of generic classes. ([PR #448](cloudpipe/cloudpickle#448))

Support and CI configuration for Python 3.11. ([PR #467](cloudpipe/cloudpickle#467))

Support for the experimental nogil variant of CPython ([PR #470](cloudpipe/cloudpickle#470))

Commits

f31859b Release 2.2.0

23cbe15 FIX: Support PyPy > 3.7 (#480)

f5472e1 Fix for dis module is not yet available in 3.11b3 (#475)

8bbea3e compat: Import Pickler from "pickle" instead of "_pickle" (#469)

0006829 Install development version of dask in downstream tests (#472)

f926a04 Back to dev mode

d50bd11 Release 2.1.0

6a0e12d Improve compatibility with "nogil" Python and 3.11 (#470)

2fc334d Fix downstream CI (#471)

f758eb3 Fix compatibility with Python 3.11 (#467)

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

draft dependencies python
opened by dependabot[bot] 0

Releases(1.1.0)

1.1.0(Sep 16, 2021)
What’s Changed

[FIX] Removed direct dependency and changed docker registry (#163) @Cedric-Magnan

[DOC] Updated method for spacy tokenizer installation (#159) @Cedric-Magnan

Feature/ignore stopwords (#157) @Guillaume6606

fix: display explicit error message when model not downloaded (#156) @benoitgoujon

Feature/dataloader (#152) @sachalasry-artefact

Hotfix/pylint (#151) @amaleelhamri

Fix/credits (#150) @rafaelleaygalenq

:busts_in_silhouette: List of contributors

@Cedric-Magnan, @Guillaume6606, @amaleelhamri, @benoitgoujon, @hugovasselin, @rafaelleaygalenq and @sachalasry-artefact
Source code(tar.gz)
Source code(zip)
1.0.3(Feb 18, 2021)

Update license MIT to Apache in PyPI
Source code(tar.gz)
Source code(zip)
nlpretext-1.0.2-py3-none-any.whl(131.91 KB)
nlpretext-1.0.2.tar.gz(275.42 KB)
1.0.1(Feb 18, 2021)
Readme fix

Long description add

Augmentation sphinx documentation fix

Source code(tar.gz)
Source code(zip)
nlpretext-1.0.1-py3-none-any.whl(131.90 KB)
nlpretext-1.0.1.tar.gz(275.33 KB)
1.0.0(Feb 18, 2021)
First release

Easy pipelines to clean text efficiently

Catalogue of preprocessing functions for different needs

Source code(tar.gz)
Source code(zip)
nlpretext-1.0.0-py3-none-any.whl(126.46 KB)
nlpretext-1.0.0.tar.gz(271.90 KB)