🌴
dobbi
🦕
Takes care of all of this boring NLP stuff
Description
An open-source NLP library: fast text cleaning and preprocessing.
TL;DR
This library provides a quick and ready-to-use text preprocessing tools for text cleaning and normalization. You can simply remove hashtags, nicknames, emoji, url addresses, punctuation, whitespace and whatever.
Installation
To download dobbi, either fork this GitHub repo or simply use Pypi via pip:
$ pip install dobbi
Usage
Import the library:
import dobbi
Interaction
The library uses method chaining in order to simplify text processing:
dobbi.clean() \
.hashtag() \
.nickname() \
.url() \
.execute('Check here: https://some-url.com')
Supported methods and patterns
The process consists of three stages:
- Initialization methods: initialize a dobbi Work object
- Intermediate methods: chain patterns in the needed order
- Terminal methods: choose if you need a function or a result
Initialization functions:
dobbi.clean()
dobbi.collect()
dobbi.replace()
Intermediate methods (pattern processing choice):
regexp()
- custom regular expressionsurl()
- URLshtml()
- HTML and "<...>" type markupspunctuation()
- punctuationhashtag()
- hashtagsemoji()
- emojiemoticons()
- emoticonswhitespace()
- any type of whitespacesnickname()
- @-starting nicknames
Terminal methods:
execute(str)
- executes chosen methods on the provided string.function()
- returns a function which is a combination of the chosen methods.
Examples
1) Clean a random Twitter message
dobbi.clean() \
.hashtag() \
.nickname() \
.url() \
.execute('#fun #lol Why @Alex33 is so funny? Check here: https://some-url.com')
Result:
'Why is so funny? Check here:'
2) Replace nicknames and urls with tokens
dobbi.replace() \
.hashtag('') \
.nickname() \
.url('__CUSTOM_URL_TOKEN__') \
.execute('#fun #lol Why @Alex33 is so funny? Check here: https://some-url.com')
Result:
'Why TOKEN_NICKNAME is so funny? Check here: __CUSTOM_URL_TOKEN__'
3) Get the text cleanup function (one-liner)
Please, try to avoid the in-line method chaining, as it is less readable. Do as your heart tells you.
func = dobbi.clean().url().hashtag().punctuation().whitespace().html().function()
func('\t #fun #lol Why @Alex33 is so... funny?
\nCheck
\there: https://some-url.com'
)
Result:
'Why Alex33 is so funny Check here'
- Chain regexp methods
dobbi.clean() \
.regexp('#\w+') \
.regexp('@\w+') \
.regexp('https?://\S+') \
.execute('#fun #lol Why @Alex33 is so funny? Check here: https://some-url.com')
Result:
'Why is so funny? Check here:'
Additional
Please pay attention that the functions are applied in the order you've specified them. So, you're better to chain .punctuation()
as one of the last functions.
🤗
Call for collaboration If you enjoyed the project I would be grateful if you supported it :)
Below is the list of useful features I would be happy to share with you:
- Finding bugs
- Making code optimizations
- Writing tests
- Help with new features development