Extract Keywords from sentence or Replace keywords in sentences.

Overview

FlashText

Build Status Documentation Status Version Test coverage license

This module can be used to replace keywords in sentences or extract keywords from sentences. It is based on the FlashText algorithm.

Installation

$ pip install flashtext

API doc

Documentation can be found at FlashText Read the Docs.

Usage

Extract keywords
>>> from flashtext import KeywordProcessor
>>> keyword_processor = KeywordProcessor()
>>> # keyword_processor.add_keyword(<unclean name>, <standardised name>)
>>> keyword_processor.add_keyword('Big Apple', 'New York')
>>> keyword_processor.add_keyword('Bay Area')
>>> keywords_found = keyword_processor.extract_keywords('I love Big Apple and Bay Area.')
>>> keywords_found
>>> # ['New York', 'Bay Area']
Replace keywords
>>> keyword_processor.add_keyword('New Delhi', 'NCR region')
>>> new_sentence = keyword_processor.replace_keywords('I love Big Apple and new delhi.')
>>> new_sentence
>>> # 'I love New York and NCR region.'
Case Sensitive example
>>> from flashtext import KeywordProcessor
>>> keyword_processor = KeywordProcessor(case_sensitive=True)
>>> keyword_processor.add_keyword('Big Apple', 'New York')
>>> keyword_processor.add_keyword('Bay Area')
>>> keywords_found = keyword_processor.extract_keywords('I love big Apple and Bay Area.')
>>> keywords_found
>>> # ['Bay Area']
Span of keywords extracted
>>> from flashtext import KeywordProcessor
>>> keyword_processor = KeywordProcessor()
>>> keyword_processor.add_keyword('Big Apple', 'New York')
>>> keyword_processor.add_keyword('Bay Area')
>>> keywords_found = keyword_processor.extract_keywords('I love big Apple and Bay Area.', span_info=True)
>>> keywords_found
>>> # [('New York', 7, 16), ('Bay Area', 21, 29)]
Get Extra information with keywords extracted
>>> from flashtext import KeywordProcessor
>>> kp = KeywordProcessor()
>>> kp.add_keyword('Taj Mahal', ('Monument', 'Taj Mahal'))
>>> kp.add_keyword('Delhi', ('Location', 'Delhi'))
>>> kp.extract_keywords('Taj Mahal is in Delhi.')
>>> # [('Monument', 'Taj Mahal'), ('Location', 'Delhi')]
>>> # NOTE: replace_keywords feature won't work with this.
No clean name for Keywords
>>> from flashtext import KeywordProcessor
>>> keyword_processor = KeywordProcessor()
>>> keyword_processor.add_keyword('Big Apple')
>>> keyword_processor.add_keyword('Bay Area')
>>> keywords_found = keyword_processor.extract_keywords('I love big Apple and Bay Area.')
>>> keywords_found
>>> # ['Big Apple', 'Bay Area']
Add Multiple Keywords simultaneously
>>> from flashtext import KeywordProcessor
>>> keyword_processor = KeywordProcessor()
>>> keyword_dict = {
>>>     "java": ["java_2e", "java programing"],
>>>     "product management": ["PM", "product manager"]
>>> }
>>> # {'clean_name': ['list of unclean names']}
>>> keyword_processor.add_keywords_from_dict(keyword_dict)
>>> # Or add keywords from a list:
>>> keyword_processor.add_keywords_from_list(["java", "python"])
>>> keyword_processor.extract_keywords('I am a product manager for a java_2e platform')
>>> # output ['product management', 'java']
To Remove keywords
>>> from flashtext import KeywordProcessor
>>> keyword_processor = KeywordProcessor()
>>> keyword_dict = {
>>>     "java": ["java_2e", "java programing"],
>>>     "product management": ["PM", "product manager"]
>>> }
>>> keyword_processor.add_keywords_from_dict(keyword_dict)
>>> print(keyword_processor.extract_keywords('I am a product manager for a java_2e platform'))
>>> # output ['product management', 'java']
>>> keyword_processor.remove_keyword('java_2e')
>>> # you can also remove keywords from a list/ dictionary
>>> keyword_processor.remove_keywords_from_dict({"product management": ["PM"]})
>>> keyword_processor.remove_keywords_from_list(["java programing"])
>>> keyword_processor.extract_keywords('I am a product manager for a java_2e platform')
>>> # output ['product management']
To check Number of terms in KeywordProcessor
>>> from flashtext import KeywordProcessor
>>> keyword_processor = KeywordProcessor()
>>> keyword_dict = {
>>>     "java": ["java_2e", "java programing"],
>>>     "product management": ["PM", "product manager"]
>>> }
>>> keyword_processor.add_keywords_from_dict(keyword_dict)
>>> print(len(keyword_processor))
>>> # output 4
To check if term is present in KeywordProcessor
>>> from flashtext import KeywordProcessor
>>> keyword_processor = KeywordProcessor()
>>> keyword_processor.add_keyword('j2ee', 'Java')
>>> 'j2ee' in keyword_processor
>>> # output: True
>>> keyword_processor.get_keyword('j2ee')
>>> # output: Java
>>> keyword_processor['colour'] = 'color'
>>> keyword_processor['colour']
>>> # output: color
Get all keywords in dictionary
>>> from flashtext import KeywordProcessor
>>> keyword_processor = KeywordProcessor()
>>> keyword_processor.add_keyword('j2ee', 'Java')
>>> keyword_processor.add_keyword('colour', 'color')
>>> keyword_processor.get_all_keywords()
>>> # output: {'colour': 'color', 'j2ee': 'Java'}

For detecting Word Boundary currently any character other than this \w [A-Za-z0-9_] is considered a word boundary.

To set or add characters as part of word characters
>>> from flashtext import KeywordProcessor
>>> keyword_processor = KeywordProcessor()
>>> keyword_processor.add_keyword('Big Apple')
>>> print(keyword_processor.extract_keywords('I love Big Apple/Bay Area.'))
>>> # ['Big Apple']
>>> keyword_processor.add_non_word_boundary('/')
>>> print(keyword_processor.extract_keywords('I love Big Apple/Bay Area.'))
>>> # []

Test

$ git clone https://github.com/vi3k6i5/flashtext
$ cd flashtext
$ pip install pytest
$ python setup.py test

Build Docs

$ git clone https://github.com/vi3k6i5/flashtext
$ cd flashtext/docs
$ pip install sphinx
$ make html
$ # open _build/html/index.html in browser to view it locally

Why not Regex?

It's a custom algorithm based on Aho-Corasick algorithm and Trie Dictionary.

Benchmark

Time taken by FlashText to find terms in comparison to Regex.

https://thepracticaldev.s3.amazonaws.com/i/xruf50n6z1r37ti8rd89.png

Time taken by FlashText to replace terms in comparison to Regex.

https://thepracticaldev.s3.amazonaws.com/i/k44ghwp8o712dm58debj.png

Link to code for benchmarking the Find Feature and Replace Feature.

The idea for this library came from the following StackOverflow question.

Citation

The original paper published on FlashText algorithm.

@ARTICLE{2017arXiv171100046S,
   author = {{Singh}, V.},
    title = "{Replace or Retrieve Keywords In Documents at Scale}",
  journal = {ArXiv e-prints},
archivePrefix = "arXiv",
   eprint = {1711.00046},
 primaryClass = "cs.DS",
 keywords = {Computer Science - Data Structures and Algorithms},
     year = 2017,
    month = oct,
   adsurl = {http://adsabs.harvard.edu/abs/2017arXiv171100046S},
  adsnote = {Provided by the SAO/NASA Astrophysics Data System}
}

The article published on Medium freeCodeCamp.

Contribute

License

The project is licensed under the MIT license.

Comments
  • Fuzzy matching

    Fuzzy matching

    This PR tries to introduce fuzzy matching in flashtext, as mentioned in these issues:


    Guidelines

    • rely the most we can on the existing algorithm : only trigger fuzzy matching when we have a mismatch, so we keep a focus on performance
    • if adding new parameters, the function should keep the exact same behaviour when this parameter is left to its default value, so we don't get any conflict with the other tests
    • modify the less code as possible

    Features included

    1. KeywordProcessor.extract_keywords and KeywordProcessor.replace_keywords both have a new optional parameter : max_cost, which is the maximum levensthein distance accepted to perform fuzzy matching on a single keyword
    2. KeywordProcessor implements a levensthein function, which tries to find a match with a given word, with respect to a provided max_cost, and returns a node in the trie, which will be used to continue the search
    3. a new function has been included : get_next_word, which just retrieve the next word in the sequence. Tests are included in test/test_kp_next_word.py

    Optimizations to keep focus on performance

    • We set the current_dict to the first node yielded by the levenshtein function, so we get back to static matching as soon as possible
    • We decrement the current cost (initialized to max_cost) every time we trigger fuzzy matching (on a word), so if all the cost from max_cost already have been "consumed" by other words in the current keyword, we do not trigger fuzzy matching. E.g : when trying to extract the keyword "here you are" from "heere you are" with a max_cost of 1, the current cost will gets all consumed after the first word ("heere"), so no fuzzy matching will be performed on the other words ("you" and "are")

    Limitations

    • Well, as this is a pure python implementation don't expect this to be blazingly fast, even if we cut the recursion inside the levenshtein inner function ASAP, fuzzy matching still requires a lot more operations than exact matching
    opened by remiadon 31
  • Please stop using random crap algorithms and use correct Aho–Corasick

    Please stop using random crap algorithms and use correct Aho–Corasick

    kp = KeywordProcessor()
    kp.add_keyword("a a")
    kp.add_keyword("a b")
    print(kp.extract_keywords("a a b", span_info = True)) # where is "a b"?
    print(kp.extract_keywords("a a b a a", span_info = True)) # where is "a b" and second "a a"?
    
    kp2 = KeywordProcessor()
    kp2.add_keyword("a b")
    kp2.add_keyword("b c d e")
    print(kp2.extract_keywords("a b c d e", span_info = True)) # where is _longest_ "b c d e"?
    

    You can use regular Aho–Corasick for you program if you include word boundary symbol in keywords.

    Or you can use whole words as symbols - this will be much faster in python.

    question 
    opened by mayorovp 15
  • Replace with Nothing

    Replace with Nothing

    Hi, This is very useful. for below case do we have any option? I want to replace one particular word with 'Nothing'/null/blank. currently if i give space it is working.

    below is not working, (I just want to replace new delhi with empty):

    keyword_processor.add_keyword('New Delhi', '') new_sentence = keyword_processor.replace_keywords('I love Big Apple and new delhi.') new_sentence

    below is working with Space:

    keyword_processor.add_keyword('New Delhi', ' ') new_sentence = keyword_processor.replace_keywords('I love Big Apple and new delhi.') new_sentence

    opened by nh111 7
  • keywords_case_sensitive test not pass

    keywords_case_sensitive test not pass

    opened by sundy-li 7
  • Feature Request: Can we also get span of matches found?

    Feature Request: Can we also get span of matches found?

    Most regex libraries also give the location of the matches found. Can this information also be provided by FlashText?

    For example:

    >>> from flashtext import KeywordProcessor
    >>> keyword_processor = KeywordProcessor()
    >>> # keyword_processor.add_keyword(<unclean name>, <standardised name>)
    >>> keyword_processor.add_keyword('Big Apple', 'New York')
    >>> keyword_processor.add_keyword('Bay Area')
    # Maybe something like
    >>> keywords_found = keyword_processor.extract_keywords('I love New York and Bay Area.', spanInfo=True)
    >>> keywords_found
    >>> # {'New York': (8,15), 'Bay Area': (21,28)}
    
    duplicate 
    opened by scarescrow 6
  • Trigger multiple entries by same keyword?

    Trigger multiple entries by same keyword?

    I was trying to use a key word dict like this:

    from flashtext import KeywordProcessor
    keyword_processor = KeywordProcessor()
    keyword_dict = {
        "java": ["java_2e", "java programing"],
        "product management": ["PM", "java_2e", "product manager"]
    }
    

    I thought the keyword "jave_2e" would trigger both "java" and "product management".

    However, the output for the following code is:

    keyword_processor.extract_keywords('I am a programmer for a java_2e platform')
    

    Output:

    ['product management']
    

    Expected output:

    ['java', 'product management']
    

    It seems to be confused. I was wondering what is the correct way to trigger multiple entries by the same keyword.

    opened by easonnie 5
  • Suggestion: compile your trie to a regexp...

    Suggestion: compile your trie to a regexp...

    You may get the best of both worlds (good algorithm, native-speed matcher) by actually compiling your trie to a regexp as https://github.com/ZhukovAlexander/triegex does...

    opened by pygy 5
  • not finding all occurrences of keywords

    not finding all occurrences of keywords

    Thanks for open-sourcing flashtext.

    I discovered a minor issue. I'm looking for genes in biological texts. I noticed that the gene ire1 is not recognized in the following passage:

    targets relative to targets of the IRE1/XBP1s and PERK arms of the UPR

    Prior to utilizing your script I used regexes and I padded each search terms with \b hence I was able to pick up genes that occurred in form of geneA/geneB.

    Thanks again, Axel

    bug 
    opened by axelmuller 5
  • Fix issue with incomplete keyword at the end of the sentence

    Fix issue with incomplete keyword at the end of the sentence

    While experimenting with the Flashtext performance, I've added exact check between the replacement results by different methods. By doing that, I've noticed that if the sentence ends with beginning of some keyword, that last word of sentence would be lost.

    opened by killfactory 4
  • Include links to other projects?

    Include links to other projects?

    flashtext is great, thank you for building, documenting, and writing a post about it!

    I've incorporated it into our pipeline and saw a 60x speedup with respect to the regex matching we were doing. If you'd like to link back to our project (as proof yours is being used in the wild), feel free, or I can submit a PR for that. If not, thanks again for the quick drop-in library!

    opened by thoppe 4
  • [Feature suggestion] has_keywords() to check if there is one of the keywords in the text

    [Feature suggestion] has_keywords() to check if there is one of the keywords in the text

    We can currently do if extract_keywords(...) (will get True if one or more instances found), but it has to go through the entire text.

    To reduce time consumption for this use case, does it possible to just return True when the first instance was found?

    Thank you.

    question 
    opened by bact 4
  • Get tags that come after a linebreak

    Get tags that come after a linebreak

    Hi guys,

    first of all, thanks a lot for your great library, it is a huge help on my current project. I have one question though: I need to find the indices of certain tags (':F50:') in a string that I get from an XML file. These tags come after a linebreak, which in xml is represented by '&#xD'. However, some of the tags are followed by an '/' whereas others are not. When I add ':F50:' to the list of keywords, the keyword processor is able to find the tags that are being followed by the '/', but not the other ones. Only if I add ':F50' to the keyword list, the ones without a '/' are found. My concern is, that with ':F50' as part of the keyword list, the keyword processor finds more tags than I desire. Is there an explanation for that behavior? If yes, can I somehow work around it? Would it make sense to replace the xml formatted linebreak with a different value?

    Thanks a lot in advance for any help provided!

    opened by Moxelbert 0
  • Fails to replace adjacent keywords with empty non-word boundaries

    Fails to replace adjacent keywords with empty non-word boundaries

    In this example:

    test_replacer = KeywordProcessor(case_sensitive=True)
    test_replacer.add_keyword("aa", "b")
    test_replacer.add_keyword("cc", "d")
    test_replacer.set_non_word_boundaries("")
    
    teststring = "aacc"
    replacedstring = test_replacer.replace_keywords(teststring)
    
    print("Teststring:\n" + teststring)
    print("Replacedstring:\n" + replacedstring)
    

    I get this output:

    Teststring:
    aacc
    Replacedstring:
    bcc
    

    I expect to get bd. Am I misunderstanding the intended behaviour, or is this a bug?

    opened by jamespicone 0
  • Add support to release Linux aarch64 wheels

    Add support to release Linux aarch64 wheels

    Problem

    On aarch64, ‘pip install flashtext’ builds the wheels from the source code and then installs it. It requires the user to have a development environment installed on his system. Also, it takes some time to build the wheels than downloading and extracting the wheels from PyPI.

    Resolution

    On aarch64, ‘pip install flashtext’ should download the wheels from PyPI.

    @vi3k6i5 and Team, Please let me know your interest in releasing aarch64 wheels. I can help in this.

    opened by odidev 0
  • “成都”the two chinese words won't recognize

    “成都”the two chinese words won't recognize

    from flashtext import KeywordProcessor

    #text = "@苍月轶 再次核实:骆然5月8日持24小时核酸从宜昌回蓉,到成都24小时内核酸一次,9号回泸定,24小时内又做一次核酸,均阴性,健康码绿码。宜昌不是 AB区域。" text = "成都到北京高铁3小时,郑州到成都2小时"

    print(text) kp = KeywordProcessor() kp.add_keyword("到成都", ("成都", "ab")) kp.add_keyword("宜昌", ("宜昌", "ab"))

    print(len(kp)) print(kp) word_index = kp.extract_keywords(text, span_info=True) print(word_index) for item in word_index: print(text[item[1]:item[2]])

    print('finished')

    opened by GuoPL 1
  • Be case_sensitive w.r.t. whitespaces /blank spaces

    Be case_sensitive w.r.t. whitespaces /blank spaces

    Hey, consider the following example:

    from flashtext import KeywordProcessor
    keyword_processor = KeywordProcessor()
    # keyword_processor.add_keyword(<unclean name>, <standardised name>)
    keyword_processor.add_keyword('Big Apple')#, 'New York')
    keyword_processor.add_keyword('Bay Area')
    
    text1_keywords = 'I love Big Apple and Bay Area.'
    text2_keywords = 'I love Big  Apple and Bay Area.'
    
    keywords_found_1 = keyword_processor.extract_keywords(text1_keywords)
    keywords_found_2 = keyword_processor.extract_keywords(text2_keywords)
    
    keywords_found_1," vs. ",keywords_found_2
    

    For every-day-use it would be beneficiary, that the algorithm has the option to not distinguish between a single and multiple whitespaces. (A version, where the maximal number of whitespaces to be considered as one would be even better)

    For text extraction, we can easily preprocess the text and reduce multiple whitespaces a priori to just one. For text replacement the task is much more complicated, since we might want to reduce the whitespaces only in this expression, not for the complete text.

    My question is hence: Is it possible to implement this case sensitivity into the algorithm (even if regex patterns are not supported in general)?


    The same question arises for line breaks "\n"

    opened by sambaPython24 5
Owner
Vikash Singh
Software Engineer @ Google
Vikash Singh
Idea is to build a model which will take keywords as inputs and generate sentences as outputs.

keytotext Idea is to build a model which will take keywords as inputs and generate sentences as outputs. Potential use case can include: Marketing Sea

Gagan Bhatia 364 Jan 3, 2023
Using context-free grammar formalism to parse English sentences to determine their structure to help computer to better understand the meaning of the sentence.

Sentance Parser Executing the Program Make sure Python 3.6+ is installed. Install requirements $ pip install requirements.txt Run the program:

Vaibhaw 12 Sep 28, 2022
🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.

pySBD: Python Sentence Boundary Disambiguation (SBD) pySBD - python Sentence Boundary Disambiguation (SBD) - is a rule-based sentence boundary detecti

Nipun Sadvilkar 549 Jan 6, 2023
🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.

pySBD: Python Sentence Boundary Disambiguation (SBD) pySBD - python Sentence Boundary Disambiguation (SBD) - is a rule-based sentence boundary detecti

Nipun Sadvilkar 277 Feb 18, 2021
REST API for sentence tokenization and embedding using Multilingual Universal Sentence Encoder.

What is MUSE? MUSE stands for Multilingual Universal Sentence Encoder - multilingual extension (16 languages) of Universal Sentence Encoder (USE). MUS

Dani El-Ayyass 47 Sep 5, 2022
Türkçe küfürlü içerikleri bulan bir yapay zeka kütüphanesi / An ML library for profanity detection in Turkish sentences

"Kötü söz sahibine aittir." -Anonim Nedir? sinkaf uygunsuz yorumların bulunmasını sağlayan bir python kütüphanesidir. Farkı nedir? Diğer algoritmalard

KaraGoz 4 Feb 18, 2022
基于GRU网络的句子判断程序/A program based on GRU network for judging sentences

SentencesJudger SentencesJudger 是一个基于GRU神经网络的句子判断程序,基本的功能是判断文章中的某一句话是否为一个优美的句子。 English 如何使用SentencesJudger 确认Python运行环境 安装pyTorch与LTP python3 -m pip

null 8 Mar 24, 2022
Write Alphabet, Words and Sentences with your eyes.

The-Next-Gen-AI-Eye-Writer The Eye tracking Technique has become one of the most popular techniques within the human and computer interaction era, thi

Rohan Kasabe 2 Apr 5, 2022
Finally, some decent sample sentences

tts-dataset-prompts This repository aims to be a decent set of sentences for people looking to clone their own voices (e.g. using Tacotron 2). Each se

hecko 19 Dec 13, 2022
The Internet Archive Research Assistant - Daily search Internet Archive for new items matching your keywords

The Internet Archive Research Assistant - Daily search Internet Archive for new items matching your keywords

Kay Savetz 60 Dec 25, 2022
Searching keywords in PDF file folders

keyword_searching Steps to use this Python scripts: (1)Paste this script into the file folder containing the PDF files you need to search from; (2)Thi

null 1 Nov 8, 2021
File-based TF-IDF: Calculates keywords in a document, using a word corpus.

File-based TF-IDF Calculates keywords in a document, using a word corpus. Why? Because I found myself with hundreds of plain text files, with no way t

Jakob Lindskog 1 Feb 11, 2022
A sentence aligner for comparable corpora

About Yalign is a tool for extracting parallel sentences from comparable corpora. Statistical Machine Translation relies on parallel corpora (eg.. eur

Machinalis 128 Aug 24, 2022
Sentence Embeddings with BERT & XLNet

Sentence Transformers: Multilingual Sentence Embeddings using BERT / RoBERTa / XLM-RoBERTa & Co. with PyTorch This framework provides an easy method t

Ubiquitous Knowledge Processing Lab 9.1k Jan 2, 2023
Sentence Embeddings with BERT & XLNet

Sentence Transformers: Multilingual Sentence Embeddings using BERT / RoBERTa / XLM-RoBERTa & Co. with PyTorch This framework provides an easy method t

Ubiquitous Knowledge Processing Lab 4.2k Feb 18, 2021
Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)

Bunkai Bunkai is a sentence boundary (SB) disambiguation tool for Japanese texts. Quick Start $ pip install bunkai $ echo -e '宿を予約しました♪!まだ2ヶ月も先だけど。早すぎ

Megagon Labs 160 Dec 23, 2022
SimCSE: Simple Contrastive Learning of Sentence Embeddings

SimCSE: Simple Contrastive Learning of Sentence Embeddings This repository contains the code and pre-trained models for our paper SimCSE: Simple Contr

Princeton Natural Language Processing 2.5k Jan 7, 2023
Language-Agnostic SEntence Representations

LASER Language-Agnostic SEntence Representations LASER is a library to calculate and use multilingual sentence embeddings. NEWS 2019/11/08 CCMatrix is

Facebook Research 3.2k Jan 4, 2023