Python implementation of TextRank for phrase extraction and summarization of text documents

Overview

PyTextRank

PyTextRank is a Python implementation of TextRank as a spaCy pipeline extension, used to:

  • extract the top-ranked phrases from text documents
  • run low-cost extractive summarization of text documents
  • help infer links from unstructured text into structured data

Background

One of the goals for PyTextRank is to provide support (eventually) for entity linking, in contrast to the more commonplace usage of named entity recognition. These approaches can be used together in complementary ways to improve the results overall.

The introduction of graph algorithms -- notably, eigenvector centrality -- provides a more flexible and robust basis for integrating additional techniques that enhance the natural language work being performed. The entity linking aspects here are still a work-in-progress scheduled for a later release.

Internally PyTextRank constructs a lemma graph to represent links among the candidate phrases (e.g., unrecognized entities) and their supporting language. Generally speaking, any means of enriching that graph prior to phrase ranking will tend to improve results. Possible ways to enrich the lemma graph include coreference resolution and semantic relations, as well as leveraging knowledge graphs in the general case.

For example, WordNet and DBpedia both provide means for inferring links among entities, and purpose-built knowledge graphs can be applied for specific use cases. These can help enrich a lemma graph even in cases where links are not explicit within the text. Consider a paragraph that mentions cats and kittens in different sentences: an implied semantic relation exists between the two nouns since the lemma kitten is a hyponym of the lemma cat -- such that an inferred link can be added between them.

This has an additional benefit of linking parsed and annotated documents into more structured data, and can also be used to support knowledge graph construction.

The TextRank algorithm used here is based on research published in:
"TextRank: Bringing Order into Text"
Rada Mihalcea, Paul Tarau
Empirical Methods in Natural Language Processing (2004)

Several modifications in PyTextRank improve on the algorithm originally described in the paper:

  • fixed a bug: see Java impl, 2008
  • use lemmatization in place of stemming
  • include verbs in the graph (but not in the resulting phrases)
  • leverage preprocessing via noun chunking and named entity recognition
  • provide extractive summarization based on ranked phrases

This implementation was inspired by the Williams 2016 talk on text summarization. Note that while much better approaches exit for summarizing text, questions linger about some of the top contenders -- see: 1, 2. Arguably, having alternatives such as this allow for cost trade-offs.

Installation

Prerequisites:

To install from PyPi:

pip install pytextrank
python -m spacy download en_core_web_sm

If you install directly from this Git repo, be sure to install the dependencies as well:

pip install -r requirements.txt
python -m spacy download en_core_web_sm

Usage

import spacy
import pytextrank

# example text
text = "Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types systems and systems of mixed types."

# load a spaCy model, depending on language, scale, etc.
nlp = spacy.load("en_core_web_sm")

# add PyTextRank to the spaCy pipeline
tr = pytextrank.TextRank()
nlp.add_pipe(tr.PipelineComponent, name="textrank", last=True)

doc = nlp(text)

# examine the top-ranked phrases in the document
for p in doc._.phrases:
    print("{:.4f} {:5d}  {}".format(p.rank, p.count, p.text))
    print(p.chunks)

For other example usage, see the PyTextRank wiki. If you need to troubleshoot any problems:

For related course materials and training, please check for calendar updates in the article "Natural Language Processing in Python".

Let us know if you find this package useful, tell us about use cases, describe what else you would like to see integrated, etc. For inquiries about consulting work in machine learning, natural language, knowledge graph, and other AI applications, contact Derwen, Inc.

Links

Testing

To run the unit tests:

coverage run -m unittest discover

To generate a coverage report and upload it to the codecov.io reporting site:

coverage report
bash <(curl -s https://codecov.io/bash) -t @.cc_token

Test coverage reports can be viewed at https://codecov.io/gh/DerwenAI/pytextrank

License and Copyright

Source code for PyTextRank plus its logo, documentation, and examples have an MIT license which is succinct and simplifies use in commercial applications.

All materials herein are Copyright © 2016-2021 Derwen, Inc.

Attribution

Please use the following BibTeX entry for citing PyTextRank if you use it in your research or software. Citations are helpful for the continued development and maintenance of this library.

@software{PyTextRank,
  author = {Paco Nathan},
  title = {{PyTextRank, a Python implementation of TextRank for phrase extraction and summarization of text documents}},
  year = 2016,
  publisher = {Derwen},
  url = {https://github.com/DerwenAI/pytextrank}
}

TODOs

  • kglab integration
  • generate MkDocs
  • MyPy and PyLint coverage
  • include more unit tests
  • show examples of spacy-wordnet to enrich the lemma graph
  • leverage neuralcoref to enrich the lemma graph
  • generate a phrase graph, with entity linking into Wikidata, etc.
  • include more unit tests
  • fix Sphinx errors, generate docs

Kudos

Many thanks to our contributors: @louisguitton, @anna-droid-beep, @kavorite, @htmartin, @williamsmj, @mattkohl, @vanita5, @HarshGrandeur, @mnowotka, @kjam, @dvsrepo, @SaiThejeshwar, @laxatives, @dimmu, @JasonZhangzy1757, @jake-aft, @junchen1992, @Ankush-Chander, @shyamcody, @chikubee, encouragement from the wonderful folks at spaCy, plus general support from Derwen, Inc.

thx noam

Comments
  • Example file throws KeyError: 1255

    Example file throws KeyError: 1255

    Have not been able to get either the long form (from wiki) or short form (from github readme) files to work successfully.

    The file @ https://github.com/DerwenAI/pytextrank/blob/master/example.py throws a KeyError: 1255 when run. Output for this is below.

    I have been able to get the example from the github page working but only for very small strings. Anything larger than a few words throws a KeyError with varying number depending on the length of the string.

    Can't figure out the issue even using all input (txt files) from the example on the wiki page and changing the spacy version to various releases from 2.0.0 to present.


    KeyError Traceback (most recent call last) in () 31 text = f.read() 32 ---> 33 doc = nlp(text) 34 35 print("pipeline", nlp.pipe_names)

    /home/pete/.local/lib/python3.5/site-packages/spacy/language.py in call(self, text, disable, component_cfg) 433 if not hasattr(proc, "call"): 434 raise ValueError(Errors.E003.format(component=type(proc), name=name)) --> 435 doc = proc(doc, **component_cfg.get(name, {})) 436 if doc is None: 437 raise ValueError(Errors.E005.format(name=name))

    /usr/local/lib/python3.5/dist-packages/pytextrank/pytextrank.py in PipelineComponent(self, doc) 530 """ 531 self.doc = doc --> 532 Doc.set_extension("phrases", force=True, default=self.calc_textrank()) 533 Doc.set_extension("textrank", force=True, default=self) 534

    /usr/local/lib/python3.5/dist-packages/pytextrank/pytextrank.py in calc_textrank(self) 389 390 for chunk in self.doc.noun_chunks: --> 391 self.collect_phrases(chunk) 392 393 for ent in self.doc.ents:

    /usr/local/lib/python3.5/dist-packages/pytextrank/pytextrank.py in collect_phrases(self, chunk) 345 if key in self.seen_lemma: 346 node_id = list(self.seen_lemma.keys()).index(key) --> 347 rank = self.ranks[node_id] 348 phrase.sq_sum_rank += rank 349 compound_key.add(key)

    KeyError: 1255

    bug 
    opened by oldskewlcool 17
  • A question on keyphrases that are subsets of others and overlapping `Spans`

    A question on keyphrases that are subsets of others and overlapping `Spans`

    I think the current implementation returns keyphrases that are potential subsets of each other, that this is due to the use of noun_chunks and ents, and that this is not the desired output. Specifically, if a document has an entity that is a superset (as far as span start and end is concerned) of a noun chunk (or vice-versa), and both contain a key token, then both will be returned as keyphrases.

    While also/possibly linked to the issue of entity linkage (which I'd love to know more about!), this can simply be a matter of defining "entity" boundaries and a "duplication" issue, as the example below with "Seouls Four Seasons hotel" and "Four Seasons", where I believe one keyphrase is enough and having both is confusing, demonstrates.

    Am I missing something? Is this the desired logic?

    Example:

    from spacy.util import filter_spans
    import pytextrank
    import en_core_web_sm
    
    nlp = en_core_web_sm.load()
    nlp.add_pipe("textrank", last=True);
    
    # from dat/lee.txt
    text = """
    After more than four hours of tight play and a rapid-fire endgame, Google's artificially intelligent Go-playing computer system has won a second contest against grandmaster Lee Sedol, taking a two-games-to-none lead in their historic best-of-five match in downtown Seoul.  The surprisingly skillful Google machine, known as AlphaGo, now needs only one more win to claim victory in the match. The Korean-born Lee Sedol will go down in defeat unless he takes each of the match's last three games. Though machines have beaten the best humans at chess, checkers, Othello, Scrabble, Jeopardy!, and so many other games considered tests of human intellect, they have never beaten the very best at Go. Game Three is set for Saturday afternoon inside Seoul's Four Seasons hotel.  The match is a way of judging the suddenly rapid progress of artificial intelligence. One of the machine-learning techniques at the heart of AlphaGo has already reinvented myriad online services inside Google and other big-name Internet companies, helping to identify images, recognize commands spoken into smartphones, improve search engine results, and more. Meanwhile, another AlphaGo technique is now driving experimental robotics at Google and places like the University of California at Berkeley. This week's match can show how far these technologies have come - and perhaps how far they will go.  Created in Asia over 2,500 year ago, Go is exponentially more complex than chess, and at least among humans, it requires an added degree of intuition. Lee Sedol is widely-regarded as the top Go player of the last decade, after winning more international titles than all but one other player. He is currently ranked number five in the world, and according to Demis Hassabis, who leads DeepMind, the Google AI lab that created AlphaGo, his team chose the Korean for this all-important match because they wanted an opponent who would be remembered as one of history's great players.  Although AlphaGo topped Lee Sedol in the match's first game on Wednesday afternoon, the outcome of Game Two was no easier to predict. In his 1996 match with IBM's Deep Blue supercomputer, world chess champion Gary Kasparov lost the first game but then came back to win the second game and, eventually, the match as a whole. It wasn't until the following year that Deep Blue topped Kasparov over the course of a six-game contest. The thing to realize is that, after playing AlphaGo for the first time on Wednesday, Lee Sedol could adjust his style of play - just as Kasparov did back in 1996. But AlphaGo could not. Because this Google creation relies so heavily on machine learning techniques, the DeepMind team needs a good four to six weeks to train a new incarnation of the system. And that means they can't really change things during this eight-day match.  "This is about teaching and learning," Hassabis told us just before Game Two. "One game is not enough data to learn from - for a machine - and training takes an awful lot of time."
    """
    
    doc = nlp(text)
    
    key_spans = []
    for phrase in doc._.phrases:
        for chunk in phrase.chunks:
            key_spans.append(chunk)
    
    print(len(key_spans))
    
    full_set = set([p.text for p in doc._.phrases])
    
    print(full_set)
    
    print(len(filter_spans(key_spans)))
    
    sub_set = set([pytextrank.util.default_scrubber(p) for p in filter_spans(key_spans)])
    
    print(sub_set)
    
    print(full_set - sub_set)
    
    print(sub_set - full_set)
    

    Possible solution?:

    all_spans = list(self.doc.noun_chunks) + list(self.doc.ents)
    filtered_spans = filter_spans(all_spans)
    filtered_phrases = self._collect_phrases(filtered_spans, self.ranks) # replacing all_phrases
    

    instead of

    nc_phrases: typing.Dict[Span, float] = self._collect_phrases(self.doc.noun_chunks, self.ranks)
    ent_phrases: typing.Dict[Span, float] = self._collect_phrases(self.doc.ents, self.ranks)
    all_phrases: typing.Dict[Span, float] = { **nc_phrases, **ent_phrases }
    

    see https://github.com/DerwenAI/pytextrank/blob/29339027b905844af0064ed9a0326e2578f21bf6/pytextrank/base.py#L362

    Note:

    • My understanding is that self._get_min_phrases is doing something else.
    • spacy.util.filter_spans simply looks for the (first) longest span, which might not be the best solution.
    enhancement 
    opened by DayalStrub 11
  • Errors importing from pytextrank

    Errors importing from pytextrank

    Hi! I'm working on a project connected with NLP and was happy to find out that there is such a tool as PyTextRank. However, I've encountered an issue at the very beginning trying to just import package to run the example code given here. The error that I get is the following:

    ----> from pytextrank import json_iter, parse_doc, pretty_print
    ImportError: cannot import name 'json_iter'
    ----> from pytextrank import parse_doc
    ImportError: cannot import name 'parse_doc'
    

    I've tried running it in iPython console and a Jupyter Notebook, both the same result. I've installed PyTextRank with pip, the python version that I have is 3.5.4., spacy 2.1.8., networkx 2.4, graphvis 0.13.2

    question 
    opened by Erin59 9
  • NotImplementedError: [E894] The 'noun_chunks' syntax iterator is not implemented for language 'ru'.

    NotImplementedError: [E894] The 'noun_chunks' syntax iterator is not implemented for language 'ru'.

    It seems to me that nlp.add_pipe("textrank") must have "noun chunks" which will raise "NotImplementedError" for some language models where "noun chunks" have not been implemented. I've got "NotImplementedError" with "ru_core_news_lg" and "ru_core_news_sm" spacy models.

    The proposal is to make the use of "noun chunks" optional to prevent such errors.

    bug 
    opened by gremur 8
  • How to use this?

    How to use this?

    Hi there, I've been looking at your code and example for a long time and I still have no idea how to use this.

    I have documents in string format, what JSON format should they have if I want to use the stages as in the examples?

    I find there's a crucial piece of information missing in the documentation, which is how to use the functionality of this package with a simple document in string format (or list of strings, representing sentences). As I don't know beforehand what JSON format I have to convert my text to in order to use the stage pipeline.

    Cheers

    question 
    opened by romanovzky 8
  • Error: Can't find factory for 'textrank' for language English....

    Error: Can't find factory for 'textrank' for language English....

    Hi there,

    Does any know how to fix below errors when running example code?

    Thanks.

    Traceback (most recent call last): File "test.py", line 14, in nlp.add_pipe("textrank") File "/usr/local/lib64/python3.6/site-packages/spacy/language.py", line 773, in add_pipe validate=validate, File "/usr/local/lib64/python3.6/site-packages/spacy/language.py", line 639, in create_pipe raise ValueError(err) ValueError: [E002] Can't find factory for 'textrank' for language English (en). This usually happens when spaCy calls nlp.create_pipe with a custom component name that's not registered on the current language class. If you're using a Transformer, make sure to install 'spacy-transformers'. If you're using a custom component, make sure you've added the decorator @Language.component (for function components) or @Language.factory (for class components).

    Available factories: attribute_ruler, tok2vec, merge_noun_chunks, merge_entities, merge_subtokens, token_splitter, parser, beam_parser, entity_linker, ner, beam_ner, entity_ruler, lemmatizer, tagger, morphologizer, senter, sentencizer, textcat, textcat_multilabel, en.lemmatizer

    question 
    opened by r76941156 7
  • Differences between 2.1.0 and 3.0.0

    Differences between 2.1.0 and 3.0.0

    Are the changes between the two versions of pytextrank documented anywhere?

    The queries seem to be giving different results, so I would like to understand if that is because of changes to spaCy or to the algorithm itself?

    Thank you for your help.

    question howto 
    opened by debraj135 7
  • Keyword extraction

    Keyword extraction

    Hi there, I'm working on a project extracting keywords from a german text. Is there a tutorial on how to extract keywords using pytextrank?

    Best regards,

    question 
    opened by danielp3011 7
  • AttributeError: 'DiGraph' object has no attribute 'edge'

    AttributeError: 'DiGraph' object has no attribute 'edge'

    Fixed by changing the code on pytextrank (307) from: try: graph.edge[pair[0]][pair[1]]["weight"] += 1.0 except KeyError: graph.add_edge(pair[0], pair[1], weight=1.0) to:
    if "edge" in dir(graph): graph.edge[pair[0]][pair[1]]["weight"] += 1.0 else: graph.add_edge(pair[0], pair[1], weight=1.0)

    opened by Vickoh 7
  • Add biasedtextrank module.

    Add biasedtextrank module.

    Hey @ceteri I have added basic version of biased textrank.

    It takes into account "focus" as well "bias" to augment ranking in favour of focus. As per the paper, it should add bias to the graph based on similarity calculation between "focus" and nodes. But this version just assigns "bias" to focus terms while leaving other nodes unbiased.

    Let me know of your ideas so that we can improve upon this version.

    @louisguitton

    opened by Ankush-Chander 6
  • IndexError: list index out of range

    IndexError: list index out of range

    Hi,

    I'm getting the following error when trying to run pytextrank with my own data. Is there a way to fix this?

    app_1 | Traceback (most recent call last): app_1 | File "index.py", line 26, in app_1 | for rl in pytextrank.normalize_key_phrases(path_stage1, ranks): app_1 | File "/usr/local/lib/python3.6/site-packages/pytextrank/pytextrank.py", line 581, in normalize_key_phrases app_1 | for rl in collect_entities(sent, ranks, stopwords, spacy_nlp): app_1 | File "/usr/local/lib/python3.6/site-packages/pytextrank/pytextrank.py", line 485, in collect_entities app_1 | w_ranks, w_ids = find_entity(sent, ranks, ent.text.split(" "), 0) app_1 | File "/usr/local/lib/python3.6/site-packages/pytextrank/pytextrank.py", line 454, in find_entity app_1 | return find_entity(sent, ranks, ent, i + 1) app_1 | File "/usr/local/lib/python3.6/site-packages/pytextrank/pytextrank.py", line 454, in find_entity app_1 | return find_entity(sent, ranks, ent, i + 1) app_1 | File "/usr/local/lib/python3.6/site-packages/pytextrank/pytextrank.py", line 454, in find_entity app_1 | return find_entity(sent, ranks, ent, i + 1) app_1 | [Previous line repeated 137 more times] app_1 | File "/usr/local/lib/python3.6/site-packages/pytextrank/pytextrank.py", line 451, in find_entity app_1 | w = sent[i + j] app_1 | IndexError: list index out of range

    wontfix 
    opened by rabinneslo 6
  • [Snyk] Security upgrade setuptools from 39.0.1 to 65.5.1

    [Snyk] Security upgrade setuptools from 39.0.1 to 65.5.1

    This PR was automatically created by Snyk using the credentials of a real user.


    Snyk has created this PR to fix one or more vulnerable packages in the `pip` dependencies of this project.

    Changes included in this PR

    • Changes to the following files to upgrade the vulnerable dependencies to a fixed version:
      • requirements-dev.txt
    ⚠️ Warning
    pymdown-extensions 8.0 requires Markdown, which is not installed.
    mkdocs-material 8.0.1 requires mkdocs, which is not installed.
    mkdocs-material 8.0.1 requires mkdocs-material-extensions, which is not installed.
    mkdocs-material 8.0.1 requires markdown, which is not installed.
    mkdocs-material 8.0.1 has requirement pymdown-extensions>=9.0, but you have pymdown-extensions 8.0.
    
    

    Vulnerabilities that will be fixed

    By pinning:

    Severity | Priority Score (*) | Issue | Upgrade | Breaking Change | Exploit Maturity :-------------------------:|-------------------------|:-------------------------|:-------------------------|:-------------------------|:------------------------- low severity | 441/1000
    Why? Recently disclosed, Has a fix available, CVSS 3.1 | Regular Expression Denial of Service (ReDoS)
    SNYK-PYTHON-SETUPTOOLS-3113904 | setuptools:
    39.0.1 -> 65.5.1
    | No | No Known Exploit

    (*) Note that the real score may have changed since the PR was raised.

    Some vulnerabilities couldn't be fully fixed and so Snyk will still find them when the project is tested again. This may be because the vulnerability existed within more than one direct dependency, but not all of the affected dependencies could be upgraded.

    Check the changes in this PR to ensure they won't cause issues with your project.


    Note: You are seeing this because you or someone else with access to this repository has authorized Snyk to open fix PRs.

    For more information: 🧐 View latest project report

    🛠 Adjust project settings

    📚 Read more about Snyk's upgrade and patch logic


    Learn how to fix vulnerabilities with free interactive lessons:

    🦉 Regular Expression Denial of Service (ReDoS)

    opened by ceteri 0
  • [Snyk] Security upgrade setuptools from 39.0.1 to 65.5.1

    [Snyk] Security upgrade setuptools from 39.0.1 to 65.5.1

    Snyk has created this PR to fix one or more vulnerable packages in the `pip` dependencies of this project.

    Changes included in this PR

    • Changes to the following files to upgrade the vulnerable dependencies to a fixed version:
      • requirements-dev.txt
    ⚠️ Warning
    pymdown-extensions 8.0 requires Markdown, which is not installed.
    mkdocs-material 8.0.1 requires mkdocs, which is not installed.
    mkdocs-material 8.0.1 requires markdown, which is not installed.
    mkdocs-material 8.0.1 requires mkdocs-material-extensions, which is not installed.
    mkdocs-material 8.0.1 has requirement pymdown-extensions>=9.0, but you have pymdown-extensions 8.0.
    
    

    Vulnerabilities that will be fixed

    By pinning:

    Severity | Priority Score (*) | Issue | Upgrade | Breaking Change | Exploit Maturity :-------------------------:|-------------------------|:-------------------------|:-------------------------|:-------------------------|:------------------------- low severity | 441/1000
    Why? Recently disclosed, Has a fix available, CVSS 3.1 | Regular Expression Denial of Service (ReDoS)
    SNYK-PYTHON-SETUPTOOLS-3113904 | setuptools:
    39.0.1 -> 65.5.1
    | No | No Known Exploit

    (*) Note that the real score may have changed since the PR was raised.

    Some vulnerabilities couldn't be fully fixed and so Snyk will still find them when the project is tested again. This may be because the vulnerability existed within more than one direct dependency, but not all of the affected dependencies could be upgraded.

    Check the changes in this PR to ensure they won't cause issues with your project.


    Note: You are seeing this because you or someone else with access to this repository has authorized Snyk to open fix PRs.

    For more information: 🧐 View latest project report

    🛠 Adjust project settings

    📚 Read more about Snyk's upgrade and patch logic


    Learn how to fix vulnerabilities with free interactive lessons:

    🦉 Regular Expression Denial of Service (ReDoS)

    opened by snyk-bot 0
  • suggestion: allow

    suggestion: allow "wildcard" POS for stopwords

    The current approach which specifies stopwords as lemma: [POS] presents two issues:

    1. There are some terms which POS taggers will fail over. For example, Spacy labels "AI" (artificial intelligence) as PROPN
    2. If I create software to be used by people without linguistic knowledge, I cannot expect them to know about POS.

    As a work-around, it is necessary to specify all POS tags, which is rather inelegant.

    opened by arc12 0
  • "ValueError: [E002] Can't find factory for 'textrank' for language English (en)." - incompatibility with SpaCy 3.3.1?

    I'm trying to use this package for the first time and followed the README:

    !pip install pytextrank
    !python -m spacy download en_core_web_sm
    nlp = spacy.load("en_core_web_sm")
    nlp.add_pipe("textrank")
    

    This throws an error at the last line:

    ValueError: [E002] Can't find factory for 'textrank' for language English (en). This usually happens when spaCy calls 'nlp.create_pipe' with a custom component name that's not registered on the current language class. If you're using a Transformer, make sure to install 'spacy-transformers'. If you're using a custom component, make sure you've added the decorator '@Language.component' (for function components) or '@Language.factory' (for class components).

    Available factories: attribute_ruler, tok2vec, merge_noun_chunks, merge_entities, merge_subtokens, token_splitter, doc_cleaner, parser, beam_parser, lemmatizer, trainable_lemmatizer, entity_linker, ner, beam_ner, entity_ruler, tagger, morphologizer, senter, sentencizer, textcat, spancat, future_entity_ruler, span_ruler, textcat_multilabel, en.lemmatizer`

    Is this an incompatibility with SpaCy version 3.3.1 or have I overseen something crucial? Which SpaCy version do you recommend? (I restarted the kernel after installing pytextrank)

    question 
    opened by lisabecker-ml6 1
  • Is `biasedtextrank` implemented?

    Is `biasedtextrank` implemented?

    https://github.com/DerwenAI/pytextrank/blob/9ab64507a26f946191504598f86021f511245cd7/pytextrank/base.py#L305

    self.focus_tokens is initialized to an empty set but I don't see where it is parameterized?

    e.g.

    nlp = spacy.load("en_core_web_sm")
    nlp.add_pipe("biasedtextrank")
    focus = "my example focus"
    doc = nlp(text)
    

    At what point can I inform the model of the focus?

    question kg 
    opened by Ayenem 4
  • ZeroDivisionError: division by zero in _calc_discounted_normalised_rank

    ZeroDivisionError: division by zero in _calc_discounted_normalised_rank

    Hi,

    I use this library together with spacy for the extraction of the most important words. However, when using the catalan model of spacy, the algorithm gives the following error:

    `File "/code/app.py", line 20, in getNlpEntities

    entities = runTextRankEntities(hl, contents['contents'], algorithm, num)
    

    File "/code/nlp/textRankEntities.py", line 51, in runTextRankEntities

    doc = nlp(joined_content)
    

    File "/usr/local/lib/python3.9/site-packages/spacy/language.py", line 1022, in call

    error_handler(name, proc, [doc], e)
    

    File "/usr/local/lib/python3.9/site-packages/spacy/util.py", line 1617, in raise_error

    raise e
    

    File "/usr/local/lib/python3.9/site-packages/spacy/language.py", line 1017, in call

    doc = proc(doc, **component_cfg.get(name, {}))  # type: ignore[call-arg]
    

    File "/usr/local/lib/python3.9/site-packages/pytextrank/base.py", line 253, in call

    doc._.phrases = doc._.textrank.calc_textrank()
    

    File "/usr/local/lib/python3.9/site-packages/pytextrank/base.py", line 363, in calc_textrank

    nc_phrases = self._collect_phrases(self.doc.noun_chunks, self.ranks)
    

    File "/usr/local/lib/python3.9/site-packages/pytextrank/base.py", line 548, in _collect_phrases

    return {
    

    File "/usr/local/lib/python3.9/site-packages/pytextrank/base.py", line 549, in

    span: self._calc_discounted_normalised_rank(span, sum_rank)
    

    File "/usr/local/lib/python3.9/site-packages/pytextrank/base.py", line 592, in _calc_discounted_normalised_rank

    phrase_rank = math.sqrt(sum_rank / (len(span) + non_lemma))
    

    ZeroDivisionError: division by zero`

    bug help wanted good first issue 
    opened by sumitkumarjethani 2
Releases(v3.2.4)
  • v3.2.4(Jul 27, 2022)

    2022-07-27

    • better support for "ru" and other languages without noun_chunks support in spaCy
    • updated example notebook to illustrate TopicRank algorithm
    • made the node bias setting case-independent for Biased Textrank algorithm; kudos @Ankush-Chander
    • updated summarization tests; kudos @tomaarsen
    • reworked some unit tests to be less brittle, less dependent on specific spaCy point releases

    What's Changed

    • updated docs and example to show TopicRank by @ceteri in https://github.com/DerwenAI/pytextrank/pull/211
    • working on #204 by @ceteri in https://github.com/DerwenAI/pytextrank/pull/212
    • Prevent exception on TopicRank when there are no noun_chunks by @tomaarsen in https://github.com/DerwenAI/pytextrank/pull/219
    • Biasedrank case fix by @Ankush-Chander in https://github.com/DerwenAI/pytextrank/pull/217
    • Docs update by @ceteri in https://github.com/DerwenAI/pytextrank/pull/221
    • rework some unit tests by @ceteri in https://github.com/DerwenAI/pytextrank/pull/222

    Full Changelog: https://github.com/DerwenAI/pytextrank/compare/v3.2.3...v3.2.4

    Source code(tar.gz)
    Source code(zip)
  • v3.2.3(Mar 6, 2022)

    2022-03-06

    • handles missing noun_chunks in some language models (e.g., "ru") #204
    • add TopicRank algorithm; kudos @tomaarsen
    • improved test suite; fixed tests for newer spacy releases; kudos @tomaarsen

    What's Changed

    • [Snyk] Security upgrade mistune from 0.8.4 to 2.0.1 by @snyk-bot in https://github.com/DerwenAI/pytextrank/pull/201
    • Improved test suite; fixed tests by @tomaarsen in https://github.com/DerwenAI/pytextrank/pull/205
    • Updated Copyright year from 2021 to 2022 by @tomaarsen in https://github.com/DerwenAI/pytextrank/pull/206
    • update API reference docs by @ceteri in https://github.com/DerwenAI/pytextrank/pull/207
    • Inclusion of the TopicRank Keyphrase Extraction algorithm by @tomaarsen in https://github.com/DerwenAI/pytextrank/pull/208
    • Prep release by @ceteri in https://github.com/DerwenAI/pytextrank/pull/210

    New Contributors

    • @snyk-bot made their first contribution in https://github.com/DerwenAI/pytextrank/pull/201

    Full Changelog: https://github.com/DerwenAI/pytextrank/compare/v3.2.2...v3.2.3

    Source code(tar.gz)
    Source code(zip)
  • v3.2.2(Oct 10, 2021)

    What's Changed

    • prep next release by @ceteri in https://github.com/DerwenAI/pytextrank/pull/189
    • warning about the deprecated code in archive by @ceteri in https://github.com/DerwenAI/pytextrank/pull/190
    • fixes chunk to be between sent_start and sent_end in BaseTextRank.calc_sent_dist by @clabornd in https://github.com/DerwenAI/pytextrank/pull/191
    • Update by @ceteri in https://github.com/DerwenAI/pytextrank/pull/198
    • add more scrubber examples and documentation by @dayalstrub-cma in https://github.com/DerwenAI/pytextrank/pull/197
    • kudos by @ceteri in https://github.com/DerwenAI/pytextrank/pull/199
    • prep PyPi release by @ceteri in https://github.com/DerwenAI/pytextrank/pull/200

    New Contributors

    • @clabornd made their first contribution in https://github.com/DerwenAI/pytextrank/pull/191
    • @dayalstrub-cma made their first contribution in https://github.com/DerwenAI/pytextrank/pull/197

    Full Changelog: https://github.com/DerwenAI/pytextrank/compare/v3.2.1...v3.2.2

    Source code(tar.gz)
    Source code(zip)
  • v3.2.1(Jul 24, 2021)

  • v3.2.0(Jul 17, 2021)

    2021-07-17

    Various support for spaCy 3.1.x updates, which changes some interfaces.

    • NB: THE SCRUBBER UPDATE WILL BREAK PREVIOUS RELEASES
    • allow Span as scrubber argument, to align with spaCy 3.1.x; kudos @Ankush-Chander
    • add lgtm code reviews (slow, not integrating into GitHub PRs directly)
    • evaluating grayskull to generate a conda-forge recipe
    • add use of pipdeptree to analyze dependencies
    • use KG from biblio.ttl to generate bibliography
    • fixed overlooked comment from earlier code; kudos @debraj135
    • add visualisation using altair; kudos @louisguitton
    • add scrubber usage in sample notebook; kudos @Ankush-Chander
    • integrating use of MkRefs to generate semantic reference pages in docs
    Source code(tar.gz)
    Source code(zip)
  • v3.1.1(Mar 25, 2021)

    2021-03-25

    • fix the span length calculation in explanation notebook; kudos @Ankush-Chander
    • add BiasedTextRank by @Ankush-Chander (many thanks!)
    • add conda environment.yml plus instructions
    • use bandit to check for security issues
    • use codespell to check for spelling errors
    • add pre-commit checks in general
    • update doc._.phrases in the call to change_focus() so the summarization will sync with the latest focus
    Source code(tar.gz)
    Source code(zip)
  • v3.1.0(Mar 12, 2021)

    2021-03-12

    • rename master branch to main
    • add a factory class that assigns each doc its own Textrank object; kudos @Ankush-Chander
    • refactor the stopwords feature as a constructor argument
    • add get_unit_vector() method to expose the characteristic unit vector
    • add calc_sent_dist() method to expose the sentence distance measures (for summarization)
    • include a unit test for summarization
    • updated contributor instructions
    • pylint coverage for code checking
    • linking definitions and citations in source code apidocs to our online docs
    • updated links on PyPi
    Source code(tar.gz)
    Source code(zip)
  • v3.0.1(Feb 27, 2021)

  • v3.0.0(Feb 14, 2021)

    2021-02-14

    • THIS WILL BREAK THINGS!!!
    • support for spaCy 3.0.x; kudos @Lord-V15
    • full integration of PositionRank
    • migrated all unit tests to pytest
    • removed use of logger for debugging, introducing icecream instead
    Source code(tar.gz)
    Source code(zip)
  • v2.1.0(Jan 31, 2021)

    2021-01-31

    • add PositionRank by @louisguitton (many thanks!)
    • fixes chunk in explain_summ.ipynb by @anna-droid-beep
    • add option preserve_order in TextRank.summary by @kavorite
    • tested with spaCy 2.3.5
    Source code(tar.gz)
    Source code(zip)
  • v2.0.3(Sep 15, 2020)

    2020-09-15

    • try-catch ZeroDivisionError in summary method -- kudos @shyamcody
    • tested with updated dependencies: spaCy 2.3.x and NetworkX 2.5
    Source code(tar.gz)
    Source code(zip)
  • v2.0.2(Jun 28, 2020)

  • v2.0.1(Mar 2, 2020)

    2020-03-02

    • fix KeyError issue for pre Python 3.6
    • integrated codecov.io
    • added PyTextRank to the spaCy uniVerse
    • fixed README.md instructions to download en_core_web_sm
    Source code(tar.gz)
    Source code(zip)
  • v2.0.0(Nov 5, 2019)

    • refactored library to run as a spaCy extension
    • supports multiple languages
    • significantly faster, with less memory required
    • better extraction of top-ranked phrases
    • changed license to MIT
    • uses lemma-based stopwords for more precise control
    • WIP toward integration with knowledge graph use cases
    Source code(tar.gz)
    Source code(zip)
  • v1.2.1(Nov 1, 2019)

  • v1.2.0(Nov 1, 2019)

  • v1.1.1(Sep 15, 2017)

  • v1.1.0(Jun 7, 2017)

    Replaced TextBlob usage with spaCy for improved parsing results. Updated the other Python dependencies. Also added better handling for UTF-8.

    Source code(tar.gz)
    Source code(zip)
  • v1.0.1(May 1, 2017)

  • v1.0.0(Mar 13, 2017)

Owner
derwen.ai
"In the loop..."
derwen.ai
Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

TextBlob: Simplified Text Processing Homepage: https://textblob.readthedocs.io/ TextBlob is a Python (2 and 3) library for processing textual data. It

Steven Loria 8.4k Dec 26, 2022
Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

TextBlob: Simplified Text Processing Homepage: https://textblob.readthedocs.io/ TextBlob is a Python (2 and 3) library for processing textual data. It

Steven Loria 7.5k Feb 17, 2021
Module for automatic summarization of text documents and HTML pages.

Automatic text summarizer Simple library and command line utility for extracting summary from HTML pages or plain texts. The package also contains sim

Mišo Belica 3k Jan 8, 2023
Module for automatic summarization of text documents and HTML pages.

Automatic text summarizer Simple library and command line utility for extracting summary from HTML pages or plain texts. The package also contains sim

Mišo Belica 2.5k Feb 17, 2021
Automated Phrase Mining from Massive Text Corpora in Python.

Automated Phrase Mining from Massive Text Corpora in Python.

luozhouyang 28 Apr 15, 2021
Phrase-Based & Neural Unsupervised Machine Translation

Unsupervised Machine Translation This repository contains the original implementation of the unsupervised PBSMT and NMT models presented in Phrase-Bas

Facebook Research 1.5k Dec 28, 2022
NLP tool to extract emotional phrase from tweets 🤩

Emotional phrase extractor Extract phrase in the given text that is used to express the sentiment. Capturing sentiment in language is important in the

Shahul ES 38 Oct 17, 2022
Understand Text Summarization and create your own summarizer in python

Automatic summarization is the process of shortening a text document with software, in order to create a summary with the major points of the original document. Technologies that can make a coherent summary take into account variables such as length, writing style and syntax.

Sreekanth M 1 Oct 18, 2022
Summarization, translation, sentiment-analysis, text-generation and more at blazing speed using a T5 version implemented in ONNX.

Summarization, translation, Q&A, text generation and more at blazing speed using a T5 version implemented in ONNX. This package is still in alpha stag

Abel 211 Dec 28, 2022
Summarization, translation, sentiment-analysis, text-generation and more at blazing speed using a T5 version implemented in ONNX.

Summarization, translation, Q&A, text generation and more at blazing speed using a T5 version implemented in ONNX. This package is still in alpha stag

Abel 137 Feb 1, 2021
Two-stage text summarization with BERT and BART

Two-Stage Text Summarization Description We experiment with a 2-stage summarization model on CNN/DailyMail dataset that combines the ability to filter

Yukai Yang (Alexis) 6 Oct 22, 2022
The guide to tackle with the Text Summarization

The guide to tackle with the Text Summarization

Takahiro Kubo 1.2k Dec 30, 2022
SummerTime - Text Summarization Toolkit for Non-experts

A library to help users choose appropriate summarization tools based on their specific tasks or needs. Includes models, evaluation metrics, and datasets.

Yale-LILY 213 Jan 4, 2023
Deploying a Text Summarization NLP use case on Docker Container Utilizing Nvidia GPU

GPU Docker NLP Application Deployment Deploying a Text Summarization NLP use case on Docker Container Utilizing Nvidia GPU, to setup the enviroment on

Ritesh Yadav 9 Oct 14, 2022
Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Pytorch-NLU,一个中文文本分类、序列标注工具包,支持中文长文本、短文本的多类、多标签分类任务,支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

null 186 Dec 24, 2022
Implementation of TF-IDF algorithm to find documents similarity with cosine similarity

NLP learning Trying to learn NLP to use in my projects! Table of Contents About The Project Built With Getting Started Requirements Run Usage License

Faraz Farangizadeh 3 Aug 25, 2022
Blue Brain text mining toolbox for semantic search and structured information extraction

Blue Brain Search Source Code DOI Data & Models DOI Documentation Latest Release Python Versions License Build Status Static Typing Code Style Securit

The Blue Brain Project 29 Dec 1, 2022