SpikeX - SpaCy Pipes for Knowledge Extraction

Overview

SpikeX - SpaCy Pipes for Knowledge Extraction

SpikeX is a collection of pipes ready to be plugged in a spaCy pipeline. It aims to help in building knowledge extraction tools with almost-zero effort.

Build Status pypi Version Code style: black

What's new in SpikeX 0.5.0

WikiGraph has never been so lightning fast:

  • 🌕 Performance mooning, thanks to the adoption of a sparse adjacency matrix to handle pages graph, instead of using igraph
  • 🚀 Memory optimization, with a consumption cut by ~40% and a compressed size cut by ~20%, introducing new bidirectional dictionaries to manage data
  • 📖 New APIs for a faster and easier usage and interaction
  • 🛠 Overall fixes, for a better graph and a better pages matching

Pipes

  • WikiPageX links Wikipedia pages to chunks in text
  • ClusterX picks noun chunks in a text and clusters them based on a revisiting of the Ball Mapper algorithm, Radial Ball Mapper
  • AbbrX detects abbreviations and acronyms, linking them to their long form. It is based on scispacy's one with improvements
  • LabelX takes labelings of pattern matching expressions and catches them in a text, solving overlappings, abbreviations and acronyms
  • PhraseX creates a Doc's underscore extension based on a custom attribute name and phrase patterns. Examples are NounPhraseX and VerbPhraseX, which extract noun phrases and verb phrases, respectively
  • SentX detects sentences in a text, based on Splitta with refinements

Tools

  • WikiGraph with pages as leaves linked to categories as nodes
  • Matcher that inherits its interface from the spaCy's one, but built using an engine made of RegEx which boosts its performance

Install SpikeX

Some requirements are inherited from spaCy:

  • spaCy version: 2.3+
  • Operating system: macOS / OS X · Linux · Windows (Cygwin, MinGW, Visual Studio)
  • Python version: Python 3.6+ (only 64 bit)
  • Package managers: pip

Some dependencies use Cython and it needs to be installed before SpikeX:

pip install cython

Remember that a virtual environment is always recommended, in order to avoid modifying system state.

pip

At this point, installing SpikeX via pip is a one line command:

pip install spikex

Usage

Prerequirements

SpikeX pipes work with spaCy, hence a model its needed to be installed. Follow official instructions here. The brand new spaCy 3.0 is supported!

WikiGraph

A WikiGraph is built starting from some key components of Wikipedia: pages, categories and relations between them.

Auto

Creating a WikiGraph can take time, depending on how large is its Wikipedia dump. For this reason, we provide wikigraphs ready to be used:

Date WikiGraph Lang Size (compressed) Size (memory)
2021-04-01 enwiki_core EN 1.1GB 5.9GB
2021-04-01 simplewiki_core EN 19MB 120MB
2021-04-01 itwiki_core IT 189MB 1.1GB
More coming...

SpikeX provides a command to shortcut downloading and installing a WikiGraph (Linux or macOS, Windows not supported yet):

spikex download-wikigraph simplewiki_core

Manual

A WikiGraph can be created from command line, specifying which Wikipedia dump to take and where to save it:

spikex create-wikigraph \
  <YOUR-OUTPUT-PATH> \
  --wiki <WIKI-NAME, default: en> \
  --version <DUMP-VERSION, default: latest> \
  --dumps-path <DUMPS-BACKUP-PATH> \

Then it needs to be packed and installed:

spikex package-wikigraph \
  <WIKIGRAPH-RAW-PATH> \
  <YOUR-OUTPUT-PATH>

Follow the instructions at the end of the packing process and install the distribution package in your virtual environment. Now your are ready to use your WikiGraph as you wish:

from spikex.wikigraph import load as wg_load

wg = wg_load("enwiki_core")
page = "Natural_language_processing"
categories = wg.get_categories(page, distance=1)
for category in categories:
    print(category)

>>> Category:Speech_recognition
>>> Category:Artificial_intelligence
>>> Category:Natural_language_processing
>>> Category:Computational_linguistics

Matcher

The Matcher is identical to the spaCy's one, but faster when it comes to handle many patterns at once (order of thousands), so follow official usage instructions here.

A trivial example:

from spikex.matcher import Matcher
from spacy import load as spacy_load

nlp = spacy_load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
matcher.add("TEST", [[{"LOWER": "nlp"}]])
doc = nlp("I love NLP")
for _, s, e in matcher(doc):
  print(doc[s: e])

>>> NLP

WikiPageX

The WikiPageX pipe uses a WikiGraph in order to find chunks in a text that match Wikipedia page titles.

from spacy import load as spacy_load
from spikex.wikigraph import load as wg_load
from spikex.pipes import WikiPageX

nlp = spacy_load("en_core_web_sm")
doc = nlp("An apple a day keeps the doctor away")
wg = wg_load("simplewiki_core")
wpx = WikiPageX(wg)
doc = wpx(doc)
for span in doc._.wiki_spans:
  print(span._.wiki_pages)

>>> ['An']
>>> ['Apple', 'Apple_(disambiguation)', 'Apple_(company)', 'Apple_(tree)']
>>> ['A', 'A_(musical_note)', 'A_(New_York_City_Subway_service)', 'A_(disambiguation)', 'A_(Cyrillic)')]
>>> ['Day']
>>> ['The_Doctor', 'The_Doctor_(Doctor_Who)', 'The_Doctor_(Star_Trek)', 'The_Doctor_(disambiguation)']
>>> ['The']
>>> ['Doctor_(Doctor_Who)', 'Doctor_(Star_Trek)', 'Doctor', 'Doctor_(title)', 'Doctor_(disambiguation)']

ClusterX

The ClusterX pipe takes noun chunks in a text and clusters them using a Radial Ball Mapper algorithm.

from spacy import load as spacy_load
from spikex.pipes import ClusterX

nlp = spacy_load("en_core_web_sm")
doc = nlp("Grab this juicy orange and watch a dog chasing a cat.")
clusterx = ClusterX(min_score=0.65)
doc = clusterx(doc)
for cluster in doc._.cluster_chunks:
  print(cluster)

>>> [this juicy orange]
>>> [a cat, a dog]

AbbrX

The AbbrX pipe finds abbreviations and acronyms in the text, linking short and long forms together:

from spacy import load as spacy_load
from spikex.pipes import AbbrX

nlp = spacy_load("en_core_web_sm")
doc = nlp("a little snippet with an abbreviation (abbr)")
abbrx = AbbrX(nlp.vocab)
doc = abbrx(doc)
for abbr in doc._.abbrs:
  print(abbr, "->", abbr._.long_form)

>>> abbr -> abbreviation

LabelX

The LabelX pipe matches and labels patterns in text, solving overlappings, abbreviations and acronyms.

from spacy import load as spacy_load
from spikex.pipes import LabelX

nlp = spacy_load("en_core_web_sm")
doc = nlp("looking for a computer system engineer")
patterns = [
  [{"LOWER": "computer"}, {"LOWER": "system"}],
  [{"LOWER": "system"}, {"LOWER": "engineer"}],
]
labelx = LabelX(nlp.vocab, ("TEST", patterns), validate=True, only_longest=True)
doc = labelx(doc)
for labeling in doc._.labelings:
  print(labeling, f"[{labeling.label_}]")

>>> computer system engineer [TEST]

PhraseX

The PhraseX pipe creates a custom Doc's underscore extension which fulfills with matches from phrase patterns.

from spacy import load as spacy_load
from spikex.pipes import PhraseX

nlp = spacy_load("en_core_web_sm")
doc = nlp("I have Melrose and McIntosh apples, or Williams pears")
patterns = [
  [{"LOWER": "mcintosh"}],
  [{"LOWER": "melrose"}],
]
phrasex = PhraseX(nlp.vocab, "apples", patterns)
doc = phrasex(doc)
for apple in doc._.apples:
  print(apple)

>>> Melrose
>>> McIntosh

SentX

The SentX pipe splits sentences in a text. It modifies tokens' is_sent_start attribute, so it's mandatory to add it before parser pipe in the spaCy pipeline:

from spacy import load as spacy_load
from spikex.pipes import SentX
from spikex.defaults import spacy_version

if spacy_version >= 3:
  from spacy.language import Language

    @Language.factory("sentx")
    def create_sentx(nlp, name):
        return SentX()

nlp = spacy_load("en_core_web_sm")
sentx_pipe = SentX() if spacy_version < 3 else "sentx"
nlp.add_pipe(sentx_pipe, before="parser")
doc = nlp("A little sentence. Followed by another one.")
for sent in doc.sents:
  print(sent)

>>> A little sentence.
>>> Followed by another one.

That's all folks

Feel free to contribute and have fun!

Comments
  • abbreviation difference from scispacy

    abbreviation difference from scispacy

    Hi! scispacy developer here. Could you share what changes you made to our abbreviation detector? I am curious what issues you encountered/fixed (obviously not bothered at all that you based yours off of ours).

    opened by dakinggg 2
  • fixes 'too many values to unpack' on tuple

    fixes 'too many values to unpack' on tuple

    Hello!

    This PR is about fixing issue #6 that I submitted (basically as explained here). Please let me know if it helps (this chunk of documentation is now reproducible).

    opened by hp0404 1
  • Abbrv pipeline errors out

    Abbrv pipeline errors out

    • spikex version: spikex-0.4.0.dev2 from source / spacy 2.3.5
    • Python version: 3.6
    • Operating System: OSX

    Description

    Describe what you were trying to get done.

    • I was trying to test the abbrv pipeline

    Tell us what happened, what went wrong, and what you expected to happen.

    • Copied the example from README

    What I Did

    import spacy
    from spikex.pipes import AbbrX
    
    nlp = spacy.load("en_core_web_sm")
    
    abbrx = AbbrX(nlp)
    nlp.add_pipe(abbrx)
    doc = abbrx(nlp("a little snippet with abbreviations (abbrs)"))
    doc._.abbrs
    
    205         return (
        206             self.vocab.strings.add(key)
    --> 207             if key not in self.vocab.strings
        208             else self.vocab.strings[key]
        209         )
    
    AttributeError: 'English' object has no attribute 'strings'
    
    opened by trisongz 1
  • labelX pipeline errors out

    labelX pipeline errors out

    • spikex version: 0.5.2
    • Python version: 3.7.11
    • Operating System: google.cloud

    Description

    Describe what you were trying to get done.

    • I was trying to test the labelX pipeline

    Tell us what happened, what went wrong, and what you expected to happen.

    • Copied the example from README

    What I Did

    from spacy import load as spacy_load
    from spikex.pipes import LabelX
    
    nlp = spacy_load("en_core_web_sm")
    doc = nlp("looking for a computer system engineer")
    patterns = [
      [{"LOWER": "computer"}, {"LOWER": "system"}],
      [{"LOWER": "system"}, {"LOWER": "engineer"}],
    ]
    labelx = LabelX(vocab=nlp.vocab, labelings=("TEST", patterns), validate=True, only_longest=True)
    doc = labelx(doc)
    for labeling in doc._.labelings:
      print(labeling, f"[{labeling.label_}]")
    
    ---------------------------------------------------------------------------
    ValueError                                Traceback (most recent call last)
    <ipython-input-4-da0684755206> in <module>()
          9   [{"LOWER": "system"}, {"LOWER": "engineer"}],
         10 ]
    ---> 11 labelx = LabelX(vocab=nlp.vocab, labelings=("TEST", patterns), validate=True, only_longest=True)
         12 doc = labelx(doc)
         13 for labeling in doc._.labelings:
    
    /usr/local/lib/python3.7/dist-packages/spikex/pipes/labels.py in __init__(self, vocab, labelings, validate, only_longest)
         32         if not labelings or labelings is None:
         33             return
    ---> 34         for label, patterns in labelings:
         35             self.add(label, patterns)
         36 
    
    ValueError: too many values to unpack (expected 2)
    
    opened by hp0404 0
  • Cannot download

    Cannot download "enwiki_core"

    Very insightful packages and explanation, thanks a lot!

    Now I encountered a problem. The spikex provides wikigraphs ready to be used, including enwiki_core, simplewiki_core, and itwiki_core. However, it seems that the website that stores these packages has gone (404 not found) so that we can not download "enwiki_core". How can I access to the wikigraph package (i.e., enwiki_core) at this moment? I would be very grateful if someone could help me download these packages.

    opened by Hannah123567 0
  • Incomplete list of categories

    Incomplete list of categories

    • spikex version: 0.5.2
    • Python version: 3.9.7
    • Operating System: Windows 10

    Description

    I want to get all categories of a page, but most categories are missing

    What I Did

    from spikex.wikigraph import load as wg_load
    page = "Peking_2022"
    categories = wg.get_categories(page, distance=1)
    

    What I get: ['Category:Olympische_Winterspiele_2022'] The output I expect: ['Austragung der Olympischen Winterspiele', 'Olympische Winterspiele 2022', 'Sport (Hebei)', 'Sportveranstaltung 2022', 'Sportveranstaltung in Peking', 'Wikipedia:Veraltet nach Jahr 2022', 'Zukünftige Sportveranstaltung'] Prove: https://de.wikipedia.org/wiki/Olympische_Winterspiele_2022

    I created a categorylinks dictionary from the categorylinks.sql.gz, so that the keys are the page_ids and under each key is the list of categories. I used your functions to get the page_id: page_id = self.get_pageid(self.redirect(page)) and my categorylinks dictionary . With this method I get the expected output. If this behaviour is not desired, I would like to think that there is a problem with the processing of categorylinks.sql.gz on your side.

    opened by Fetzii 1
  • Umlauts

    Umlauts

    • spikex version: 0.5.2
    • Python version: 3.9.7
    • Operating System: Windows 10

    Description

    Getting categories for a page with umlauts from my dewiki_core (Cem Özdemir: https://de.wikipedia.org/wiki/Cem_%C3%96zdemir) It crashes, what shouldn't happen. There is also an english wiki page for him (https://en.wikipedia.org/wiki/Cem_%C3%96zdemir)

    What I Did

    from spikex.wikigraph import load as wg_load
    wg = wg_load("dewiki_core")
    page = "Cem_Özdemir"
    categories = wg.get_categories(page, distance=1)
    TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'
    
    opened by Fetzii 2
  • Creating dewiki_core

    Creating dewiki_core

    • spikex version: 0.5.2
    • Python version: 3.9.7
    • Operating System: Windows 10

    Description

    I tryed to create a german wikigraph and got a type error for the compression_wrapper()

    What I Did

    spikex create-wikigraph de_wiki_graph --wiki de --dumps-path de_wiki_dumps

    Traceback (most recent call last):
      File "C:\Users\Friedrich.Schmidt\Anaconda3\envs\ml\lib\runpy.py", line 197, in _run_module_as_main
        return _run_code(code, main_globals, None,
      File "C:\Users\Friedrich.Schmidt\Anaconda3\envs\ml\lib\runpy.py", line 87, in _run_code
        exec(code, run_globals)
      File "C:\Users\Friedrich.Schmidt\Anaconda3\envs\ml\Scripts\spikex.exe\__main__.py", line 7, in <module>
      File "C:\Users\Friedrich.Schmidt\Anaconda3\envs\ml\lib\site-packages\spikex\__main__.py", line 23, in main
        typer.run(commands[command])
      File "C:\Users\Friedrich.Schmidt\Anaconda3\envs\ml\lib\site-packages\typer\main.py", line 859, in run
        app()
      File "C:\Users\Friedrich.Schmidt\Anaconda3\envs\ml\lib\site-packages\typer\main.py", line 214, in __call__
        return get_command(self)(*args, **kwargs)
      File "C:\Users\Friedrich.Schmidt\Anaconda3\envs\ml\lib\site-packages\click\core.py", line 829, in __call__
        return self.main(*args, **kwargs)
      File "C:\Users\Friedrich.Schmidt\Anaconda3\envs\ml\lib\site-packages\click\core.py", line 782, in main
        rv = self.invoke(ctx)
      File "C:\Users\Friedrich.Schmidt\Anaconda3\envs\ml\lib\site-packages\click\core.py", line 1066, in invoke
        return ctx.invoke(self.callback, **ctx.params)
      File "C:\Users\Friedrich.Schmidt\Anaconda3\envs\ml\lib\site-packages\click\core.py", line 610, in invoke
        return callback(*args, **kwargs)
      File "C:\Users\Friedrich.Schmidt\Anaconda3\envs\ml\lib\site-packages\typer\main.py", line 497, in wrapper
        return callback(**use_params)  # type: ignore
      File "C:\Users\Friedrich.Schmidt\Anaconda3\envs\ml\lib\site-packages\spikex\cli\create.py", line 62, in create_wikigraph
        wg = WikiGraph.build(**kwargs)
      File "C:\Users\Friedrich.Schmidt\Anaconda3\envs\ml\lib\site-packages\spikex\wikigraph\wikigraph.py", line 61, in build
        p, r, d, c, cl = _make_graph_components(**kwargs)
      File "C:\Users\Friedrich.Schmidt\Anaconda3\envs\ml\lib\site-packages\spikex\wikigraph\wikigraph.py", line 278, in _make_graph_components
        pprops = _get_pprops(**kwargs)
      File "C:\Users\Friedrich.Schmidt\Anaconda3\envs\ml\lib\site-packages\spikex\wikigraph\wikigraph.py", line 316, in _get_pprops
        for pageid, prop, value in iter_pprops_data:
      File "C:\Users\Friedrich.Schmidt\Anaconda3\envs\ml\lib\site-packages\spikex\wikigraph\dumptools.py", line 211, in _parse_wiki_sql_dump
        ) as pbar, compression_wrapper(compress_obj, "rb") as decompress_obj:
    TypeError: compression_wrapper() missing 1 required positional argument: 'compression'
    

    Additonal Information

    https://github.com/RaRe-Technologies/smart_open/blob/develop/smart_open/compression.py -> Line 106: def compression_wrapper(file_obj, mode, compression):

    The current "compression_wrapper" function actually expects another argument called "compression", which is not passed at the moment: compression_wrapper(compress_obj, "rb")

    I fixed the error locally by adding the missing argument: compression_wrapper(compress_obj, "rb", 'infer_from_extension')

    opened by Fetzii 1
  • How to speed up the progress of adding patterns

    How to speed up the progress of adding patterns

    • spikex version: 0.5.0
    • Python version:
    • Operating System: linux

    Description

    Hey, guys. I found your tool is very powerful, thx for sharing. I met a problem that the time cost is huge, when I was trying to add 30 thousands patterns to initialize LabelX. And this progress is much slower than the spacy, so that I wonder if any solution you guys can propose?

    opened by Hunter-Leo 1
  • spikex download-wikigraph simplewiki_core

    spikex download-wikigraph simplewiki_core

    Hello,

    I tested this command :

    spikex download-wikigraph simplewiki_core

    In a Jupyter Notebook, it returns :

    File "<ipython-input-7-d71a5d9ca149>", line 1 spikex download-wikigraph simplewiki_core ^ SyntaxError: invalid syntax

    In Anaconda prompt, it returns:

    ` (base) C:\WINDOWS\system32>spikex download-wikigraph simplewiki_core Traceback (most recent call last): File "c:\users\ludovic\anaconda3\lib\runpy.py", line 194, in _run_module_as_main return run_code(code, main_globals, None, File "c:\users\ludovic\anaconda3\lib\runpy.py", line 87, in run_code exec(code, run_globals) File "C:\Users\Ludovic\anaconda3\Scripts\spikex.exe_main.py", line 7, in File "c:\users\ludovic\anaconda3\lib\site-packages\spikex_main.py", line 23, in main typer.run(commands[command]) File "c:\users\ludovic\anaconda3\lib\site-packages\typer\main.py", line 859, in run app() File "c:\users\ludovic\anaconda3\lib\site-packages\typer\main.py", line 214, in call return get_command(self)(*args, **kwargs) File "c:\users\ludovic\anaconda3\lib\site-packages\click\core.py", line 829, in call return self.main(*args, **kwargs) File "c:\users\ludovic\anaconda3\lib\site-packages\click\core.py", line 782, in main rv = self.invoke(ctx) File "c:\users\ludovic\anaconda3\lib\site-packages\click\core.py", line 1066, in invoke return ctx.invoke(self.callback, **ctx.params) File "c:\users\ludovic\anaconda3\lib\site-packages\click\core.py", line 610, in invoke return callback(*args, **kwargs) File "c:\users\ludovic\anaconda3\lib\site-packages\typer\main.py", line 497, in wrapper return callback(**use_params) # type: ignore File "c:\users\ludovic\anaconda3\lib\site-packages\spikex\cli\download.py", line 46, in download_wikigraph _run_command(f"wget --quiet --show-progress -O {wg_tar} {wg_url}") File "c:\users\ludovic\anaconda3\lib\site-packages\spikex\cli\download.py", line 54, in _run_command return run( File "c:\users\ludovic\anaconda3\lib\subprocess.py", line 493, in run with Popen(*popenargs, **kwargs) as process: File "c:\users\ludovic\anaconda3\lib\subprocess.py", line 858, in init self._execute_child(args, executable, preexec_fn, close_fds, File "c:\users\ludovic\anaconda3\lib\subprocess.py", line 1311, in _execute_child hp, ht, pid, tid = _winapi.CreateProcess(executable, args, FileNotFoundError: [WinError 2] Le fichier spécifié est introuvable

    (base) C:\WINDOWS\system32>`

    What can I do ?

    Thank you very much for your help !

    opened by lbocken 1
Owner
Erre Quadro Srl
Erre Quadro Srl
Spacy-ginza-ner-webapi - Named Entity Recognition API with spaCy and GiNZA

Named Entity Recognition API with spaCy and GiNZA I wrote a blog post about this

Yuki Okuda 3 Feb 27, 2022
open-information-extraction-system, build open-knowledge-graph(SPO, subject-predicate-object) by pyltp(version==3.4.0)

中文开放信息抽取系统, open-information-extraction-system, build open-knowledge-graph(SPO, subject-predicate-object) by pyltp(version==3.4.0)

null 7 Nov 2, 2022
NLP, before and after spaCy

textacy: NLP, before and after spaCy textacy is a Python library for performing a variety of natural language processing (NLP) tasks, built on the hig

Chartbeat Labs Projects 2k Jan 4, 2023
NLP, before and after spaCy

textacy: NLP, before and after spaCy textacy is a Python library for performing a variety of natural language processing (NLP) tasks, built on the hig

Chartbeat Labs Projects 1.6k Feb 10, 2021
✨Fast Coreference Resolution in spaCy with Neural Networks

✨ NeuralCoref 4.0: Coreference Resolution in spaCy with Neural Networks. NeuralCoref is a pipeline extension for spaCy 2.1+ which annotates and resolv

Hugging Face 2.6k Jan 4, 2023
🛸 Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy

spacy-transformers: Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy This package provides spaCy components and architectures to use tr

Explosion 1.2k Jan 8, 2023
A full spaCy pipeline and models for scientific/biomedical documents.

This repository contains custom pipes and models related to using spaCy for scientific documents. In particular, there is a custom tokenizer that adds

AI2 1.3k Jan 3, 2023
spaCy plugin for Transformers , Udify, ELmo, etc.

Camphr - spaCy plugin for Transformers, Udify, Elmo, etc. Camphr is a Natural Language Processing library that helps in seamless integration for a wid

null 342 Nov 21, 2022
NLP, before and after spaCy

textacy: NLP, before and after spaCy textacy is a Python library for performing a variety of natural language processing (NLP) tasks, built on the hig

Chartbeat Labs Projects 1.6k Feb 17, 2021
✨Fast Coreference Resolution in spaCy with Neural Networks

✨ NeuralCoref 4.0: Coreference Resolution in spaCy with Neural Networks. NeuralCoref is a pipeline extension for spaCy 2.1+ which annotates and resolv

Hugging Face 2.2k Feb 18, 2021
🛸 Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy

spacy-transformers: Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy This package provides spaCy components and architectures to use tr

Explosion 903 Feb 17, 2021
A full spaCy pipeline and models for scientific/biomedical documents.

This repository contains custom pipes and models related to using spaCy for scientific documents. In particular, there is a custom tokenizer that adds

AI2 831 Feb 17, 2021
spaCy plugin for Transformers , Udify, ELmo, etc.

Camphr - spaCy plugin for Transformers, Udify, Elmo, etc. Camphr is a Natural Language Processing library that helps in seamless integration for a wid

null 327 Feb 18, 2021
DaCy: The State of the Art Danish NLP pipeline using SpaCy

DaCy: A SpaCy NLP Pipeline for Danish DaCy is a Danish preprocessing pipeline trained in SpaCy. At the time of writing it has achieved State-of-the-Ar

Kenneth Enevoldsen 71 Jan 6, 2023
Augmenty is an augmentation library based on spaCy for augmenting texts.

Augmenty: The cherry on top of your NLP pipeline Augmenty is an augmentation library based on spaCy for augmenting texts. Besides a wide array of high

Kenneth Enevoldsen 124 Dec 29, 2022
A spaCy wrapper of OpenTapioca for named entity linking on Wikidata

spaCyOpenTapioca A spaCy wrapper of OpenTapioca for named entity linking on Wikidata. Table of contents Installation How to use Local OpenTapioca Vizu

Universitätsbibliothek Mannheim 80 Jan 3, 2023
A Domain Specific Language (DSL) for building language patterns. These can be later compiled into spaCy patterns, pure regex, or any other format

RITA DSL This is a language, loosely based on language Apache UIMA RUTA, focused on writing manual language rules, which compiles into either spaCy co

Šarūnas Navickas 60 Sep 26, 2022
🌸 fastText + Bloom embeddings for compact, full-coverage vectors with spaCy

floret: fastText + Bloom embeddings for compact, full-coverage vectors with spaCy floret is an extended version of fastText that can produce word repr

Explosion 222 Dec 16, 2022
🧪 Cutting-edge experimental spaCy components and features

spacy-experimental: Cutting-edge experimental spaCy components and features This package includes experimental components and features for spaCy v3.x,

Explosion 65 Dec 30, 2022