💫 Industrial-strength Natural Language Processing (NLP) in Python

Explosion

Last update: Jan 2, 2023

Related tags

Text Data & NLP python nlp data-science machine-learning natural-language-processing ai deep-learning neural-network text-classification cython artificial-intelligence spacy named-entity-recognition neural-networks nlp-library tokenization entity-linking

Overview

spaCy: Industrial-strength NLP

spaCy is a library for advanced Natural Language Processing in Python and Cython. It's built on the very latest research, and was designed from day one to be used in real products.

spaCy comes with pretrained pipelines and currently supports tokenization and training for 60+ languages. It features state-of-the-art speed and neural network models for tagging, parsing, named entity recognition, text classification and more, multi-task learning with pretrained transformers like BERT, as well as a production-ready training system and easy model packaging, deployment and workflow management. spaCy is commercial open-source software, released under the MIT license.

💫 Version 3.0 out now! Check out the release notes here.

📖 Documentation

Documentation
⭐️ spaCy 101	New to spaCy? Here's everything you need to know!
📚 Usage Guides	How to use spaCy and its features.
🚀 New in v3.0	New features, backwards incompatibilities and migration guide.
🪐 Project Templates	End-to-end workflows you can clone, modify and run.
🎛 API Reference	The detailed reference for spaCy's API.
📦 Models	Download trained pipelines for spaCy.
🌌 Universe	Plugins, extensions, demos and books from the spaCy ecosystem.
👩‍🏫 Online Course	Learn spaCy in this free and interactive online course.
📺 Videos	Our YouTube channel with video tutorials, talks and more.
🛠 Changelog	Changes and version history.
💝 Contribute	How to contribute to the spaCy project and code base.

💬 Where to ask questions

The spaCy project is maintained by @honnibal, @ines, @svlandeg and @adrianeboyd. Please understand that we won't be able to provide individual support via email. We also believe that help is much more valuable if it's shared publicly, so that more people can benefit from it.

Type	Platforms
🚨 Bug Reports	GitHub Issue Tracker
🎁 Feature Requests & Ideas	GitHub Discussions
👩‍💻 Usage Questions	GitHub Discussions · Stack Overflow
🗯 General Discussion	GitHub Discussions

Features

Support for 60+ languages
Trained pipelines for different languages and tasks
Multi-task learning with pretrained transformers like BERT
Support for pretrained word vectors and embeddings
State-of-the-art speed
Production-ready training system
Linguistically-motivated tokenization
Components for named entity recognition, part-of-speech-tagging, dependency parsing, sentence segmentation, text classification, lemmatization, morphological analysis, entity linking and more
Easily extensible with custom components and attributes
Support for custom models in PyTorch, TensorFlow and other frameworks
Built in visualizers for syntax and NER
Easy model packaging, deployment and workflow management
Robust, rigorously evaluated accuracy

📖 For more details, see the facts, figures and benchmarks.

⏳ Install spaCy

For detailed installation instructions, see the documentation.

Operating system: macOS / OS X · Linux · Windows (Cygwin, MinGW, Visual Studio)
Python version: Python 3.6+ (only 64 bit)
Package managers: pip · conda (via conda-forge)

pip

Using pip, spaCy releases are available as source packages and binary wheels. Before you install spaCy and its dependencies, make sure that your pip, setuptools and wheel are up to date.

pip install -U pip setuptools wheel
pip install spacy

To install additional data tables for lemmatization and normalization you can run pip install spacy[lookups] or install spacy-lookups-data separately. The lookups package is needed to create blank models with lemmatization data, and to lemmatize in languages that don't yet come with pretrained models and aren't powered by third-party libraries.

When using pip it is generally recommended to install packages in a virtual environment to avoid modifying system state:

python -m venv .env
source .env/bin/activate
pip install -U pip setuptools wheel
pip install spacy

conda

You can also install spaCy from conda via the conda-forge channel. For the feedstock including the build recipe and configuration, check out this repository.

conda install -c conda-forge spacy

Updating spaCy

Some updates to spaCy may require downloading new statistical models. If you're running spaCy v2.0 or higher, you can use the validate command to check if your installed models are compatible and if not, print details on how to update them:

pip install -U spacy
python -m spacy validate

If you've trained your own models, keep in mind that your training and runtime inputs must match. After updating spaCy, we recommend retraining your models with the new version.

📖 For details on upgrading from spaCy 2.x to spaCy 3.x, see the migration guide.

📦 Download model packages

Trained pipelines for spaCy can be installed as Python packages. This means that they're a component of your application, just like any other module. Models can be installed using spaCy's download command, or manually by pointing pip to a path or URL.

Documentation
Available Pipelines	Detailed pipeline descriptions, accuracy figures and benchmarks.
Models Documentation	Detailed usage and installation instructions.
Training	How to train your own pipelines on your data.

# Download best-matching version of specific model for your spaCy installation
python -m spacy download en_core_web_sm

# pip install .tar.gz archive or .whl from path or URL
pip install /Users/you/en_core_web_sm-3.0.0.tar.gz
pip install /Users/you/en_core_web_sm-3.0.0-py3-none-any.whl
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0.tar.gz

Loading and using models

To load a model, use spacy.load() with the model name or a path to the model data directory.

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("This is a sentence.")

You can also import a model directly via its full name and then call its load() method with no arguments.

import spacy
import en_core_web_sm

nlp = en_core_web_sm.load()
doc = nlp("This is a sentence.")

📖 For more info and examples, check out the models documentation.

⚒ Compile from source

The other way to install spaCy is to clone its GitHub repository and build it from source. That is the common way if you want to make changes to the code base. You'll need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, virtualenv and git installed. The compiler part is the trickiest. How to do that depends on your system.

Platform
Ubuntu	Install system-level dependencies via `apt-get`: `sudo apt-get install build-essential python-dev git` .
Mac	Install a recent version of XCode, including the so-called "Command Line Tools". macOS and OS X ship with Python and git preinstalled.
Windows	Install a version of the Visual C++ Build Tools or Visual Studio Express that matches the version that was used to compile your Python interpreter.

For more details and instructions, see the documentation on compiling spaCy from source and the quickstart widget to get the right commands for your platform and Python version.

git clone https://github.com/explosion/spaCy
cd spaCy

python -m venv .env
source .env/bin/activate

# make sure you are using the latest pip
python -m pip install -U pip setuptools wheel

pip install -r requirements.txt
pip install --no-build-isolation --editable .

To install with extras:

pip install --no-build-isolation --editable .[lookups,cuda102]

🚦 Run tests

spaCy comes with an extensive test suite. In order to run the tests, you'll usually want to clone the repository and build spaCy from source. This will also install the required development dependencies and test utilities defined in the requirements.txt.

Alternatively, you can run pytest on the tests from within the installed spacy package. Don't forget to also install the test utilities via spaCy's requirements.txt:

pip install -r requirements.txt
python -m pytest --pyargs spacy

Comments

Japanese Model
Feature description

I'd like to add a Japanese model to spaCy. (Let me know if this should be discussed in #3056 instead - I thought it best to just tag it in for now.)

The Ginza project exists, but currently it's a repackaging of spaCy rather than a model to use with normal spaCy, and I think some of the resources it uses may be tricky to integrate from a licensing perspective.

My understanding is that the main parts of a model now are 1. the dependency model, 2. NER, and 3. word vectors. Notes on each of those:

Dependencies. For dependency info we can use UD Japanese GSD. UD BCCWJ is bigger but the corpus has licensing issues. GSD is rather small but probably enough to be usable (8k sentences). I have trained it with spaCy and there were no conversion issues.

NER. I don't know of a good dataset for this; Christopher Manning mentioned the same problem two years ago. I guess I could make one based on Wikipedia - I think some other spaCy models use data produced by Nothman et al's method, which skipped Japanese to avoid dealing with segmentation, so that might be one approach. (A reasonable question here is: what do people use for NER in Japanese? Most tokenizer dictionaries, including Unidic, have entity-like information and make it easy to add your own entries, so that's probably the most common approach.)

Vectors. Using JA Wikipedia is no problem. I haven't worked with the Common Crawl before and I'm not sure I have the hardware for it, buf if I could get some help on it that's also an option.

So, how does that sound? If there's no issues with that I'll look into creating an NER dataset.
enhancement models lang / ja
opened by polm 182
💫 spaCy v2.0.0 alpha – details, feedback & questions (plus stickers!)
We're very excited to finally publish the first alpha pre-release of spaCy v2.0. It's still an early release and (obviously) not intended for production use. You might come across a NotImplementedError – see the release notes for the implementation details that are still missing.

This thread is intended for general discussion, feedback and all questions related to v2.0. If you come across more complex bugs, feel free to open a separate issue.

Quickstart & overview

spaCy v2.0.0 alpha release notes

What's new in v2.0 – Overview of new features, backwards compatibility and a guide on migrating from v1.x.

spaCy 101 – Everything you need to know – The most important concepts and features explained in simple terms and with examples and illustrations.

spaCy v2.0 alpha docs

The most important new features

New neural network models for English (15 MB) and multi-language NER (12 MB), plus GPU support via Chainer's CuPy.

Strings mapped to hash values instead of integer IDs. This means they will always match – even across models.

Improved saving and loading, consistent serialization API across objects, plus Pickle support.

Built-in displaCy visualizers with Jupyter notebook support.

Improved language data with support for lazy loading and multi-language models. Alpha tokenization for Norwegian Bokmål, Japanese, Danish and Polish. Lookup-based lemmatization for English, German, French, Spanish, Italian, Hungarian, Portuguese and Swedish.

Revised API for Matcher and language processing pipelines.

Trainable document vectors and contextual similarity via convolutional neural networks.

Various bug fixes and almost completely re-written documentation.

Installation

spaCy v2.0.0-alpha is available on pip as spacy-nightly. If you want to test the new version, we recommend setting up a clean environment first. To install the new model, you'll have to download it with its full name, using the --direct flag.

pip install spacy-nightly python -m spacy download en_core_web_sm-2.0.0-alpha --direct # English python -m spacy download xx_ent_wiki_sm-2.0.0-alpha --direct # Multi-language NER

import spacy nlp = spacy.load('en_core_web_sm')

import en_core_web_sm nlp = en_core_web_sm.load()

Alpha models for German, French and Spanish are coming soon!

Now on to the fun part – stickers!

We just got our first delivery of spaCy stickers and want to to share them with you! There's only one small favour we'd like to ask. The part we're currently behind on are the tests – this includes our test suite as well as in-depth testing of the new features and usage examples. So here's the idea:

Find something that's currently not covered in the test suite and doesn't require the models, and write a test for it - for example, language-specific tokenization tests.

Alternatively, find examples from the docs that haven't been added to the tests yet and add them. Plus points if the examples don't actually work – this means you've either discovered a bug in spaCy, or a bug in the docs! 🎉

Submit a PR with your test to the develop branch – if the test covers a bug and currently fails, mark it with @pytest.mark.xfail. For more info, see the test suite docs. Once your pull request is accepted, send us your address via email or private message on Gitter and we'll mail you stickers.

If you can't find anything, don't have time or can't be bothered, that's fine too. Posting your feedback on spaCy v2.0 here counts as well. To be honest, we really just want to mail out stickers 😉
help wanted 🌙 nightly meta
opened by ines 109

Build from source with MinGW

I am trying to build from source under MinGW. I noticed that Cython seems to have trouble with relative import sometimes, but not all the times. I am not using virtualenv as I have installed the dependencies into my system. I am not sure if that might have something to do with this. Anyway this is what I am encountering:

First I ran into this:

Error compiling Cython file:
------------------------------------------------------------
...
from ..vocab cimport Vocab
^
------------------------------------------------------------

spacy/serialize/packer.pxd:1:0: 'vocab.pxd' not found

Error compiling Cython file:
------------------------------------------------------------
...
from ..vocab cimport Vocab
^
------------------------------------------------------------

spacy/serialize/packer.pxd:1:0: 'vocab/Vocab.pxd' not found

So I edited it to use absolute path for the module:

--- a/spacy/serialize/packer.pxd
+++ b/spacy/serialize/packer.pxd
@@ -1,4 +1,4 @@
-from ..vocab cimport Vocab
+from spacy.vocab cimport Vocab

and the compile then succeeded. I also had to do the same for spacy/syntax/transition_system.pxd and spacy/tokens/doc.pxd. I was able to compile the following DLLs:

$ ls spacy/*.dll
spacy/_ml-cpython-34m.dll      spacy/morphology-cpython-34m.dll
spacy/_theano-cpython-34m.dll  spacy/orth-cpython-34m.dll
spacy/attrs-cpython-34m.dll    spacy/parts_of_speech-cpython-34m.dll
spacy/cfile-cpython-34m.dll    spacy/strings-cpython-34m.dll
spacy/gold-cpython-34m.dll     spacy/tagger-cpython-34m.dll
spacy/lexeme-cpython-34m.dll   spacy/tokenizer-cpython-34m.dll
spacy/matcher-cpython-34m.dll  spacy/vocab-cpython-34m.dll

Now I am having trouble with

spacy/syntax/ner.cpp: In function 'int __pyx_f_5spacy_6syntax_3ner_13BiluoPushDown_preprocess_gold(__pyx_obj_5spacy_6syntax_3ner_BiluoPushDown*, __pyx_obj_5spacy_4gold_GoldParse*)':
spacy/syntax/ner.cpp:3532:38: error: no match for 'operator=' (operand types are '__pyx_t_5spacy_24syntax_dot_transition_system_Transition' and '__pyx_t_5spacy_6syntax_17transition_system_Transition')
     (__pyx_v_gold->c.ner[__pyx_v_i]) = __pyx_t_4;

which looks like it that it could be an issue related to imported names?

I wonder if you have seen this kind of problems before. I use up to date Msys2/MinGW-Packages. My

$ python3 --version
Python 3.4.3
$ cython --version
Cython version 0.23.beta1

opened by htzh 78

💫 Better, faster and more customisable matcher
Related issues: #1567, #1711, #1819, #1939, #1945, #1951, #2042

We're currently in the process of rewriting the match loop, fixing long-standing issues and making it easier to extend the Matcher and PhraseMatcher. The community contributions by @GregDubbin and @savkov have already made a big difference – we can't wait to get it all ready and shipped.

This issue discusses some of the planned new features and additions to the match patterns API, including matching by custom extension attributes (Token._.), regular expressions, set membership and rich comparison for numeric values.

New features

Custom extension attributes

spaCy v2.0 introduced custom extension attributes on the Doc, Span and Token. Custom attributes make it easier to attach arbitrary data to the built-in objects, and let users take advantage of spaCy's data structures and the Doc object as the "single source of truth". However, not being able to match on custom attributes was quite limiting (see #1499, #1825).

The new patterns spec will allow an _ space on token patterns, which can map to a dictionary keyed by the attribute names:

Token.set_extension('is_fruit', getter=lambda token: token.text in ('apple', 'banana')) pattern = [{'LEMMA': 'have'}, {'_': {'is_fruit': True}}] matcher.add('HAVING_FRUIT', None, pattern)

Both regular attribute extensions (with a default value) and property extensions (with a getter) will be supported and can be combined for more exact matches.

pattern = [{'_': {'is_fruit': True, 'fruit_color': 'red', 'fruit_rating': 5}}]

Rich comparison for numeric values

Token patterns already allow specifying a LENGTH (the token's character length). However, matching tokens of between five and ten characters previously required adding 6 copies of the exact same pattern, introducing unnecessary overhead. Numeric attributes can now also specify a dictionary with the predicate (e.g. '>' or '<=') mapped to the value. For example:

pattern = [{'ENT_TYPE': 'ORG', 'LENGTH': 5}] # exact length pattern = [{'ENT_TYPE': 'ORG', 'LENGTH': {'>=': 5}}] # length with predicate

The second pattern above will match a token with the entity type ORG that's 5 or more characters long. Combined with custom attributes, this allows very powerful queries combining both linguistic features and numeric data:

# match a token based on custom numeric attributes pattern = [{'_': {'fruit_rating': {'>': 7}, 'fruit_weight': {'>=': 100, '<': 300}}] # match a verb with ._.sentiment_score >= 5 and one token on each side pattern = [{}, {'POS': 'VERB', '_': {'sentiment_score': {'>=': 0.5}}}, {}]

Defining predicates and values as a dictionary instead of a single string like '>=5' allows us to avoid string parsing, and lets spaCy handle custom attributes without requiring the user to specify their types upfront. (While we know the type of the built-in LENGTH attribute, spaCy has no way of knowing whether the value '<3' of a custom attribute should be interpreted as "less than 3", or the heart emoticon.)

Set membership

This is another feature that has been requested before and will now be much easier to implement. Similar to the predicate mapping for numeric values, token attributes can now also be defined as dictionaries. The keys IN or NOT_IN can be used to indicate set membership and non-membership.

pattern = [{'LEMMA': {'IN': ['like', 'love']}}, {'LOWER': {'IN': ['apples', 'bananas']}}]

The above pattern will match a token with the lemma "like" or "love", followed by a token whose lowercase form is either "apples" or "bananas". For example, "loving apples" or "likes bananas". Lists can be used for all non-boolean values, including custom _ attributes:

# verb or conjunction followed by custom is_fruit token pattern = [{'POS': {'IN': ['VERB', 'CONJ', 'CCONJ']}}, {'_': {'is_fruit': True, 'fruit_color': {'NOT_IN': ['red', 'yellow']}}}] # set membership of numeric custom attributes pattern = [{'_': {'is_customer': True, 'customer_year': {'IN': [2018, 2017, 2016]}}}] # combination of predicates and and non-membership pattern = [{'_': {'custom_count': {'<=': 100, 'NOT_IN': [12, 66, 79]}}}]

Regular expressions

Using regular expressions within token patterns is already possible via custom binary flags (see #1567). However, this has some inconvenient limitations – including the patterns not being JSON-serializable. If the solution is to add binary flags, spaCy might as well take care of that. The following example is based on the work by @savkov (see #1833):

pattern = [{'ORTH': {'REGEX': '^([Uu](\\.?|nited) ?[Ss](\\.?|tates)'}}, {'LOWER': 'president'}]

'REGEX' as an operator (instead of a top-level property that only matches on the token's text) allows defining rules for any string value, including custom attributes:

# match tokens with fine-grained POS tags starting with 'V' pattern = [{'TAG': {'REGEX': '^V'}}] # match custom attribute values with regular expressions pattern = [{'_': {'country': {'REGEX': '^([Uu](\\.?|nited) ?[Ss](\\.?|tates)'}}}]

New operators

TL;DR: The new patterns spec will allow two ways of defining properties – attribute values for exact matches and dictionaries using operators for more fine-grained matches.

{ PROPERTY: value, # exact match PROPERTY: {OPERATOR: value, ...} # match with operators }

The following operators can be used within dictionaries describing attribute values:

| Operator | Value type | Description | Example | | --- | --- | --- | --- | | ==, >=, <=, >, < | int, float | Attribute value is equal, greater or equal, smaller or equal, greater or smaller. | 'LENGTH': {'>': 10} | | IN | any | Attribute value is member of a list. | 'LEMMA': {'IN': ['like', 'love']} | | NOT_IN | any | Attribute value is not member of a list. | 'POS': {'NOT_IN': ['NOUN', 'PROPN']} | | REGEX | unicode | Attribute value matches regular expression. | 'TAG': {'REGEX': '^V'} |

API improvements and bug fixes

See @honnibal's comments in #1945 and the feature/better-faster-matcher branch for more details and implementation examples.

Other fixes

[x] #1711: Remove hard-coded length limit of 10 on the PhraseMatcher.

[x] #1939: Fix pickling of PhraseMatcher.

[x] #1951: Matcher.pipe should yield matches instead of Doc objects.

[ ] #2042: Support deleting rules in PhraseMatcher.

[x] Accept "TEXT" as an alternative to "ORTH" (for consistency).

enhancement feat / matcher perf / speed
opened by ines 64
v2 standard pipeline running 10x slower
Your Environment

Info about spaCy

Python version: 2.7.13

Platform: Linux-4.10.0-38-generic-x86_64-with-debian-stretch-sid

spaCy version: 2.0.0

Models: en

I just updated to v2.0. Not sure what changed, but the exact same pipeline of documents called in the standard nlp = spacy.load('en'); nlp(u"string") way is now 10x slower.
usage perf / speed
opened by hodsonjames 56
PR for testing Thinc 703

Description

dummy PR for testing purposes only - should not be merged

PR to test the CI with the Thinc branch for https://github.com/explosion/thinc/pull/703
🔮 thinc

opened by svlandeg 55
Use in Apache Spark / English() object cannot be pickled
For spaCy to work out of the box with Apache Spark the language modles need to be pickled so that they can be initialised on the master node and then sent to the workers.

This currently doesn't work with plain pickle, failing as follows:

>>> from __future__ import unicode_literals, print_function >>> from spacy.en import English >>> import pickle >>> nlp = English() >>> nlpp = pickle.dumps(nlp) Traceback (most recent call last): [...] TypeError: can't pickle Vocab objects

Apache Spark ships with a package called cloudpickle which is meant to support a wider set of Python constructs, but serialisation with cloudpickle also fails resulting in a segmentation fault:

>>> from pyspark import cloudpickle >>> pickled_nlp = cloudpickle.dumps(nlp) >>> nlpp = pickle.dumps(nlp) >>> nlpp('test text') Segmentation fault

By default Apache Spark uses pickle, but can be told to use cloudpickle instead.

Currently a feasable workaround is lazy loading of the language models on the worker nodes:

global nlp def lazyloaded_nlp(s): global nlp try: return nlp(s) except: nlp = English() return nlp(s)

The above works. Nevertheless, I wonder if it would be possible to make the English() object pickleable? If not too difficult from your end, having the language models pickleable would provide a better out of box experience for Apache Spark users.
enhancement 🌙 nightly
opened by aeneaswiener 53
Segmentation fault training NER with large number of training examples #1757 #1335

Re-opening this as a new issue specifically related to NER ~~batch size~~ training with many examples. Relates to #1757 #1335 (which appear to be closed).

Training NER on 500+ examples throws segmentation fault error:

Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)

Anybody found a solution/workaround for this? Thanks!

info about spaCy Python version: 3.6.3 spaCy version: 2.0.5 Models: en, en_core_sm Platform: MacOS
bug training feat / ner

opened by nikeqiang 43

Timeout Downloading Models

How to reproduce the behaviour

My GitHub action trying to download models as follows -

python -m spacy download en_core_web_lg

But it is sometimes giving timeout errors -

ERROR: Could not install packages due to an OSError: HTTPSConnectionPool(host='objects.githubusercontent.com', port=443): Max retries exceeded with url: /github-production-release-asset-2e65be/84940268/ee782580-63d4-11eb-9a2f-4a14ddffedbb?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20211103%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20211103T074829Z&X-Amz-Expires=300&X-Amz-Signature=4a4170665e395bcd6d5c55886d9fdc8d982870ee5954f34ef0d681b9ded628a2&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=84940268&response-content-disposition=attachment%3B%20filename%3Den_core_web_lg-3.0.0-py3-none-any.whl&response-content-type=application%2Foctet-stream (Caused by ReadTimeoutError("HTTPSConnectionPool(host='objects.githubusercontent.com', port=443): Read timed out. (read timeout=15)"))

Not sure if this is related to this https://github.com/explosion/spaCy/issues/5260

Your Environment

Operating System: Github Runners (Ubuntu, Windows, and Mac)
Python Version Used: 3.7
spaCy Version Used: 3.0.0
Environment Information: pip

install models third-party

opened by lalitpagaria 42

💫 Participating in CoNLL 2018 Universal Dependencies evaluation (Team spaCy?)
Update 06/06/2018. Best way to run the CoNLL experiments is:

git clone https://github.com/explosion/spaCy -b develop cd spaCy make ./dist/spacy.pex ud-train --help

The Conference for Natural Language Learning (CoNLL) 2017 shared task is a great standard for evaluating parsing algorithms. Unlike previous parsing evaluations, CoNLL 2017 is end-to-end: from raw text to dependencies, across many languages. While we missed the 2017 evaluation, I'd like to participate in 2018.

To participate in CoNLL 2018, we would need to:

Adapt tokenizers for to match UD tokenization more closely.

Add pipeline component for statistical lemmatization, to improve lemmatizer coverage across languages.

Add pipeline component to predict morphological tags.

Support joint segmentation and tagging or parsing, for languages like Chinese.

All of these are great goals, regardless of the competition! However, it's a lot of work, especially the tokenization, which really needs speakers of the various languages.

Even if we don't get everything done in time to participate in the official evaluation, it will be a great step for spaCy to publish accuracy figures using the official evaluation software and methodology. This will allow direct comparison against other systems, and make quality control across languages much easier.

What would be really awesome is if we got a few people working on this together, so we could participate as "Team spaCy". Ideally we'd have people taking ownership of some of the main languages, e.g. French, Spanish, German, Chinese, Japanese etc. It's much easier to work on a specific language that you're well familiar with. The official evaluation will consider all language equally, but I'm okay with having low accuracy on like, Ancient Greek or Dothraki.

The official testing period will run April 30 to June 26. However, we can get started right away by working with the CoNLL 2017 data.

To get started, I've made a quick script to run an experiment, which I've been testing on the English data. You can run it by building the feature/better-gold branch, and running the examples/training/conllu.py script like so:

python examples/training/conllu.py en ~/data/ud-treebanks-conll2017/UD_English/en-ud-train.conllu ~/data/ud-treebanks-conll2017/UD_English/en-ud-train.txt ~/data/ud-treebanks-conll2017/UD_English/en-ud-dev.conllu ~/data/ud-treebanks-conll2017/UD_English/en-ud-dev.txt /tmp/dev.conllu

This will write you an output file /tmp/dev.conllu after each training epoch, which you can pass into the official CoNLL 2017 evaluation scorer. Scores currently suck, as there are various things to tweak and fix --- but at least the evaluation runs.
enhancement help wanted
opened by honnibal 41
💫 Entity Linking in spaCy
Feature description

With @honnibal & @ines we have been discussing adding an Entity Linking module to spaCy. This module would run on top of NER results and disambiguate & link tagged mentions to a knowledge base. We are thinking of implementing this in a few different phases:

Implement an efficient encoding of a knowledge base + all APIs / interfaces, to integrate with the current processing pipeline. We would take the following components of EL into account:

Candidate generation

Encoding document context

Encoding local context

Type prediction

Coreference resolution / ensuring global consistency

Implement a model that links English texts to English Wikipedia entries

Implement a cross-lingual model that links non-English texts to English Wikipedia entries

Fine-tune WP linking models to be able to ship them as such

Implement support in Prodigy to perform custom EL annotations for your specific project

Test / implement the models on a different domain & non-wikipedia knowledge base

Notes

As some prior research, we compiled some notes on this project & its requirements: https://drive.google.com/file/d/1UYnPgx3XjhUx48uNNQ3kZZoDzxoinSBF. This contains more details on the EL components and implementation phases.

Feedback requested

We will start implementing the APIs soon, but we would love to hear your ideas, suggestions, requests with respect to this new functionality first!
enhancement feat / ner
opened by svlandeg 40
Fix inconsistency in displaCy docs about page option
Description

The page option, which wraps the output SVG in HTML, is true by default for serve but not for render. The render docs were wrong though, so this updates them.

Types of change

Minor docs fix

Checklist

[x] I confirm that I have the right to submit this contribution under the project's MIT license.

[ ] I ran the tests, and all new and existing tests passed.

[x] My changes don't require a change to the documentation, or if they do, I've added all required information.

docs feat / visualizers
opened by polm 0
nlp.rehearse with textcat and tok2vec listener
Description

Related to #12044

When using nlp.rehearse on a textcat pipeline with a tok2vec listener, it throws ValueError: [E953] Mismatched IDs received by the Tok2Vec listener. This is not the case when using an inline tok2vec listener. (The same goes for when using textcat_multilabel)

This PR aims to fix the issue, however it is still WIP and currently only contains the failing unit tests.

Types of change

bug fix

Checklist

[x] I confirm that I have the right to submit this contribution under the project's MIT license.

[ ] I ran the tests, and all new and existing tests passed.

[ ] My changes don't require a change to the documentation, or if they do, I've added all required information.

bug feat / textcat feat / tok2vec
opened by thomashacker 0
Mismatched IDs error when using nlp.rehearse on textcat
Discussed in https://github.com/explosion/spaCy/discussions/10861

Using nlp.rehearse on a textcat pipeline with a tok2vec listener results in ValueError: [E953] Mismatched IDs. This is not the case when using tok2vec directly within the textcat component. The same goes for textcat_multilabel.

^{Originally posted by nashcaps2255 May 27, 2022} Have a textcat multilabel model which I am trying to update with nlp.rehearse to alleviate the catastrophic forgetting problem.

nlp = spacy.load('my_model') examples = [] for line in file_: text, label = line.split("|") doc = nlp(text) gold_dict = {"cats": {label: float(1)}} gold_dict = Example.from_dict(doc, gold_dict) examples.append(example) optimizer = nlp.resume_training() nlp.rehearse(examples, sgd = optimizer)

Results in......

ValueError: [E953] Mismatched IDs received by the Tok2Vec listener: 179568814531392983158587824 vs. 2172509679243279887229

bug training feat / textcat
opened by thomashacker 0
Delete unused imports for StringStore
Description

This PR removes unused imports for StringStore from lexeme and tokenizer.

Types of change

Checklist

[x] I confirm that I have the right to submit this contribution under the project's MIT license.

[x] I ran the tests, and all new and existing tests passed.

[x] My changes don't require a change to the documentation, or if they do, I've added all required information.
opened by tetsuok 0
Memory leak when processing a large number of documents with Spacy transformers
I have a Spacy distilbert transformer model trained for NER. When I use this model for predictions on a large corpus of documents, the RAM usage spikes up very quickly, and then keeps increasing over time, until I run out of memory, and my process gets killed. I am running this on a CPU AWS machine m5.12xlarge I see the same behavior when using en_core_web_trf model.

The following code can be used to reproduce error with en_core_web_trf model

import sys, pickle, time, os import spacy print(f"CPU Count: {os.cpu_count()}") model = spacy.load("en_core_web_trf") ## Docs are English text documents with average character length of 2479, std dev 3487, max 69000 docs = pickle.load( open( "memory_analysis/data/docs.p", "rb" ) ) print(len(docs)) for i, body in (enumerate(docs)): if i==10000: break ## Spacy prediction list( model.pipe([body], disable=["tok2vec", "parser", "attribute_ruler", "lemmatizer"] )) if i%400==0: print(f"Doc number: {i}")

Environment:

spacy-transformers==1.1.8 spacy==3.4.3 torch==1.12.1

Additional info: I notice that model vocab length and cached string store grows with the processed documents as well, although unsure if this is causing the memory leak. I tried periodically reloading model, but that does not help either.

Using Memray for memory usage analysis:

python3 -m memray run -o memory_usage_trf_max.bin memory_analysis.py python3 -m memray flamegraph memory_usage_trf_max_len.bin
opened by saketsharmabmb 0
Fix required maximum version of typing-extensions
Description

This PR fixes the required maximum version of typing-extensions.

Currently it is bounded to <4.2.0: typing_extensions>=3.7.4.1,<4.2.0; python_version < "3.8"

This PR sets the upper bound to all compatible versions, until the next major release <5.0.0.

Required:

[ ] https://github.com/explosion/confection/pull/20

[ ] https://github.com/explosion/thinc/pull/833

See:

https://github.com/explosion/spaCy/issues/12034

See issue in pydantic:

https://github.com/pydantic/pydantic/issues/4885

See fixing PR in pydantic (typing-extensions>=4.2.0), which will be incompatible with your requirement typing_extensions>=3.7.4,<4.2.0; python_version < "3.8":

https://github.com/pydantic/pydantic/pull/4886

Types of change

Checklist

[ ] I confirm that I have the right to submit this contribution under the project's MIT license.

[ ] I ran the tests, and all new and existing tests passed.

[ ] My changes don't require a change to the documentation, or if they do, I've added all required information.

install third-party
opened by albertvillanova 1

Releases(v3.0.9)

v3.0.9(Dec 16, 2022)
This bug fix release is primarily to avoid deprecation warnings and future incompatibility with NumPy v1.24+.

🔴 Bug fixes

#11331, #11701: Clean up warnings in spaCy and its test suite.

#11845: Don't raise an error in displaCy for unset spans keys.

#11864: Add smart_open requirement and update deprecated options.

#11899: Fix spacy init config --gpu for environments without spacy-transformers.

#11933: Update for compatibility with NumPy v1.24+ integer conversions.

#11935: Restore missing error messages for beam search.

👥 Contributors

@adrianeboyd, @honnibal, @ines, @polm, @svlandeg
Source code(tar.gz)
Source code(zip)
v2.3.9(Dec 16, 2022)
This release addresses future compatibility with NumPy v1.24+.

🔴 Bug fixes

#11940: Update for compatibility with NumPy v1.24+ integer conversions.

👥 Contributors

@adrianeboyd, @honnibal, @ines, @svlandeg
Source code(tar.gz)
Source code(zip)
v3.4.4(Dec 14, 2022)
This bug fix release is primarily to avoid deprecation warnings and future incompatibility with NumPy v1.24+.

🔴 Bug fixes

#11845: Don't raise an error in displaCy for unset spans keys.

#11860: Fix spancat for docs with zero suggestions.

#11864: Add smart_open requirement and update deprecated options.

#11899: Fix spacy init config --gpu for environments without spacy-transformers.

#11933: Update for compatibility with NumPy v1.24+ integer conversions.

#11934: Add strings when initializing from labels in EditTreeLemmatizer.

#11935: Restore missing error messages for beam search.

👥 Contributors

@adrianeboyd, @danieldk, @honnibal, @ines, @polm, @svlandeg
Source code(tar.gz)
Source code(zip)
v3.3.2(Dec 16, 2022)
This bug fix release is primarily to avoid deprecation warnings and future incompatibility with NumPy v1.24+.

🔴 Bug fixes

#10911, #11194: Improve speed in precomputable_biaffine by avoiding concatenation.

#11276, #11331, #11701: Clean up warnings in spaCy and its test suite.

#11845: Don't raise an error in displaCy for unset spans keys.

#11860: Fix spancat for docs with zero suggestions.

#11864: Add smart_open requirement and update deprecated options.

#11899: Fix spacy init config --gpu for environments without spacy-transformers.

#11933: Update for compatibility with NumPy v1.24+ integer conversions.

#11934: Add strings when initializing from labels in EditTreeLemmatizer.

#11935: Restore missing error messages for beam search.

👥 Contributors

@adrianeboyd, @danieldk, @honnibal, @ines, @polm, @svlandeg
Source code(tar.gz)
Source code(zip)
v3.2.5(Dec 16, 2022)
This bug fix release is primarily to avoid deprecation warnings and future incompatibility with NumPy v1.24+.

🔴 Bug fixes

#10573: Remove Click pin following Typer updates.

#11331, #11701: Clean up warnings in spaCy and its test suite.

#11845: Don't raise an error in displaCy for unset spans keys.

#11860: Fix spancat for docs with zero suggestions.

#11864: Add smart_open requirement and update deprecated options.

#11899: Fix spacy init config --gpu for environments without spacy-transformers.

#11933: Update for compatibility with NumPy v1.24+ integer conversions.

#11935: Restore missing error messages for beam search.

👥 Contributors

@adrianeboyd, @honnibal, @ines, @polm, @svlandeg
Source code(tar.gz)
Source code(zip)
v3.1.7(Dec 16, 2022)
This bug fix release is primarily to avoid deprecation warnings and future incompatibility with NumPy v1.24+.

🔴 Bug fixes

#10573: Remove Click pin following Typer updates.

#11331, #11701: Clean up warnings in spaCy and its test suite.

#11845: Don't raise an error in displaCy for unset spans keys.

#11860: Fix spancat for docs with zero suggestions.

#11864: Add smart_open requirement and update deprecated options.

#11899: Fix spacy init config --gpu for environments without spacy-transformers.

#11933: Update for compatibility with NumPy v1.24+ integer conversions.

#11935: Restore missing error messages for beam search.

👥 Contributors

@adrianeboyd, @honnibal, @ines, @polm, @svlandeg
Source code(tar.gz)
Source code(zip)
v3.4.3(Nov 10, 2022)
✨ New features and improvements

Extend Typer support to v0.7.x (#11720).

🔴 Bug fixes

#11640: Handle docs with no entities in EntityLinker.

#11688: Restore custom doc extension values in Doc.to_json() for attributes set by getters.

#11706: Remove incorrect warning for pipeline_package.load().

#11735: Improve spacy project requirements checks for unsupported specifiers and requirements lines.

#11745: Revert modifications to spacy.load(disable=) that could enable currently disabled components.

👥 Contributors

@aaronzipp, @adrianeboyd, @honnibal, @ines, @polm, @rmitsch, @ryndaniels, @svlandeg, @thomashacker
Source code(tar.gz)
Source code(zip)
v3.4.2(Oct 20, 2022)
✨ New features and improvements

NEW: Luganda language support (#10847).

NEW: Latin language support (#11349).

NEW: spacy.ConsoleLogger.v2 optionally saves training logs to JSONL (#11214).

NEW: New operators for the DependencyMatcher to include matching parents or children to the left or the right of the node (#10371).

Prebuilt Python 3.11 wheels are now available for all spaCy dependencies distributed by @explosion.

Support pydantic v1.10 and mypy 0.980+, drop mypy support for Python 3.6 (#11546, #11635).

Support CuPy v11 and add extras for cuda11x and cuda-autodetect (using cupy-wheel) (#11279).

Support custom attributes for tokens and spans in Doc.to_json() and Doc.from_json() (#11125).

Make the enable and disable options for spacy.load() more consistent (#11459).

Allow a single string argument for disable/enclude/exclude for spacy.load() (#11406).

New --url flag for spacy info to print the direct download URL for a pipeline (#11175).

Add a check for missing requirements in the spacy project CLI (#11226).

Add a Levenshtein distance function (#11418).

Improvements to the spacy debug data CLI for spancat data (#11504).

Allow overriding spacy_version in spacy package metadata (#11552).

Improve the error message when using the wrong command for spacy project assets (#11458).

Ensure parent directories are created when storing the results of the spacy pretrain command (#11210).

Extend support to newer versions of natto-py for the ko extra (#11222).

📦 Trained pipelines updates

This release includes updated English pipelines for spaCy v3.4 with improved NER performance. The updates in en_core_web_* v3.4.1 address issues related to training from data with partial named entity annotation, which led to lower NER recall in English pipeline versions v3.0.0–v3.4.0. In particular, entities that appear in the sections of the OntoNotes training data without NER annotation were not predicted consistently by the earlier pipeline versions, such as names and places that are frequent in the Biblical sections, e.g., "David" and "Egypt" (see #7493).

Use spacy download to update your English pipelines to the newest version. If you'd prefer to keep using an earlier version, you can specify the version directly with e.g. spacy download -d en_core_web_sm-3.4.0. You can check that you are using the new version (v3.4.1) with spacy validate:

NAME SPACY VERSION en_core_web_md >=3.4.0,<3.5.0 3.4.1 ✔

🔴 Bug fixes

#11275: Fix Dutch noun chunks to skip overlapping spans.

#11276: Fix regex invalid escape sequences.

#11312: Better handling of unexpected types in SetPredicate.

#11460: Fix config validation failures caused by NVTX pipeline wrappers.

#11506: Avoid unwanted side effects in Doc.__init__.

#11540: Preserve missing entity annotation in augmenters.

#11592: Fix issues with DVC commands.

#11631: Fix initialization for pymorphy2_lookup lemmatizer mode for Russian and Ukrainian.

⚠️ Backwards incompatibilities

If you're using a custom component that does not return a Doc type, an error will now be raised (#11424).

If you're using a dot in a factory name, an error is raised as this is not supported (#11336).

📖 Documentation and examples

Added documentation for the new experimental coref component.

Added Ukrainian trained pipelines to the website.

Added documentation for the spacy.models_and_pipes_with_nvtx_range.v1 callback.

Fix English pipeline names in v3.4 release notes.

Various fixes to the Example API documentation.

Extensions and improvements to the displacy docs.

Fix the example command for spacy project dvc.

Update example code for spacy-wordnet.

Improve API documentation around the initialize() function for pipeline components.

Fix various typos and inconsistencies.

spaCy universe additions:

concepCy: A spaCy wrapper for ConceptNet.

spaCy partial tagger: build a CRF tagger with a partially annotated dataset.

Zshot: Zero and Few shot named entity & relationships recognition.

👥 Contributors

@adrianeboyd, @bdura, @danieldk, @diyclassics, @DSLituiev, @GabrielePicco, @honnibal, @ines, @JulesBelveze, @kadarakos, @ljvmiranda921, @ninjalu, @pmbaumgartner, @polm, @radandreicristian, @richardpaulhudson, @rmitsch, @shadeMe, @stefawolf, @svlandeg, @thomashacker, @tobiusaolo, @tzussman , @yasufumy
Source code(tar.gz)
Source code(zip)
v2.3.8(Oct 19, 2022)
✨ New features and improvements

Updates and binary wheels for Python 3.10 and 3.11.

👥 Contributors

@adrianeboyd, @honnibal, @ines
Source code(tar.gz)
Source code(zip)
v3.4.1(Jul 26, 2022)
🔴 Bug fixes

Fix issue #11137: Fix compatibility with CuPy v9.x.

📖 Documentation and examples

spaCy universe additions:

BERTopic: Leveraging BERT and c-TF-IDF to create easily interpretable topics.

English Interpretation Sentence Pattern: English interpretation for accurate translation from English to Japanese.

👥 Contributors

@adrianeboyd, @danieldk, @honnibal, @ines, @lll-lll-lll-lll, @Lucaterre, @MaartenGr, @mr-bjerre, @polm, @radenkovic
Source code(tar.gz)
Source code(zip)
v3.4.0(Jul 12, 2022)
✨ New features and improvements

Support for mypy 0.950+ and pydantic v1.9 (#10786).

Prebuilt linux aarch64 wheels are now available for all spaCy dependencies distributed by @explosion.

Min/max {n,m} operator for Matcher patterns (#10981).

Language updates:

Improve tokenization for Cyrillic combining diacritics (#10837).

Improve English tokenizer exceptions for contractions with this/that/these/those (#10873).

Improved speed of vector lookups (#10992).

For the parser, use C saxpy/sgemm provided by the Ops implementation in order to use Accelerate through thinc-apple-ops (#10773).

Improved speed of Example.get_aligned_parse and Example.get_aligned (#10952).

Improved speed of StringStore lookups (#10938).

Updated spacy project clone to try both main and master branches by default (#10843).

Added confidence threshold for named entity linker (#11016).

Improved handling of Typer optional default values for init_config_cli (#10788).

Added cycle detection in parser projectivization methods (#10877).

Added counts for NER labels in debug data (#10960).

Support for adding NVTX ranges to TrainablePipe components (#10965).

Support env variable SPACY_NUM_BUILD_JOBS to specify the number of build jobs to run in parallel with pip (#11073).

📦 Trained pipelines updates

We have added new pipelines for Croatian that use the trainable lemmatizer and floret vectors.

| Package | UPOS | Parser LAS | NER F | | ----------------------------------------------- | ---: | ---------: | ----: | | hr_core_news_sm | 96.6 | 77.5 | 76.1 | | hr_core_news_md | 97.3 | 80.1 | 81.8 | | hr_core_news_lg | 97.5 | 80.4 | 83.0 |

🙏 Special thanks to @gtoffoli for help with the new pipelines!

The English pipelines have new word vectors:

| Package | Model Version | TAG | Parser LAS | NER F | | ----------------------------------------------- | ------------- | ---: | ---------: | ----: | | en_core_news_md | v3.3.0 | 97.3 | 90.1 | 84.6 | | en_core_news_md | v3.4.0 | 97.2 | 90.3 | 85.5 | | en_core_news_lg | v3.3.0 | 97.4 | 90.1 | 85.3 | | en_core_news_lg | v3.4.0 | 97.3 | 90.2 | 85.6 |

All CNN pipelines have been extended to add whitespace augmentation.

🔴 Bug fixes

Fix issue #10960: Support hyphens in NER labels.

Fix issue #10994: Fix horizontal spacing for spans in displaCy.

Fix issue #11013: Check for any token with a vector in Doc.has_vector, distinguish 0-vectors and missing vectors in similarity warnings.

Fix issue #11056: Don't use get_array_module in textcat.

Fix issue #11092: Fix vertical alignment for spans in displaCy.

🚀 Notes about upgrading from v3.3

Doc.has_vector now matches Token.has_vector and Span.has_vector: it returns True if at least one token in the doc has a vector rather than checking only whether the vocab contains vectors.

📖 Documentation and examples

spaCy universe additions:

Aim-spacy: An Aim-based spaCy experiment tracker.

Asent: Fast, flexible and transparent sentiment analysis.

spaCy fishing: Named entity disambiguation and linking on Wikidata in spaCy with Entity-Fishing.

spacy-report: Generates interactive reports for spaCy models.

👥 Contributors

@adrianeboyd, @danieldk, @ericholscher, @gorarakelyan, @honnibal, @ines, @jademlc, @kadarakos, @KennethEnevoldsen, @koaning, @Lucaterre, @maxTarlov, @philipvollet, @pmbaumgartner, @polm, @richardpaulhudson, @rmitsch, @sadovnychyi, @shadeMe, @shen-qin, @single-fingal, @svlandeg, @victorialslocum, @Zackere
Source code(tar.gz)
Source code(zip)
v3.3.1(Jun 7, 2022)
✨ New features and improvements

Add the SpanRuler component. This component saves a list of matched spans to Doc.spans[spans_key].

Support for JSON serialization and deserialization of Doc objects.

Add span analysis to debug data.

Allow data assets to be made optional in a spaCy project.

Prebuilt macOS ARM64 wheels are now available for all spaCy dependencies distributed by @Explosion.

🔴 Bug fixes

Fix issue #9575: Fix Entity Linker with tokenization mismatches between gold and predicted Doc objects.

Fix issue #10685: Fix serialization of SpanGroup objects that share the same name within one SpanGroups container.

Fix issue #10718: Remove debug print statements in walk_head_nodes to avoid acquiring the GIL.

Fix issue #10741: Make the StringStore.__getitem__ return type dependent on its parameter type.

Fix issue #10734: Support removal of overlapping terms in PhraseMatcher.

Fix issue #10772: Override SpanGroups.setdefault to also support Iterable[SpanGroup] as the default.

Fix issue #10817: Ensure that the term ROOT is in the glossary.

Fix issue #10830: Better errors for Doc.has_annotation and Matcher.

Fix issue #10864: Avoid pickling Doc inputs passed to Language.pipe().

Fix issue #10898: Fix schemas import in Doc.

⚠️ Backward incompatibilities

Before this release, a validation bug allowed the configuration of a pipeline component to override the name of the pipeline itself through the name attribute. For example, the following pipeline component:

[components.transformer] factory = "transformer" name = "custom_transformer_name"

would be registered erroneously as custom_transformer_name. Such overrides are now ignored and a warning is emitted (#10779). From spaCy v3.3.1 onwards, this component will be registered as transformer.

👥 Contributors

@adrianeboyd, @danieldk, @freddyheppell, @honnibal, @ines, @kadarakos, @ldorigo, @ljvmiranda921, @maxTarlov, @pmbaumgartner, @polm, @pypae, @richardpaulhudson, @rmitsch, @shadeMe, @single-fingal, @svlandeg
Source code(tar.gz)
Source code(zip)
v3.3.0(Apr 29, 2022)
✨ New features and improvements

Improved speeds for many components, see speed benchmarks for trained pipelines:

Speed up parser and NER by using constant-time head lookups (#10048).

Support unnormalized softmax probabilities in spacy.Tagger.v2 to speed up inference for the tagger, morphologizer, senter and trainable lemmatizer (#10197).

Speed up parser projectivization functions (#10241).

Replace Ragged with faster AlignmentArray in Example for training (#10319).

Improve Matcher speed (#10659).

Improve serialization speed for empty Doc.spans (#10250).

NEW: A trainable lemmatizer component that uses edit trees to transform tokens to lemmas. Add it to your config with spacy init config -p trainable_lemmatizer or using the quickstart.

Language updates:

Initial support for Lower Sorbian and Upper Sorbian.

New noun chunks for Finnish.

Updated noun chunks for French, Italian and Spanish.

Additional updates for English, French, Italian, Japanese, Korean, Norwegian, Russian, Slovenian, Spanish, Turkish, Ukrainian and Vietnamese.

Big endian support with thinc v8.0.14+ and thinc-bigendian-ops.

Config comparisons with spacy debug diff-config.

displaCy support for overlapping span annotation and multiple labeled arcs between the same tokens.

SpanCategorizer.set_candidates for debugging span suggesters.

The quickstart now supports adding spancat and trainable_lemmatizer components.

📦 Trained pipelines

v3.3 introduces trained pipelines for Finnish, Korean and Swedish which feature the trainable lemmatizer and floret vectors. Due to the use Bloom embeddings and subwords, the pipelines have compact vectors with no out-of-vocabulary words.

| Package | Language | UPOS | Parser LAS | NER F | | --------------------------------------------------------------- | -------- | ---: | ---------: | ----: | | fi_core_news_sm | Finnish | 92.5 | 71.9 | 75.9 | | fi_core_news_md | Finnish | 95.9 | 78.6 | 80.6 | | fi_core_news_lg | Finnish | 96.2 | 79.4 | 82.4 | | ko_core_news_sm | Korean | 86.1 | 65.6 | 71.3 | | ko_core_news_md | Korean | 94.7 | 80.9 | 83.1 | | ko_core_news_lg | Korean | 94.7 | 81.3 | 85.3 | | sv_core_news_sm | Swedish | 95.0 | 75.9 | 74.7 | | sv_core_news_md | Swedish | 96.3 | 78.5 | 79.3 | | sv_core_news_lg | Swedish | 96.3 | 79.1 | 81.1 |

🙏 Special thanks to @aajanki, @thiippal (Finnish) and Elena Fano (Swedish) for their help with the new pipelines!

The new trainable lemmatizer is used for Danish, Dutch, Finnish, German, Greek, Italian, Korean, Lithuanian, Norwegian, Polish, Portuguese, Romanian and Swedish.

| Model | v3.2 Lemma Acc | v3.3 Lemma Acc | | ----------------------------------------------- | -------------: | -------------: | | da_core_news_md | 84.9 | 94.8 | | de_core_news_md | 73.4 | 97.7 | | el_core_news_md | 56.5 | 88.9 | | fi_core_news_md | - | 86.2 | | it_core_news_md | 86.6 | 97.2 | | ko_core_news_md | - | 90.0 | | lt_core_news_md | 71.1 | 84.8 | | nb_core_news_md | 76.7 | 97.1 | | nl_core_news_md | 81.5 | 94.0 | | pl_core_news_md | 87.1 | 93.7 | | pt_core_news_md | 76.7 | 96.9 | | ro_core_news_md | 81.8 | 95.5 | | sv_core_news_md | - | 95.5 |

🔴 Bug fixes

Fix issue #5447: Avoid overlapping arcs when using displaCy in manual mode.

Fix issue #9443: Fix Scorer.score_cats for missing labels.

Fix issue #9669: Fix entity linker batching.

Fix issue #9903: Handle _ value for UPOS in CoNLL-U converter.

Fix issue #9904: Fix textcat loss scaling.

Fix issue #9956: Compare all Span attributes consistently.

Fix issue #10073: Add "spans" to the output of doc.to_json.

Fix issue #10086: Add tokenizer option to allow Matcher handling for all special cases.

Fix issue #10189: Allow Example to align whitespace annotation.

Fix issue #10302: Fix check for NER annotation in MISC in CoNLL-U converter.

Fix issue #10324: Fix Tok2Vec for empty batches.

Fix issue #10347: Update basic functionality for rehearse.

Fix issue #10394: Fix Vectors.n_keys for floret vectors.

Fix issue #10400: Use meta in util.load_model_from_config.

Fix issue #10451: Fix Example.get_matching_ents.

Fix issue #10460: Fix initial special cases for Tokenizer.explain.

Fix issue #10521: Stream large assets on download in spaCy projects.

Fix issue #10536: Handle unknown tags in KoreanTokenizer tag map.

Fix issue #10551: Add automatic vector deduplication for init vectors.

🚀 Notes about upgrading from v3.2

To see the speed improvements for the Tagger architecture, edit your configs to switch from spacy.Tagger.v1 to spacy.Tagger.v2 and then run init fill-config.

Span comparisons involving ordering (<, <=, >, >=) now take all span attributes into account (start, end, label, and KB ID) so spans may be sorted in a slightly different order (#9956).

Annotation on whitespace tokens is handled in the same way as annotation on non-whitespace tokens during training in order to allow custom whitespace annotation (#10189).

Doc.from_docs now includes Doc.tensor by default and supports excludes with an exclude argument in the same format as Doc.to_bytes. The supported exclude fields are spans, tensor and user_data.

📖 Documentation and examples

spaCy universe additions:

classy-classification: A Python library for classy few-shot and zero-shot classification within spaCy.

Concise Concepts: Concise Concepts uses few-shot NER based on word embedding similarity.

Crosslingual Coreference: Crosslingual coreference with an English coreference model plus crosslingual embeddings.

EDS-NLP: spaCy components to extract information from clinical notes written in French.

HuSpaCy: Industrial-strength Hungarian natural language processing.

Klayers: spaCy as a AWS Lambda Layer.

Named Entity Recognition (NER) using spaCy (video).

Scrubadub: Remove personally identifiable information from text using spaCy.

spacy-setfit-textcat: Experiments with SetFit & Few-Shot Classification.

tmtoolkit: Text mining and topic modeling toolkit.

👥 Contributors

@aajanki, @adrianeboyd, @apjanco, @bdura, @BramVanroy, @danieldk, @danmysak, @davidberenstein1957, @DuyguA, @fonfonx, @gremur, @HaakonME, @harmbuisman, @honnibal, @ines, @internaut, @jfainberg, @jnphilipp, @jsnfly, @kadarakos, @koaning, @ljvmiranda921, @martinjack, @mgrojo, @nrodnova, @ofirnk, @orglce, @pepemedigu, @philipvollet, @pmbaumgartner, @polm, @richardpaulhudson, @ryndaniels, @SamEdwardes, @Schero1994, @shadeMe, @single-fingal, @svlandeg, @thebugcreator, @thomashacker, @umaxfun, @y961996
Source code(tar.gz)
Source code(zip)
v3.1.6(Mar 30, 2022)
🔴 Bug fixes

Fix issue #10564: Restrict supported Click versions as a workaround for incompatibilities between Click v8.1.0 and Typer v0.4.0.

👥 Contributors

@adrianeboyd, @honnibal, @ines
Source code(tar.gz)
Source code(zip)
v3.2.4(Mar 29, 2022)
🔴 Bug fixes

Fix issue #10564: Restrict supported Click versions as a workaround for incompatibilities between Click v8.1.0 and Typer v0.4.0.

👥 Contributors

@adrianeboyd, @honnibal, @ines
Source code(tar.gz)
Source code(zip)
v3.2.3(Mar 1, 2022)
🔴 Bug fixes

Fix issue #10324: Fix Tok2Vec for empty batches.

👥 Contributors

@adrianeboyd, @honnibal, @ines
Source code(tar.gz)
Source code(zip)
v3.1.5(Mar 1, 2022)
🔴 Bug fixes

Fix issue #9593: Use metaclass to subclass errors for easier pickling.

Fix issue #9654: Fix spancat for empty docs and zero suggestions.

Fix issue #9979: Fix type of Lexeme.rank.

Fix issue #10324: Fix Tok2Vec for empty batches.

👥 Contributors

@adrianeboyd, @BramVanroy, @brucewlee, @danieldk, @honnibal, @ines, @ljvmiranda921, @polm, @svlandeg, @vgautam, @xxyzz
Source code(tar.gz)
Source code(zip)
v3.0.8(Mar 1, 2022)
🔴 Bug fixes

Fix issue #10324: Fix Tok2Vec for empty batches.

👥 Contributors

@adrianeboyd, @danieldk, @honnibal, @ines
Source code(tar.gz)
Source code(zip)
v3.2.2(Feb 11, 2022)
✨ New features and improvements

Improved parser and ner speeds on long documents (see technical details in #10019).

Support for spancat components in debug data.

Support for ENT_IOB as a Matcher token pattern key.

Extended and improved types for many classes.

🔴 Bug fixes

Fix issue #9735: Make floret murmurhash endian-neutral.

Fix issue #9738: Support string IOB values for ENT_IOB.

Fix issue #9746: Updates to avoid "dictionary size changed during iteration" runtime errors.

Fix issue #9960: Warn about entities that cross sentence boundaries in debug data.

Fix issue #9979: Fix type for Lexeme.rank.

Fix issue #10026: Check for 0-size assets in spacy project.

Fix issue #10051: Consistently return scalars from similarity methods.

Fix issue #10052: Fix spaces in Doc.from_docs() for empty docs.

Fix issue #10079: Fix label detection in debug data for components with custom names.

Fix issue #10109: Add types to Underscore and DependencyMatcher and improve types in Language, Matcher and PhraseMatcher.

Fix issue #10130: Fix Tokenizer.explain when infixes appear as prefixes.

Fix issue #10143: Use simple suggester in spancat initialization.

Fix issue #10164: Support IS_SENT_END in Doc.has_annotation.

Fix issue #10192: Detect invalid package names in spacy package.

Fix issue #10223: Support mixed case in package names.

Fix issue #10234: Fix type in PhraseMatcher.

📖 Documentation and examples

Various documentation updates.

New spaCy version tags in spaCy universe.

New Dockerfile for repeatable website builds and easier local development.

New additions to spaCy universe:

Augmenty: a text augmentation library

Healthsea: an end-to-end spaCy pipeline for exploring health supplement effects

spacy-wrap: wrap fine-tuned transformers in spaCy pipelines

spacypdfreader: easy PDF to text to spaCy text extraction

textnets: text analysis with networks

👥 Contributors

@adrianeboyd, @antonpibm, @ColleterVi, @danieldk, @DuyguA, @ezorita, @HaakonME, @honnibal, @ines, @jboynyc, @KennethEnevoldsen, @ljvmiranda921, @mrshu, @pmbaumgartner, @polm, @ramonziai, @richardpaulhudson, @ryndaniels, @svlandeg, @thiippal, @thomashacker, @yoavxyoav
Source code(tar.gz)
Source code(zip)
v3.2.1(Dec 7, 2021)
✨ New features and improvements

NEW: doc_cleaner component for removing doc.tensor,doc._._trf_data or other Doc attributes at the end of the pipeline to reduce size of output docs.

NEW: ENT_ID and ENT_KB_ID to Matcher pattern attributes.

Support kb_id for entities in displaCy from Doc input.

Add Span.sents property for spans spanning over more than one sentence.

Add EntityRuler.remove to remove patterns by id.

Make the Tagger neg_prefix configurable.

Use Language.pipe in Language.evaluate for more efficient processing.

Test suite updates: move regression tests into core test modules with pytest markers for issue numbers, extend tests for languages with alpha support.

🔴 Bug fixes

Fix issue #9638: Make JsonlCorpus path optional again.

Fix issue #9654: Fix spancat for empty docs and zero suggestions.

Fix issue #9658: Improve error message for incorrect .jsonl paths in EntityRuler.

Fix issue #9674: Fix language-specific factory handling in package CLI.

Fix issue #9694: Convert labels to strings for README in package CLI.

Fix issue #9697: Exclude strings from source vector checks.

Fix issue #9701: Allow Scorer.score_spans to handle predicted docs with missing annotation.

Fix issue #9722: Initialize parser from reference parse rather than aligned example.

Fix issue #9764: Set annotations more efficiently in tagger and morphologizer.

📖 Documentation and examples

Various documentation updates: init_tok2vec after pretraining, batch contract for listeners.

New additions to the spaCy universe:

eng-spacysentiment: Sentiment analysis for English.

Applied Language Technology course: NLP for newcomers using spaCy and Stanza.

👥 Contributors

@adrianeboyd, @danieldk, @DuyguA, @honnibal, @ines, @ljvmiranda921, @narayanacharya6, @nrodnova, @Pantalaymon, @polm, @richardpaulhudson, @svlandeg, @thiippal, @Vishnunkumar
Source code(tar.gz)
Source code(zip)
v3.2.0(Nov 5, 2021)
✨ New features and improvements

NEW: Registered scoring functions for each component in the config.

NEW: nlp() and nlp.pipe() accept Doc input, which simplifies setting custom tokenization or extensions before processing.

NEW: Support for floret vectors, which combine fastText subwords with Bloom embeddings for compact, full-coverage vectors.

overwrite config settings for entity_linker, morphologizer, tagger, sentencizer and senter.

extend config setting for morphologizer for whether existing feature types are preserved.

Support for a wider range of language codes in spacy.blank() including IETF language tags, for example fra for French and zh-Hans for Chinese.

New package spacy-loggers for additional loggers.

New Irish lemmatizer.

New Portuguese noun chunks and updated Spanish noun chunks.

Language updates for Bulgarian, Catalan, Sinhala, Tagalog, Tigrinya and Vietnamese.

Japanese reading and inflection from sudachipy are annotated as Token.morph features.

Additional morph_micro_p/r/f scores for morphological features from Scorer.score_morph_per_feat().

LIKE_URL attribute includes the tokenizer URL pattern.

--n-save-epoch option for spacy pretrain.

Trained pipelines:

New transformer pipeline for Japanese ja_core_news_trf, thanks to @hiroshi-matsuda-rit and the spaCy Japanese community!

Updates for Catalan data, tokenizer and lemmatizer, thanks to @cayorodriguez, Carme Armentano and @TeMU-BSC!

Transformer pipelines are trained using spacy-transformers v1.1, with improved IO and more options for model config and output.

Universal Dependencies corpora updated to v2.8.

Trailing space added as a tok2vec feature, improving the performance for many components, especially fine-grained tagging and sentence segmentation.

English attribute ruler patterns updated to improve Token.pos and Token.morph.

For more details, see the New in v3.2 usage guide.

🔴 Bug fixes

Fix issue #8972: Fix pickling for Japanese, Korean and Vietnamese tokenizers.

Fix issue #9032: Retain alignment between doc and context for Language.pipe(as_tuples=True) for multiprocessing with custom error handlers.

Fix issue #9136: Ignore prefixes when applying suffix patterns in Tokenizer.

Fix issue #9584: Use metaclass to subclass errors to allow better pickling.

⚠️ Backwards incompatibilities

In the Tokenizer, prefixes are now removed before suffix matches are applied, which may lead to minor differences in the output. In particular, the default tokenization of °[cfk]. is now ° c . instead of ° c. for most languages.

The tokenizer classes ChineseTokenizer, JapaneseTokenizer, KoreanTokenizer, ThaiTokenizer and VietnameseTokenizer require Vocab rather than Language in __init__.

In DocBin, user data is now always serialized according to the store_user_data option, see #9190.

📖 Documentation and examples

Demo projects for floret vectors:

pipelines/floret_vectors_demo: basic floret vector training and importing.

pipelines/floret_fi_core_demo: Finnish UD+NER vector and pipeline training, comparing standard vs. floret vectors.

pipelines/floret_ko_ud_demo: Korean UD vector and pipeline training, comparing standard vs. floret vectors.

👥 Contributors

@adrianeboyd, @Avi197, @baxtree, @BramVanroy, @cayorodriguez, @DuyguA, @fgaim, @honnibal, @ines, @Jette16, @jimregan, @polm, @rspeer, @rumeshmadhusanka, @svlandeg, @syrull, @thomashacker
Source code(tar.gz)
Source code(zip)
v3.1.4(Oct 29, 2021)
✨ New features and improvements

NEW: Binary wheels for Python 3.10.

NEW: Improve performance on Apple M1 with AppleOps: pip install spacy[apple].

GPU profiling with spacy.models_with_nvtx_range.v1.

Full mypy integration in the CI and many type fixes across the code base.

Added custom Protocol classes in ty.py to define behavior of pipeline components.

Support for entity linking visualization in displacy.

Allow overriding vars in spacy project assets .

Standalone train function to run the training from Python scripts just like the spacy train CLI.

Support for spacy-transformers>=1.1.0 with improved IO.

Support for thinc>=8.0.11 with improved gradient clipping.

🔴 Bug fixes

Fix issue #5507: Improve UX for multiprocessing on GPU.

Fix issue #9137: Fix serialization for KnowledgeBase.set_entities.

Fix issue #9244: Fix vectors for 0-length spans.

Fix issue #9247: Improve UX for the DocBin constructor.

Fix Issue #9254: Allow unicode in a spacy project title.

Fix issue #9263: Make added patterns consistent in the DependencyMatcher.

Fix issue #9305: Restore tokenization timing during evaluation.

Fix issue #9335: Sync vocab in vectors and sourced components.

Fix issue #9387: Ensure lemmas are consistent for Catalan, Dutch, French, Russian and Ukrainian.

Fix issue #9404: Create consistent default textcat and textcat_multilabel configurations.

Fix issue #9437: Improve UX around Doc object creation.

Fix issue #9465: Fix minor issues with convert CLI.

Fix issue #9500: Include .pyi files in the distributed package.

📖 Documentation and examples

Various updates to the documentation.

New additions to the spaCy universe:

deplacy: CUI-based dependency visualizer

ipymarkup: Visualizations for NER and syntax trees

PhruzzMatcher: Find fuzzy matches

spacy-huggingface-hub: Push spaCy pipelines to the Hugging Face Hub

spaCyOpenTapioca: Entity Linking on Wikidata

spacy-clausie: Clause-based information extraction system

"Applied Natural Language Processing in the Enterprise": Book by Ankur A. Patel

"Introduction to spaCy 3": Free course by Dr. W.J.B. Mattingly

👥 Contributors

@adrianeboyd, @connorbrinton, @danieldk, @DuyguA, @honnibal, @ines, @Jette16, @ljvmiranda921, @mjvallone, @philipvollet, @polm, @rspeer, @ryndaniels, @shigapov, @svlandeg, @thomashacker
Source code(tar.gz)
Source code(zip)
v3.1.3(Sep 20, 2021)
✨ New features and improvements

The v3 of WandbLogger now supports optional run_name and entity parameters.

Improved UX when providing invalid pos values for a Doc or Token.

🔴 Bug fixes

Fix issue #9001: Pass alignments to Matcher callbacks.

Fix issue #9009: Include component factories in third-party dependencies resolver.

Fix issue #9012: Correct type of config in create_pipe.

Fix issue #9014: Allow typer 0.4 to provide support for both Click 7 and Click 8.

Fix issue #9033: Fix verbs list for French tokenizer exceptions.

Fix issue #9059: Pass overrides to subcommands in spacy project workflows.

Fix issue #9074: Improve UX around repo and path arguments in spacy project.

Fix issue #9084: Fix inference of epoch_resume in spacy pretrain.

Fix issue #9163: Handle spacy-legacy in spacy package dependency detection.

Fix issue #9211: Include only runtime-relevant dependencies in spacy package.

📖 Documentation and examples

Various updates to the documentation.

Few additions and updates to the spaCy universe.

Extended the developer documentation with information about the listener pattern, the StringStore and the Vocab.

👥 Contributors

@adrianeboyd, @davidefiocco, @davidstrouk, @filipematos95, @honnibal, @ines, @j-frei, @Joozty, @kwhumphreys, @mjhajharia, @mylibrar, @polm, @rspeer, @shigapov, @svlandeg, @thomashacker
Source code(tar.gz)
Source code(zip)
v3.1.2(Aug 20, 2021)
✨ New features and improvements

NEW: Provide scores for the SpanCategorizer predictions.

NEW: Broader compatibility with type checkers thanks to .pyi stub files.

NEW: Auto-detect package dependencies in spacy package.

New INTERSECTS operator for the Matcher.

More debugging info for spacy project push and pull commands.

Allow passing in a precomputed array for speeding up multiple Span.as_doc calls.

The default da transformer is now the same as the one from the trained pipelines (Maltehb/danish-bert-botxo).

🔴 Bug fixes

Fix issue #8767: Fix offsets of empty and out-of-bounds spans.

Fix issue #8774: Ensure debug data runs correctly with a custom tokenizer.

Fix issue #8784: Fix incorrect ISSUBSET and ISSUPERSET in schema and docs.

Fix issue #8796: Respect the no_skip value for spacy project run.

Fix issue #8810: Make ConsoleLogger flush after each logging line.

Fix issue #8819: Pass exclude when serializing the vocab.

Fix issue #8830: Avoid adding sourced vectors hashes if not necessary.

Fix issue #8970: Fix allow_overlap default for span categorizer scoring.

Fix issue #8982: Add glossary entry for _SP.

Fix issue #9007: Fix span categorizer training on nested entities.

📖 Documentation and examples

New developer documentation covering spaCy's internals and code conventions.

Added a documentation section on preparing training data in spaCy's binary format.

Updated some error/log messages to be more informative.

Various updates to the documentation.

A few new additions to the spaCy universe.

👥 Contributors

@adrianeboyd, @bbieniek, @DuyguA, @ezorita, @HLasse, @honnibal, @ines, @kabirkhan, @kevinlu1248, @ldorigo, @Ledenel, @nsorros, @polm, @svlandeg, @swfarnsworth, @themrmax, @thomashacker
Source code(tar.gz)
Source code(zip)
v3.0.7(Jul 23, 2021)
✨ New features and improvements

Alpha tokenization support for Azerbaijani.

Updates for French stop words.

🔴 Bug fixes

Fix issue #7629: Fix scoring normalization.

Fix issue #7886: Fix unknown tokens percentage in debug data.

Fix issue #7907: Update load_lookups return type and docstring.

Fix issue #7930: Make EntityLinker robust for nO=None.

Fix issue #7925: Skip vector ngram backoff if minn is not set.

Fix issue #7973: Fix debug model for transformers.

Fix issue #7988: Preserve existing ENT_KB_ID in ner annotation.

Fix issue #7992: Fix span offsets for Matcher(as_spans) on spans.

Fix issue #8004: Handle errors while multiprocessing.

Fix issue #8009: Fix Doc.from_docs() for all empty docs.

Fix issue #8012: Fix ensemble textcat with listener.

Fix issue #8054: Add ENT_ID and NORM to DocBin strings.

Fix issue #8055: Handle partial entities in Span.as_doc.

Fix issue #8062: Make all Span attrs writable.

Fix issue #8066: Update debug data for textcat.

Fix issue #8069: Custom warning if DocBin is too large.

Fix issue #8113: Support to/from_bytes for KnowledgeBase and EntityLinker.

Fix issue #8116: Fix offsets in Span.get_lca_matrix.

Fix issue #8132: Remove unsupported attrs from attrs.IDS.

Fix issue #8158: Ensure tolerance is passed on in spacy.batch_by_words.v1.

Fix issue #8169: Fix bug from EntityRuler: ent_ids returns None for phrases.

Fix issue #8208: Address missing config overrides post load of models.

Fix issue #8212: Add all symbols in Unicode Currency Symbols to currency characters.

Fix issue #8216: Don't add duplicate patterns in EntityRuler.

Fix issue #8244: Use context manager when reading model file.

Fix issue #8245: Fix other open calls without context managers.

Fix issue #8265: Address mypy errors.

Fix issue #8299: Restrict pymorphy2 requirement to pymorphy2 mode in Russian and Ukrainian lemmatizers.

Fix issue #8335: Raise error if deps not provided with heads in Doc.

Fix issue #8368: Preserve whitespace in Span.lemma_.

Fix issue #8396: Make JsonlReader path optional.

Fix issue #8421: Fix non-deterministic deduplication in Greek lemmatizer.

Fix issue #8423: Update validate CLI to fix compat and ignore warnings.

Fix issue #8426: Fix setting empty entities in Example.from_dict.

Fix issue #8487: Fix span offsets and keys in Doc.from_docs.

Fix issue #8584: Raise an error for textcat with <2 labels.

Fix issue #8551: Fix duplicate spacy package CLI opts.

👥 Contributors

@adrianeboyd, @bodak, @bryant1410, @dhruvrnaik, @fhopp, @frascuchon, @graue70, @ines, @jenojp, @jhroy, @jklaise, @juliensalinas, @meghanabhange, @michael-k, @narayanacharya6, @polm, @sevdimali, @svlandeg, @ZeeD
Source code(tar.gz)
Source code(zip)
v3.1.1(Jul 20, 2021)
✨ New features and improvements

Alpha tokenization support for Ancient Greek.

Implementation of a noun_chunk iterator for Dutch.

Support for black & flake8 as pre-commit hooks.

New spacy.ngram_range_suggester.v1 for suggesting a range of n-gram sizes for the spancat component.

🔴 Bug fixes

Fix issue #8638: Fix Azerbaijani initialization.

Fix issue #8639: Use 0-vector for OOV lexemes.

Fix issue #8640: Update lexeme ranks for loaded vectors.

Fix issue #8651: Fix ru and uk multiprocessing (with spawn).

Fix issue #8663: Preserve existing meta information with spacy package.

Fix issue #8718: Ensure that replace_pipe takes disabled components into account.

👥 Contributors

@adrianeboyd, @honnibal, @ines, @jmyerston, @julien-talkair, @KennethEnevoldsen, @mariosasko, @mylibrar, @polm, @rynoV, @svlandeg, @thomashacker, @yohasebe
Source code(tar.gz)
Source code(zip)
v3.1.0(Jul 7, 2021)
✨ New features and improvements

NEW: Trained pipelines for Catalan and a new transformer-based pipeline for Danish.

NEW: Experimental SpanCategorizer component for labeling arbitrary and potentially overlapping spans of text.

NEW: Use predicted annotations during training via the [training.annotating_components] config setting.

Alpha tokenization support for Azerbaijani.

Part-of-speech tag-based lemmatizers for Catalan and Italian.

The TextCatCNN and TextCatBOW architectures are now resizable.

Support updating the EntityRecognizer with known incorrect span annotations.

Auto-generate a pretty README.md based on the meta in spacy package.

For more details, see the New in v3.1 usage guide.

📦 New trained pipelines

| Package | Language | UPOS | Parser LAS | NER F | | ----------------------------------------------------------------- | -------- | ---: | ---------: | -----: | | ca_core_news_sm | Catalan | 98.2 | 87.4 | 79.8 | | ca_core_news_md | Catalan | 98.3 | 88.2 | 84.0 | | ca_core_news_lg | Catalan | 98.5 | 88.4 | 84.2 | | ca_core_news_trf | Catalan | 98.9 | 93.0 | 91.2 | | da_core_news_trf | Danish | 98.0 | 85.0 | 82.9 |

⚠️ Upgrading from v3.0

Due to the use of configs with extensive versioning, v3.0 pipelines should be compatible with v3.1, however you may see slight differences in performance. Test your v3.0 pipeline with v3.1 against your test suite and if the performance is identical, extend the spacy_version in your model package meta to ">=3.0.0,<3.2.0". If you run into degraded performance, retrain your pipeline with v3.1.

Use spacy init fill-config to update a v3.0 config for v3.1.

When sourcing a pipeline component that requires static vectors, it is now required to include the source model's vectors in [initialize.vectors].

Logger warnings have been converted to Python warnings. Use warnings.filterwarnings or the new helper method spacy.errors.filter_warning(action, error_msg='') to manage warnings.

For more information, see Notes on upgrading from v3.0.

🔴 Bug fixes

Fix issue #7036: Use a context manager when reading model.

Fix issue #7629: Fix scoring normalization.

Fix issue #7799: Ensure spacy ray command works.

Fix issue #7807: Show warning if entity ruler runs without patterns.

Fix issue #7886: Fix unknown tokens percentage in debug data.

Fix issue #7930: Make EntityLinker robust for nO=None.

Fix issue #7925: Skip vector ngram backoff if minn is not set.

Fix issue #7973: Fix debug model for transformers.

Fix issue #7988: Preserve existing ENT_KB_ID in ner annotation.

Fix issue #8004: Handle errors while multiprocessing.

Fix issue #8009: Fix Doc.from_docs() for all empty docs.

Fix issue #8012: Fix ensemble textcat with listener.

Fix issue #8054: Add ENT_ID and NORM to DocBin strings.

Fix issue #8055: Handle partial entities in Span.as_doc.

Fix issue #8062: Make all Span attrs writable.

Fix issue #8066: Update debug data for textcat.

Fix issue #8069: Custom warning if DocBin is too large.

Fix issue #8099: Update Vietnamese tokenizer.

Fix issue #8113: Support to/from_bytes for KnowledgeBase and EntityLinker.

Fix issue #8116: Fix offsets in Span.get_lca_matrix.

Fix issue #8132: Remove unsupported attrs from attrs.IDS.

Fix issue #8158: Ensure tolerance is passed on in spacy.batch_by_words.v1.

Fix issue #8169: Fix bug from EntityRuler: ent_ids returns None for phrases.

Fix issue #8208: Address missing config overrides post load of models.

Fix issue #8212: Add all symbols in Unicode Currency Symbols to currency characters.

Fix issue #8216: Don't add duplicate patterns in EntityRuler.

Fix issue #8265: Address mypy errors.

Fix issue #8335: Raise error if deps not provided with heads in Doc.

Fix issue #8368: Preserve whitespace in Span.lemma_.

Fix issue #8388: Don't clobber vectors when loading components from source models.

Fix issue #8421: Fix non-deterministic deduplication in Greek lemmatizer.

Fix issue #8426: Fix setting empty entities in Example.from_dict.

Fix issue #8441: Add correct types for Language.pipe return values.

Fix issue #8487: Fix span offsets and keys in Doc.from_docs.

Fix issue #8559: Fix vectors check for sourced components.

Fix issue #8584: Raise an error for textcat with <2 labels.

👥 Contributors

@aajanki, @adrianeboyd, @bodak, @bryant1410, @dhruvrnaik, @explosion-bot, @fhopp, @frascuchon, @graue70, @gtoffoli, @honnibal, @ines, @jacopofar, @jenojp, @jhroy, @jklaise, @juliensalinas, @kevinlu1248, @ldorigo, @mathcass, @meghanabhange, @michael-k, @narayanacharya6, @NirantK, @nsorros, @polm, @sevdimali, @svlandeg, @themrmax, @xadrianzetx, @yohasebe, @ZeeD
Source code(tar.gz)
Source code(zip)
v2.3.7(Jun 4, 2021)
🔴 Bug fixes

Fix issue #8286: Fix spacy download.

Source code(tar.gz)
Source code(zip)
v2.3.6(May 18, 2021)
✨ New features and improvements

Add base support for Amharic.

Add noun chunk iterator for Danish.

Updates to French, Portuguese and Romanian stop words.

🔴 Bug fixes

Fix issue #6705: Fix deserialization of null token_match and url_match for the tokenizer.

Fix issue #6712: Prevent overlapping noun chunks for Spanish.

Fix issue #6745: Fix minibatch iterator when size iterator is finished.

Fix issue #6759: Skip 0-length matches in the Matcher.

Fix issue #6771: Support IS_SENT_START in the PhraseMatcher.

Fix issue #6772: Fix Span.text for empty spans.

Fix issue #6820: Improve Doc.char_span alignment_mode handling.

Fix issue #6857: Remove --no-cache-dir when downloading models.

Fix issue #8115: Fix offsets in Span.get_lca_matrix.

👥 Contributors

Thanks to @alexcombessie, @AMArostegui, @bryant1410, @Cristianasp, @garethsparks, @jenojp, @jganseman, @jumasheff, @lorenanda, @ophelielacroix, @thomasbird, @timgates42, @tupui and @yosiasz for the pull requests and contributions.
Source code(tar.gz)
Source code(zip)
v3.0.6(Apr 23, 2021)
✨ New features and improvements

New assemble CLI command for assembling a pipeline from a config without training.

Add support for match alignments in the Matcher to align matched tokens with matcher patterns.

Add support for training from streamed corpora.

Add support for W&B data and model checkpoint logging and versioning in spacy.WandbLogger.v2.

Extend Scorer.score_spans to support overlapping and unlabeled spans.

Update debug data for new v3 components.

Improve language data for Italian.

Various improvements to error handling and UX.

🔴 Bug fixes

Fix issue #7408: Add vocab kwarg to spacy.load.

Fix issue #7419: Exclude user hooks in displacy conversion.

Fix issue #7421: Update --code usage in CLI commands.

Fix issue #7424: Preserve sent starts on retokenization without parse.

Fix issue #7440: Fix pymorphy2 lookup lemmatizer.

Fix issue #7471: Improve warnings related to listening components.

Fix issue #7488: Fix upstream check in pretraining.

Fix issue #7489: Support callbacks entry points.

Fix issue #7497: Merge doc.spans in Doc.from_docs().

Fix issue #7528: Preserve user data for DependencyMatcher on spans.

Fix issue #7557: Fix __add__ method for PRFScore.

Fix issue #7574: Fix conversion of custom extension data in Span.as_doc and Doc.from_docs.

Fix issue #7620: Fix replace_listeners in configs.

Fix issue #7626: Fix vectors data on GPU.

Fix issue #7630: Update NEL for entities crossing sentence boundaries.

Fix issue #7631: Fix parser sourcing in NER converter.

Fix issue #7642: Fix handling of hyphen string value in config files.

Fix issue #7655: Fix sent starts when converting from v2 JSON training format.

Fix issue #7674: Fix handling of unknown tokens in StaticVectors.

Fix issue #7690: Fix pickling of Lemmatizer.

Fix issue #7749: Update Tokenizer.explain for special cases in v3.

Fix issue #7755: Fix config parsing of ints/strings.

Fix issue #7836: Fix tokenizer cache flushing.

Fix issue #7847: Fix handling of boolean values in Example.from_dict for sent starts.

📖 Documentation and examples

Add documentation for legacy functions and architectures.

Add documentation for pretrained pipeline design.

Add more details about pipe and multiprocessing.

Fix various typos and inconsistencies.

👥 Contributors

Thanks to @alvaroabascar, @armsp, @AyushExel, @BramVanroy, @broaddeep, @bryant1410, @bsweileh, @dpalmasan, @Findus23, @graue70, @jaidevd, @koaning, @langdonholmes, @m0canu1, @meghanabhange, @paoloq, @plison, @richardpaulhudson, @SamEdwardes, @Stannislav for the pull requests and contributions!
Source code(tar.gz)
Source code(zip)

💫 Industrial-strength Natural Language Processing (NLP) in Python

Related tags

Overview

spaCy: Industrial-strength NLP

📖 Documentation

💬 Where to ask questions

Features

⏳ Install spaCy

pip

conda

Updating spaCy

📦 Download model packages

Loading and using models

⚒ Compile from source

🚦 Run tests

Comments

Feature description

Quickstart & overview

The most important new features

Installation

Now on to the fun part – stickers!

New features

Custom extension attributes

Rich comparison for numeric values

Set membership

Regular expressions

New operators

API improvements and bug fixes

Other fixes

Your Environment

Info about spaCy

Description

How to reproduce the behaviour

Your Environment

Feature description

Notes

Feedback requested

Description

Types of change

Checklist

Description

Types of change

Checklist

Discussed in https://github.com/explosion/spaCy/discussions/10861

Description

Types of change

Checklist

Description

Types of change

Checklist

Releases(v3.0.9)

v3.0.9(Dec 16, 2022)

🔴 Bug fixes

👥 Contributors

v2.3.9(Dec 16, 2022)

🔴 Bug fixes

👥 Contributors

v3.4.4(Dec 14, 2022)

🔴 Bug fixes

👥 Contributors

v3.3.2(Dec 16, 2022)

🔴 Bug fixes

👥 Contributors

v3.2.5(Dec 16, 2022)

🔴 Bug fixes

👥 Contributors

v3.1.7(Dec 16, 2022)

🔴 Bug fixes

👥 Contributors

v3.4.3(Nov 10, 2022)

✨ New features and improvements

🔴 Bug fixes

👥 Contributors

v3.4.2(Oct 20, 2022)

✨ New features and improvements

📦 Trained pipelines updates

🔴 Bug fixes

⚠️ Backwards incompatibilities

📖 Documentation and examples

👥 Contributors