HuSpaCy: Industrial-strength Hungarian NLP
HuSpaCy is a spaCy model and a library providing industrial-strength Hungarian language processing facilities. The released pipeline consists of a tokenizer, sentence splitter, lemmatizer, tagger (predicting morphological features as well), dependency parser and a named entity recognition module. Word and phrase embeddings are also available through spaCy's API. All models have high throughput, decent memory usage and close to state-of-the-art accuracy. A live demo is available here, model releases are published to Hugging Face Hub.
This repository contains material to build HuSpaCy's models from the ground up.
Installation
To get started using the tool, first, you need to do download the model. The easiest way to achieve this is fetch the model by installing the huspacy
package from PyPI:
pip install huspacy
This utility package exposes convenience methods for downloading and using the latest model:
import huspacy
# Download the latest model
huspacy.download()
# Download the specified model
huspacy.download(version="v0.4.2")
# Load the previously downloaded model (hu_core_news_lg)
nlp = huspacy.load()
Alternatively, one can install the latest model from Hugging Face Hub directly:
pip install https://huggingface.co/huspacy/hu_core_news_lg/resolve/main/hu_core_news_lg-any-py3-none-any.whl
To speed up inference using GPUs, CUDA support can be installed as described in https://spacy.io/usage.
Usage
HuSpaCy is fully compatible with spaCy's API, newcomers can easily get started using spaCy 101 guide.
Although HuSpacy models can be leaded with spacy.load()
, the tool provides convenience methods to easily access downloaded models.
# Load the model using huspacy
import huspacy
nlp = huspacy.load()
# Load the mode using spacy.load()
import spacy
nlp = spacy.load("hu_core_news_lg")
# Load the model directly as a module
import hu_core_news_lg
nlp = hu_core_news_lg.load()
# Either way you get the same model and can start processing texts.
doc = nlp("Csiribiri csiribiri zabszalma - négy csillag közt alszom ma.")
Available Models
Currently, we provide a single large model which achieves a good balance between accuracy and processing speed. A demo of this model is available at Hugging Face Spaces. This default model (hu_core_news_lg
) provides tokenization, sentence splitting, part-of-speech tagging (UD labels w/ detailed morphosyntactic features), lemmatization, dependency parsing and named entity recognition and ships with pretrained word vectors.
Models' changes are recorded in the changelog.
Development
Installing requirements
poetry install
will install all the dependencies- For better performance you might need to reinstall spacy with GPU support, e.g.
poetry add spacy[cuda92]
will add support for CUDA 9.2
Repository structure
├── .github -- Github configuration files
├── data -- Data files
│ ├── external -- External models required to train models (e.g. word vectors)
│ ├── processed -- Processed data ready to feed spacy
│ └── raw -- Raw data, mostly corpora as they are obtained from the web
├── hu_core_news_lg -- Spacy 3.x project files for building a model for news texts
│ ├── configs -- Spacy pipeline configuration files
│ ├── project.lock -- Auto-generated project script
│ ├── project.yml -- Spacy3 Project file describing steps needed to build the model
│ └── README.md -- Instructions on building a model from scratch
├── huspacy -- subproject for the PyPI distributable package
├── tools -- Source package for tools
│ └── cli -- Command line scripts (Python)
├── models -- Trained models and their metadata
├── resources -- Resource files
├── scripts -- Bash scripts
├── tests -- Test files
├── CHANGELOG.md -- Keeps the changelog
├── LICENSE -- License file
├── poetry.lock -- Locked poetry dependencies files
├── poetry.toml -- Poetry configurations
├── pyproject.toml -- Python project configutation, including dependencies managed with Poetry
└── README.md -- This file
Citing
If you use the models or this library in your research please cite this paper.
Additionally, please indicate the version of the model you used so that your research can be reproduced.
@misc{HuSpaCy:2021,
title = {{HuSpaCy: an industrial-strength Hungarian natural language processing toolkit}},
booktitle = {{XVIII. Magyar Sz{\'a}m{\'\i}t{\'o}g{\'e}pes Nyelv{\'e}szeti Konferencia}},
author = {Orosz, Gy{\"o}rgy and Sz{\' a}nt{\' o}, Zsolt and Berkecz, P{\' e}ter and Szab{\' o}, Gerg{\H o} and Farkas, Rich{\' a}rd},
location = {{Szeged}},
year = {in press 2021},
}
License
This library is released under the Apache 2.0 License
The trained models have their own license (CC BY-SA 4.0) as described on the models page.
Contact
For feature request issues and bugs please use the GitHub Issue Tracker. Otherwise, please use the Discussion Forums.
Authors
HuSpaCy is implemented in the SzegedAI team, coordinated by Orosz György in the Hungarian AI National Laboratory, MILAB program.