HuSpaCy: industrial-strength Hungarian natural language processing

HuSpaCy

Last update: Dec 14, 2022

Related tags

Deep Learning nlp natural-language-processing information-extraction spacy named-entities named-entity-recognition ner universal-dependencies hungarian morphological-analysis dependency-parsing pos-tagger lemmatization hungarian-language hunlp spacy-models spacy-pipeline hungarian-models huspacy

Overview

HuSpaCy: Industrial-strength Hungarian NLP

HuSpaCy is a spaCy model and a library providing industrial-strength Hungarian language processing facilities. The released pipeline consists of a tokenizer, sentence splitter, lemmatizer, tagger (predicting morphological features as well), dependency parser and a named entity recognition module. Word and phrase embeddings are also available through spaCy's API. All models have high throughput, decent memory usage and close to state-of-the-art accuracy. A live demo is available here, model releases are published to Hugging Face Hub.

This repository contains material to build HuSpaCy's models from the ground up.

Installation

To get started using the tool, first, you need to do download the model. The easiest way to achieve this is fetch the model by installing the huspacy package from PyPI:

pip install huspacy

This utility package exposes convenience methods for downloading and using the latest model:

import huspacy

# Download the latest model
huspacy.download()

# Download the specified model 
huspacy.download(version="v0.4.2")

# Load the previously downloaded model (hu_core_news_lg)
nlp = huspacy.load()

Alternatively, one can install the latest model from Hugging Face Hub directly:

pip install https://huggingface.co/huspacy/hu_core_news_lg/resolve/main/hu_core_news_lg-any-py3-none-any.whl

To speed up inference using GPUs, CUDA support can be installed as described in https://spacy.io/usage.

Usage

HuSpaCy is fully compatible with spaCy's API, newcomers can easily get started using spaCy 101 guide.

Although HuSpacy models can be leaded with spacy.load(), the tool provides convenience methods to easily access downloaded models.

# Load the model using huspacy
import huspacy
nlp = huspacy.load()

# Load the mode using spacy.load()
import spacy
nlp = spacy.load("hu_core_news_lg")

# Load the model directly as a module
import hu_core_news_lg
nlp = hu_core_news_lg.load()

# Either way you get the same model and can start processing texts.
doc = nlp("Csiribiri csiribiri zabszalma - négy csillag közt alszom ma.")

Available Models

Currently, we provide a single large model which achieves a good balance between accuracy and processing speed. A demo of this model is available at Hugging Face Spaces. This default model (hu_core_news_lg) provides tokenization, sentence splitting, part-of-speech tagging (UD labels w/ detailed morphosyntactic features), lemmatization, dependency parsing and named entity recognition and ships with pretrained word vectors.

Models' changes are recorded in the changelog.

Development

Installing requirements

poetry install will install all the dependencies
For better performance you might need to reinstall spacy with GPU support, e.g. poetry add spacy[cuda92] will add support for CUDA 9.2

Repository structure

├── .github            -- Github configuration files
├── data               -- Data files
│   ├── external       -- External models required to train models (e.g. word vectors)
│   ├── processed      -- Processed data ready to feed spacy
│   └── raw            -- Raw data, mostly corpora as they are obtained from the web
├── hu_core_news_lg    -- Spacy 3.x project files for building a model for news texts
│   ├── configs        -- Spacy pipeline configuration files
│   ├── project.lock   -- Auto-generated project script
│   ├── project.yml    -- Spacy3 Project file describing steps needed to build the model
│   └── README.md      -- Instructions on building a model from scratch
├── huspacy            -- subproject for the PyPI distributable package
├── tools              -- Source package for tools
│   └── cli            -- Command line scripts (Python)
├── models             -- Trained models and their metadata
├── resources          -- Resource files
├── scripts            -- Bash scripts
├── tests              -- Test files 
├── CHANGELOG.md       -- Keeps the changelog
├── LICENSE            -- License file
├── poetry.lock        -- Locked poetry dependencies files
├── poetry.toml        -- Poetry configurations
├── pyproject.toml     -- Python project configutation, including dependencies managed with Poetry 
└── README.md          -- This file

Citing

If you use the models or this library in your research please cite this paper.
Additionally, please indicate the version of the model you used so that your research can be reproduced.

@misc{HuSpaCy:2021,
  title = {{HuSpaCy: an industrial-strength Hungarian natural language processing toolkit}},
  booktitle = {{XVIII. Magyar Sz{\'a}m{\'\i}t{\'o}g{\'e}pes Nyelv{\'e}szeti Konferencia}},
  author = {Orosz, Gy{\"o}rgy and Sz{\' a}nt{\' o}, Zsolt and Berkecz, P{\' e}ter and Szab{\' o}, Gerg{\H o} and Farkas, Rich{\' a}rd}, 
  location = {{Szeged}},
  year = {in press 2021},
}

License

This library is released under the Apache 2.0 License

The trained models have their own license (CC BY-SA 4.0) as described on the models page.

Contact

For feature request issues and bugs please use the GitHub Issue Tracker. Otherwise, please use the Discussion Forums.

Authors

HuSpaCy is implemented in the SzegedAI team, coordinated by Orosz György in the Hungarian AI National Laboratory, MILAB program.

Comments

Transformer v1
Changes

Added configs for training a transformer model.

Tagger model changes from the "lg" tagger model: The tagger now also teaches lemmatizer, more specifically the edit tree lemmatizer.

Parser model changes from the "lg" parser model: The parser no longer teaches components (as it used to do with the "lg" model), only the sentencizer. Also, the parser component is no longer the same, it is now learning dependencies with the biaffine parser.

Ner model changes from the "lg" ner model: the ner component didn't change too much, only factory (on the components.ner config) variable changed, instead of plain "ner", "beam_ner" is used in the configuration now.

spacy eval scores

|components|dev|test| |-|-|-| |TOK|100.00|100.00| |TAG|97.78|97.36| |POS|97.79|97.39| |MORPH|93.29|93.63| |LEMMA|97.67|97.66| |UAS|91.68|91.01| |LAS|87.29|87.20| |NER P|91.97|91.37| |NER R|92.25|91.42| |NER F|92.11|91.40| |SENT P|97.53|98.23| |SENT R|98.41|98.89| |SENT F|97.97|98.56| |SPEED|3184|3262|

Remaining tasks

Add Zsolti's eval script to the pipeline Add Zsolti's and Peti's multiple root removal script to the pipeline
opened by SzaboGergo01 3
zipfile.BadZipFile: Bad CRC-32 for file

Hi,

I tried to install this tool with python3 and I have given the following error. Can you help me to solve this issue please?

Best regards, László

$ pip3 install https://github.com/oroszgy/spacy-hungarian-models/releases/download/hu_core_ud_lg-0.2.0/hu_core_ud_lg-0.2.0-py3-none-any.whl Downloading https://github.com/oroszgy/spacy-hungarian-models/releases/download/hu_core_ud_lg-0.2.0/hu_core_ud_lg-0.2.0-py3-none-any.whl (1362.0MB) Exception: Traceback (most recent call last): File "/usr/lib/python3/dist-packages/pip/basecommand.py", line 215, in main status = self.run(options, args) File "/usr/lib/python3/dist-packages/pip/commands/install.py", line 353, in run wb.build(autobuilding=True) File "/usr/lib/python3/dist-packages/pip/wheel.py", line 749, in build self.requirement_set.prepare_files(self.finder) File "/usr/lib/python3/dist-packages/pip/req/req_set.py", line 380, in prepare_files ignore_dependencies=self.ignore_dependencies)) File "/usr/lib/python3/dist-packages/pip/req/req_set.py", line 620, in _prepare_file session=self.session, hashes=hashes) File "/usr/lib/python3/dist-packages/pip/download.py", line 821, in unpack_url hashes=hashes File "/usr/lib/python3/dist-packages/pip/download.py", line 663, in unpack_http_url unpack_file(from_path, location, content_type, link) File "/usr/lib/python3/dist-packages/pip/utils/__init__.py", line 617, in unpack_file flatten=not filename.endswith('.whl') File "/usr/lib/python3/dist-packages/pip/utils/__init__.py", line 506, in unzip_file data = zip.read(name) File "/usr/lib/python3.6/zipfile.py", line 1338, in read return fp.read() File "/usr/lib/python3.6/zipfile.py", line 858, in read buf += self._read1(self.MAX_N) File "/usr/lib/python3.6/zipfile.py", line 962, in _read1 self._update_crc(data) File "/usr/lib/python3.6/zipfile.py", line 890, in _update_crc raise BadZipFile("Bad CRC-32 for file %r" % self.name) zipfile.BadZipFile: Bad CRC-32 for file 'hu_core_ud_lg/hu_core_ud_lg-0.2.0/tagger/model'
bug

opened by laklaja 3
Does this model support fine-grained UD features?

In SpaCy one can access the fine-grained UD tags by the tag_ attribute (see documentation). In this model it only repeats the value of pos_.

Is there any chance to get the CoNLL-U style FEATS from the tagged data for Hungarian?
enhancement help wanted

opened by dlazesz 3

Spacy lemmatizer does not work with numbers as expected

Kedves György, észrevettem, h a spacy nem mindíg jól lemmatizálja a (betűvel kiírt) számokat. Íme egy példa:

import spacy
import hu_core_ud_lg
import pandas as pd

nlp = hu_core_ud_lg.load() # 2-3 perc

a = "nyolcvanöt"
b = "nyolcvanhat"
c = "nyolcvanhét" 
d = [a, b, c] 
  
df = pd.DataFrame(d, columns = ['datum']) 

output_lemma = []

for i in df.datum:
    mondat = ""
    doc = nlp(i)
    newtext = [(tok.lemma_, tok.is_title) for tok in doc]
    mondat = ' '.join([tok[0].title() if tok[1] == 1 else tok[0] for tok in newtext])
    output_lemma.append(mondat)

output_lemma 
['nyolcvan', 'nyolcvanh', 'nyolcvanhét']

Új vagyok a githubon, de nagyon szívesen segítenék a csomag fejlesztésében. Meg tudnád kérlek mondani, h ez reális nehézségű projekt lenne egy kezdő számára, vagy inkább érdemes előbb egy egyszerűbb feladat után néznem? Előre is nagyon köszönöm a válaszod!

enhancement

opened by gaborstats 2

Error during download UD_Hungarian-Szeged

Error during make install. Have the permissions of this dependency changed?

`mkdir -p ./data/raw/UD_Hungarian-Szeged git clone [email protected]:UniversalDependencies/UD_Hungarian-Szeged.git ./data/raw/UD_Hungarian-Szeged

Cloning into './data/raw/UD_Hungarian-Szeged'... Host key verification failed. fatal: Could not read from remote repository.

Please make sure you have the correct access rights and the repository exists. make: *** [data/raw/UD_Hungarian-Szeged] Error 128`

opened by laklaja 2
BadZipFile error

I get an error when trying to install through pip.

zipfile.BadZipFile: Bad CRC-32 for file 'hu_core_ud_lg/hu_core_ud_lg-0.2.0/tagger/model'

any ideas?

opened by begdaniel 2
Inkompatibilitás

Sajnos a kód (huspacy) és a nagy model(hu-core-news-lg) más spacy verziót kíván, és nincs közös halmazuk. A huspacy régebbi spacyt kíván, mint a model. Nem találtam megoldást az inkompatibilitás feloldására. Ebben kérnék segítséget.

Köszönettel ! Attila
bug

opened by VamperAta 1
Lookup lemmatizer

Added Lookup Lemmatizer and also its usage in the hu_core_news_lg model, it returns a lemma from the token and its POS tag. I also added the lemma smoother to the hu_core_news_lg model and now it has an accuracy of 97.36%.
enhancement lemmatizer

opened by qeterme 1
Fix CLI tool paths for hu_core_news_lg
Changes proposed in this pull request:

While training hu_core_news_lg, some CLI tools are referenced as executables in the PATH, causing errors during training. This PR replaces them with the appropriate tool from "tools/cli".

After submitting

[x] All GitHub Actions jobs for my pull request have passed.
opened by dvarnai 1
Bump pyyaml from 5.2 to 5.4
Bumps pyyaml from 5.2 to 5.4.

Changelog

Sourced from pyyaml's changelog.

5.4 (2021-01-19)

yaml/pyyaml#407 -- Build modernization, remove distutils, fix metadata, build wheels, CI to GHA

yaml/pyyaml#472 -- Fix for CVE-2020-14343, moves arbitrary python tags to UnsafeLoader

yaml/pyyaml#441 -- Fix memory leak in implicit resolver setup

yaml/pyyaml#392 -- Fix py2 copy support for timezone objects

yaml/pyyaml#378 -- Fix compatibility with Jython

5.3.1 (2020-03-18)

yaml/pyyaml#386 -- Prevents arbitrary code execution during python/object/new constructor

5.3 (2020-01-06)

yaml/pyyaml#290 -- Use is instead of equality for comparing with None

yaml/pyyaml#270 -- Fix typos and stylistic nit

yaml/pyyaml#309 -- Fix up small typo

yaml/pyyaml#161 -- Fix handling of slots

yaml/pyyaml#358 -- Allow calling add_multi_constructor with None

yaml/pyyaml#285 -- Add use of safe_load() function in README

yaml/pyyaml#351 -- Fix reader for Unicode code points over 0xFFFF

yaml/pyyaml#360 -- Enable certain unicode tests when maxunicode not > 0xffff

yaml/pyyaml#359 -- Use full_load in yaml-highlight example

yaml/pyyaml#244 -- Document that PyYAML is implemented with Cython

yaml/pyyaml#329 -- Fix for Python 3.10

yaml/pyyaml#310 -- Increase size of index, line, and column fields

yaml/pyyaml#260 -- Remove some unused imports

yaml/pyyaml#163 -- Create timezone-aware datetimes when parsed as such

yaml/pyyaml#363 -- Add tests for timezone

Commits

58d0cb7 5.4 release

a60f7a1 Fix compatibility with Jython

ee98abd Run CI on PR base branch changes

ddf2033 constructor.timezone: _copy & deepcopy

fc914d5 Avoid repeatedly appending to yaml_implicit_resolvers

a001f27 Fix for CVE-2020-14343

fe15062 Add 3.9 to appveyor file for completeness sake

1e1c7fb Add a newline character to end of pyproject.toml

0b6b7d6 Start sentences and phrases for capital letters

c976915 Shell code improvements

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 1
Bump pygments from 2.5.2 to 2.7.4
Bumps pygments from 2.5.2 to 2.7.4.

Release notes

Sourced from pygments's releases.

2.7.4

Updated lexers:

Apache configurations: Improve handling of malformed tags (#1656)

CSS: Add support for variables (#1633, #1666)

Crystal (#1650, #1670)

Coq (#1648)

Fortran: Add missing keywords (#1635, #1665)

Ini (#1624)

JavaScript and variants (#1647 -- missing regex flags, #1651)

Markdown (#1623, #1617)

Shell

Lex trailing whitespace as part of the prompt (#1645)

Add missing in keyword (#1652)

SQL - Fix keywords (#1668)

Typescript: Fix incorrect punctuation handling (#1510, #1511)

Fix infinite loop in SML lexer (#1625)

Fix backtracking string regexes in JavaScript/TypeScript, Modula2 and many other lexers (#1637)

Limit recursion with nesting Ruby heredocs (#1638)

Fix a few inefficient regexes for guessing lexers

Fix the raw token lexer handling of Unicode (#1616)

Revert a private API change in the HTML formatter (#1655) -- please note that private APIs remain subject to change!

Fix several exponential/cubic-complexity regexes found by Ben Caller/Doyensec (#1675)

Fix incorrect MATLAB example (#1582)

Thanks to Google's OSS-Fuzz project for finding many of these bugs.

2.7.3

Updated lexers:

Ada (#1581)

HTML (#1615, #1614)

Java (#1594, #1586)

JavaScript (#1605, #1589, #1588)

JSON (#1569 -- this is a complete rewrite)

Lean (#1601)

LLVM (#1612)

Mason (#1592)

MySQL (#1555, #1551)

Rust (#1608)

Turtle (#1590, #1553)

Deprecated JsonBareObjectLexer, which is now identical to JsonLexer (#1600)

The ImgFormatter now calculates the exact character width, which fixes some issues with overlapping text (#1213, #1611)

... (truncated)

Changelog

Sourced from pygments's changelog.

Version 2.7.4

(released January 12, 2021)

Updated lexers:

Apache configurations: Improve handling of malformed tags (#1656)

CSS: Add support for variables (#1633, #1666)

Crystal (#1650, #1670)

Coq (#1648)

Fortran: Add missing keywords (#1635, #1665)

Ini (#1624)

JavaScript and variants (#1647 -- missing regex flags, #1651)

Markdown (#1623, #1617)

Shell

Lex trailing whitespace as part of the prompt (#1645)

Add missing in keyword (#1652)

SQL - Fix keywords (#1668)

Typescript: Fix incorrect punctuation handling (#1510, #1511)

Fix infinite loop in SML lexer (#1625)

Fix backtracking string regexes in JavaScript/TypeScript, Modula2 and many other lexers (#1637)

Limit recursion with nesting Ruby heredocs (#1638)

Fix a few inefficient regexes for guessing lexers

Fix the raw token lexer handling of Unicode (#1616)

Revert a private API change in the HTML formatter (#1655) -- please note that private APIs remain subject to change!

Fix several exponential/cubic-complexity regexes found by Ben Caller/Doyensec (#1675)

Fix incorrect MATLAB example (#1582)

Thanks to Google's OSS-Fuzz project for finding many of these bugs.

Version 2.7.3

(released December 6, 2020)

Updated lexers:

Ada (#1581)

HTML (#1615, #1614)

Java (#1594, #1586)

JavaScript (#1605, #1589, #1588)

JSON (#1569 -- this is a complete rewrite)

Lean (#1601)

LLVM (#1612)

Mason (#1592)

... (truncated)

Commits

4d555d0 Bump version to 2.7.4.

fc3b05d Update CHANGES.

ad21935 Revert "Added dracula theme style (#1636)"

e411506 Prepare for 2.7.4 release.

275e34d doc: remove Perl 6 ref

2e7e8c4 Fix several exponential/cubic complexity regexes found by Ben Caller/Doyensec

eb39c43 xquery: fix pop from empty stack

2738778 fix coding style in test_analyzer_lexer

02e0f09 Added 'ERROR STOP' to fortran.py keywords. (#1665)

c83fe48 support added for css variables (#1633)

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 1
hu_core_news_trf-ben token.children hibás
Hiba leírása token.children mindig üres generátorral tér vissza a hu_core_news_trf modellben.

Hiba előidézése Az alábbi kód szemlélteti (google colab környezetben):

doc = nlp('Peti evett egy almát.') displacy.render(doc, style="dep", jupyter=True) for token in doc: print(token.text, token.head, [child for child in token.children])

A displacy kimenete alapján helyesen elemzi a mondatot a modell, ezt megerősíti a kiírásnál, hogy helyes a token.head (a displacy kódjába ásva, kiderült az is token.head-et használ). A token.children elemit kiolvasva mégis üres listát kapunk.

Peti evett [] evett evett [] egy almát [] almát evett [] . evett []

Elvárt működés A token.children-nek az adott token gyerekeit kéne visszaadnia.

További kontextus A fenti kódot hu_core_news_lg-on futtatva helyes kimenetet kapunk.

Peti evett [] evett evett [Peti, almát, .] egy almát [] almát evett [egy] . evett []

Eredetileg a DependencyMatcher használata közben vettem észre hibát, onnan sikerült idáig visszavezetnem a hiba forrását.
bug parser
opened by boapps 2
Tokenization bug with !.

Describe the bug When tokenizing text, for example: [token for token in nlp("A kutya evett egy csontot!.")] The expression !. is considered a single token, and is also combined with the preceding word's token. Problem also occurs with multiple exclamation marks, for example: !!. !!!!!!. ...but not with multiple periods, for example: !.. !!.. !!... <--- these work properly It also does not occur if it's not directly preceded by a word (for example: there's a space between them, like this: csontot !.) If there's a chain of this, for example: !.!.!.!.! <- then the entire chain is one token... for example: kutya!.!.!.!. is tokenized simply as kutya!.!.!.!.

Expected behavior The exclamation mark and the periods should be separate tokens, like this: kutya!. <--- kutya ! . Note that question marks for example do behave like this, this bug only happens with exclamation marks (as far as I noticed)
bug tokenizer

opened by speter00 1

Releases(huspacy-v0.6.0)

huspacy-v0.6.0(Nov 11, 2022)
Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

0.6

New

Added a lookup component for sentiment lexicons

Added integration for novakat's onpp NER model (nerpp)

Added support for new models (hu_core_news_trf-v3.4.0, hu_core_news_md-v3.4.2, hu_core_news_lg-v3.4.4)

Source code(tar.gz)
Source code(zip)
huspacy-0.6.0-py3-none-any.whl(89.56 KB)
huspacy-v0.5.1(Oct 27, 2022)

Source code(tar.gz)
Source code(zip)
huspacy-0.5.1-py3-none-any.whl(15.25 KB)
hu_core_ud_lg-0.3.1(Oct 4, 2019)

Hungarian multi-task CNN trained on Universal Dependencies data. Assigns context-specific token vectors, Brown cluster IDs, word probabilities, POS tags, dependency parse, named entity tags and lemmata.

Feature | Description -- | -- Name | hu_core_ud_lg Version | 0.3.1 spaCy | >=2.2.1 Model size | 1360 MB Pipeline | tokenizer, sentencizer, tagger, parser, lemmatizer, ner Vectors | 1140008 unique vectors (300 dimensions) Sources | Universal Dependencies, Szeged Corpus, Web Corpus, Wikipedia, Hunnerwiki, Szeged NER corpora License | CC BY-NC-SA 4.0

Pipeline details

| Vectors | Tokenizer | Sentencizer | Tagger | Parser | Lemmatizer | NER -- | -- | -- | -- | -- | -- | -- | -- | Model | Word2Vec CBOW dim=300 minfreq=10 | Rule-based implemented in SpaCy | Rule-based | Multi-task CNN | Multi-task CNN | Lemmy (CST-like) | CNN Training data | Wikipedia dump (2017-04-21)) and the Hungarian Webcorpus | - | - | CONLL'17 training data | CONLL'17 training data | UD converted Szeged Korpusz | Hunnerwiki, Szeged NER Business & Criminal Test data | Hungarian analogical questions | CONLL'17 test data | CONLL'17 test data | CONLL'17 test data | CONLL'17 test data | CONLL'17 test data | Szeged NER Business & Criminal Accuracy | ACC 20.95 | F1 99.89 | F1 96.97| ACC 94.81 | UAS 76.18 LAS 66.58 | ACC 95.51 | F1 93.95
Source code(tar.gz)
Source code(zip)
hu_core_ud_lg-0.3.1-py3-none-any.whl(1272.91 MB)
hu_core_ud_lg-0.3.1.tar.gz(1271.78 MB)
hu_core_ud_lg-0.3.0(Sep 26, 2019)

Hungarian multi-task CNN trained on Universal Dependencies data. Assigns context-specific token vectors, Brown cluster IDs, word probabilities, POS tags, dependency parse, named entity tags and lemmata.

Feature | Description -- | -- Name | hu_core_ud_lg Version | 0.3.0 spaCy | >=2.1.8 Model size | 1360 MB Pipeline | tokenizer, sentencizer, tagger, parser, lemmatizer, ner Vectors | 1140008 unique vectors (300 dimensions) Sources | Universal Dependencies, Szeged Corpus, Web Corpus, Wikipedia, Hunnerwiki, Szeged NER corpora License | CC BY-NC-SA 4.0

Pipeline details

| Vectors | Tokenizer | Sentencizer | Tagger | Parser | Lemmatizer | NER -- | -- | -- | -- | -- | -- | -- | -- | Model | Word2Vec CBOW dim=300 minfreq=10 | Rule-based implemented in SpaCy | Rule-based | Multi-task CNN | Multi-task CNN | Lemmy (CST-like) | CNN Training data | Wikipedia dump (2017-04-21)) and the Hungarian Webcorpus | - | - | CONLL'17 training data | CONLL'17 training data | UD converted Szeged Korpusz | Hunnerwiki, Szeged NER Business & Criminal Test data | Hungarian analogical questions | CONLL'17 test data | CONLL'17 test data | CONLL'17 test data | CONLL'17 test data | CONLL'17 test data | Szeged NER Business & Criminal Accuracy | ACC 20.95 | F1 99.89 | F1 96.97| ACC 94.91 | UAS 75.73 LAS 66.16 | ACC 95.49 | F1 93.95
Source code(tar.gz)
Source code(zip)
hu_core_ud_lg-0.3.0-py3-none-any.whl(1272.91 MB)
hu_core_ud_lg-0.3.0.tar.gz(1271.78 MB)
hu_core_ud_lg-0.2.0(Jun 1, 2019)

Hungarian multi-task CNN trained on Universal Dependencies data. Assigns context-specific token vectors, Brown cluster IDs, word probabilities, POS tags, dependency parse and lemmata.

Feature | Description -- | -- Name | hu_core_ud_lg Version | 0.2.0 spaCy | >=2.1.0 Model size | 1360 MB Pipeline | tokenizer, sentencizer, tagger, parser, lemmatizer Vectors | 1140008 unique vectors (300 dimensions) Sources | Universal Dependencies, Szeged Corpus, Web Corpus, Wikipedia License | CC BY-NC-SA 4.0

Pipeline details

| Vectors | Tokenizer | Sentencizer | Tagger | Parser | Lemmatizer -- | -- | -- | -- | -- | -- | -- | Model | Word2Vec CBOW dim=300 minfreq=10 | Rule-based implemented in SpaCy | Rule-based | Multi-task CNN | multi-task CNN | Lemmy (CST-like) Training data | Wikipedia dump (2017-04-21)) and the Hungarian Webcorpus | - | - | CONLL'17 training data | CONLL'17 training data | UD converted Szeged Korpusz Test data | Hungarian analogical questions | CONLL'17 test data | CONLL'17 test data | CONLL'17 test data | CONLL'17 test data | CONLL'17 test data Accuracy | ACC 20.95 | F1 99.89 | F1 96.97| ACC 94.82 | UAS 78.02 LAS 67.92 | ACC 95.60
Source code(tar.gz)
Source code(zip)
hu_core_ud_lg-0.2.0-py3-none-any.whl(1298.91 MB)
hu_core_ud_lg-0.2.0.tar.gz(1297.78 MB)
hu_core_ud_lg-0.1.0(Jan 4, 2019)

Hungarian multi-task CNN trained on Universal Dependencies data. Assigns context-specific token vectors, Brown cluster IDs, word probabilities, POS tags, dependency parse and lemmata.

Feature | Description -- | -- Name | hu_core_ud_lg Version | 0.1.0 spaCy | >=2.0.0 Model size | 1350 MB Pipeline | tokenizer, sentencizer, tagger, parser, lemmatizer Vectors | 1140008 unique vectors (300 dimensions) Sources | Universal Dependencies, Szeged Corpus, Web Corpus, Wikipedia License | CC BY-NC-SA 4.0

Pipeline details

| Vectors | Tokenizer | Sentencizer | Tagger | Parser | Lemmatizer -- | -- | -- | -- | -- | -- | -- | Model | Word2Vec CBOW dim=300 minfreq=10 | Rule-based implemented in SpaCy | Rule-based | Multi-task CNN | multi-task CNN | Lemmy (CST-like) Training data | Wikipedia dump (2017-04-21)) and the Hungarian Webcorpus | - | - | CONLL'17 training data | CONLL'17 training data | UD converted Szeged Korpusz Test data | Hungarian analogical questions | CONLL'17 test data | CONLL'17 test data | CONLL'17 test data | CONLL'17 test data | CONLL'17 test data Accuracy | ACC 20.95 | F1 99.88 | F1 96.64| ACC 95.11 | UAS 77.52 LAS 68.45 | ACC 95.60
Source code(tar.gz)
Source code(zip)
hu_core_ud_lg-0.1.0-py3-none-any.whl(1285.18 MB)
hu_core_ud_lg-0.1.0.tar.gz(1284.05 MB)
hu_tagger_web_md-0.1.0(Jun 11, 2017)

Baseline tagger and parser from Universal dependencies + vocabulary and word vector model generated from the Hungarian Webcorpus and Wikipedia

Feature | Description ------- | ------------ Tagger | 98.23 ACC trained/tested on the Szeged Corpus (Universal Morphology transcript) Word vectors | word2vec bow with 150 dimensions, generated from the Hungarian Webcorpus and Wikipedia Brown clusters | 1024 clusters generated from the Hungarian Webcorpus and Wikipedia
Source code(tar.gz)
Source code(zip)
hu_tagger_web_md-0.1.0.tar.gz(1005.09 MB)
hu_parser_web_md-0.1.0(Jun 11, 2017)

Baseline tagger and parser from Universal dependencies + vocabulary and word vector model generated from the Hungarian Webcorpus and Wikipedia

Feature | Description ------- | ------------ Tagger | 93.95 ACC trained/tested on Universal dependencies corpus Parser | 75.12 UAS and 64.85 LAS trained/tested on Universal dependencies corpus Word vectors | word2vec bow with 150 dimensions, generated from the Hungarian Webcorpus and Wikipedia Brown clusters | 1024 clusters generated from the Hungarian Webcorpus and Wikipedia
Source code(tar.gz)
Source code(zip)
hu_parser_web_md-0.1.0.tar.gz(1037.80 MB)
hu_vectors_web_md-0.1.0(Jun 7, 2017)

Vocabulary and word vector model trained on the Hungarian Webcorpus and Wikipedia

Feature | Description ------- | ------------ Corpora | Hungarian Webcorpus, Hungarian Wikipedia Word vectors | 150 dimension, word2vec Brown clusters | 1024
Source code(tar.gz)
Source code(zip)
hu_vectors_web_md-0.1.0.tar.gz(998.97 MB)
hu_vectors_web_lg-0.1.0(Jun 5, 2017)

Vocabulary and word vector model trained on the Hungarian Webcorpus and Wikipedia

Feature | Description ------- | ------------ Corpora | Hungarian Webcorpus, Hungarian Wikipedia Word vectors | 300 dimension, word2vec Brown clusters | 1024
Source code(tar.gz)
Source code(zip)
hu_vectors_web_lg-0.1.0.tar.gz(1713.65 MB)

Owner

HuSpaCy

HuSpaCy: industrial-strength Hungarian natural language processing

GitHub

🤗 Transformers: State-of-the-art Natural Language Processing for Pytorch, TensorFlow, and JAX.

English | 简体中文 | 繁體中文 State-of-the-art Natural Language Processing for Jax, PyTorch and TensorFlow ?? Transformers provides thousands of pretrained mo

77.2k Jan 2, 2023

Deep Learning for Natural Language Processing SS 2021 (TU Darmstadt)

Deep Learning for Natural Language Processing SS 2021 (TU Darmstadt) Task Training huge unsupervised deep neural networks yields to strong progress in

2 Aug 5, 2022

Deep Learning for Natural Language Processing SS 2021 (TU Darmstadt)

Deep Learning for Natural Language Processing SS 2021 (TU Darmstadt) Task Training huge unsupervised deep neural networks yields to strong progress in

1 Jan 26, 2022

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice （『飞桨』核心框架，深度学习&机器学习高性能单机、分布式训练和跨平台部署）

English | 简体中文 Welcome to the PaddlePaddle GitHub. PaddlePaddle, as the only independent R&D deep learning platform in China, has been officially open

19.4k Jan 4, 2023

Industrial knn-based anomaly detection for images. Visit streamlit link to check out the demo.

Industrial KNN-based Anomaly Detection ⭐ Now has streamlit support! ⭐ Run $ streamlit run streamlit_app.py This repo aims to reproduce the results of

102 Dec 26, 2022

Industrial Image Anomaly Localization Based on Gaussian Clustering of Pre-trained Feature

Industrial Image Anomaly Localization Based on Gaussian Clustering of Pre-trained Feature Q. Wan, L. Gao, X. Li and L. Wen, "Industrial Image Anomaly

6 Dec 25, 2022

Underwater industrial application yolov5m6

This project wins the intelligent algorithm contest finalist award and stands out from over 2000teams in China Underwater Robot Professional Contest, entering the final of China Underwater Robot Professional Contest and ranking 13 out of 31 teams in finals.

8 Nov 9, 2022

Third party Pytorch implement of Image Processing Transformer (Pre-Trained Image Processing Transformer arXiv:2012.00364v2)

ImageProcessingTransformer Third party Pytorch implement of Image Processing Transformer (Pre-Trained Image Processing Transformer arXiv:2012.00364v2)

61 Jan 1, 2023

Uncertain natural language inference

Uncertain Natural Language Inference This repository hosts the code for the following paper: Tongfei Chen*, Zhengping Jiang*, Adam Poliak, Keisuke Sak

14 Sep 1, 2022

NaturalProofs: Mathematical Theorem Proving in Natural Language

NaturalProofs: Mathematical Theorem Proving in Natural Language NaturalProofs: Mathematical Theorem Proving in Natural Language Sean Welleck, Jiacheng

83 Jan 5, 2023

Release of SPLASH: Dataset for semantic parse correction with natural language feedback in the context of text-to-SQL parsing

SPLASH: Semantic Parsing with Language Assistance from Humans SPLASH is dataset for the task of semantic parse correction with natural language feedba

Microsoft Research - Language and Information Technologies (MSR LIT)

35 Oct 31, 2022

The source code for the Cutoff data augmentation approach proposed in this paper: "A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation".

Cutoff: A Simple Data Augmentation Approach for Natural Language This repository contains source code necessary to reproduce the results presented in

49 Dec 22, 2022

🏆 The 1st Place Submission to AICity Challenge 2021 Natural Language-Based Vehicle Retrieval Track (Alibaba-UTS submission)

AI City 2021: Connecting Language and Vision for Natural Language-Based Vehicle Retrieval ?? The 1st Place Submission to AICity Challenge 2021 Natural

82 Dec 29, 2022

CLIP: Connecting Text and Image (Learning Transferable Visual Models From Natural Language Supervision)

CLIP (Contrastive Language–Image Pre-training) Experiments (Evaluation) Model Dataset Acc (%) ViT-B/32 (Paper) CIFAR100 65.1 ViT-B/32 (Our) CIFAR100 6

52 Jan 7, 2023

Train an RL agent to execute natural language instructions in a 3D Environment (PyTorch)

Gated-Attention Architectures for Task-Oriented Language Grounding This is a PyTorch implementation of the AAAI-18 paper: Gated-Attention Architecture

234 Nov 5, 2022

Implementation of EMNLP 2017 Paper "Natural Language Does Not Emerge 'Naturally' in Multi-Agent Dialog" using PyTorch and ParlAI

Language Emergence in Multi Agent Dialog Code for the Paper Natural Language Does Not Emerge 'Naturally' in Multi-Agent Dialog Satwik Kottur, José M.

105 Nov 25, 2022

Implementation of EMNLP 2017 Paper "Natural Language Does Not Emerge 'Naturally' in Multi-Agent Dialog" using PyTorch and ParlAI

Language Emergence in Multi Agent Dialog Code for the Paper Natural Language Does Not Emerge 'Naturally' in Multi-Agent Dialog Satwik Kottur, José M.

105 Nov 25, 2022

Flybirds - BDD-driven natural language automated testing framework, present by Trip Flight

Flybird | English Version 行为驱动开发（Behavior-driven development，缩写BDD），是一种软件过程的思想或者

706 Dec 30, 2022

deep learning model that learns to code with drawing in the Processing language

sketchnet sketchnet - processing code generator can we teach a computer to draw pictures with code. We use Processing and java/jruby code paired with

41 Dec 12, 2022

HuSpaCy: industrial-strength Hungarian natural language processing

Related tags

Overview

HuSpaCy: Industrial-strength Hungarian NLP

Installation

Usage

Available Models

Development

Installing requirements

Repository structure

Citing

License

Contact

Authors

Comments

Changes

spacy eval scores

Remaining tasks

After submitting

2.7.4

2.7.3

Version 2.7.4

Version 2.7.3

Releases(huspacy-v0.6.0)

huspacy-v0.6.0(Nov 11, 2022)

Changelog

0.6

New

huspacy-v0.5.1(Oct 27, 2022)

hu_core_ud_lg-0.3.1(Oct 4, 2019)

Pipeline details

hu_core_ud_lg-0.3.0(Sep 26, 2019)

Pipeline details

hu_core_ud_lg-0.2.0(Jun 1, 2019)

Pipeline details

hu_core_ud_lg-0.1.0(Jan 4, 2019)

Pipeline details

hu_tagger_web_md-0.1.0(Jun 11, 2017)

hu_parser_web_md-0.1.0(Jun 11, 2017)

hu_vectors_web_md-0.1.0(Jun 7, 2017)

hu_vectors_web_lg-0.1.0(Jun 5, 2017)

Owner

HuSpaCy

🤗 Transformers: State-of-the-art Natural Language Processing for Pytorch, TensorFlow, and JAX.

Deep Learning for Natural Language Processing SS 2021 (TU Darmstadt)

Deep Learning for Natural Language Processing SS 2021 (TU Darmstadt)

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice （『飞桨』核心框架，深度学习&机器学习高性能单机、分布式训练和跨平台部署）

Industrial knn-based anomaly detection for images. Visit streamlit link to check out the demo.

Industrial Image Anomaly Localization Based on Gaussian Clustering of Pre-trained Feature

Underwater industrial application yolov5m6

Third party Pytorch implement of Image Processing Transformer (Pre-Trained Image Processing Transformer arXiv:2012.00364v2)

Uncertain natural language inference

NaturalProofs: Mathematical Theorem Proving in Natural Language

Release of SPLASH: Dataset for semantic parse correction with natural language feedback in the context of text-to-SQL parsing

The source code for the Cutoff data augmentation approach proposed in this paper: "A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation".

🏆 The 1st Place Submission to AICity Challenge 2021 Natural Language-Based Vehicle Retrieval Track (Alibaba-UTS submission)

CLIP: Connecting Text and Image (Learning Transferable Visual Models From Natural Language Supervision)

Train an RL agent to execute natural language instructions in a 3D Environment (PyTorch)

Implementation of EMNLP 2017 Paper "Natural Language Does Not Emerge 'Naturally' in Multi-Agent Dialog" using PyTorch and ParlAI

Implementation of EMNLP 2017 Paper "Natural Language Does Not Emerge 'Naturally' in Multi-Agent Dialog" using PyTorch and ParlAI

Flybirds - BDD-driven natural language automated testing framework, present by Trip Flight

deep learning model that learns to code with drawing in the Processing language