PymUSAS
Python Multilingual Ucrel Semantic Analysis System, it currently is a rule based token level semantic tagger which can be added to any spaCy pipeline. The current tagger system is flexible enough to support any semantic tagset, however the tagset we have concentrated on and give examples for throughout the documentation is the Ucrel Semantic Analysis System (USAS).
Documentation
-
📚 Usage Guides - What the package is, tutorials, how to guides, and explanations. -
🔎 API Reference - The docstrings of the library, with minimum working examples.
Install PyMUSAS
Can be installed on all operating systems and supports Python version >= 3.7
, to install run:
pip install pymusas
Quick example
Here is a quick example of what PyMUSAS can do using the USASRuleBasedTagger, from now on called the USAS tagger, for a full tutorial, which explains all of the steps in this example, see the Using PyMUSAS tutorial in the documentation.
This example will semantically tag, at the token level, some Portuguese text. We do first need to download a spaCy Portuguese model (any version will do, but we choose the small version)
python -m spacy download pt_core_news_sm
Then we load the Portuguese spaCy tagger, add the USAS tagger, and apply it to the Portuguese text:
import spacy
from pymusas.file_utils import download_url_file
from pymusas.lexicon_collection import LexiconCollection
from pymusas.spacy_api.taggers import rule_based
from pymusas.pos_mapper import UPOS_TO_USAS_CORE
# We exclude ['parser', 'ner'] as these components are typically not needed
# for the USAS tagger
nlp = spacy.load('pt_core_news_sm', exclude=['parser', 'ner'])
# Adds the tagger to the pipeline and returns the tagger
usas_tagger = nlp.add_pipe('usas_tagger')
# Rule based tagger requires a lexicon
portuguese_usas_lexicon_url = 'https://raw.githubusercontent.com/UCREL/Multilingual-USAS/master/Portuguese/semantic_lexicon_pt.tsv'
portuguese_usas_lexicon_file = download_url_file(portuguese_usas_lexicon_url)
# Includes the POS information
portuguese_lexicon_lookup = LexiconCollection.from_tsv(portuguese_usas_lexicon_file)
# excludes the POS information
portuguese_lemma_lexicon_lookup = LexiconCollection.from_tsv(portuguese_usas_lexicon_file,
include_pos=False)
# Add the lexicon information to the USAS tagger within the pipeline
usas_tagger.lexicon_lookup = portuguese_lexicon_lookup
usas_tagger.lemma_lexicon_lookup = portuguese_lemma_lexicon_lookup
# Maps from the POS model tagset to the lexicon POS tagset
usas_tagger.pos_mapper = UPOS_TO_USAS_CORE
text = "O Parque Nacional da Peneda-Gerês é uma área protegida de Portugal, com autonomia administrativa, financeira e capacidade jurídica, criada no ano de 1971, no meio ambiente da Peneda-Gerês."
output_doc = nlp(text)
print(f'Text\tLemma\tPOS\tUSAS Tags')
for token in output_doc:
print(f'{token.text}\t{token.lemma_}\t{token.pos_}\t{token._.usas_tags}')
This will output the following, whereby the USAS tags are a list of the most likely semantic tags, the first tag in the list is the most likely semantic tag. For more information on the USAS tagset see the USAS website.
Text Lemma POS USAS Tags
O O DET ['Z5']
Parque Parque PROPN ['M2']
Nacional Nacional PROPN ['M7/S2mf']
da da ADP ['Z5']
Peneda-Gerês Peneda-Gerês PROPN ['Z99']
é ser AUX ['A3+', 'Z5']
uma umar DET ['Z99']
área área NOUN ['H2/S5+c', 'X2.2', 'M7', 'A4.1', 'N3.6']
protegida protegido ADJ ['O4.5/A2.1', 'S1.2.5+']
de de ADP ['Z5']
Portugal Portugal PROPN ['Z2', 'Z3c']
, , PUNCT ['PUNCT']
com com ADP ['Z5']
autonomia autonomia NOUN ['A1.7-', 'G1.1/S7.1+', 'X6+/S5-', 'S5-']
administrativa administrativo ADJ ['S7.1+']
, , PUNCT ['PUNCT']
financeira financeiro ADJ ['I1', 'I1/G1.1']
e e CCONJ ['Z5']
capacidade capacidade NOUN ['N3.2', 'N3.4', 'N5.1+', 'X9.1+', 'I3.1', 'X9.1']
jurídica jurídico ADJ ['G2.1']
, , PUNCT ['PUNCT']
criada criar VERB ['I3.1/B4/S2.1f', 'S2.1f%', 'S7.1-/S2mf']
no o ADP ['Z5']
ano ano NOUN ['T1.3', 'P1c']
de de ADP ['Z5']
1971 1971 NUM ['N1']
, , PUNCT ['PUNCT']
no o ADP ['Z5']
meio mear ADJ ['M6', 'N5', 'N4', 'T1.2', 'N2', 'X4.2', 'I1.1', 'M3/H3', 'N3.3', 'A4.1', 'A1.1.1', 'T1.3']
ambiente ambientar NOUN ['W5', 'W3', 'E1', 'Y2', 'O4.1']
da da ADP ['Z5']
Peneda-Gerês Peneda-Gerês PROPN ['Z99']
. . PUNCT ['PUNCT']
Development
When developing on the project you will want to install the Python package locally in editable format with all the extra requirements, this can be done like so:
pip install -e .[tests]
For a zsh
shell, which is the default shell for the new Macs you will need to escape with \
the brackets:
pip install -e .\[tests\]
Running linters and tests
This code base uses flake8 and mypy to ensure that the format of the code is consistent and contain type hints. The flake8 settings can be found in ./setup.cfg and the mypy settings within ./pyproject.toml. To run these linters:
isort pymusas tests scripts
flake8
mypy
To run the tests with code coverage (NOTE these are the code coverage tests that the Continuos Integration (CI) reports at the top of this README, the doc tests are not part of this report):
coverage run # Runs the tests (uses pytest)
coverage report # Produces a report on the test coverage
To run the doc tests, these are tests to ensure that examples within the documentation run as expected:
coverage run -m pytest --doctest-modules pymusas/ # Runs the doc tests
coverage report # Produces a report on the doc tests coverage
Creating a build and checking it before release
If you would like to build this project and check it with twine before release there is a make command that can do this, this command will install build
, twine
, and the latest version of pip
:
make check-twine
Team
PyMUSAS is an open-source project that has been created and funded by the University Centre for Computer Corpus Research on Language (UCREL) at Lancaster University. For more information on who has contributed to this code base see the contributions page.