Python Multilingual Ucrel Semantic Analysis System

UCREL

Last update: Nov 18, 2022

Related tags

Miscellaneous pymusas

Overview

PymUSAS

Python Multilingual Ucrel Semantic Analysis System, it currently is a rule based token level semantic tagger which can be added to any spaCy pipeline. The current tagger system is flexible enough to support any semantic tagset, however the tagset we have concentrated on and give examples for throughout the documentation is the Ucrel Semantic Analysis System (USAS).

Documentation

📚 Usage Guides - What the package is, tutorials, how to guides, and explanations.
🔎 API Reference - The docstrings of the library, with minimum working examples.

Install PyMUSAS

Can be installed on all operating systems and supports Python version >= 3.7, to install run:

pip install pymusas

Quick example

Here is a quick example of what PyMUSAS can do using the USASRuleBasedTagger, from now on called the USAS tagger, for a full tutorial, which explains all of the steps in this example, see the Using PyMUSAS tutorial in the documentation.

This example will semantically tag, at the token level, some Portuguese text. We do first need to download a spaCy Portuguese model (any version will do, but we choose the small version)

python -m spacy download pt_core_news_sm

Then we load the Portuguese spaCy tagger, add the USAS tagger, and apply it to the Portuguese text:

import spacy

from pymusas.file_utils import download_url_file
from pymusas.lexicon_collection import LexiconCollection
from pymusas.spacy_api.taggers import rule_based
from pymusas.pos_mapper import UPOS_TO_USAS_CORE

# We exclude ['parser', 'ner'] as these components are typically not needed
# for the USAS tagger
nlp = spacy.load('pt_core_news_sm', exclude=['parser', 'ner'])
# Adds the tagger to the pipeline and returns the tagger 
usas_tagger = nlp.add_pipe('usas_tagger')

# Rule based tagger requires a lexicon
portuguese_usas_lexicon_url = 'https://raw.githubusercontent.com/UCREL/Multilingual-USAS/master/Portuguese/semantic_lexicon_pt.tsv'
portuguese_usas_lexicon_file = download_url_file(portuguese_usas_lexicon_url)
# Includes the POS information
portuguese_lexicon_lookup = LexiconCollection.from_tsv(portuguese_usas_lexicon_file)
# excludes the POS information
portuguese_lemma_lexicon_lookup = LexiconCollection.from_tsv(portuguese_usas_lexicon_file, 
                                                             include_pos=False)
# Add the lexicon information to the USAS tagger within the pipeline
usas_tagger.lexicon_lookup = portuguese_lexicon_lookup
usas_tagger.lemma_lexicon_lookup = portuguese_lemma_lexicon_lookup
# Maps from the POS model tagset to the lexicon POS tagset
usas_tagger.pos_mapper = UPOS_TO_USAS_CORE

text = "O Parque Nacional da Peneda-Gerês é uma área protegida de Portugal, com autonomia administrativa, financeira e capacidade jurídica, criada no ano de 1971, no meio ambiente da Peneda-Gerês."

output_doc = nlp(text)

print(f'Text\tLemma\tPOS\tUSAS Tags')
for token in output_doc:
    print(f'{token.text}\t{token.lemma_}\t{token.pos_}\t{token._.usas_tags}')

This will output the following, whereby the USAS tags are a list of the most likely semantic tags, the first tag in the list is the most likely semantic tag. For more information on the USAS tagset see the USAS website.

Text    Lemma   POS     USAS Tags
O       O       DET     ['Z5']
Parque  Parque  PROPN   ['M2']
Nacional        Nacional        PROPN   ['M7/S2mf']
da      da      ADP     ['Z5']
Peneda-Gerês    Peneda-Gerês    PROPN   ['Z99']
é       ser     AUX     ['A3+', 'Z5']
uma     umar    DET     ['Z99']
área    área    NOUN    ['H2/S5+c', 'X2.2', 'M7', 'A4.1', 'N3.6']
protegida       protegido       ADJ     ['O4.5/A2.1', 'S1.2.5+']
de      de      ADP     ['Z5']
Portugal        Portugal        PROPN   ['Z2', 'Z3c']
,       ,       PUNCT   ['PUNCT']
com     com     ADP     ['Z5']
autonomia       autonomia       NOUN    ['A1.7-', 'G1.1/S7.1+', 'X6+/S5-', 'S5-']
administrativa  administrativo  ADJ     ['S7.1+']
,       ,       PUNCT   ['PUNCT']
financeira      financeiro      ADJ     ['I1', 'I1/G1.1']
e       e       CCONJ   ['Z5']
capacidade      capacidade      NOUN    ['N3.2', 'N3.4', 'N5.1+', 'X9.1+', 'I3.1', 'X9.1']
jurídica        jurídico        ADJ     ['G2.1']
,       ,       PUNCT   ['PUNCT']
criada  criar   VERB    ['I3.1/B4/S2.1f', 'S2.1f%', 'S7.1-/S2mf']
no      o       ADP     ['Z5']
ano     ano     NOUN    ['T1.3', 'P1c']
de      de      ADP     ['Z5']
1971    1971    NUM     ['N1']
,       ,       PUNCT   ['PUNCT']
no      o       ADP     ['Z5']
meio    mear    ADJ     ['M6', 'N5', 'N4', 'T1.2', 'N2', 'X4.2', 'I1.1', 'M3/H3', 'N3.3', 'A4.1', 'A1.1.1', 'T1.3']
ambiente        ambientar       NOUN    ['W5', 'W3', 'E1', 'Y2', 'O4.1']
da      da      ADP     ['Z5']
Peneda-Gerês    Peneda-Gerês    PROPN   ['Z99']
.       .       PUNCT   ['PUNCT']

Development

When developing on the project you will want to install the Python package locally in editable format with all the extra requirements, this can be done like so:

pip install -e .[tests]

For a zsh shell, which is the default shell for the new Macs you will need to escape with \ the brackets:

pip install -e .\[tests\]

Running linters and tests

This code base uses flake8 and mypy to ensure that the format of the code is consistent and contain type hints. The flake8 settings can be found in ./setup.cfg and the mypy settings within ./pyproject.toml. To run these linters:

isort pymusas tests scripts
flake8
mypy

To run the tests with code coverage (NOTE these are the code coverage tests that the Continuos Integration (CI) reports at the top of this README, the doc tests are not part of this report):

coverage run # Runs the tests (uses pytest)
coverage report # Produces a report on the test coverage

To run the doc tests, these are tests to ensure that examples within the documentation run as expected:

coverage run -m pytest --doctest-modules pymusas/ # Runs the doc tests
coverage report # Produces a report on the doc tests coverage

Creating a build and checking it before release

If you would like to build this project and check it with twine before release there is a make command that can do this, this command will install build, twine, and the latest version of pip:

make check-twine

Team

PyMUSAS is an open-source project that has been created and funded by the University Centre for Computer Corpus Research on Language (UCREL) at Lancaster University. For more information on who has contributed to this code base see the contributions page.

Comments

load_model_from_init_py() got an unexpected keyword argument 'enable'

I'm using the English tagger. If I run the same code as mentioned in https://ucrel.github.io/pymusas/usage/how_to/tag_text, it is throwing up an error. The error happens in the following line of the code:

english_tagger_pipeline = spacy.load('en_dual_none_contextual')

The error is: `/usr/local/lib/python3.7/dist-packages/en_dual_none_contextual/init.py in load(**overrides) 44 45 def load(**overrides): ---> 46 return load_model_from_init_py(file, **overrides)

TypeError: load_model_from_init_py() got an unexpected keyword argument 'enable'`

It was working perfectly fine 2 days ago. This error seems to be recent.

opened by FahdCodes 3

Trouble tagging English text

I'm facing trouble tagging English text. I'm using spaCy's 'en_core_web_sm' dataset for the pipeline. Apparently, the English dataset does not have 'token.pos', something that the 'usas_tagger' requires. The documentation says that the tagger should work even without 'token.pos', however when I go ahead and feed the english text to the tagger, it simply tags 'Z99' to all the words.

Would really appreciate any valuable inputs.

Below is the full code. Thanks!

!pip install pymusas
!python -m spacy download en_core_web_sm

import spacy
from pymusas.spacy_api.taggers import rule_based
from pymusas.pos_mapper import UPOS_TO_USAS_CORE

nlp = spacy.load('en_core_web_sm', exclude=['parser', 'ner'])

usas_tagger = nlp.add_pipe('usas_tagger')

_ = nlp.analyze_pipes(pretty=True)

usas_tagger.pos_mapper = UPOS_TO_USAS_CORE

input_text = "It was raining in London and the cat was missing"

tagged_text = nlp(input_text)

print(f'Text\tLemma\tPOS\tUSAS Tags')
for token in tagged_text:
    print(f'{token.text}\t{token.lemma_}\t{token.pos_}\t{token._.usas_tags}')

opened by FahdCodes 2

Welsh example
Adds the following:

A mapping from the basic CorCenCC POS tagset to USAS core POS tagset.

The usage documentation, for the "How-to Tag Text", has been updated so that it includes a Welsh example which does not use spaCy, instead uses the CyTag toolkit.

Creating a new release

I think once this has been added to the main branch we should create a new release, e.g. 0.2.0 as the new release will contain the following two POS mappings:

A mapping from the basic CorCenCC POS tagset to USAS core POS tagset.

A mapping from the Penn Chinese Treebank POS tagset to USAS core POS tagset.

What do you think @perayson ?
documentation enhancement
opened by apmoore1 2
Chinese Penn Treebank POS tagset mapping

The Chinese spaCy model outputs POS tags that come from the Chinese treebank tagset rather than the Universal POS tagset. This therefore requires a mapping from the Chinese treebank tagset to the USAS core tagset to be able to use the POS tagger within the Chinese spaCy model for the USASRuleBasedTagger if we would like to make the most of the POS information within the Chinese USAS lexicon.

A solution to this is to take the mapping from the Universal POS (UPOS) tagset for mapping between the Chinese treebank tagset to the UPOS tagset, of which the mapping can be found here and swap the UPOS tags in that mapping to USAS core tagsets using the mapping we have current for UPOS to USAS core.
enhancement

opened by apmoore1 2
Create a Spacy version of the rule based tagger

We want the Spacy version to allow for multiple languages, following their language-specific factory setup. As the rule based tagger requires a lexicon resource, we need to load in data, to do this we are going to follow option 2: Save data with the pipeline and load it in once on initialization. Option 2 was chosen as it will allow us to ship the models without the user having to specify where the data has come from, we can state in the shipped models where the data has come from with the license for the data which will reflect the license of the model.

With Option 1 we would either have to create download functions that save data to the users file system or we would have to ship the data with the python package, this would then require us to have different licenses for the code and data even though they are in the same repository.
enhancement license

opened by apmoore1 2
License

Once the license has been decided this needs to be added to the setup.cfg file, for more information on how to add it to the setup.cfg see the setup.cfg guide.
license

opened by apmoore1 2
Lookup table for the tags (English)

Hi,

For the USAS tags like 'A11.1+', 'N3.2+', 'Z1mf', 'Z3c' etc, is there a lookup table to get the corresponding English name of the tags?

Thanks! Fahad

opened by FahdCodes 1
Multi Word Expressions (v0.3)
Added

Roadmap added.

Define the MWE template and it's syntax, this is stated in Notes -> Multi Word Expression Syntax in the Usage section of the documentation. This is the first task of issue #24.

PEP 561 (Distributing and Packaging Type Information) compatible by adding py.typed file.

Added srsly as a pip requirement, we use srsly to serialise components to bytes, for example the pymusas.lexicon_collection.LexiconCollection.to_bytes function uses srsly to serialise the LexiconCollection to bytes.

An abstract class, pymusas.base.Serialise, that requires sub-classes to create two methods to_bytes and from_bytes so that the class can be serialised.

pymusas.lexicon_collection.LexiconCollection has three new methods to_bytes, from_bytes, and __eq__. This allows the collection to be serialised and to be compared to other collections.

A Lexicon Collection class for Multi Word Expression (MWE), pymusas.lexicon_collection.MWELexiconCollection, which allows a user to easily create and / or load in from a TSV file a MWE lexicon, like the MWE lexicons from the Multilingual USAS repository. In addition it contains the functionality to match a MWE template to templates stored in the MWELexiconCollection class following the MWE special syntax rules, this is all done through the mwe_match method. It also supports Part Of Speech mapping so that you can map from the lexicon's POS tagset to the tagset of your choice, in both a one-to-one and one-to-many mapping. Like the pymusas.lexicon_collection.LexiconCollection it contains to_bytes, from_bytes, and __eq__ methods for serialisation and comparisons.

The rule based taggers have now been componentised so that they are based off a List of Rules and a Ranker whereby each Rule defines how a token(s) in a text can be matched to a semantic category. Given the matches from the Rules the for each token, a token can have zero or more matches, the Ranker ranks each match and finds the global best match for each token in the text. The taggers now support direct match and wildcard Multi Word Expressions. Due to this:

pymusas.taggers.rule_based.USASRuleBasedTagger has been changed and re-named to pymusas.taggers.rule_based.RuleBasedTagger and now only has a __call__ method.

pymusas.spacy_api.taggers.rule_based.USASRuleBasedTagger has been changed and re-named to pymusas.spacy_api.taggers.rule_based.RuleBasedTagger.

A Rule system, of which all rules can be found in pymusas.taggers.rules:

pymusas.taggers.rules.rule.Rule an abstract class that describes how other sub-classes define the __call__ method and it's signature. This abstract class is sub-classed from pymusas.base.Serialise.

pymusas.taggers.rules.single_word.SingleWordRule a concrete sub-class of Rule for finding Single word lexicon entry matches.

pymusas.taggers.rules.mwe.MWERule a concrete sub-class of Rule for finding Multi Word Expression entry matches.

A Ranking system, of which all of the components that are linked to ranking can be found in pymusas.rankers:

pymusas.rankers.ranking_meta_data.RankingMetaData describes a lexicon entry match, that are typically generated from pymusas.taggers.rules.rule.Rule classes being called. These matches indicate that some part of a text, one or more tokens, matches a lexicon entry whether that is a Multi Word Expression or single word lexicon.

pymusas.rankers.lexicon_entry.LexiconEntryRanker an abstract class that describes how other sub-classes should rank each token in the text and the expected output through the class's __call__ method. This abstract class is sub-classed from pymusas.base.Serialise.

pymusas.rankers.lexicon_entry.ContextualRuleBasedRanker a concrete sub-class of LexiconEntryRanker based off the ranking rules from Piao et al. 2003.

pymusas.rankers.lexical_match.LexicalMatch describes the lexical match within a pymusas.rankers.ranking_meta_data.RankingMetaData object.

pymusas.utils.unique_pos_tags_in_lexicon_entry a function that given a lexicon entry, either Multi Word Expression or Single word, returns a Set[str] of unique POS tags in the lexicon entry.

pymusas.utils.token_pos_tags_in_lexicon_entry a function that given a lexicon entry, either Multi Word Expression or Single word, yields a Tuple[str, str] of word and POS tag from the lexicon entry.

A mapping from USAS core to Universal Part Of Speech (UPOS) tagset.

A mapping from USAS core to basic CorCenCC POS tagset.

A mapping from USAS core to Penn Chinese Treebank POS tagset tagset.

pymusas.lexicon_collection.LexiconMetaData, object that contains all of the meta data about a single or Multi Word Expression lexicon entry.

pymusas.lexicon_collection.LexiconType which describes the different types of single and Multi Word Expression (MWE) lexicon entires and templates that PyMUSAS uses or will use in the case of curly braces.

The usage documentation, for the "How-to Tag Text", has been updated so that it includes an Indonesian example which does not use spaCy instead uses the Indonesian TreeTagger.

spaCy registered functions for reading in a LexiconCollection or MWELexiconCollection from a TSV. These can be found in pymusas.spacy_api.lexicon_collection.

spaCy registered functions for creating SingleWordRule and MWERule. These can be found in pymusas.spacy_api.taggers.rules.

spaCy registered function for creating ContextualRuleBasedRanker. This can be found in pymusas.spacy_api.rankers.

spaCy registered function for creating a List of Rules, this can be found here: pymusas.spacy_api.taggers.rules.rule_list.

LexiconCollection and MWELexiconCollection open the TSV file downloaded through from_tsv method by default using utf-8 encoding.

pymusas_rule_based_tagger is now a spacy registered factory by using an entry point.

MWELexiconCollection warns users that it does not support curly braces MWE template expressions.

All of the POS mappings can now be called through a spaCy registered function, all of these functions can be found in the pymusas.spacy_api.pos_mapper module.

Updated the Introduction and How-to Tag Text usage documentation with the new updates that PyMUSAS now supports, e.g. MWE's. Also the How-to Tag Text is updated so that it uses the pre-configured spaCy components that have been created for each language, this spaCy components can be found and downloaded from the pymusas-models repository.

Removed

pymusas.taggers.rule_based.USASRuleBasedTagger this is now replaced with pymusas.taggers.rule_based.RuleBasedTagger.

pymusas.spacy_api.taggers.rule_based.USASRuleBasedTagger this is now replaced with pymusas.spacy_api.taggers.rule_based.RuleBasedTagger.

Using PyMUSAS usage documentation page as it requires updating.
opened by apmoore1 1
Indonesian example
Adds the following:

The usage documentation, for the "How-to Tag Text", has been updated so that it includes an Indonesian example which does not use spaCy instead uses the Indonesian TreeTagger.
opened by apmoore1 1
Chinese pos tagset mapping
Fixes #19

Adds a mapping from the Penn Chinese Treebank POS tagset to USAS core POS tagset.

In the documentation it clarifies that we used the Universal Dependencies Treebank version of the UPOS tagset rather than the original version from the paper by Petrov et al. 2012.

enhancement
opened by apmoore1 1
GitHub Actions signing commits
Some of the GitHub actions we have push content to the repository, of which these commits are not signed, it would be good if these could be signed.

Actions that require signatures:

ci.yml in the Update environment file step.

documentation.yml in the Release to GitHub Pages step.

enhancement low priority
opened by apmoore1 1
issue loading "en_dual_none_contextual"

I am Jaspreet and am using the code from URCEL under your repository for a project of mine. I was trying to implement the code for English language and the following error for 'en_dual_none_contextual" is coming repeatedly. I am not sure why? Could you please help out on this.

Kindly help out on this AEAP

Regards

Jaspreet Singh

opened by jasp9559 2
pydoc-markdown requirement >= 4.6.0

pydoc-markdown has made a change since version 4.6.0 to use Novella as the render, this means the current way we create our Python API documentation needs to change if we want to use pydoc-markdown >= 4.6.0. The code that creates this Python API documentation is the ./scripts/py2md.py script.

This is not an urgent change as version 4.5.1 of pydoc-markdown works with Python versions >=3.7,<=3.10.
documentation low priority Potential Future Enhancement

opened by apmoore1 0
Wildcard single word lexicon rule matches (Auto Tag)
To support wildcard (*) syntax for single word lexicon files. This would also be useful for rules like all punctuation tokens, which should be labelled as the semantic category PUNCT, for punctuation.

The wildcard symbol in this syntax would mean that zero or more characters may appear after the word token and/or Part Of Speech (POS) tag. This syntax will therefore hold the same meaning between single word and Multi Word Expression files.

Example

Assuming the single word lexicon file:

lemma pos semantic_tags *kg num N3.5 * punc PUNCT

In the first case it would allow tagging anything that ended with kg, e.g. 15kg to be tagged as a measurement, the N3.5 semantic tag. In the second case it would label all punctuation with the punctuation semantic tag, PUNCT.
enhancement
opened by apmoore1 0
Auxiliary verb rule for single word semantic lexicon lookup
To incorporate auxiliary verb rules into the USAS Rule Based Tagger.

Definition of auxiliary verb rules

All POS tags used here are from the CLAWS C7 tagset.

In English (at least in the C version of the semantic tagger) we use auxiliary verb rules for POS tags VB* (be), VD* (do), VH* (have), to determine the main and auxiliary verbs and therefore alter the semantic tag.

An auxiliary verb would normally be given the USAS semantic tag Z5 grammatical bin, whereas the main verb would be given a non Z5 tag. For example in the sentence (format is token_USAS semantic tag) below the auxiliary verb is have and the main verb is finished:

I_Z8mf have_Z5 finished_T2- my_Z8 lunch_F1 ._PUNC

We have approximately 35 rules in place for amending the semantic tags on be, do, and have after the initial set of potential semantic tags are applied. An example rule for have is as follows:

VH*[Z5] (RR*n) (RT*n) (XX) (RR*n) (RT*n) V*N

If the sequence of POS tags matches a given context, VH* (POS tag for have) followed by V*N (POS tag for the word finished) with optional intervening adverbs (R* POS tags) or negation (XX POS tag), then the rule instructs the tagger to change the semantic tag on the auxiliary verb have to be Z5.

For semantic taggers in other languages (the Java versions), we do not have auxiliary/main verb rules in place.

How this rule maps to spaCy pipeline through UPOS tagset

In the UPOS tagset and therefore spaCy POS models we can use the AUX POS tag from the UPOS tagset, instead of VB* (be), VD* (do), VH* (have). Below is the code and output of running the small English spaCy model on the sentence I have finished my lunch.:

import spacy nlp = spacy.load('en_core_web_sm') doc = nlp('I have finished my lunch.') print('Token\tPOS') for token in doc: print(f'{token.text}\t{token.pos_}')

Output:

Token POS I PRON have AUX finished VERB my PRON lunch NOUN . PUNCT
low priority Potential Future Enhancement
opened by apmoore1 1
df tag in MWE template.
To incorporate Df tags from MWE templates to enhance the USAS Rule Based Tagger.

Definition of Df tags

A small number (currently 93) of English MWE templates have the tag Df, which stands for default. The Df tags refers to the first token starting with a wildcard (*) single word's semantic tag from the single word semantic lexicon, and the Df tag is replaced in the tagger output. Note that, in the C version of the semantic tagger, only the first part of any slash tag is copied across, and any gender markers (lower case letters) on the single word semantic tag are also removed during the replacement step.

Example 1

In the MWE template below, the semantic tag Df would be replaced with the semantic tags of the adjective (JJ) token by looking up that token's semantic tags in the single word semantic lexicon:

mwe_template semantic_tags *_JJ style_NN1 Df

To make this more concrete, given the text:

The acting style.

And the following single word lexicon

lemma pos semantic_tags acting JJ A1/Z3

As well as the above MWE template, then the tokens acting style will be tagged as an MWE with the A1 semantic tag.

Example 2

Some MWE templates can include membership to more than one semantic category using the slash (/) notation, here is an example of how the Df tag is processed in these cases. Given the MWE template:

mwe_template semantic_tags *_JJ style_NN1 C1/Df

given the text:

The acting style.

And the following single word lexicon

lemma pos semantic_tags acting JJ A1/Z3

Then the tokens acting style will be tagged as an MWE with the C1/A1 semantic tags.

Problems with the definition

As stated in the definition that:

The `Df` tags refers to the first token starting with a wildcard (`*`)

This means that an MWE with a Df tag must contain a word token element starting with a wildcard. If no such token exists in the template, then a warning should be issued.
low priority Potential Future Enhancement
opened by apmoore1 1
Multi Word Expressions
To incorporate Multi Word Expressions (MWE) into the rule based tagger. The MWEs will come from MWE lexicons, of which examples of these can be found in the Multilingual USAS GitHub repository, e.g. the Spanish MWE lexicon.

Tasks

[x] Define the MWE template. Update definition of MWE syntax can be found within the documentation notes.

[x] Incorporate MWE templates that do not have a special syntax, e.g. all templates that do not use the wildcard (*) or curly brace syntax ({}).

[x] Incorporate MWE templates that use the wildcard (*) special syntax.

[ ] Incorporate MWE templates that use the curly brace ({}) special syntax, which cover discontinuous MWE's.

documentation enhancement
opened by apmoore1 1

Releases(v0.3.0)

v0.3.0(May 4, 2022)
What's new

Added 🎉

Roadmap added.

Define the MWE template and it's syntax, this is stated in Notes -> Multi Word Expression Syntax in the Usage section of the documentation. This is the first task of issue #24.

PEP 561 (Distributing and Packaging Type Information) compatible by adding py.typed file.

Added srsly as a pip requirement, we use srsly to serialise components to bytes, for example the pymusas.lexicon_collection.LexiconCollection.to_bytes function uses srsly to serialise the LexiconCollection to bytes.

An abstract class, pymusas.base.Serialise, that requires sub-classes to create two methods to_bytes and from_bytes so that the class can be serialised.

pymusas.lexicon_collection.LexiconCollection has three new methods to_bytes, from_bytes, and __eq__. This allows the collection to be serialised and to be compared to other collections.

A Lexicon Collection class for Multi Word Expression (MWE), pymusas.lexicon_collection.MWELexiconCollection, which allows a user to easily create and / or load in from a TSV file a MWE lexicon, like the MWE lexicons from the Multilingual USAS repository. In addition it contains the functionality to match a MWE template to templates stored in the MWELexiconCollection class following the MWE special syntax rules, this is all done through the mwe_match method. It also supports Part Of Speech mapping so that you can map from the lexicon's POS tagset to the tagset of your choice, in both a one-to-one and one-to-many mapping. Like the pymusas.lexicon_collection.LexiconCollection it contains to_bytes, from_bytes, and __eq__ methods for serialisation and comparisons.

The rule based taggers have now been componentised so that they are based off a List of Rules and a Ranker whereby each Rule defines how a token(s) in a text can be matched to a semantic category. Given the matches from the Rules the for each token, a token can have zero or more matches, the Ranker ranks each match and finds the global best match for each token in the text. The taggers now support direct match and wildcard Multi Word Expressions. Due to this:

pymusas.taggers.rule_based.USASRuleBasedTagger has been changed and re-named to pymusas.taggers.rule_based.RuleBasedTagger and now only has a __call__ method.

pymusas.spacy_api.taggers.rule_based.USASRuleBasedTagger has been changed and re-named to pymusas.spacy_api.taggers.rule_based.RuleBasedTagger.

A Rule system, of which all rules can be found in pymusas.taggers.rules:

pymusas.taggers.rules.rule.Rule an abstract class that describes how other sub-classes define the __call__ method and it's signature. This abstract class is sub-classed from pymusas.base.Serialise.

pymusas.taggers.rules.single_word.SingleWordRule a concrete sub-class of Rule for finding Single word lexicon entry matches.

pymusas.taggers.rules.mwe.MWERule a concrete sub-class of Rule for finding Multi Word Expression entry matches.

A Ranking system, of which all of the components that are linked to ranking can be found in pymusas.rankers:

pymusas.rankers.ranking_meta_data.RankingMetaData describes a lexicon entry match, that are typically generated from pymusas.taggers.rules.rule.Rule classes being called. These matches indicate that some part of a text, one or more tokens, matches a lexicon entry whether that is a Multi Word Expression or single word lexicon.

pymusas.rankers.lexicon_entry.LexiconEntryRanker an abstract class that describes how other sub-classes should rank each token in the text and the expected output through the class's __call__ method. This abstract class is sub-classed from pymusas.base.Serialise.

pymusas.rankers.lexicon_entry.ContextualRuleBasedRanker a concrete sub-class of LexiconEntryRanker based off the ranking rules from Piao et al. 2003.

pymusas.rankers.lexical_match.LexicalMatch describes the lexical match within a pymusas.rankers.ranking_meta_data.RankingMetaData object.

pymusas.utils.unique_pos_tags_in_lexicon_entry a function that given a lexicon entry, either Multi Word Expression or Single word, returns a Set[str] of unique POS tags in the lexicon entry.

pymusas.utils.token_pos_tags_in_lexicon_entry a function that given a lexicon entry, either Multi Word Expression or Single word, yields a Tuple[str, str] of word and POS tag from the lexicon entry.

A mapping from USAS core to Universal Part Of Speech (UPOS) tagset.

A mapping from USAS core to basic CorCenCC POS tagset.

A mapping from USAS core to Penn Chinese Treebank POS tagset tagset.

pymusas.lexicon_collection.LexiconMetaData, object that contains all of the meta data about a single or Multi Word Expression lexicon entry.

pymusas.lexicon_collection.LexiconType which describes the different types of single and Multi Word Expression (MWE) lexicon entires and templates that PyMUSAS uses or will use in the case of curly braces.

The usage documentation, for the "How-to Tag Text", has been updated so that it includes an Indonesian example which does not use spaCy instead uses the Indonesian TreeTagger.

spaCy registered functions for reading in a LexiconCollection or MWELexiconCollection from a TSV. These can be found in pymusas.spacy_api.lexicon_collection.

spaCy registered functions for creating SingleWordRule and MWERule. These can be found in pymusas.spacy_api.taggers.rules.

spaCy registered function for creating ContextualRuleBasedRanker. This can be found in pymusas.spacy_api.rankers.

spaCy registered function for creating a List of Rules, this can be found here: pymusas.spacy_api.taggers.rules.rule_list.

LexiconCollection and MWELexiconCollection open the TSV file downloaded through from_tsv method by default using utf-8 encoding.

pymusas_rule_based_tagger is now a spacy registered factory by using an entry point.

MWELexiconCollection warns users that it does not support curly braces MWE template expressions.

All of the POS mappings can now be called through a spaCy registered function, all of these functions can be found in the pymusas.spacy_api.pos_mapper module.

Updated the Introduction and How-to Tag Text usage documentation with the new updates that PyMUSAS now supports, e.g. MWE's. Also the How-to Tag Text is updated so that it uses the pre-configured spaCy components that have been created for each language, this spaCy components can be found and downloaded from the pymusas-models repository.

Removed 🗑

pymusas.taggers.rule_based.USASRuleBasedTagger this is now replaced with pymusas.taggers.rule_based.RuleBasedTagger.

pymusas.spacy_api.taggers.rule_based.USASRuleBasedTagger this is now replaced with pymusas.spacy_api.taggers.rule_based.RuleBasedTagger.

Using PyMUSAS usage documentation page as it requires updating.

Commits

cc52c6d Added languages that we support a0f748b Merge pull request #32 from UCREL/mwe 5feb6ef Added the changes to the documentation 39b88ae Added link to MWE syntax notes 9b63279 Updated so that it uses the pre-configured models 91a7089 Added that we support MWE and have models that can be downloaded 61b8265 Needs to be updated before being added back into the documentation 4ff95aa version 0.3.0 2ab0d4b Added spacy registered functions for pos mappers 0b288bb Changed API loading page to the base module 6da04a9 MWE Lexicon Collection can handle curly braces being added but will be ignored 5042323 @reader to @misc due to config file format f186803 isort 1e2d045 spacy factory entry point 17f7821 spacy factory entry point 014f73d Added rule_list spacy registered function 37fb15e No longer use OS default encoding 8c21fc8 CI does not fail on windows when it should, Fixed 67ee480 CI does not fail on windows when it should, DEBUGGING e48017a isort 745b57a spacy registered function for ContextualRuleBasedRanker fed00b2 Click issue with version 8.1.0 543b251 spacy registered functions for tagger rules 89d59ec Click issue with version 8.1.0 4b8a22c pytest issue with version 7.1.0 e4b75a5 Click issue with version 8.1.0 787496e spacy registered functions for lexicon collections 6fb5882 Added roadmap link ca53cc6 ROADMAP from main branch 1626496 update 5a98ccd Now up to date 404da49 PEP 561 compatible by adding py.typed file c55d991 Added py.typed 2575138 Added srsly as a requirement bdc84bb Added srsly as a requirement 03ddc79 Moved the new_rule_based tagger into rule_based d97aa08 Moved the new_rule_based tagger into rule_based 67e60ba flake8 92b43ab Updated examples ea7fd40 Updated examples f2a7d47 Added lexicon TSV file that was deleted after removing old tagger 20ba93b Removed old tagger f316ef5 Serialised methods for custom classes 4135e67 eq methods for the LexiconEntryRanker classes 4ea243b eq methods for the LexiconCollection classes 15a9013 to and from bytes method for the ranker classes 85f96b6 to and from bytes method for SingleWordRule and LexiconCollection 2f1275b Compare meta data directly rather than through a for loop c71c3b5 to and from bytes method for rule and MWE rule 75e341d to and from bytes methods for MWELexiconCollection 8b862e8 Added srsly as known third party package dcd90e0 First version of roadmap e631dd0 ignore abstract method in code coverage results b017816 update_factory_attributes can update either requires or assigns 0c4b2f2 Corrected docstring 0fe5379 Refactored spacy tagger in doing so created the functions in the utils module 9f4e884 Refactored spacy tagger in doing so created the functions in the utils module 13c2ab3 End of day f74c887 Corrected docstring 4eacbbf Example conll script a4e1f0f Added default punctuation and number POS tags c700642 Added default punctuation and number POS tags fffa4f6 Reverse POS mappers e534ae1 New rule based tagger outputs token indexes of MWE 6f1591b pydoc-markdown requirement fixed b8dd34c Added longest_mwe_template and most_wildcards_in_mwe_template attributes to MWELexiconCollection fa56b3a End of day c17e0f0 make wildcard plural 68b5944 Fixed docstring link error 895e7ee Updated version of the tagger using Rules and Ranker b65da30 MWE Rule match can use a POS Mapper 86c0e6b MWE Rule match can use a POS Mapper 3c4a6f5 MWE lexicon collection can handle POS Mapping f674577 MWE lexicon collection can handle POS Mapping a0fa451 MWE lexicon collection can handle POS Mapping 49f631d removed extra whitespace, flake8 6329d79 removed extra whitespace, flake8 b5c8056 unique pos tags from lexicon entry function cd647db New version of the tagger, works with single word rules e51bbf1 Added pos mapper 62448b3 ContextualRuleBasedRanker works with global lowest ranking b4197dd Refactored test data locations and made tests simpler 65e9828 Restructured single word rule test data 6a15b6b Restructured single word rule test data cc75575 Restructured single word rule test data e929b76 Single word rule 614f0db Corrected docstring ed66305 Corrected docstring 2c43738 Added semantic_tags to RankingMetaData object 2649042 Added MWE Lexicon Rules 7f11774 Added MWE Lexicon Rules 0824a3e Better example 2024873 missing assert statement 1f52680 Test empty list parameter for rule based ranker 2051545 Ranker to rank output from single and MWE rules e57915f Added LexiconMetaData to MWELexiconCollection 526a395 Added Lexicon Meta Data object ee2fe06 Refactored lexicon entry from collection in the tests 0b3b5bc MWE lexicon collection can detect MWE given an MWE template 1e8f7f6 Corrected python examples 5ead038 Documentation 3119906 isort and flake8 corrected and removed an if statement e76c719 Made the MWE direct lookup more efficient 036a85e MWE benchmark for MWE direct lookup 5a8b325 MWE direct lookup can handle regular expression special syntax 99ac3f7 Merge branch 'main' into mwe d381180 Merge pull request #25 from UCREL/indonesian-documentation f597074 Indonesian example added to the usage documentation b765d4b isort issue resolved bf83c87 First version of MWE matching with no special syntax #24 d48656a MWE Lexicon Collection 8823d94 Adds support for raw docstrings 89adf29 Moved LexiconCollection test data into its own folder dc9675e Moved LexiconCollection test data into its own folder dc1f2e3 moved lexicon collection tests into a seperate folder 03d3141 Merge branch 'mwe' of github.com:UCREL/pymusas into mwe 9f1512d Update MWE syntax definitions and examples fc6fbe4 Re-organisation of the test data files/folders 738ff4a Fixed broken link cd1136c Start of the MWE syntax guide
Source code(tar.gz)
Source code(zip)
v0.2.0(Jan 18, 2022)
What's new

Added 🎉

Release process guide adapted from the AllenNLP release process guide, many thanks to the AllenNLP team for creating the original guide.

A mapping from the basic CorCenCC POS tagset to USAS core POS tagset.

The usage documentation, for the "How-to Tag Text", has been updated so that it includes a Welsh example which does not use spaCy instead uses the CyTag toolkit.

A mapping from the Penn Chinese Treebank POS tagset to USAS core POS tagset.

In the documentation it clarifies that we used the Universal Dependencies Treebank version of the UPOS tagset rather than the original version from the paper by Petrov et al. 2012.

The usage documentation, for the "How-to Tag Text", has been updated so that the Chinese example includes using POS information.

A CHANGELOG file has been added. The format of the CHANGELOG file will now be used for the formats of all current and future GitHub release notes. For more information on the CHANGELOG file format see Keep a Changelog.

Commits

9283107 Changed the publish release part of the instructions fea9510 Prepare for release v0.2.0 5581882 Prepare for release v0.2.0 bd4c74f Prepare for release v0.2.0 3fa0346 Prepare for release v0.2.0 f548e08 Publish to PyPI only on releases rather than tags 85ac891 Merge pull request #23 from UCREL/welsh-example e6efe4a Welsh USAS example e476cde Welsh usage example 854bce6 Merge pull request #22 from UCREL/chinese-pos-tagset-mapping e5e33bf Updated CHANGELOG e6afd2d Corrected English e7b0502 Clarification on UPOS tagset used eab65a6 Changed name from Chinese Penn Treebank to Penn Chinese Treebank aa8dc1d Added Chinese Penn Treebank to USAS core POS mapping 8344730 Updated the Chinese tag-text example to include POS information #19 c726a06 Benchmarking the welsh tagger 4bcc239 Added CHANGELOG file fixes #17 4ad6ab2 Added PyPI downloads badge cf38ff0 Updated to docusaurus version 2.0.0-beta.14 60f0bb5 Updated to latest alpine and node 4b09298 Merge pull request #20 from UCREL/language-documentation c18f092 Alphabetical order d11b332 Portuguese example a7170d7 English syntax mistake d697df9 Spanish example ea601c4 Italian example b7af3b5 French example 6c441b0 Dutch example 01ccf6f position of the sections 547fd08 renamed 77e9a9c Clarification in the introduction a20eac3 Chinese example 0202c5d Added better note formatting 207e023 Merge pull request #16 from UCREL/citation b7299de added RSPM environment variable ae02909 package missing error in Validate-CITATION-cff job 0a5eebf Added sudo f3aeb47 Added citation.cff validator 12e7e65 Citation file and how to validate it 5033c98 Increased version number in preparation for next release df67477 Changed homepage URL and removed bug tracker 414b3a4 Added badges from pypi and changed the emojis
Source code(tar.gz)
Source code(zip)
v0.1.0(Dec 7, 2021)
In the initial release we have created a rule based tagger that has been built into different ways:

As a spaCy component that can be added to a spaCy pipeline, of which this is called pymusas.spacy_api.taggers.rule_based.USASRuleBasedTagger

A non-spaCy version, of which this is called pymusas.taggers.rule_based.USASRuleBasedTagger

In this initial release we have concentrated on building the 1 tagger, the spaCy version, of which all of the usage guides are built around this version of the tagger. However the 2 tagger, non-spaCy version, does work, but has fewer capabilities, e.g. no way of easily saving the tagger to a JSON file etc.

We have also created a LexiconCollection class that allows a user to easily create and / or load in from a TSV file a lexicon, like the single word lexicons from the Multilingual USAS repository. This LexiconCollection can be used to format a lexicon file so that it can be used within the rule based tagger, as shown in the using PyMUSAS tutorial.

Lastly we have created a POS mapping module that contains a mapping between the Universal Part Of Speech (UPOS) tagset and the USAS core POS tagset. This can be used within the spaCy component version of the rule based tagger to convert the POS tags outputted from the spaCy POS model, which use UPOS tagset, to the USAS core tagset, which are used by the single word lexicons from the Multilingual USAS repository. For more information on the use of this mapping feature in the rule based tagger see the using PyMUSAS tutorial
Source code(tar.gz)
Source code(zip)