Tool to add main subject to items on Wikidata using a WMFs CirrusSearch for named entity recognition or a manually supplied list of QIDs

Overview

ItemSubjector

Tool made to add main subject statements to items based on the title using a home-brewed CirrusSearch-based Named Entity Recognition algorithm. bild The tool running in PAWS adding manually found main subject QIDs

Features

This tool has the following features

  • adding a list of main subjects to items (for now only scholarly articles are supported)
  • automatically extracting n-grams from labels of 10.000 articles (this is not that powerful because users know better than scikit what subjects are meaningful to have on our scientific articles)

Thanks

During the development of this tool the author got a help multiple times from Jan Ainali and Jon Søby with figuring out how to query the API using the CirrusSearch extensions and to remove more specific main subjects from the query results.

A special thanks also to Magnus Sälgö for his valuable input and ideas, e.g. to search for aliases also.

Installation

Clone the repo and run

pip install -r requirements.txt

to install all requirements.

PAWS

The tool runs in PAWS with no known issues.

  • log in to PAWS
  • open a terminal
  • make sure you clone somewhere not public cd /tmp
  • run git clone https://github.com/dpriskorn/ItemSubjector.git
  • run the pip-command above
  • copy config cp config.example.py config.py
  • edit nano config.py and add your credentials

Setup

Like my other tools, copy config.example.py -> config.py and enter the botusername (e.g. So9q@itemsubjector) and password

Use

It has 2 modes:

  1. automatic finding n-grams and trying to detect items that match (default if no arguments are given on the command line)
  2. add main subject items to scholarly articles

Both modes conclude by adding the validated or supplied QID to all scientific articles where the n-gram or search string appears (with spaces around it) in the label of the target item (e.g. scientific article).

Adding QIDs manually

Always provide the most precise subjects first

Run the script with the -l or -list argument followed by one or more QIDs:

  • python itemsubjector.py -l Q108528107

Here is a more advanced example: The first is metastatic breast cancer which is a subclass of the second breast cancer

  • python itemsubjector.py -l Q108528107 Q128581

In this case the tool is smart enough (thanks to Jan Ainali) to first add metastatic breast cancer to items and then exclude those items when adding the more general subject afterwards.

This way we avoid redundancy since we want the most specific subjects on the items and not all the general ones above it in the classification system.

Please investigate before adding broad subjects and try to nail down specific subjects and add them first. If you are unsure, please ask on-wiki or in the Wikicite Telegram group

License

GPLv3+

Comments
  • PAWS v0.3.3: no_alias_for_scholarly_items error

    PAWS v0.3.3: no_alias_for_scholarly_items error

    I get this error in PAWS on v0.3.3 when doing single subject:

    Picking a random main subject
    Working on naturgas
    Do you want to continue? [Y/Enter/n]: 
    Traceback (most recent call last):
      File "/home/paws/.itemsubjector/itemsubjector.py", line 8, in <module>
        itemsubjector.run()
      File "/home/paws/.itemsubjector/src/__init__.py", line 79, in run
        main_subjects.get_validated_main_subjects_as_jobs()
      File "/home/paws/.itemsubjector/src/models/main_subjects.py", line 108, in get_validated_main_subjects_as_jobs
        job = main_subject_item.fetch_items_and_get_job_if_confirmed()
      File "/home/paws/.itemsubjector/src/models/wikimedia/wikidata/item/main_subject.py", line 240, in fetch_items_and_get_job_if_confirmed
        return self.__fetch_and_parse__()
      File "/home/paws/.itemsubjector/src/models/wikimedia/wikidata/item/main_subject.py", line 250, in __fetch_and_parse__
        self.__prepare_before_fetching_items__()
      File "/home/paws/.itemsubjector/src/models/wikimedia/wikidata/item/main_subject.py", line 188, in __prepare_before_fetching_items__
        self.__extract_search_strings__()
      File "/home/paws/.itemsubjector/src/models/wikimedia/wikidata/item/main_subject.py", line 141, in __extract_search_strings__
        elif self.id in config.no_alias_for_scholarly_items:
    AttributeError: module 'config' has no attribute 'no_alias_for_scholarly_items'
    

    My command was poetry run python itemsubjector.py -a Q40858

    opened by Ainali 10
  • ModuleNotFoundError: No module named 'config.items'

    ModuleNotFoundError: No module named 'config.items'

    On checking the latest version and 0.3-alpha2, I am getting the following error:

    Traceback (most recent call last):
      File "itemsubjector.py", line 3, in <module>
        import src
      File "/mnt/nfs/labstore-secondary-tools-project/itemsubjector-jsamwrites/itemsubjector/tmp/ItemSubjector-0.3-alpha2/src/__init__.py", line 11, in <module>
        from src.helpers.console import (
      File "/mnt/nfs/labstore-secondary-tools-project/itemsubjector-jsamwrites/itemsubjector/tmp/ItemSubjector-0.3-alpha2/src/helpers/console.py", line 11, in <module>
        from src.models.batch_job import BatchJob
      File "/mnt/nfs/labstore-secondary-tools-project/itemsubjector-jsamwrites/itemsubjector/tmp/ItemSubjector-0.3-alpha2/src/models/batch_job.py", line 3, in <module>
        from src.models.items import Items
      File "/mnt/nfs/labstore-secondary-tools-project/itemsubjector-jsamwrites/itemsubjector/tmp/ItemSubjector-0.3-alpha2/src/models/items/__init__.py", line 10, in <module>
        from src.models.wikimedia.wikidata.sparql_item import SparqlItem
      File "/mnt/nfs/labstore-secondary-tools-project/itemsubjector-jsamwrites/itemsubjector/tmp/ItemSubjector-0.3-alpha2/src/models/wikimedia/wikidata/sparql_item.py", line 4, in <module>
        import config.items
    ModuleNotFoundError: No module named 'config.items'
    
    opened by johnsamuelwrites 7
  • Error in PAWS on v0.3.2 when doing single subject

    Error in PAWS on v0.3.2 when doing single subject

    After installing in PAWS I ran this command:

    poetry run python itemsubjector.py -a Q40858

    I then selected 2 to work on Riksdagen documents. This caused this screen:

    Working on naturgas, see http://www.wikidata.org/entity/Q40858
    Got a total of 78 items
    Please keep an eye on the lag of the WDQS cluster here and avoid working if it is over a few minutes.
    https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&viewPanel=8&from=now-30m&to=now&refresh=1d You can see if any lagging servers are pooled here
    https://config-master.wikimedia.org/pybal/eqiad/wdqs
    If any enabled servers are lagging more than 5-10 minutes you can search phabricator for open tickets to see if the team is on it.
    If you don't find any feel free to create a new ticket like this:
    https://phabricator.wikimedia.org/T291621
    Running 1 job(s) with a total of 1 items non-interactively now. You can take a coffee break and lean back :)
    Traceback (most recent call last):
      File "/home/paws/.itemsubjector/itemsubjector.py", line 8, in <module>
        itemsubjector.run()
      File "/home/paws/.itemsubjector/src/__init__.py", line 164, in run
        handle_job_preparation_or_run_directly_if_any_jobs(
      File "/home/paws/.itemsubjector/src/helpers/jobs.py", line 154, in handle_job_preparation_or_run_directly_if_any_jobs
        batchjobs.run_jobs()
      File "/home/paws/.itemsubjector/src/models/batch_jobs.py", line 45, in run_jobs
        job.suggestion.add_to_items(
      File "/home/paws/.itemsubjector/src/models/suggestion.py", line 111, in add_to_items
        f"to {clean_rich_formatting(target_item.label)}"
      File "/home/paws/.itemsubjector/src/helpers/cleaning.py", line 24, in clean_rich_formatting
        return label.replace("[/", "['/")
    AttributeError: 'NoneType' object has no attribute 'replace'
    
    opened by Ainali 4
  • Labels with apostrophe(') do not work

    Labels with apostrophe(') do not work

    Labels with apostrophe(') currently do not work. I think that an escape character needs to be added, before sending the query string to WDQS.

    Take for example:

    Alzheimer's disease (Q11081)

    returns the following error

    Fetching items with labels that have one of the search strings by running a total of 11 queries on WDQS...INFO:backoff:Backing off execute_sparql_query(...) for 1.0s (requests.exceptions.HTTPError: 400 Client Error: Bad Request for url

    bug 
    opened by johnsamuelwrites 3
  • Error on pip install

    Error on pip install

    In v0.2 I am trying pip install -r requirements.txt in PAWS and get this error message:

    Collecting wikibaseintegrator
      Cloning git://github.com/LeMyst/WikibaseIntegrator (to revision v0.12.0.dev5) to /tmp/pip-install-h0jhod33/wikibaseintegrator_2f94ad8cb5b244b3816e997a960745eb
      Running command git clone --filter=blob:none --quiet git://github.com/LeMyst/WikibaseIntegrator /tmp/pip-install-h0jhod33/wikibaseintegrator_2f94ad8cb5b244b3816e997a960745eb
      fatal: unable to connect to github.com:
      github.com[0: 140.82.113.4]: errno=Connection timed out
    
      error: subprocess-exited-with-error
      
      × git clone --filter=blob:none --quiet git://github.com/LeMyst/WikibaseIntegrator /tmp/pip-install-h0jhod33/wikibaseintegrator_2f94ad8cb5b244b3816e997a960745eb did not run successfully.
      │ exit code: 128
      ╰─> See above for output.
      
      note: This error originates from a subprocess, and is likely not a problem with pip.
    error: subprocess-exited-with-error
    
    × git clone --filter=blob:none --quiet git://github.com/LeMyst/WikibaseIntegrator /tmp/pip-install-h0jhod33/wikibaseintegrator_2f94ad8cb5b244b3816e997a960745eb did not run successfully.
    │ exit code: 128
    ╰─> See above for output.
    
    note: This error originates from a subprocess, and is likely not a problem with pip.
    

    What should I do?

    opened by Ainali 2
  • OAuth authentification error

    OAuth authentification error

    Thanks for correcting the previous errors in 0.3-alpha3.

    I checked out the latest commit in the main branch and 0.3-alpha4. And now, I now face the OAuth error.

    1. I checked with username/password
    2. I checked with botname/password
      File "/mnt/nfs/labstore-secondary-tools-project/itemsubjector-jsamwrites/itemsubjector/my_venv/lib/python3.7/site-packages/oauthlib/oauth2/rfc6749/parameters.py", line 432, in validate_token_parameters
        raise_from_error(params.get('error'), params)
      File "/mnt/nfs/labstore-secondary-tools-project/itemsubjector-jsamwrites/itemsubjector/my_venv/lib/python3.7/site-packages/oauthlib/oauth2/rfc6749/errors.py", line 402, in raise_from_error
        raise cls(**kwargs)
    oauthlib.oauth2.rfc6749.errors.InvalidClientIdError: (invalid_request) The request is missing a required parameter, includes an invalid parameter value, includes a parameter more than once, or is otherwise malformed.
    

    Any idea on this error.

    I checked with other scripts of mine. There are no issues.

    opened by johnsamuelwrites 2
  • Unable to remove articles belonging to specific subjects from the list of articles related to generic subjects

    Unable to remove articles belonging to specific subjects from the list of articles related to generic subjects

    I want to add the following main subjects to the articles

    1. scoping review protocol (Q108684373): very specific
    2. scoping review (Q101116078): generic

    and I run the following command (from specific topics to generic topics)

    $ python itemsubjector.py -na -l Q108684373 Q101116078
    

    Even though the addition of 'Q108684373' is complete, I see articles with the text 'scoping review protocol' in the list for 'scoping review'.

    This issue may be related to Issue 14

    bug 
    opened by johnsamuelwrites 2
  • Ask when limit reached

    Ask when limit reached

    This pull request implements asking in the end when --limit is used. Hopefully that improves the UX.

    Also pydantics BaseModel is now used for validation.

    The QS export support was dropped to reduce complexity. Also adding main subjects already present on articles was dropped.

    opened by dpriskorn 1
  • Implement --exclude List[QID] for excluding matches

    Implement --exclude List[QID] for excluding matches

    When working on HCV https://www.wikidata.org/wiki/Q154869 I see that there is another item with "HCV 229E" and I want to make sure that they don't get included in my batch. bild

    enhancement 
    opened by dpriskorn 1
  • Match against a list of existing main subjects on scholarly articles

    Match against a list of existing main subjects on scholarly articles

    This is useful, because there are already many thousand different main subjects and many of them are not matched properly with all relevant articles yet.

    enhancement 
    opened by dpriskorn 1
  • Enable searching for alias also

    Enable searching for alias also

    As a user, I want to choose whether to search for items matching the label and or one of the aliases so I get as many hits as possible.

    Pseudo code: also fetch the aliases from WDQS ask user for which ones to include (or all) https://console-menu.readthedocs.io/en/latest/consolemenu/MultiSelectMenu.html add them to a new attribute in class Labels: search_strings fetch based on that (with one query if possible) use https://pmitzias.com/SPARQLBurger/docs.html to generate the SPARQL query using UNION

    opened by dpriskorn 1
  • Bump certifi from 2022.9.24 to 2022.12.7

    Bump certifi from 2022.9.24 to 2022.12.7

    Bumps certifi from 2022.9.24 to 2022.12.7.

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • Add a Web UI

    Add a Web UI

    Similar to QuickStatements batches, ItemsSubjector could have a flask frontend that runs in Toolforge and execute the users batches.

    This requires oauth and flask. Lucas made a good toolforge flask template to get started.

    opened by dpriskorn 1
Releases(v0.3.4)
  • v0.3.4(Oct 18, 2022)

    What's Changed

    • Delete Suggestion by @dpriskorn in https://github.com/dpriskorn/ItemSubjector/pull/77
    • Fix limit_to_items_without_p921 by @dpriskorn in https://github.com/dpriskorn/ItemSubjector/pull/78

    Full Changelog: https://github.com/dpriskorn/ItemSubjector/compare/v0.3.3...v0.3.4

    Source code(tar.gz)
    Source code(zip)
  • v0.3.3(Oct 6, 2022)

    What's Changed

    • disable --limit-to-items-without-p921 for now
    • simplify configuration into one file
    • disable scientific journal task for now
    • Add thesis query to the scholarly items task by @dpriskorn in https://github.com/dpriskorn/ItemSubjector/pull/72
    • Fix thesis query by @dpriskorn in https://github.com/dpriskorn/ItemSubjector/pull/73
    • New MainSubjects class by @dpriskorn in https://github.com/dpriskorn/ItemSubjector/pull/74
    • Rewrite classes for better readability and debugging by @dpriskorn in https://github.com/dpriskorn/ItemSubjector/pull/71

    Full Changelog: https://github.com/dpriskorn/ItemSubjector/compare/v0.3.2...v0.3.3

    Source code(tar.gz)
    Source code(zip)
  • v0.3.2(Oct 2, 2022)

    What's Changed

    • Add requirements.txt and document release workflow by @dpriskorn in https://github.com/dpriskorn/ItemSubjector/pull/67
    • Update documentation by @dpriskorn in https://github.com/dpriskorn/ItemSubjector/pull/68
    • Prepare 0.3.2 by @dpriskorn in https://github.com/dpriskorn/ItemSubjector/pull/69

    Full Changelog: https://github.com/dpriskorn/ItemSubjector/compare/v0.3.1...v0.3.2

    Source code(tar.gz)
    Source code(zip)
  • v0.3.1(Oct 1, 2022)

    What's Changed

    • Updated the README to instruct the user to use poetry to setup the environment from now on.
    • Prepare for 0.3.1 by @dpriskorn in https://github.com/dpriskorn/ItemSubjector/pull/65

    Full Changelog: https://github.com/dpriskorn/ItemSubjector/compare/v0.3.0...v0.3.1

    Source code(tar.gz)
    Source code(zip)
  • v0.3.0(Oct 1, 2022)

    What's Changed

    • Implement allow list for aliases shorter than 5 characters by @dpriskorn in https://github.com/dpriskorn/ItemSubjector/pull/32
    • Enable fetching main subjects from sparql for all tasks by @dpriskorn in https://github.com/dpriskorn/ItemSubjector/pull/33
    • Improve handling of results by @dpriskorn in https://github.com/dpriskorn/ItemSubjector/pull/38
    • Support exporting to dataframe by @dpriskorn in https://github.com/dpriskorn/ItemSubjector/pull/39
    • Ask when limit reached by @dpriskorn in https://github.com/dpriskorn/ItemSubjector/pull/42
    • Implement blocklist and no_alias lists by @dpriskorn in https://github.com/dpriskorn/ItemSubjector/pull/44
    • Refactor configuration files by @dpriskorn in https://github.com/dpriskorn/ItemSubjector/pull/45
    • Show number of queries also by @dpriskorn in https://github.com/dpriskorn/ItemSubjector/pull/46
    • Lookup aliases and discard those that appear in a label by @dpriskorn in https://github.com/dpriskorn/ItemSubjector/pull/47
    • Use set in suggestion by @dpriskorn in https://github.com/dpriskorn/ItemSubjector/pull/51
    • Add ItemSubjector class by @dpriskorn in https://github.com/dpriskorn/ItemSubjector/pull/52
    • fix list_of_allowed_aliases by @dpriskorn in https://github.com/dpriskorn/ItemSubjector/pull/54
    • Fix git requirement invocation and update WBI by @dpriskorn in https://github.com/dpriskorn/ItemSubjector/pull/56
    • Refactor login and update to new WBI syntax by @dpriskorn in https://github.com/dpriskorn/ItemSubjector/pull/57
    • Manage with poetry by @dpriskorn in https://github.com/dpriskorn/ItemSubjector/pull/63

    Full Changelog: https://github.com/dpriskorn/ItemSubjector/compare/v0.2...v0.3.0

    Source code(tar.gz)
    Source code(zip)
  • 0.3-alpha4(Apr 11, 2022)

    What's Changed

    • fix list_of_allowed_aliases by @dpriskorn in https://github.com/dpriskorn/ItemSubjector/pull/54
    • Fix git requirement invocation and update WBI by @dpriskorn in https://github.com/dpriskorn/ItemSubjector/pull/56
    • Refactor login and update to new WBI syntax by @dpriskorn in https://github.com/dpriskorn/ItemSubjector/pull/57

    Full Changelog: https://github.com/dpriskorn/ItemSubjector/compare/0.3-alpha3...0.3-alpha4

    Source code(tar.gz)
    Source code(zip)
  • 0.3-alpha3(Mar 31, 2022)

    What's Changed

    • Show number of queries also by @dpriskorn in https://github.com/dpriskorn/ItemSubjector/pull/46
    • Lookup aliases and discard those that appear in a label by @dpriskorn in https://github.com/dpriskorn/ItemSubjector/pull/47 (Big thanks to Arthur Smith in the Wikicite Telegram channel for the suggestion)
    • Use set in suggestion to avoid duplicate search expressions by @dpriskorn in https://github.com/dpriskorn/ItemSubjector/pull/51
    • Add ItemSubjector class by @dpriskorn in https://github.com/dpriskorn/ItemSubjector/pull/52

    Full Changelog: https://github.com/dpriskorn/ItemSubjector/compare/0.3-alpha2...0.3-alpha3

    Source code(tar.gz)
    Source code(zip)
  • 0.3-alpha2(Feb 25, 2022)

    What's Changed

    • Refactor configuration files by @dpriskorn in https://github.com/dpriskorn/ItemSubjector/pull/45

    Full Changelog: https://github.com/dpriskorn/ItemSubjector/compare/v0.2...0.3-alpha2

    Source code(tar.gz)
    Source code(zip)
  • 0.3-alpha1(Feb 25, 2022)

  • 0.3-alpha0(Feb 24, 2022)

    What's Changed

    • Implement allow list for aliases shorter than 5 characters by @dpriskorn in https://github.com/dpriskorn/ItemSubjector/pull/32
    • Enable fetching main subjects from sparql for all tasks by @dpriskorn in https://github.com/dpriskorn/ItemSubjector/pull/33
    • Improve handling of results by @dpriskorn in https://github.com/dpriskorn/ItemSubjector/pull/38
    • Support exporting to dataframe by @dpriskorn in https://github.com/dpriskorn/ItemSubjector/pull/39
    • Ask when limit reached by @dpriskorn in https://github.com/dpriskorn/ItemSubjector/pull/42

    Full Changelog: https://github.com/dpriskorn/ItemSubjector/compare/v0.2...0.3-alpha0

    Source code(tar.gz)
    Source code(zip)
  • v0.2(Oct 17, 2021)

    Changes since v0.2-beta2

    • Clean ™ and ® from the search strings.
    • New flag --no-confirmation.
    • Bugfix with label being None but not skipped.
    Source code(tar.gz)
    Source code(zip)
  • v0.2-beta2(Oct 6, 2021)

    Changes since beta1

    • Implement skipping of aliases shorter than 5 chars to avoid false positives.
    • New Kubernetes_HOWTO.md file that details how to set it up
    • File hashing is now applied so that a running job will not delete the new prepared list of jobs (if the hash is different). This ensures that the user after starting a batch can start immediately with preparing a new batch :)
    • Improve the UI messages in some places to be more clear.
    • and more...
    Source code(tar.gz)
    Source code(zip)
  • v0.2-beta1(Oct 3, 2021)

  • v0.2-beta0(Oct 3, 2021)

    Changes since alpha3

    • Improved handling of job list
    • Now automatically skips items with no label in the working language of the task, but prints a link, so the item can be improved by the user.
    • Print more statistics to the user, so they can judge when to run the job list
    • Update Wikibase Integrator
    Source code(tar.gz)
    Source code(zip)
  • v0.2-alpha3(Oct 1, 2021)

  • v0.2-alpha2(Oct 1, 2021)

    Changes since alpha1

    • Adding main subjects based on a SPARQL query is now supported! Use it like this --sparql QUERY
    • New script to fetch a random sample of 100,000 main subjects from the currently 14M statements on scholarly articles to use when matching using -m.
    • Include preprints in the scholarly article task.
    • Add new task: Thesis'.
    • Increase the size of samples shown with the size of the batch so that for every 20 items a random will be picked and shown to the user for validation. Also warn for big batches > 4000 items.
    • Cleanup the README and explain that the script can no longer handle broad/narrow in the same run. You have to run it multiple times if you want to make sure you are not adding both a narrow and broad subject to items.
    • Add more confirmations and better UI messages.
    • and more...
    Source code(tar.gz)
    Source code(zip)
  • v0.2-alpha1(Sep 25, 2021)

    Changes since v0.2-alpha0

    User facing:

    • new flag: --limit-to-items-without-p921 which is useful sometimes
    • more information when validating batches
    • print statistics of the prepared jobs
    • handle apostrophes and backslashes correctly when matching
    • add pickle path in config.example.py that works on WCS Kubernetes when the tempdir is not available in the pod running the job
    • add WCS Kubernetes beta cluster helper script

    Code changes:

    • clean up dead code
    • move best practice into Tasks
    • refactor more functionality out of main.py to improve code readability
    • ...and more
    Source code(tar.gz)
    Source code(zip)
  • v0.2-alpha0(Sep 24, 2021)

    Changes since v0.1

    • removed all NER and n-gram code
    • added batch mode
    • added existing main subject matcher
    • updated readme.md
    • added support for Python 3.7
    • added better output logging when running a job, so the user sees both the current job number vs. the total number of jobs and current number of item being processed vs. the total number.
    • added display of the total runtime when a batch finishes.
    • ...and more.
    Source code(tar.gz)
    Source code(zip)
  • v0.1(Sep 24, 2021)

Owner
Dennis Priskorn
I love designing and writing software using mainly Python but also know some C#, Rust and PHP. I'm studying computer science at my local university
Dennis Priskorn
A spaCy wrapper of OpenTapioca for named entity linking on Wikidata

spaCyOpenTapioca A spaCy wrapper of OpenTapioca for named entity linking on Wikidata. Table of contents Installation How to use Local OpenTapioca Vizu

Universitätsbibliothek Mannheim 80 Jan 3, 2023
A text augmentation tool for named entity recognition.

neraug This python library helps you with augmenting text data for named entity recognition. Augmentation Example Reference from An Analysis of Simple

Hiroki Nakayama 48 Oct 11, 2022
Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

NeuroNER NeuroNER is a program that performs named-entity recognition (NER). Website: neuroner.com. This page gives step-by-step instructions to insta

Franck Dernoncourt 1.6k Dec 27, 2022
Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

NeuroNER NeuroNER is a program that performs named-entity recognition (NER). Website: neuroner.com. This page gives step-by-step instructions to insta

Franck Dernoncourt 1.5k Feb 11, 2021
Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

NeuroNER NeuroNER is a program that performs named-entity recognition (NER). Website: neuroner.com. This page gives step-by-step instructions to insta

Franck Dernoncourt 1.5k Feb 17, 2021
Framework for fine-tuning pretrained transformers for Named-Entity Recognition (NER) tasks

NERDA Not only is NERDA a mesmerizing muppet-like character. NERDA is also a python package, that offers a slick easy-to-use interface for fine-tuning

Ekstra Bladet 141 Dec 30, 2022
CrossNER: Evaluating Cross-Domain Named Entity Recognition (AAAI-2021)

CrossNER is a fully-labeled collected of named entity recognition (NER) data spanning over five diverse domains (Politics, Natural Science, Music, Literature, and Artificial Intelligence) with specialized entity categories for different domains.

Zihan Liu 89 Nov 10, 2022
PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing

PhoNLP is a multi-task learning model for joint part-of-speech (POS) tagging, named entity recognition (NER) and dependency parsing. Experiments on Vietnamese benchmark datasets show that PhoNLP produces state-of-the-art results, outperforming a single-task learning approach that fine-tunes the pre-trained Vietnamese language model PhoBERT for each task independently.

VinAI Research 109 Dec 2, 2022
Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.

anaGo anaGo is a Python library for sequence labeling(NER, PoS Tagging,...), implemented in Keras. anaGo can solve sequence labeling tasks such as nam

Hiroki Nakayama 1.5k Dec 5, 2022
Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.

anaGo anaGo is a Python library for sequence labeling(NER, PoS Tagging,...), implemented in Keras. anaGo can solve sequence labeling tasks such as nam

Hiroki Nakayama 1.4k Feb 17, 2021
Pytorch-Named-Entity-Recognition-with-BERT

BERT NER Use google BERT to do CoNLL-2003 NER ! Train model using Python and Inference using C++ ALBERT-TF2.0 BERT-NER-TENSORFLOW-2.0 BERT-SQuAD Requi

Kamal Raj 1.1k Dec 25, 2022
Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Pytorch-NLU,一个中文文本分类、序列标注工具包,支持中文长文本、短文本的多类、多标签分类任务,支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

null 186 Dec 24, 2022
Implemented shortest-circuit disambiguation, maximum probability disambiguation, HMM-based lexical annotation and BiLSTM+CRF-based named entity recognition

Implemented shortest-circuit disambiguation, maximum probability disambiguation, HMM-based lexical annotation and BiLSTM+CRF-based named entity recognition

null 0 Feb 13, 2022
Use Google's BERT for named entity recognition (CoNLL-2003 as the dataset).

For better performance, you can try NLPGNN, see NLPGNN for more details. BERT-NER Version 2 Use Google's BERT for named entity recognition (CoNLL-2003

Kaiyinzhou 1.2k Dec 26, 2022
Named Entity Recognition API used by TEI Publisher

TEI Publisher Named Entity Recognition API This repository contains the API used by TEI Publisher's web-annotation editor to detect entities in the in

e-editiones.org 14 Nov 15, 2022
Nested Named Entity Recognition

Nested Named Entity Recognition Training Dataset: CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark url: https://tianchi.aliyun.

null 8 Dec 25, 2022
RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2

RoNER RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2. It is meant to be an easy to use, hi

Stefan Dumitrescu 9 Nov 7, 2022
Spacy-ginza-ner-webapi - Named Entity Recognition API with spaCy and GiNZA

Named Entity Recognition API with spaCy and GiNZA I wrote a blog post about this

Yuki Okuda 3 Feb 27, 2022
Code for "Parallel Instance Query Network for Named Entity Recognition", accepted at ACL 2022.

README Code for Two-stage Identifier: "Parallel Instance Query Network for Named Entity Recognition", accepted at ACL 2022. For details of the model a

Yongliang Shen 45 Nov 29, 2022