BookNLP, a natural language processing pipeline for books

Overview

BookNLP

BookNLP is a natural language processing pipeline that scales to books and other long documents (in English), including:

  • Part-of-speech tagging
  • Dependency parsing
  • Entity recognition
  • Character name clustering (e.g., "Tom", "Tom Sawyer", "Mr. Sawyer", "Thomas Sawyer" -> TOM_SAWYER) and coreference resolution
  • Quotation speaker identification
  • Supersense tagging (e.g., "animal", "artifact", "body", "cognition", etc.)
  • Event tagging
  • Referential gender inference (TOM_SAWYER -> he/him/his)

BookNLP ships with two models, both with identical architectures but different underlying BERT sizes. The larger and more accurate big model is fit for GPUs and multi-core computers; the faster small model is more appropriate for personal computers. See the table below for a comparison of the difference, both in terms of overall speed and in accuracy for the tasks that BookNLP performs.

Small Big
Entity tagging (F1) 88.2 90.0
Supersense tagging (F1) 73.2 76.2
Event tagging (F1) 70.6 74.1
Coreference resolution (Avg. F1) 76.4 79.0
Speaker attribution (B3) 86.4 89.9
CPU time, 2019 MacBook Pro (mins.)* 3.6 15.4
CPU time, 10-core server (mins.)* 2.4 5.2
GPU time, Titan RTX (mins.)* 2.1 2.2

*timings measure speed to run BookNLP on a sample book of The Secret Garden (99K tokens). To explore running BookNLP in Google Colab on a GPU, see this notebook.

Installation

conda create --name booknlp python=3.7
conda activate booknlp
  • If using a GPU, install pytorch for your system and CUDA version by following installation instructions on https://pytorch.org.

  • Install booknlp and download Spacy model.

pip install booknlp
python -m spacy download en_core_web_sm

Usage

from booknlp.booknlp import BookNLP

model_params={
		"pipeline":"entity,quote,supersense,event,coref", 
		"model":"big"
	}
	
booknlp=BookNLP("en", model_params)

# Input file to process
input_file="input_dir/bartleby_the_scrivener.txt"

# Output directory to store resulting files in
output_directory="output_dir/bartleby/"

# File within this directory will be named ${book_id}.entities, ${book_id}.tokens, etc.
book_id="bartleby"

booknlp.process(input_file, output_directory, book_id)

This runs the full BookNLP pipeline; you are able to run only some elements of the pipeline (to cut down on computational time) by specifying them in that parameter (e.g., to only run entity tagging and event tagging, change model_params above to include "pipeline":"entity,event").

This process creates the directory output_dir/bartleby and generates the following files:

  • bartleby/bartleby.tokens -- This encodes core word-level information. Each row corresponds to one token and includes the following information:

    • paragraph ID
    • sentence ID
    • token ID within sentence
    • token ID within document
    • word
    • lemma
    • byte onset within original document
    • byte offset within original document
    • POS tag
    • dependency relation
    • token ID within document of syntactic head
    • event
  • bartleby/bartleby.entities -- This represents the typed entities within the document (e.g., people and places), along with their coreference.

    • coreference ID (unique entity ID)
    • start token ID within document
    • end token ID within document
    • NOM (nominal), PROP (proper), or PRON (pronoun)
    • PER (person), LOC (location), FAC (facility), GPE (geo-political entity), VEH (vehicle), ORG (organization)
    • text of entity
  • bartleby/bartleby.supersense -- This stores information from supersense tagging.

    • start token ID within document
    • end token ID within document
    • supersense category (verb.cognition, verb.communication, noun.artifact, etc.)
  • bartleby/bartleby.quotes -- This stores information about the quotations in the document, along with the speaker. In a sentence like "'Yes', she said", where she -> ELIZABETH_BENNETT, "she" is the attributed mention of the quotation 'Yes', and is coreferent with the unique entity ELIZABETH_BENNETT.

    • start token ID within document of quotation
    • end token ID within document of quotation
    • start token ID within document of attributed mention
    • end token ID within document of attributed mention
    • attributed mention text
    • coreference ID (unique entity ID) of attributed mention
    • quotation text
  • bartleby/bartleby.book

JSON file providing information about all characters mentioned more than 1 time in the book, including their proper/common/pronominal references, referential gender, actions for the which they are the agent and patient, objects they possess, and modifiers.

  • bartleby/bartleby.book.html

HTML file containing a.) the full text of the book along with annotations for entities, coreference, and speaker attribution and b.) a list of the named characters and major entity catgories (FAC, GPE, LOC, etc.).

Annotations

Entity annotations

The entity annotation layer covers six of the ACE 2005 categories in text:

  • People (PER): Tom Sawyer, her daughter
  • Facilities (FAC): the house, the kitchen
  • Geo-political entities (GPE): London, the village
  • Locations (LOC): the forest, the river
  • Vehicles (VEH): the ship, the car
  • Organizations (ORG): the army, the Church

The targets of annotation here include both named entities (e.g., Tom Sawyer), common entities (the boy) and pronouns (he). These entities can be nested, as in the following:

drawing

For more, see: David Bamman, Sejal Popat and Sheng Shen, "An Annotated Dataset of Literary Entities," NAACL 2019.

The entity tagging model within BookNLP is trained on an annotated dataset of 968K tokens, including the public domain materials in LitBank and a new dataset of ~500 contemporary books, including bestsellers, Pulitzer Prize winners, works by Black authors, global Anglophone books, and genre fiction (article forthcoming).

Event annotations

The event layer identifies events with asserted realis (depicted as actually taking place, with specific participants at a specific time) -- as opposed to events with other epistemic modalities (hypotheticals, future events, extradiegetic summaries by the narrator).

Text Events Source
My father’s eyes had closed upon the light of this world six months, when mine opened on it. {closed, opened} Dickens, David Copperfield
Call me Ishmael. {} Melville, Moby Dick
His sister was a tall, strong girl, and she walked rapidly and resolutely, as if she knew exactly where she was going and what she was going to do next. {walked} Cather, O Pioneers

For more, see: Matt Sims, Jong Ho Park and David Bamman, "Literary Event Detection," ACL 2019.

The event tagging model is trained on event annotations within LitBank. The small model above makes use of a distillation process, by training on the predictions made by the big model for a collection of contemporary texts.

Supersense tagging

Supersense tagging provides coarse semantic information for a sentence by tagging spans with 41 lexical semantic categories drawn from WordNet, spanning both nouns (including plant, animal, food, feeling, and artifact) and verbs (including cognition, communication, motion, etc.)

Example Source
The [station wagons]artifact [arrived]motion at [noon]time, a long shining [line]group that [coursed]motion through the [west campus]location. Delillo, White Noise

The BookNLP tagger is trained on SemCor.

.

Character name clustering and coreference

The coreference layer covers the six ACE entity categories outlined above (people, facilities, locations, geo-political entities, organizations and vehicles) and is trained on LitBank and PreCo.

Example Source
One may as well begin with [Helen]x's letters to [[her]x sister]y Forster, Howard's End

Accurate coreference at the scale of a book-length document is still an open research problem, and attempting full coreference -- where any named entity (Elizabeth), common entity (her sister, his daughter) and pronoun (she) can corefer -- tends to erroneously conflate multiple distinct entities into one. By default, BookNLP addresses this by first carrying out character name clustering (grouping "Tom", "Tom Sawyer" and "Mr. Sawyer" into a single entity), and then allowing pronouns to corefer with either named entities (Tom) or common entities (the boy), but disallowing common entities from co-referring to named entities. To turn off this mode and carry out full corefernce, add pronominalCorefOnly=False to the model_params parameters dictionary above (but be sure to inspect the output!).

For more on the coreference criteria used in this work, see David Bamman, Olivia Lewke and Anya Mansoor (2020), "An Annotated Dataset of Coreference in English Literature", LREC.

Referential gender inference

BookNLP infers the referential gender of characters by associating them with the pronouns (he/him/his, she/her, they/them, xe/xem/xyr/xir, etc.) used to refer to them in the context of the story. This method encodes several assumptions:

  • BookNLP describes the referential gender of characters, and not their gender identity. Characters are described by the pronouns used to refer to them (e.g., he/him, she/her) rather than labels like "M/F".

  • Prior information on the alignment of names with referential gender (e.g., from government records or larger background datasets) can be used to provide some information to inform this process if desired (e.g., "Tom" is often associated with he/him in pre-1923 English texts). Name information, however, should not be uniquely determinative, but rather should be sensitive to the context in which it is used (e.g., "Tom" in the book "Tom and Some Other Girls", where Tom is aligned with she/her). By default, BookNLP uses prior information on the alignment of proper names and honorifics with pronouns drawn from ~15K works from Project Gutenberg; this prior information can be ignored by setting referential_gender_hyperparameterFile:None in the model_params file. Alternative priors can be used by passing the pathname to a prior file (in the same format as english/data/gutenberg_prop_gender_terms.txt) to this parameter.

  • Users should be free to define the referential gender categories used here. The default set of categories is {he, him, his}, {she, her}, {they, them, their}, {xe, xem, xyr, xir}, and {ze, zem, zir, hir}. To specify a different set of categories, update the model_params setting to define them: referential_gender_cats: [ ["he", "him", "his"], ["she", "her"], ["they", "them", "their"], ["xe", "xem", "xyr", "xir"], ["ze", "zem", "zir", "hir"] ]

Speaker attribution

The speaker attribution model identifies all instances of direct speech in the text and attributes it to its speaker.

Quote Speaker Source
— Come up , Kinch ! Come up , you fearful jesuit ! Buck_Mulligan-0 Joyce, Ulysses
‘ Oh dear ! Oh dear ! I shall be late ! ’ The_White_Rabbit-4 Carroll, Alice in Wonderland
“ Do n't put your feet up there , Huckleberry ; ” Miss_Watson-26 Twain, Huckleberry Finn

This model is trained on speaker attribution data in LitBank. For more on the quotation annotations, see this paper.

Part-of-speech tagging and dependency parsing

BookNLP uses Spacy for part-of-speech tagging and dependency parsing.

Acknowledgments

BookNLP is supported by the National Endowment for the Humanities (HAA-271654-20) and the National Science Foundation (IIS-1942591).
Comments
  • Character count error?

    Character count error?

    Hi. I cannot match character count with simple word search count in the text file. What alerted me was the dracula.txt in which I get 'id': 230, 'count': 37, 'max_proper_mention': 'Mina' where as when I do a simple word count (in both textEdit and MS Word) for 'Mina' I get 260. Why the anomaly?

    opened by sm9449 1
  • HuggingFace Validation Error

    HuggingFace Validation Error

    Hello, I've been using BookNLP on Windows on my laptop. Today, when I tried to install it on another computer of mine, I got this error:

    using device cpu {'pipeline': 'entity,quote,supersense,event,coref', 'model': 'small'} Traceback (most recent call last): File "c:\Users\Dan\Dizertatie\test.py", line 8, in booknlp=BookNLP("en", model_params) File "C:\Users\Dan\AppData\Local\Programs\Python\Python310\lib\site-packages\booknlp\booknlp.py", line 14, in init self.booknlp=EnglishBookNLP(model_params) File "C:\Users\Dan\AppData\Local\Programs\Python\Python310\lib\site-packages\booknlp\english\english_booknlp.py", line 148, in init self.entityTagger=LitBankEntityTagger(self.entityPath, tagsetPath) File "C:\Users\Dan\AppData\Local\Programs\Python\Python310\lib\site-packages\booknlp\english\entity_tagger.py", line 19, in init self.model = Tagger(freeze_bert=False, base_model=base_model, tagset_flat={"EVENT":1, "O":1}, supersense_tagset=self.supersense_tagset, tagset=self.tagset, device=device) File "C:\Users\Dan\AppData\Local\Programs\Python\Python310\lib\site-packages\booknlp\english\tagger.py", line 58, in init self.tokenizer = BertTokenizer.from_pretrained(modelName, do_lower_case=False, do_basic_tokenize=False) File "C:\Users\Dan\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\tokenization_utils_base.py", line 1736, in from_pretrained resolved_vocab_files[file_id] = cached_file( File "C:\Users\Dan\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\utils\hub.py", line 409, in cached_file resolved_file = hf_hub_download( File "C:\Users\Dan\AppData\Local\Programs\Python\Python310\lib\site-packages\huggingface_hub\utils_validators.py", line 114, in _inner_fn validate_repo_id(arg_value) File "C:\Users\Dan\AppData\Local\Programs\Python\Python310\lib\site-packages\huggingface_hub\utils_validators.py", line 172, in validate_repo_id raise HFValidationError( huggingface_hub.utils.validators.HFValidationError: Repo id must use alphanumeric chars or '-', '', '.', '--' and '..' are forbidden, '-' and '.' cannot start or end the name, max length is 96: 'C:\Users\Dan\booknlps\entities_google/bert_uncased_L-4_H-256_A-4'.

    Any ideas?

    opened by danoprica 1
  • Choosing spaCy model

    Choosing spaCy model

    Thank you for this fantastic library!

    Which spacy model is used when running booknlp.process()? Is this in any way controlled by the "model" parameter ("small", "big") or does it simply use the model that is currently initialized. For example, could I get it to use the en_core_web_trf by running the following before I use booknlp.process():

    import spacy
    nlp = spacy.load('en_core_web_trf')
    
    opened by omstuhler 0
  • BookNLP crashes without internet access even when models are already downloaded

    BookNLP crashes without internet access even when models are already downloaded

    I've been using BookNLP for the last couple weeks and love it; thanks for such a great package.

    I realized working in the (wifi-less) subway today that even though I have the models downloaded, BookNLP crashes without internet access. That's unfortunate since there are of course many real-life situations in which internet access is impossible.

    Here's the error (with internet turned off):

    ValueError: Connection error, and we cannot find the requested files in the cached path. Please try again or make sure your Internet connection is on.]()
    

    Here's the full stack trace:

    [File ~/github/lltk/lltk/model/booknlp.py:436, in get_booknlp(language, pipeline, model, cache, quiet, **kwargs)
        ]()[434](file:///Users/ryan/github/lltk/lltk/model/booknlp.py?line=433)[ if not key in booknlpd:
        ]()[435](file:///Users/ryan/github/lltk/lltk/model/booknlp.py?line=434)[     from booknlp.booknlp import BookNLP
    --> ]()[436](file:///Users/ryan/github/lltk/lltk/model/booknlp.py?line=435)[     booknlpd[key]=BookNLP(
        ]()[437](file:///Users/ryan/github/lltk/lltk/model/booknlp.py?line=436)[         language=language,
        ]()[438](file:///Users/ryan/github/lltk/lltk/model/booknlp.py?line=437)[         model_params=dict(pipeline=pipeline,model=model)
        ]()[439](file:///Users/ryan/github/lltk/lltk/model/booknlp.py?line=438)[     )
        ]()[440](file:///Users/ryan/github/lltk/lltk/model/booknlp.py?line=439)[ return booknlpd[key]
    
    File ~/miniforge3/envs/booknlp/lib/python3.10/site-packages/booknlp/booknlp.py:14, in BookNLP.__init__(self, language, model_params)
         ]()[11](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/booknlp/booknlp.py?line=10)[ def __init__(self, language, model_params):
         ]()[13](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/booknlp/booknlp.py?line=12)[ 	if language == "en":
    ---> ]()[14](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/booknlp/booknlp.py?line=13)[ 		self.booknlp=EnglishBookNLP(model_params)
    
    File ~/miniforge3/envs/booknlp/lib/python3.10/site-packages/booknlp/english/english_booknlp.py:148, in EnglishBookNLP.__init__(self, model_params)
        ]()[145](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/booknlp/english/english_booknlp.py?line=144)[ self.quoteTagger=QuoteTagger()
        ]()[147](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/booknlp/english/english_booknlp.py?line=146)[ if self.doEntities:
    --> ]()[148](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/booknlp/english/english_booknlp.py?line=147)[ 	self.entityTagger=LitBankEntityTagger(self.entityPath, tagsetPath)
        ]()[149](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/booknlp/english/english_booknlp.py?line=148)[ 	aliasPath = pkg_resources.resource_filename(__name__, "data/aliases.txt")
        ]()[150](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/booknlp/english/english_booknlp.py?line=149)[ 	self.name_resolver=NameCoref(aliasPath)
    
    File ~/miniforge3/envs/booknlp/lib/python3.10/site-packages/booknlp/english/entity_tagger.py:19, in LitBankEntityTagger.__init__(self, model_file, model_tagset)
         ]()[16](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/booknlp/english/entity_tagger.py?line=15)[ base_model=re.sub("google_bert", "google/bert", model_file.split("/")[-1])
         ]()[17](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/booknlp/english/entity_tagger.py?line=16)[ base_model=re.sub(".model", "", base_model)
    ---> ]()[19](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/booknlp/english/entity_tagger.py?line=18)[ self.model = Tagger(freeze_bert=False, base_model=base_model, tagset_flat={"EVENT":1, "O":1}, supersense_tagset=self.supersense_tagset, tagset=self.tagset, device=device)
         ]()[21](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/booknlp/english/entity_tagger.py?line=20)[ self.model.to(device)
         ]()[22](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/booknlp/english/entity_tagger.py?line=21)[ self.model.load_state_dict(torch.load(model_file, map_location=device))
    
    File ~/miniforge3/envs/booknlp/lib/python3.10/site-packages/booknlp/english/tagger.py:58, in Tagger.__init__(self, freeze_bert, base_model, tagset, supersense_tagset, tagset_flat, hidden_dim, flat_hidden_dim, device)
         ]()[54](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/booknlp/english/tagger.py?line=53)[ self.rev_supersense_tagset[len(supersense_tagset)+1]="O"
         ]()[56](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/booknlp/english/tagger.py?line=55)[ self.num_labels_flat=len(tagset_flat)
    ---> ]()[58](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/booknlp/english/tagger.py?line=57)[ self.tokenizer = BertTokenizer.from_pretrained(modelName, do_lower_case=False, do_basic_tokenize=False)
         ]()[59](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/booknlp/english/tagger.py?line=58)[ self.bert = BertModel.from_pretrained(modelName)
         ]()[61](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/booknlp/english/tagger.py?line=60)[ self.tokenizer.add_tokens(["[CAP]"], special_tokens=True)
    
    File ~/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:1724, in PreTrainedTokenizerBase.from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs)
       ]()[1722](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/tokenization_utils_base.py?line=1721)[ else:
       ]()[1723](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/tokenization_utils_base.py?line=1722)[     try:
    -> ]()[1724](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/tokenization_utils_base.py?line=1723)[         resolved_vocab_files[file_id] = cached_path(
       ]()[1725](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/tokenization_utils_base.py?line=1724)[             file_path,
       ]()[1726](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/tokenization_utils_base.py?line=1725)[             cache_dir=cache_dir,
       ]()[1727](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/tokenization_utils_base.py?line=1726)[             force_download=force_download,
       ]()[1728](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/tokenization_utils_base.py?line=1727)[             proxies=proxies,
       ]()[1729](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/tokenization_utils_base.py?line=1728)[             resume_download=resume_download,
       ]()[1730](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/tokenization_utils_base.py?line=1729)[             local_files_only=local_files_only,
       ]()[1731](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/tokenization_utils_base.py?line=1730)[             use_auth_token=use_auth_token,
       ]()[1732](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/tokenization_utils_base.py?line=1731)[             user_agent=user_agent,
       ]()[1733](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/tokenization_utils_base.py?line=1732)[         )
       ]()[1735](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/tokenization_utils_base.py?line=1734)[     except FileNotFoundError as error:
       ]()[1736](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/tokenization_utils_base.py?line=1735)[         if local_files_only:
    
    File ~/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/file_utils.py:1921, in cached_path(url_or_filename, cache_dir, force_download, proxies, resume_download, user_agent, extract_compressed_file, force_extract, use_auth_token, local_files_only)
       ]()[1917](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/file_utils.py?line=1916)[     local_files_only = True
       ]()[1919](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/file_utils.py?line=1918)[ if is_remote_url(url_or_filename):
       ]()[1920](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/file_utils.py?line=1919)[     # URL, so get it from the cache (downloading if necessary)
    -> ]()[1921](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/file_utils.py?line=1920)[     output_path = get_from_cache(
       ]()[1922](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/file_utils.py?line=1921)[         url_or_filename,
       ]()[1923](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/file_utils.py?line=1922)[         cache_dir=cache_dir,
       ]()[1924](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/file_utils.py?line=1923)[         force_download=force_download,
       ]()[1925](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/file_utils.py?line=1924)[         proxies=proxies,
       ]()[1926](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/file_utils.py?line=1925)[         resume_download=resume_download,
       ]()[1927](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/file_utils.py?line=1926)[         user_agent=user_agent,
       ]()[1928](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/file_utils.py?line=1927)[         use_auth_token=use_auth_token,
       ]()[1929](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/file_utils.py?line=1928)[         local_files_only=local_files_only,
       ]()[1930](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/file_utils.py?line=1929)[     )
       ]()[1931](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/file_utils.py?line=1930)[ elif os.path.exists(url_or_filename):
       ]()[1932](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/file_utils.py?line=1931)[     # File, and it exists.
       ]()[1933](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/file_utils.py?line=1932)[     output_path = url_or_filename
    
    File ~/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/file_utils.py:2177, in get_from_cache(url, cache_dir, force_download, proxies, etag_timeout, resume_download, user_agent, use_auth_token, local_files_only)
       ]()[2171](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/file_utils.py?line=2170)[                 raise FileNotFoundError(
       ]()[2172](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/file_utils.py?line=2171)[                     "Cannot find the requested files in the cached path and outgoing traffic has been"
       ]()[2173](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/file_utils.py?line=2172)[                     " disabled. To enable model look-ups and downloads online, set 'local_files_only'"
       ]()[2174](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/file_utils.py?line=2173)[                     " to False."
       ]()[2175](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/file_utils.py?line=2174)[                 )
       ]()[2176](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/file_utils.py?line=2175)[             else:
    -> ]()[2177](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/file_utils.py?line=2176)[                 raise ValueError(
       ]()[2178](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/file_utils.py?line=2177)[                     "Connection error, and we cannot find the requested files in the cached path."
       ]()[2179](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/file_utils.py?line=2178)[                     " Please try again or make sure your Internet connection is on."
       ]()[2180](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/file_utils.py?line=2179)[                 )
       ]()[2182](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/file_utils.py?line=2181)[ # From now on, etag is not None.
       ]()[2183](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/file_utils.py?line=2182)[ if os.path.exists(cache_path) and not force_download:
    
    ValueError: Connection error, and we cannot find the requested files in the cached path. Please try again or make sure your Internet connection is on.]()
    

    I turn wifi on and everything works normally.

    opened by quadrismegistus 2
  • Download speeds very slow on initial startup

    Download speeds very slow on initial startup

    Hi, the download seems to take 4 hours for the bert .model files from the server end. Is there a way to wget or curl them into a directory? Also, if one terminates the program, the files are still partially written in and cause an unzipping error in pytorch. Is there a plan to mitigate this in the future with tempfile downloads?

    minimal example:

    import booknlp
    from booknlp.booknlp import BookNLP
    import spacy
    spacy.load('en_core_web_sm')
    model_params = {
        "pipeline": "entity,quote,supersense,event,coref",
        "model": "big"
    }
    
    booknlp = BookNLP("en", model_params)
    
    # Input file to process
    input_file = "input_dir/bartleby.txt"
    
    # Output directory to store resulting files in
    output_directory = "output_dir/bartleby/"
    
    # File within this directory will be named ${book_id}.entities, ${book_id}.tokens, etc.
    book_id = "bartleby"
    
    booknlp.process(input_file, output_directory, book_id)
    

    https://i.imgur.com/FZIqNsC.png

    opened by Ori-Pixel 0
  • Problem running run_nlpbook.py

    Problem running run_nlpbook.py

    After running: booknlp=BookNLP("en", model_params)

    I get the following;

    (It seems to refer to my model location by booknlps, but what is created is booknlp_models, and tacking a local directory path to the huggingface url also seems like an issue. I'm glad to help and try things here, though my experience with big pyhon code bases is limited )

    404 Client Error: Repository Not Found for url: https://huggingface.co/C:%5CUsers%5Cdenis%5Cbooknlps%5Centities_google/bert_uncased_L-6_H-768_A-12/resolve/main/tokenizer_config.json

    RepositoryNotFoundError Traceback (most recent call last) c:\Users\denis\Anaconda3\envs\booknlp\lib\site-packages\transformers\file_utils.py in get_file_from_repo(path_or_repo, filename, cache_dir, force_download, resume_download, proxies, use_auth_token, revision, local_files_only) 2241 local_files_only=local_files_only, -> 2242 use_auth_token=use_auth_token, 2243 )

    c:\Users\denis\Anaconda3\envs\booknlp\lib\site-packages\transformers\file_utils.py in cached_path(url_or_filename, cache_dir, force_download, proxies, resume_download, user_agent, extract_compressed_file, force_extract, use_auth_token, local_files_only) 1853 use_auth_token=use_auth_token, -> 1854 local_files_only=local_files_only, 1855 )

    c:\Users\denis\Anaconda3\envs\booknlp\lib\site-packages\transformers\file_utils.py in get_from_cache(url, cache_dir, force_download, proxies, etag_timeout, resume_download, user_agent, use_auth_token, local_files_only) 2049 r = requests.head(url, headers=headers, allow_redirects=False, proxies=proxies, timeout=etag_timeout) -> 2050 _raise_for_status(r) 2051 etag = r.headers.get("X-Linked-Etag") or r.headers.get("ETag")

    c:\Users\denis\Anaconda3\envs\booknlp\lib\site-packages\transformers\file_utils.py in _raise_for_status(request) 1970 if error_code == "RepoNotFound": -> 1971 raise RepositoryNotFoundError(f"404 Client Error: Repository Not Found for url: {request.url}") 1972 elif error_code == "EntryNotFound":

    RepositoryNotFoundError: 404 Client Error: Repository Not Found for url: https://huggingface.co/C:%5CUsers%5Cdenis%5Cbooknlps%5Centities_google/bert_uncased_L-6_H-768_A-12/resolve/main/tokenizer_config.json

    During handling of the above exception, another exception occurred:

    OSError Traceback (most recent call last) ~\AppData\Local\Temp\ipykernel_2352\2094341818.py in 4 } 5 ----> 6 booknlp=BookNLP("en", model_params)

    c:\Users\denis\Anaconda3\envs\booknlp\lib\site-packages\booknlp\booknlp.py in init(self, language, model_params) 12 13 if language == "en": ---> 14 self.booknlp=EnglishBookNLP(model_params) 15 16 def process(self, inputFile, outputFolder, idd):

    c:\Users\denis\Anaconda3\envs\booknlp\lib\site-packages\booknlp\english\english_booknlp.py in init(self, model_params) 146 147 if self.doEntities: --> 148 self.entityTagger=LitBankEntityTagger(self.entityPath, tagsetPath) 149 aliasPath = pkg_resources.resource_filename(name, "data/aliases.txt") 150 self.name_resolver=NameCoref(aliasPath)

    c:\Users\denis\Anaconda3\envs\booknlp\lib\site-packages\booknlp\english\entity_tagger.py in init(self, model_file, model_tagset) 17 base_model=re.sub(".model", "", base_model) 18 ---> 19 self.model = Tagger(freeze_bert=False, base_model=base_model, tagset_flat={"EVENT":1, "O":1}, supersense_tagset=self.supersense_tagset, tagset=self.tagset, device=device) 20 21 self.model.to(device)

    c:\Users\denis\Anaconda3\envs\booknlp\lib\site-packages\booknlp\english\tagger.py in init(self, freeze_bert, base_model, tagset, supersense_tagset, tagset_flat, hidden_dim, flat_hidden_dim, device) 56 self.num_labels_flat=len(tagset_flat) 57 ---> 58 self.tokenizer = BertTokenizer.from_pretrained(modelName, do_lower_case=False, do_basic_tokenize=False) 59 self.bert = BertModel.from_pretrained(modelName) 60

    c:\Users\denis\Anaconda3\envs\booknlp\lib\site-packages\transformers\tokenization_utils_base.py in from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs) 1662 use_auth_token=use_auth_token, 1663 revision=revision, -> 1664 local_files_only=local_files_only, 1665 ) 1666 if resolved_config_file is not None:

    c:\Users\denis\Anaconda3\envs\booknlp\lib\site-packages\transformers\file_utils.py in get_file_from_repo(path_or_repo, filename, cache_dir, force_download, resume_download, proxies, use_auth_token, revision, local_files_only) 2246 logger.error(err) 2247 raise EnvironmentError( -> 2248 f"{path_or_repo} is not a local folder and is not a valid model identifier " 2249 "listed on 'https://huggingface.co/models'\nIf this is a private repository, make sure to " 2250 "pass a token having permission to this repo with use_auth_token or log in with "

    OSError: C:\Users\denis\booknlps\entities_google/bert_uncased_L-6_H-768_A-12 is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models' If this is a private repository, make sure to pass a token having permission to this repo with use_auth_token or log in with huggingface-cli login and pass use_auth_token=True

    opened by denisfitz57 4
Owner
null
Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG)

Indobenchmark Toolkit Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG) resources fo

Samuel Cahyawijaya 11 Aug 26, 2022
LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language

LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language ⚖️ The library of Natural Language Processing for Brazilian legal lang

Felipe Maia Polo 125 Dec 20, 2022
A design of MIDI language for music generation task, specifically for Natural Language Processing (NLP) models.

MIDI Language Introduction Reference Paper: Pop Music Transformer: Beat-based Modeling and Generation of Expressive Pop Piano Compositions: code This

Robert Bogan Kang 3 May 25, 2022
Implementation of Natural Language Code Search in the project CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

CodeBERT-Implementation In this repo we have replicated the paper CodeBERT: A Pre-Trained Model for Programming and Natural Languages. We are interest

Tanuj Sur 4 Jul 1, 2022
Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Pattern Pattern is a web mining module for Python. It has tools for: Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM par

Computational Linguistics Research Group 8.4k Dec 30, 2022
💫 Industrial-strength Natural Language Processing (NLP) in Python

spaCy: Industrial-strength NLP spaCy is a library for advanced Natural Language Processing in Python and Cython. It's built on the very latest researc

Explosion 24.9k Jan 2, 2023
🗣️ NALP is a library that covers Natural Adversarial Language Processing.

NALP: Natural Adversarial Language Processing Welcome to NALP. Have you ever wanted to create natural text from raw sources? If yes, NALP is for you!

Gustavo Rosa 21 Aug 12, 2022
Basic Utilities for PyTorch Natural Language Processing (NLP)

Basic Utilities for PyTorch Natural Language Processing (NLP) PyTorch-NLP, or torchnlp for short, is a library of basic utilities for PyTorch NLP. tor

Michael Petrochuk 2.1k Jan 1, 2023
Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

Trankit: A Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing Trankit is a light-weight Transformer-based Pyth

null 652 Jan 6, 2023
PORORO: Platform Of neuRal mOdels for natuRal language prOcessing

PORORO: Platform Of neuRal mOdels for natuRal language prOcessing pororo performs Natural Language Processing and Speech-related tasks. It is easy to

Kakao Brain 1.2k Dec 21, 2022
💫 Industrial-strength Natural Language Processing (NLP) in Python

spaCy: Industrial-strength NLP spaCy is a library for advanced Natural Language Processing in Python and Cython. It's built on the very latest researc

Explosion 19.5k Feb 13, 2021
🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.

State-of-the-art Natural Language Processing for PyTorch and TensorFlow 2.0 ?? Transformers provides thousands of pretrained models to perform tasks o

Hugging Face 77.3k Jan 3, 2023
A very simple framework for state-of-the-art Natural Language Processing (NLP)

A very simple framework for state-of-the-art NLP. Developed by Humboldt University of Berlin and friends. IMPORTANT: (30.08.2020) We moved our models

flair 12.3k Dec 31, 2022
State of the Art Natural Language Processing

Spark NLP: State of the Art Natural Language Processing Spark NLP is a Natural Language Processing library built on top of Apache Spark ML. It provide

John Snow Labs 3k Jan 5, 2023
Basic Utilities for PyTorch Natural Language Processing (NLP)

Basic Utilities for PyTorch Natural Language Processing (NLP) PyTorch-NLP, or torchnlp for short, is a library of basic utilities for PyTorch NLP. tor

Michael Petrochuk 1.9k Feb 3, 2021
A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

A Deep Learning NLP/NLU library by Intel® AI Lab Overview | Models | Installation | Examples | Documentation | Tutorials | Contributing NLP Architect

Intel Labs 2.9k Jan 2, 2023
Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Texar is a toolkit aiming to support a broad set of machine learning, especially natural language processing and text generation tasks. Texar provides

ASYML 2.3k Jan 7, 2023