Example code for "Real-World Natural Language Processing"

Masato Hagiwara

Last update: Dec 17, 2022

Related tags

Text Data & NLP realworldnlp

Overview

Real-World Natural Language Processing

This repository contains example code for the book "Real-World Natural Language Processing."

AllenNLP (2.5.0 or above) is required to run the example code in this repository.

Examples included in this repository:

Sentiment analysis (LSTM) [blog article] [Colab notebook]
Sentiment analysis (CNN) [Colab notebook]
Sentiment analysis (with BERT) [AllenNLP config]
Language detector [Colab notebook]
Part-of-speech tagging [blog article] [Colab notebook]
Named entity recognition (NER) [Colab notebook]
Language generation [Colab notebook (LSTM)] [Colab notebook (Transformers)]

Comments

Update some examples for AllenNLP 1.0.0
Contains a few fixes to update notebooks to AllenNLP 1.0.0.

Mostly, updates some of the data loaders to properly import from the new allennlp-models package as well as fix up their instantiation and use.

POS notebook output

LM notebook output

NER notebook output

CNN notebook output

I thought it might be helpful to update some of these to use AllenNLP >= 1.0.0. For most of them, I followed the example set in the updated sentiment notebook. I ran into a bit of trouble with the LM notebook where there the output data point seemed to be "wrapped" twice, nested as {"tokens": {"tokens": <vector>...

I figured I'd make a PR to help aid the process of transforming these. I'm also happy to close this if it's redundant.
opened by mathcass 2
Positive label for F1 measure is not configured correctly
I reviewed the code: examples/sentiment/sst_classifier.py, and found a bug.

self.f1_measure = F1Measure(4)

I think this code is intended to measure precision/recall/f1 for the label '4' which is the most positive sentiment. However, the integer 4 here is considered as index in the array representation. It must be converted using label mapping stored in vocab.
opened by ywatanabex 1
typos and errata (last updated 2021/05/18)
chapter 1, should be text generation

Finally, a third class of text classification is unconditional text generation, where natural language text is generated stochastically from a model. You can train models so that they can generate some random academic papers, Linux source code, or even some poems and play scripts. For example, Andrej Karpathy trained an RNN model form all works of Shakespeare and succeeded in generation pieces of text that look exactly like his work (http://realworldnlpbook.com/ch1.html#karpathy15):

4.2.3 typo swtich in the pseudocode

def update_gru(state, word): new_state = update_hidden(state, word) switch = get_switch(state, word) state = swtich * new_state + (1 – switch) * state return state

chapter 5: micro and macro should be switched.

original text:

If these metrics are computed while ignoring entity types, it’s called a micro average. For example, the micro-averaged precision is the total number of true positives of all types divided by the total number of retrieved named entities regardless of the type. On the other hand, if these metrics are computed per entity type and then get averaged, it’s called a macro average. For example, if the precision for PER and GPE is 80% and 90%, respectively, its macro average is 85%. What AllenNLP computes in the following is the micro average.

5.6.1

The language detection in a previous chapter used RNN and character as input.

original text:

In the first half of this section, we are going to build an English language model and train it using a generic English corpus. Before we start, we note that the RNN language model we build in this chapter operates on characters, not on words or tokens. All the RNN models we’ve seen so far operate on words, which means the input to the RNN was always sequences of words. On the other hand, the RNN we are going to use in this section takes sequences of characters as the input.
opened by xiaoouwang 0
a lot of codes are broken in allennlp 2.0
I'm now reading the book and notice a lot of bugs related to allennlp2.0. Does the author consider upgrading the code to allennlp2.0 to make it comply more with the title real world nlp?

It's a pity because this book is I think the only book using allennlp to tackle a range of general nlp tasks and I like it very much.

Some examples:

In the sst_classifier.ipynb one can note:

vocab = Vocabulary.from_instances(train_dataset + dev_dataset, min_count={'tokens': 3})

gives

unsupported operand type(s) for +: 'generator' and 'generator'

(easily fixable using list(reader.read('train.txt')))

The following two lines

train_dataset.index_with(vocab) dev_dataset.index_with(vocab)

give

'generator' object has no attribute 'index_with'

and also not specific to allen2.0,

predictor = SentenceClassifierPredictor(model, dataset_reader=reader)

gives

AttributeError: 'StanfordSentimentTreeBankDatasetReader' object has no attribute '_tokenizer'
opened by xiaoouwang 0

Error in 2.8.1

Hi,

While trying

predictor = SentenceClassifierPredictor(model, dataset_reader=reader)

in Sec 2.8.1 I'm getting error

AttributeError: 'StanfordSentimentTreeBankDatasetReader' object has no attribute '_tokenizer'

I see you have made some changes in this commit.

opened by harshit-py 1

Module Not Found Error for Machine Translation

ModuleNotFoundError: No module named 'allennlp.data.dataset_readers.seq2seq' I'm trying to run the code examples/mt/mt.py with allennlp==1.0.0 and got this error, no code changes direct clone of the repo and tried to run it.

opened by asvskartheek 1
Help: examples/mt/mt.py

I try reproduce examples/mt/mt.py but I have CPU/CUDA error:

File "/opt/conda/lib/python3.6/site-packages/allennlp/models/encoder_decoders/simple_seq2seq.py", line 212, in forward state = self._encode(source_tokens) File "/opt/conda/lib/python3.6/site-packages/allennlp/models/encoder_decoders/simple_seq2seq.py", line 268, in _encode embedded_input = self._source_embedder(source_tokens) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__ result = self.forward(*input, **kwargs) File "/opt/conda/lib/python3.6/site-packages/allennlp/modules/text_field_embedders/basic_text_field_embedder.py", line 123, in forward token_vectors = embedder(*tensors) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__ result = self.forward(*input, **kwargs) File "/opt/conda/lib/python3.6/site-packages/allennlp/modules/token_embedders/embedding.py", line 143, in forward sparse=self.sparse) File "/opt/conda/lib/python3.6/site-packages/torch/nn/functional.py", line 1506, in embedding return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) RuntimeError: Expected object of backend CPU but got backend CUDA for argument #3 'index'

I try play in kaggle enveroment

opened by remotejob 1

Owner

Masato Hagiwara

Senior AI Researcher @earthspecies working on decoding non-human communication with AI/ML. Ex @duolingo. Author of Real-World Natural Language Processing

GitHub http://www.realworldnlpbook.com/

Code examples for my Write Better Python Code series on YouTube.

Write Better Python Code This repository contains the code examples used in my Write Better Python Code series published on YouTube: https:/

858 Dec 29, 2022

Code to use Augmented Shapiro Wilks Stopping, as well as code for the paper "Statistically Signifigant Stopping of Neural Network Training"

This codebase is being actively maintained, please create and issue if you have issues using it Basics All data files are included under losses and ea

32 Nov 9, 2021

A python script to prefab your scripts/text files, and re create them with ease and not have to open your browser to copy code or write code yourself

Scriptfab - What is it? A python script to prefab your scripts/text files, and re create them with ease and not have to open your browser to copy code

3 Jul 28, 2021

Code for the Python code smells video on the ArjanCodes channel.

7 Python code smells This repository contains the code for the Python code smells video on the ArjanCodes channel (watch the video here). The example

55 Dec 29, 2022

Code for CodeT5: a new code-aware pre-trained encoder-decoder model.

CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation This is the official PyTorch implementation

564 Jan 8, 2023

Galois is an auto code completer for code editors (or any text editor) based on OpenAI GPT-2.

Galois is an auto code completer for code editors (or any text editor) based on OpenAI GPT-2. It is trained (finetuned) on a curated list of approximately 45K Python (~470MB) files gathered from the Github. Currently, it just works properly on Python but not bad at other languages (thanks to GPT-2's power).

91 Sep 23, 2022

Code-autocomplete, a code completion plugin for Python

Code AutoComplete code-autocomplete, a code completion plugin for Python.

13 Jan 7, 2023

Code of paper: A Recurrent Vision-and-Language BERT for Navigation

Recurrent VLN-BERT Code of the Recurrent-VLN-BERT paper: A Recurrent Vision-and-Language BERT for Navigation Yicong Hong, Qi Wu, Yuankai Qi, Cristian

109 Dec 21, 2022

When doing audio and video sentiment recognition, I found that a lot of code is duplicated, often a function in different time debugging for a long time, based on this problem, I want to manage all the previous work, organized into an open source library can be iterative. For their own use and others.

FastAudioVisual Our project is developed here. The goal finish time is March 01, 2021 What is FastAudioVisual? FastAudioVisual is a tool that allows u

39 Oct 27, 2022

Easy, fast, effective, and automatic g-code compression!

Getting to the meat of g-code. Easy, fast, effective, and automatic g-code compression! MeatPack nearly doubles the effective data rate of a standard

97 Nov 21, 2022

Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code.

textgenrnn Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code, or quickly tr

4.8k Dec 30, 2022

Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"

T5: Text-To-Text Transfer Transformer The t5 library serves primarily as code for reproducing the experiments in Exploring the Limits of Transfer Lear

4.6k Jan 1, 2023

Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code.

textgenrnn Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code, or quickly tr

4.3k Feb 18, 2021

Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"

T5: Text-To-Text Transfer Transformer The t5 library serves primarily as code for reproducing the experiments in Exploring the Limits of Transfer Lear

3.2k Feb 17, 2021

Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks

Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks. It takes raw videos/images + text as inputs, and outputs task predictions. ClipBERT is designed based on 2D CNNs and transformers, and uses a sparse sampling strategy to enable efficient end-to-end video-and-language learning.

612 Jan 4, 2023

Example code for "Real-World Natural Language Processing"

Related tags

Overview

Real-World Natural Language Processing

Comments

Update some examples for AllenNLP 1.0.0

Positive label for F1 measure is not configured correctly

typos and errata (last updated 2021/05/18)

a lot of codes are broken in allennlp 2.0

Error in 2.8.1

Module Not Found Error for Machine Translation

Help: examples/mt/mt.py

Owner

Masato Hagiwara

Code examples for my Write Better Python Code series on YouTube.

Code to use Augmented Shapiro Wilks Stopping, as well as code for the paper "Statistically Signifigant Stopping of Neural Network Training"

A python script to prefab your scripts/text files, and re create them with ease and not have to open your browser to copy code or write code yourself

Code for the Python code smells video on the ArjanCodes channel.

Code for CodeT5: a new code-aware pre-trained encoder-decoder model.

Galois is an auto code completer for code editors (or any text editor) based on OpenAI GPT-2.

Code-autocomplete, a code completion plugin for Python

Code of paper: A Recurrent Vision-and-Language BERT for Navigation

Easy, fast, effective, and automatic g-code compression!

Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code.

Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"

Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code.

Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"

Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks

Collection of scripts to pinpoint obfuscated code

Code associated with the "Data Augmentation using Pre-trained Transformer Models" paper

Official code for "Parser-Free Virtual Try-on via Distilling Appearance Flows", CVPR 2021

This is the source code of RPG (Reward-Randomized Policy Gradient)