Example code for "Real-World Natural Language Processing"

Overview

Real-World Natural Language Processing

This repository contains example code for the book "Real-World Natural Language Processing."

AllenNLP (2.5.0 or above) is required to run the example code in this repository.

Examples included in this repository:

Comments
  • Update some examples for AllenNLP 1.0.0

    Update some examples for AllenNLP 1.0.0

    Contains a few fixes to update notebooks to AllenNLP 1.0.0.

    Mostly, updates some of the data loaders to properly import from the new allennlp-models package as well as fix up their instantiation and use.


    I thought it might be helpful to update some of these to use AllenNLP >= 1.0.0. For most of them, I followed the example set in the updated sentiment notebook. I ran into a bit of trouble with the LM notebook where there the output data point seemed to be "wrapped" twice, nested as {"tokens": {"tokens": <vector>...

    I figured I'd make a PR to help aid the process of transforming these. I'm also happy to close this if it's redundant.

    opened by mathcass 2
  • Positive label for F1 measure is not configured correctly

    Positive label for F1 measure is not configured correctly

    I reviewed the code: examples/sentiment/sst_classifier.py, and found a bug.

        self.f1_measure = F1Measure(4)
    

    I think this code is intended to measure precision/recall/f1 for the label '4' which is the most positive sentiment. However, the integer 4 here is considered as index in the array representation. It must be converted using label mapping stored in vocab.

    opened by ywatanabex 1
  • typos and errata (last updated 2021/05/18)

    typos and errata (last updated 2021/05/18)

    • chapter 1, should be text generation

    Finally, a third class of text classification is unconditional text generation, where natural language text is generated stochastically from a model. You can train models so that they can generate some random academic papers, Linux source code, or even some poems and play scripts. For example, Andrej Karpathy trained an RNN model form all works of Shakespeare and succeeded in generation pieces of text that look exactly like his work (http://realworldnlpbook.com/ch1.html#karpathy15):

    • 4.2.3 typo swtich in the pseudocode
    def update_gru(state, word):
        new_state = update_hidden(state, word)
     
        switch = get_switch(state, word)
     
        state = swtich * new_state + (1 – switch) * state
     
        return state
    
    • chapter 5: micro and macro should be switched.

    original text:

    If these metrics are computed while ignoring entity types, it’s called a micro average. For example, the micro-averaged precision is the total number of true positives of all types divided by the total number of retrieved named entities regardless of the type. On the other hand, if these metrics are computed per entity type and then get averaged, it’s called a macro average. For example, if the precision for PER and GPE is 80% and 90%, respectively, its macro average is 85%. What AllenNLP computes in the following is the micro average.

    • 5.6.1

    The language detection in a previous chapter used RNN and character as input.

    original text:

    In the first half of this section, we are going to build an English language model and train it using a generic English corpus. Before we start, we note that the RNN language model we build in this chapter operates on characters, not on words or tokens. All the RNN models we’ve seen so far operate on words, which means the input to the RNN was always sequences of words. On the other hand, the RNN we are going to use in this section takes sequences of characters as the input.

    opened by xiaoouwang 0
  • a lot of codes are broken in allennlp 2.0

    a lot of codes are broken in allennlp 2.0

    I'm now reading the book and notice a lot of bugs related to allennlp2.0. Does the author consider upgrading the code to allennlp2.0 to make it comply more with the title real world nlp?

    It's a pity because this book is I think the only book using allennlp to tackle a range of general nlp tasks and I like it very much.

    Some examples:

    In the sst_classifier.ipynb one can note:

    vocab = Vocabulary.from_instances(train_dataset + dev_dataset,
                                      min_count={'tokens': 3})
    

    gives

    unsupported operand type(s) for +: 'generator' and 'generator'

    (easily fixable using list(reader.read('train.txt')))

    The following two lines

    train_dataset.index_with(vocab)
    dev_dataset.index_with(vocab)
    

    give

    'generator' object has no attribute 'index_with'

    and also not specific to allen2.0,

    predictor = SentenceClassifierPredictor(model, dataset_reader=reader)

    gives

    AttributeError: 'StanfordSentimentTreeBankDatasetReader' object has no attribute '_tokenizer'

    opened by xiaoouwang 0
  • Error in 2.8.1

    Error in 2.8.1

    Hi,

    While trying

    predictor = SentenceClassifierPredictor(model, dataset_reader=reader)
    

    in Sec 2.8.1 I'm getting error

    AttributeError: 'StanfordSentimentTreeBankDatasetReader' object has no attribute '_tokenizer'
    

    I see you have made some changes in this commit.

    opened by harshit-py 1
  • Module Not Found Error for Machine Translation

    Module Not Found Error for Machine Translation

    ModuleNotFoundError: No module named 'allennlp.data.dataset_readers.seq2seq' I'm trying to run the code examples/mt/mt.py with allennlp==1.0.0 and got this error, no code changes direct clone of the repo and tried to run it.

    opened by asvskartheek 1
  • Help: examples/mt/mt.py

    Help: examples/mt/mt.py

    I try reproduce examples/mt/mt.py but I have CPU/CUDA error:

    File "/opt/conda/lib/python3.6/site-packages/allennlp/models/encoder_decoders/simple_seq2seq.py", line 212, in forward state = self._encode(source_tokens) File "/opt/conda/lib/python3.6/site-packages/allennlp/models/encoder_decoders/simple_seq2seq.py", line 268, in _encode embedded_input = self._source_embedder(source_tokens) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__ result = self.forward(*input, **kwargs) File "/opt/conda/lib/python3.6/site-packages/allennlp/modules/text_field_embedders/basic_text_field_embedder.py", line 123, in forward token_vectors = embedder(*tensors) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__ result = self.forward(*input, **kwargs) File "/opt/conda/lib/python3.6/site-packages/allennlp/modules/token_embedders/embedding.py", line 143, in forward sparse=self.sparse) File "/opt/conda/lib/python3.6/site-packages/torch/nn/functional.py", line 1506, in embedding return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) RuntimeError: Expected object of backend CPU but got backend CUDA for argument #3 'index'

    I try play in kaggle enveroment

    opened by remotejob 1
Owner
Masato Hagiwara
Senior AI Researcher @earthspecies working on decoding non-human communication with AI/ML. Ex @duolingo. Author of Real-World Natural Language Processing
Masato Hagiwara
Code examples for my Write Better Python Code series on YouTube.

Write Better Python Code This repository contains the code examples used in my Write Better Python Code series published on YouTube: https:/

null 858 Dec 29, 2022
Code to use Augmented Shapiro Wilks Stopping, as well as code for the paper "Statistically Signifigant Stopping of Neural Network Training"

This codebase is being actively maintained, please create and issue if you have issues using it Basics All data files are included under losses and ea

Justin Terry 32 Nov 9, 2021
A python script to prefab your scripts/text files, and re create them with ease and not have to open your browser to copy code or write code yourself

Scriptfab - What is it? A python script to prefab your scripts/text files, and re create them with ease and not have to open your browser to copy code

DevNugget 3 Jul 28, 2021
Code for the Python code smells video on the ArjanCodes channel.

7 Python code smells This repository contains the code for the Python code smells video on the ArjanCodes channel (watch the video here). The example

null 55 Dec 29, 2022
Code for CodeT5: a new code-aware pre-trained encoder-decoder model.

CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation This is the official PyTorch implementation

Salesforce 564 Jan 8, 2023
Galois is an auto code completer for code editors (or any text editor) based on OpenAI GPT-2.

Galois is an auto code completer for code editors (or any text editor) based on OpenAI GPT-2. It is trained (finetuned) on a curated list of approximately 45K Python (~470MB) files gathered from the Github. Currently, it just works properly on Python but not bad at other languages (thanks to GPT-2's power).

Galois Autocompleter 91 Sep 23, 2022
Code-autocomplete, a code completion plugin for Python

Code AutoComplete code-autocomplete, a code completion plugin for Python.

xuming 13 Jan 7, 2023
Code of paper: A Recurrent Vision-and-Language BERT for Navigation

Recurrent VLN-BERT Code of the Recurrent-VLN-BERT paper: A Recurrent Vision-and-Language BERT for Navigation Yicong Hong, Qi Wu, Yuankai Qi, Cristian

YicongHong 109 Dec 21, 2022
Easy, fast, effective, and automatic g-code compression!

Getting to the meat of g-code. Easy, fast, effective, and automatic g-code compression! MeatPack nearly doubles the effective data rate of a standard

Scott Mudge 97 Nov 21, 2022
Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code.

textgenrnn Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code, or quickly tr

Max Woolf 4.8k Dec 30, 2022
Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"

T5: Text-To-Text Transfer Transformer The t5 library serves primarily as code for reproducing the experiments in Exploring the Limits of Transfer Lear

Google Research 4.6k Jan 1, 2023
Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code.

textgenrnn Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code, or quickly tr

Max Woolf 4.3k Feb 18, 2021
Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"

T5: Text-To-Text Transfer Transformer The t5 library serves primarily as code for reproducing the experiments in Exploring the Limits of Transfer Lear

Google Research 3.2k Feb 17, 2021
Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks

Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks. It takes raw videos/images + text as inputs, and outputs task predictions. ClipBERT is designed based on 2D CNNs and transformers, and uses a sparse sampling strategy to enable efficient end-to-end video-and-language learning.

Jie Lei 雷杰 612 Jan 4, 2023
Collection of scripts to pinpoint obfuscated code

Obfuscation Detection (v1.0) Author: Tim Blazytko Automatically detect control-flow flattening and other state machines Description: Scripts and binar

Tim Blazytko 230 Nov 26, 2022
Code associated with the "Data Augmentation using Pre-trained Transformer Models" paper

Data Augmentation using Pre-trained Transformer Models Code associated with the Data Augmentation using Pre-trained Transformer Models paper Code cont

null 44 Dec 31, 2022
Official code for "Parser-Free Virtual Try-on via Distilling Appearance Flows", CVPR 2021

Parser-Free Virtual Try-on via Distilling Appearance Flows, CVPR 2021 Official code for CVPR 2021 paper 'Parser-Free Virtual Try-on via Distilling App

null 395 Jan 3, 2023
This is the source code of RPG (Reward-Randomized Policy Gradient)

RPG (Reward-Randomized Policy Gradient) Zhenggang Tang*, Chao Yu*, Boyuan Chen, Huazhe Xu, Xiaolong Wang, Fei Fang, Simon Shaolei Du, Yu Wang, Yi Wu (

null 40 Nov 25, 2022