Code and data accompanying Natural Language Processing with PyTorch

Joostware

Last update: Jan 1, 2023

Related tags

Text Data & NLP nlp natural-language-processing deep-neural-networks deep-learning pytorch neural-networks neural-machine-translation pytorch-tutorial pytorch-nlp

Overview

Natural Language Processing with PyTorch

Build Intelligent Language Applications Using Deep Learning
By Delip Rao and Brian McMahan

Welcome. This is a companion repository for the book Natural Language Processing with PyTorch: Build Intelligent Language Applications Using Deep Learning.

Get Started!
Chapter 1: Introduction
- PyTorch Basics
Chapter 2: A Quick Tour of NLP
Chapter 3: Foundational Components of Neural Networks
- In-text examples
- Diving deep into supervised training
- Classifying sentiment of restaurant reviews using a Perceptron
Chapter 4: Feed-forward Networks for NLP
- Limitations of the Perceptron
- Introducing Multi-layer Perceptrons (MLPs)
- Introducing Convolutional Neural Networks (CNNs)
- Surname Classification with an MLP
- Surname Classification with a CNN
Chapter 5: Embedding Words and Types
- Using Pretrained Embeddings
- Learning Continous Bag-of-words Embeddings (CBOW)
- Transfer Learning using Pre-trained Embeddings
Chapter 6: Sequence Modeling for NLP
- A sequence representation for Surnames
Chapter 7: Intermediate Sequence Modeling for NLP
- Generating novel surnames from sequence representations
- Uncondition generation
- Conditioned generation
Chapter 8: Advanced Sequence Modeling for NLP
- Understanding PackedSequences
- Sequence to Sequence Learning
- Attention
- Neural Machine Translation
Chapter 9: Classics, Frontiers, Next Steps

Comments

Manual Data Download
I am getting the following error. If I click on the manual download link.

That’s an error.

The requested URL was not found on this server. That’s all we know.
opened by sunginmkone 4
cuda defaults to false in ch4.cnn instead of true

https://github.com/joosthub/PyTorchNLPBook/blob/db9fc8fe48a2416b36b21dde0dfce787c6ade2b2/chapters/chapter_4/4_4_cnn_surnames/4_4_Classifying_Surnames_with_a_CNN.ipynb#L552

code checks is not available to switch to false (but starts as false)

opened by seanv507 3

get_num_batches uses integer division in Chapter3:ReviewDataSet

isn't this wrong if data_size not multiple of batch_size? shoudn't it be :

    def get_num_batches(self, batch_size):
        """Given a batch size, return the number of batches in the dataset
        
        Args:
            batch_size (int)
        Returns:
            number of batches in the dataset
        """
        return int(np.ceil(len(self)/batch_size))

opened by seanv507 1

Removes loops in creating data subset

Uses pandas groupby and filtering functionality to create subset dataframe in place of manual creation through loops.

Also improves efficiency of execution from Wall time of over 1 min to 120ms

opened by NikhilPr95 1
dropout error

In fact, F.dropout(xx, p=0.5) works both in model's train mode and eval mode. You should write F.dropout(xx, p=0.5, training=self.training) or when you excute a cell more than one time, the result is different. I heard this book from the CS224 class, the professor said that it's a great book and if we don't read the book, he will be sad. Even now there seems no one maintaining this book. I want to say "Thank you, Great authors!"

opened by iamownt 0

error report for chapter3 code

An error occurred when I ran below code in Chapter3, could anyone please help to figure it out?

`classifier = classifier.to(args.device)

loss_func = nn.BCEWithLogitsLoss() optimizer = optim.Adam(classifier.parameters(), lr=args.learning_rate) scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer=optimizer, mode='min', factor=0.5, patience=1)

train_state = make_train_state(args)

epoch_bar = tqdm.notebook.tqdm(desc='training routine', total=args.num_epochs, position=0)

dataset.set_split('train') train_bar = tqdm.notebook.tqdm(desc='split=train', total=dataset.get_num_batches(args.batch_size), position=1, leave=True) dataset.set_split('val') val_bar = tqdm.notebook.tqdm(desc='split=val', total=dataset.get_num_batches(args.batch_size), position=1, leave=True)

try: for epoch_index in range(args.num_epochs): train_state['epoch_index'] = epoch_index

    # Iterate over training dataset

    # setup: batch generator, set loss and acc to 0, set train mode on
    dataset.set_split('train')
    batch_generator = generate_batches(dataset, 
                                       batch_size=args.batch_size, 
                                       device=args.device)
    running_loss = 0.0
    running_acc = 0.0
    classifier.train()

    for batch_index, batch_dict in enumerate(batch_generator):
        # the training routine is these 5 steps:

        # --------------------------------------
        # step 1. zero the gradients
        optimizer.zero_grad()

        # step 2. compute the output
        y_pred = classifier(x_in=batch_dict['x_data'].float())

        # step 3. compute the loss
        loss = loss_func(y_pred, batch_dict['y_target'].float())
        loss_t = loss.item()
        running_loss += (loss_t - running_loss) / (batch_index + 1)

        # step 4. use loss to produce gradients
        loss.backward()

        # step 5. use optimizer to take gradient step
        optimizer.step()
        # -----------------------------------------
        # compute the accuracy
        acc_t = compute_accuracy(y_pred, batch_dict['y_target'])
        running_acc += (acc_t - running_acc) / (batch_index + 1)

        # update bar
        train_bar.set_postfix(loss=running_loss, 
                              acc=running_acc, 
                              epoch=epoch_index)
        train_bar.update()

    train_state['train_loss'].append(running_loss)
    train_state['train_acc'].append(running_acc)

    # Iterate over val dataset

    # setup: batch generator, set loss and acc to 0; set eval mode on
    dataset.set_split('val')
    batch_generator = generate_batches(dataset, 
                                       batch_size=args.batch_size, 
                                       device=args.device)
    running_loss = 0.
    running_acc = 0.
    classifier.eval()

    for batch_index, batch_dict in enumerate(batch_generator):
        
        # compute the output
        y_pred = classifier(x_in=batch_dict['x_data'].float())

        # compute the loss
        loss = loss_func(y_pred, batch_dict['y_target'].float())
        loss_t = loss.item()
        running_loss += (loss_t - running_loss) / (batch_index + 1)

        # compute the accuracy
        acc_t = compute_accuracy(y_pred, batch_dict['y_target'])
        running_acc += (acc_t - running_acc) / (batch_index + 1)
        
        val_bar.set_postfix(loss=running_loss, 
                            acc=running_acc, 
                            epoch=epoch_index)
        val_bar.update()

    train_state['val_loss'].append(running_loss)
    train_state['val_acc'].append(running_acc)

    train_state = update_train_state(args=args, 
                                     model=classifier,
                                     train_state=train_state)

    scheduler.step(train_state['val_loss'][-1])

    train_bar.n = 0
    val_bar.n = 0
    epoch_bar.update()

    if train_state['stop_early']:
        break

    train_bar.n = 0
    val_bar.n = 0
    epoch_bar.update()

except KeyboardInterrupt: print("Exiting loop")`

ValueError: num_samples should be a positive integer value, but got num_samples=0

opened by leopardbruce 0

running_loss definition

I can't understand the code running_loss += (loss_t - running_loss) / (batch_index + 1) I think loss_t means per current batch loss ,loss_t is less than running loss (loss_t - running_loss ?) can anyone explain it what the codes mean ?

opened by kd97 0
YELP raw_train.csv file no longer available on Google Drive, please provide alternate source

raw_train.csv

https://drive.google.com/open?id=1xeUnqkhuzGGzZKThzPeXe2Vf6Uu_g_xM gives a 404 error

Please provide update link to exact dataset used in the book, or to an entirely new set of yelp CSV-formatted datasets (train, test, and reviews_with_splits_lite)

opened by richlysakowski 0
Can't download train data for chapter 3

download.py doesn't work. And when I tried to go by link to train set from .md(https://github.com/delip/PyTorchNLPBook/blob/master/data/README.md): `404. That’s an error.

The requested URL was not found on this server. That’s all we know.`

What I should do for download your examples?

opened by EgorDS15 1

5_1_Pretrained_Embeddings.ipynb notebook

file glove.6B.100d.txt from kaggle [link] the appropriate from_embeddings_file method:

def from_embeddings_file(cls, embedding_file):
    """Instantiate from pre-trained vector file.

    Vector file should be of the format:
        word0 x0_0 x0_1 x0_2 x0_3 ... x0_N
        word1 x1_0 x1_1 x1_2 x1_3 ... x1_N

    Args:
        embedding_file (str): location of the file
    Returns: 
        instance of PretrainedEmbeddigns
    """
    word_to_index = {}
    word_vectors = []

    with open(embedding_file, encoding="utf8") as fp:
        for line in fp.readlines():
            line = line.split(" ")
            word = line[0]
            vec = np.array([float(x) for x in line[1:]])

            word_to_index[word] = len(word_to_index)
            word_vectors.append(vec)

    return cls(word_to_index, word_vectors)

opened by mdzalfirdausi 0

No early stopping implemented in Chapter 3 (yelp)

The early_stopping_best_val never gets updated so the loss is ALWAYS smaller and early stopping never happens this line of code is missing in the if-else statement in update_train_stage function:

train_state['early_stopping_best_val'] = loss_t

opened by lisanka93 0
Typo in 3_5_Classifying_Yelp_Review_Sentiment.py

Hi, thanks for providing the code. I found a small typo in 3_5_Classifying_Yelp_Review_Sentiment.py. In the ReviewVectorizer class, the comment of vectorize should be one-hot rather than one-hit.

opened by lishichengyan 0

Code and data accompanying Natural Language Processing with PyTorch

Related tags

Overview

Natural Language Processing with PyTorch

Table of Contents

Comments

Manual Data Download

cuda defaults to false in ch4.cnn instead of true

get_num_batches uses integer division in Chapter3:ReviewDataSet

Removes loops in creating data subset

dropout error

error report for chapter3 code

running_loss definition

YELP raw_train.csv file no longer available on Google Drive, please provide alternate source

Can't download train data for chapter 3

5_1_Pretrained_Embeddings.ipynb notebook

No early stopping implemented in Chapter 3 (yelp)

Typo in 3_5_Classifying_Yelp_Review_Sentiment.py

Owner

Joostware

A design of MIDI language for music generation task, specifically for Natural Language Processing (NLP) models.

Implementation of Natural Language Code Search in the project CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

A Neural Language Style Transfer framework to transfer natural language text smoothly between fine-grained language styles like formal/casual, active/passive, and many more. Created by Prithiviraj Damodaran. Open to pull requests and other forms of collaboration.

🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.

🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.

Integrating the Best of TF into PyTorch, for Machine Learning, Natural Language Processing, and Text Generation. This is part of the CASL project: http://casl-project.ai/

🤗 Transformers: State-of-the-art Natural Language Processing for Pytorch, TensorFlow, and JAX.

A number of methods in order to perform Natural Language Processing on live data derived from Twitter

This repository contains all the source code that is needed for the project : An Efficient Pipeline For Bloom’s Taxonomy Using Natural Language Processing and Deep Learning

Basic Utilities for PyTorch Natural Language Processing (NLP)

Basic Utilities for PyTorch Natural Language Processing (NLP)

Basic Utilities for PyTorch Natural Language Processing (NLP)

One Stop Anomaly Shop: Anomaly detection using two-phase approach: (a) pre-labeling using statistics, Natural Language Processing and static rules; (b) anomaly scoring using supervised and unsupervised machine learning.

Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

DELTA is a deep learning based natural language and speech processing platform.

A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks