Code and data accompanying Natural Language Processing with PyTorch

Overview

Natural Language Processing with PyTorch

Build Intelligent Language Applications Using Deep Learning
By Delip Rao and Brian McMahan

Welcome. This is a companion repository for the book Natural Language Processing with PyTorch: Build Intelligent Language Applications Using Deep Learning.

Table of Contents

Comments
  • Manual Data Download

    Manual Data Download

    I am getting the following error. If I click on the manual download link.

    1. That’s an error.

    The requested URL was not found on this server. That’s all we know.

    opened by sunginmkone 4
  • cuda defaults to false in ch4.cnn instead of true

    cuda defaults to false in ch4.cnn instead of true

    https://github.com/joosthub/PyTorchNLPBook/blob/db9fc8fe48a2416b36b21dde0dfce787c6ade2b2/chapters/chapter_4/4_4_cnn_surnames/4_4_Classifying_Surnames_with_a_CNN.ipynb#L552

    code checks is not available to switch to false (but starts as false)

    opened by seanv507 3
  • get_num_batches uses integer division in Chapter3:ReviewDataSet

    get_num_batches uses integer division in Chapter3:ReviewDataSet

    isn't this wrong if data_size not multiple of batch_size? shoudn't it be :

        def get_num_batches(self, batch_size):
            """Given a batch size, return the number of batches in the dataset
            
            Args:
                batch_size (int)
            Returns:
                number of batches in the dataset
            """
            return int(np.ceil(len(self)/batch_size))
    opened by seanv507 1
  • Removes loops in creating data subset

    Removes loops in creating data subset

    Uses pandas groupby and filtering functionality to create subset dataframe in place of manual creation through loops.

    Also improves efficiency of execution from Wall time of over 1 min to 120ms

    image

    opened by NikhilPr95 1
  • dropout error

    dropout error

    In fact, F.dropout(xx, p=0.5) works both in model's train mode and eval mode. You should write F.dropout(xx, p=0.5, training=self.training) or when you excute a cell more than one time, the result is different. I heard this book from the CS224 class, the professor said that it's a great book and if we don't read the book, he will be sad. Even now there seems no one maintaining this book. I want to say "Thank you, Great authors!"

    opened by iamownt 0
  • error report for chapter3 code

    error report for chapter3 code

    An error occurred when I ran below code in Chapter3, could anyone please help to figure it out?

    `classifier = classifier.to(args.device)

    loss_func = nn.BCEWithLogitsLoss() optimizer = optim.Adam(classifier.parameters(), lr=args.learning_rate) scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer=optimizer, mode='min', factor=0.5, patience=1)

    train_state = make_train_state(args)

    epoch_bar = tqdm.notebook.tqdm(desc='training routine', total=args.num_epochs, position=0)

    dataset.set_split('train') train_bar = tqdm.notebook.tqdm(desc='split=train', total=dataset.get_num_batches(args.batch_size), position=1, leave=True) dataset.set_split('val') val_bar = tqdm.notebook.tqdm(desc='split=val', total=dataset.get_num_batches(args.batch_size), position=1, leave=True)

    try: for epoch_index in range(args.num_epochs): train_state['epoch_index'] = epoch_index

        # Iterate over training dataset
    
        # setup: batch generator, set loss and acc to 0, set train mode on
        dataset.set_split('train')
        batch_generator = generate_batches(dataset, 
                                           batch_size=args.batch_size, 
                                           device=args.device)
        running_loss = 0.0
        running_acc = 0.0
        classifier.train()
    
        for batch_index, batch_dict in enumerate(batch_generator):
            # the training routine is these 5 steps:
    
            # --------------------------------------
            # step 1. zero the gradients
            optimizer.zero_grad()
    
            # step 2. compute the output
            y_pred = classifier(x_in=batch_dict['x_data'].float())
    
            # step 3. compute the loss
            loss = loss_func(y_pred, batch_dict['y_target'].float())
            loss_t = loss.item()
            running_loss += (loss_t - running_loss) / (batch_index + 1)
    
            # step 4. use loss to produce gradients
            loss.backward()
    
            # step 5. use optimizer to take gradient step
            optimizer.step()
            # -----------------------------------------
            # compute the accuracy
            acc_t = compute_accuracy(y_pred, batch_dict['y_target'])
            running_acc += (acc_t - running_acc) / (batch_index + 1)
    
            # update bar
            train_bar.set_postfix(loss=running_loss, 
                                  acc=running_acc, 
                                  epoch=epoch_index)
            train_bar.update()
    
        train_state['train_loss'].append(running_loss)
        train_state['train_acc'].append(running_acc)
    
        # Iterate over val dataset
    
        # setup: batch generator, set loss and acc to 0; set eval mode on
        dataset.set_split('val')
        batch_generator = generate_batches(dataset, 
                                           batch_size=args.batch_size, 
                                           device=args.device)
        running_loss = 0.
        running_acc = 0.
        classifier.eval()
    
        for batch_index, batch_dict in enumerate(batch_generator):
            
            # compute the output
            y_pred = classifier(x_in=batch_dict['x_data'].float())
    
            # compute the loss
            loss = loss_func(y_pred, batch_dict['y_target'].float())
            loss_t = loss.item()
            running_loss += (loss_t - running_loss) / (batch_index + 1)
    
            # compute the accuracy
            acc_t = compute_accuracy(y_pred, batch_dict['y_target'])
            running_acc += (acc_t - running_acc) / (batch_index + 1)
            
            val_bar.set_postfix(loss=running_loss, 
                                acc=running_acc, 
                                epoch=epoch_index)
            val_bar.update()
    
        train_state['val_loss'].append(running_loss)
        train_state['val_acc'].append(running_acc)
    
        train_state = update_train_state(args=args, 
                                         model=classifier,
                                         train_state=train_state)
    
        scheduler.step(train_state['val_loss'][-1])
    
        train_bar.n = 0
        val_bar.n = 0
        epoch_bar.update()
    
        if train_state['stop_early']:
            break
    
        train_bar.n = 0
        val_bar.n = 0
        epoch_bar.update()
    

    except KeyboardInterrupt: print("Exiting loop")`

    ValueError: num_samples should be a positive integer value, but got num_samples=0

    opened by leopardbruce 0
  • running_loss definition

    running_loss definition

    I can't understand the code running_loss += (loss_t - running_loss) / (batch_index + 1) I think loss_t means per current batch loss ,loss_t is less than running loss (loss_t - running_loss ?) can anyone explain it what the codes mean ?

    opened by kd97 0
  • YELP raw_train.csv file no longer available on Google Drive, please provide alternate source

    YELP raw_train.csv file no longer available on Google Drive, please provide alternate source

    raw_train.csv

    https://drive.google.com/open?id=1xeUnqkhuzGGzZKThzPeXe2Vf6Uu_g_xM gives a 404 error

    Please provide update link to exact dataset used in the book, or to an entirely new set of yelp CSV-formatted datasets (train, test, and reviews_with_splits_lite)

    opened by richlysakowski 0
  • Can't download train data for chapter 3

    Can't download train data for chapter 3

    download.py doesn't work. And when I tried to go by link to train set from .md(https://github.com/delip/PyTorchNLPBook/blob/master/data/README.md): `404. That’s an error.

    The requested URL was not found on this server. That’s all we know.`

    What I should do for download your examples?

    opened by EgorDS15 1
  • 5_1_Pretrained_Embeddings.ipynb notebook

    5_1_Pretrained_Embeddings.ipynb notebook

    file glove.6B.100d.txt from kaggle [link] the appropriate from_embeddings_file method:

    def from_embeddings_file(cls, embedding_file):
        """Instantiate from pre-trained vector file.
    
        Vector file should be of the format:
            word0 x0_0 x0_1 x0_2 x0_3 ... x0_N
            word1 x1_0 x1_1 x1_2 x1_3 ... x1_N
    
        Args:
            embedding_file (str): location of the file
        Returns: 
            instance of PretrainedEmbeddigns
        """
        word_to_index = {}
        word_vectors = []
    
        with open(embedding_file, encoding="utf8") as fp:
            for line in fp.readlines():
                line = line.split(" ")
                word = line[0]
                vec = np.array([float(x) for x in line[1:]])
    
                word_to_index[word] = len(word_to_index)
                word_vectors.append(vec)
    
        return cls(word_to_index, word_vectors)
    
    opened by mdzalfirdausi 0
  • No early stopping implemented in Chapter 3 (yelp)

    No early stopping implemented in Chapter 3 (yelp)

    The early_stopping_best_val never gets updated so the loss is ALWAYS smaller and early stopping never happens this line of code is missing in the if-else statement in update_train_stage function:

    train_state['early_stopping_best_val'] = loss_t

    opened by lisanka93 0
  • Typo in 3_5_Classifying_Yelp_Review_Sentiment.py

    Typo in 3_5_Classifying_Yelp_Review_Sentiment.py

    Hi, thanks for providing the code. I found a small typo in 3_5_Classifying_Yelp_Review_Sentiment.py. In the ReviewVectorizer class, the comment of vectorize should be one-hot rather than one-hit.

    Screen Shot 2021-03-26 at 11 03 08 PM
    opened by lishichengyan 0
Owner
Joostware
Joostware
A design of MIDI language for music generation task, specifically for Natural Language Processing (NLP) models.

MIDI Language Introduction Reference Paper: Pop Music Transformer: Beat-based Modeling and Generation of Expressive Pop Piano Compositions: code This

Robert Bogan Kang 3 May 25, 2022
Implementation of Natural Language Code Search in the project CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

CodeBERT-Implementation In this repo we have replicated the paper CodeBERT: A Pre-Trained Model for Programming and Natural Languages. We are interest

Tanuj Sur 4 Jul 1, 2022
🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.

State-of-the-art Natural Language Processing for PyTorch and TensorFlow 2.0 ?? Transformers provides thousands of pretrained models to perform tasks o

Hugging Face 77.3k Jan 3, 2023
🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.

State-of-the-art Natural Language Processing for PyTorch and TensorFlow 2.0 ?? Transformers provides thousands of pretrained models to perform tasks o

Hugging Face 40.9k Feb 18, 2021
Integrating the Best of TF into PyTorch, for Machine Learning, Natural Language Processing, and Text Generation. This is part of the CASL project: http://casl-project.ai/

Texar-PyTorch is a toolkit aiming to support a broad set of machine learning, especially natural language processing and text generation tasks. Texar

ASYML 726 Dec 30, 2022
🤗 Transformers: State-of-the-art Natural Language Processing for Pytorch, TensorFlow, and JAX.

English | 简体中文 | 繁體中文 State-of-the-art Natural Language Processing for Jax, PyTorch and TensorFlow ?? Transformers provides thousands of pretrained mo

Hugging Face 77.2k Jan 3, 2023
A number of methods in order to perform Natural Language Processing on live data derived from Twitter

A number of methods in order to perform Natural Language Processing on live data derived from Twitter

null 1 Nov 24, 2021
This repository contains all the source code that is needed for the project : An Efficient Pipeline For Bloom’s Taxonomy Using Natural Language Processing and Deep Learning

Pipeline For NLP with Bloom's Taxonomy Using Improved Question Classification and Question Generation using Deep Learning This repository contains all

Rohan Mathur 9 Jul 17, 2021
Basic Utilities for PyTorch Natural Language Processing (NLP)

Basic Utilities for PyTorch Natural Language Processing (NLP) PyTorch-NLP, or torchnlp for short, is a library of basic utilities for PyTorch NLP. tor

Michael Petrochuk 2.1k Jan 1, 2023
Basic Utilities for PyTorch Natural Language Processing (NLP)

Basic Utilities for PyTorch Natural Language Processing (NLP) PyTorch-NLP, or torchnlp for short, is a library of basic utilities for PyTorch NLP. tor

Michael Petrochuk 1.9k Feb 3, 2021
Basic Utilities for PyTorch Natural Language Processing (NLP)

Basic Utilities for PyTorch Natural Language Processing (NLP) PyTorch-NLP, or torchnlp for short, is a library of basic utilities for PyTorch NLP. tor

Michael Petrochuk 1.9k Feb 18, 2021
One Stop Anomaly Shop: Anomaly detection using two-phase approach: (a) pre-labeling using statistics, Natural Language Processing and static rules; (b) anomaly scoring using supervised and unsupervised machine learning.

One Stop Anomaly Shop (OSAS) Quick start guide Step 1: Get/build the docker image Option 1: Use precompiled image (might not reflect latest changes):

Adobe, Inc. 148 Dec 26, 2022
Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Pattern Pattern is a web mining module for Python. It has tools for: Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM par

Computational Linguistics Research Group 8.4k Dec 30, 2022
A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

A Deep Learning NLP/NLU library by Intel® AI Lab Overview | Models | Installation | Examples | Documentation | Tutorials | Contributing NLP Architect

Intel Labs 2.9k Jan 2, 2023
Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Texar is a toolkit aiming to support a broad set of machine learning, especially natural language processing and text generation tasks. Texar provides

ASYML 2.3k Jan 7, 2023
DELTA is a deep learning based natural language and speech processing platform.

DELTA - A DEep learning Language Technology plAtform What is DELTA? DELTA is a deep learning based end-to-end natural language and speech processing p

DELTA 1.5k Dec 26, 2022
A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

A Deep Learning NLP/NLU library by Intel® AI Lab Overview | Models | Installation | Examples | Documentation | Tutorials | Contributing NLP Architect

Intel Labs 2.6k Feb 18, 2021