Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning

Maluuba Inc.

Last update: Oct 19, 2022

Related tags

Text Data & NLP gensen

Overview

GenSen

Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning

Sandeep Subramanian, Adam Trischler, Yoshua Bengio & Christopher Pal

ICLR 2018

About

GenSen is a technique to learn general purpose, fixed-length representations of sentences via multi-task training. These representations are useful for transfer and low-resource learning. For details please refer to our ICLR paper.

Code

We provide a PyTorch implementation of our paper along with pre-trained models as well as code to evaluate these models on a variety of transfer learning benchmarks.

Requirements

Python 2.7 (Python 3 compatibility coming soon)
PyTorch 0.2 or 0.3
nltk
h5py
numpy
scikit-learn

Usage

Setting up Models & pre-trained word vecotrs

You download our pre-trained models and set up pre-trained word vectors for vocabulary expansion by

cd data/models
bash download_models.sh
cd ../embedding
bash glove2h5.sh

Using a pre-trained model to extract sentence representations.

You can use our pre-trained models to extract the last hidden state or all hidden states of our multi-task GRU. Additionally, you can concatenate the output of multiple models to replicate the numbers in our paper.

from gensen import GenSen, GenSenSingle

gensen_1 = GenSenSingle(
    model_folder='./data/models',
    filename_prefix='nli_large_bothskip',
    pretrained_emb='./data/embedding/glove.840B.300d.h5'
)
reps_h, reps_h_t = gensen_1.get_representation(
    sentences, pool='last', return_numpy=True, tokenize=True
)
print reps_h.shape, reps_h_t.shape

The input to get_representation is sentences, which should be a list of strings. If your strings are not pre-tokenized, then set tokenize=True to use the NLTK tokenizer before computing representations.
reps_h (batch_size x seq_len x 2048) contains the hidden states for all words in all sentences (padded to the max length of sentences)
reps_h_t (batch_size x 2048) contains only the last hidden state for all sentences in the minibatch

GenSenSingle will return the output of a single model nli_large_bothskip (+STN +Fr +De +NLI +L +STP). You can concatenate the output of multiple models by creating a GenSen instance with multiple GenSenSingle instances, as follows:

gensen_2 = GenSenSingle(
    model_folder='./data/models',
    filename_prefix='nli_large_bothskip_parse',
    pretrained_emb='./data/embedding/glove.840B.300d.h5'
)
gensen = GenSen(gensen_1, gensen_2)
reps_h, reps_h_t = gensen.get_representation(
    sentences, pool='last', return_numpy=True, tokenize=True
)

reps_h (batch_size x seq_len x 4096) contains the hidden states for all words in all sentences (padded to the max length of sentences)
reps_h_t (batch_size x 4096) contains only the last hidden state for all sentences in the minibatch

The model will produce a fixed-length vector for each sentence as well as the hidden states corresponding to each word in every sentence (padded to max sentence length). You can also return a numpy array instead of a torch.FloatTensor by setting return_numpy=True.

Vocabulary Expansion

If you have a specific domain for which you want to compute representations, you can call vocab_expansion on instances of the GenSenSingle or GenSen class simply by gensen.vocab_expansion(vocab) where vocab is a list of unique words in the new domain. This will learn a linear mapping from the provided pretrained embeddings (which have a significantly larger vocabulary) provided to the space of gensen's word vectors. For an example of how this is used in an actual setting, please refer to gensen_senteval.py.

Training a model from scratch

To train a model from scratch, simply run train.py with an appropriate JSON config file. An example config is provided in example_config.json. To continue training, just relaunch the same scripy with load_dir=auto in the config file.

To download some of the data required to train a GenSen model, run:

bash get_data.sh

Note that this script can take a while to complete since it downloads, tokenizes and lowercases a fairly large En-Fr corpus. If you already have these parallel corpora processed, you can replace the paths to these files in the provided example_config.json

Some of the data used in our work is no longer publicly available (BookCorpus - see http://yknzhu.wixsite.com/mbweb) or has an LDC license associated (Penn Treebank). As a result, the example_config.json script will only train on Multilingual NMT and NLI, since they are publicly available. To use models trained on all tasks, please use our available pre-trained models.

Additional Sequence-to-Sequence transduction tasks can be added trivally to the multi-task framework by editing the json config file with more tasks.

python train.py --config example_config.json

To use the default settings in example_config.json you will need a GPU with atleast 16GB of memory (such as a P100), to train on smaller GPUs, you may need to reduce the batch size.

Note that if "load_dir" is set to auto, the script will resume from the last saved model in "save_dir".

Creating a GenSen model from a trained multi-task model

Once you have a trained model, we can throw away all of the decoders and just retain the encoder used to compute sentence representations.

You can do this by running

python create_gensen.py -t <path_to_trained_model> -s <path_to_save_encoder> -n <name_of_encoder>

Once you have done this, you can load this model just like any of the pre-trained models by specifying the model_folder as path_to_save_encoder and filename_prefix as name_of_encoder in the above command.

your_gensen = GenSenSingle(
    model_folder='<path_to_save_encoder>',
    filename_prefix='<name_of_encoder>',
    pretrained_emb='./data/embedding/glove.840B.300d.h5'
)

Transfer Learning Evaluations

We used the SentEval toolkit to run most of our transfer learning experiments. To replicate these numbers, clone their repository and follow setup instructions. Once complete, copy gensen_senteval.py and gensen.py into their examples folder and run the following commands to reproduce different rows in Table 2 of our paper. Note: Please set the path to the pretrained glove embeddings (glove.840B.300d.h5) and model folder as appropriate.

(+STN +Fr +De +NLI +L +STP)      python gensen_senteval.py --prefix_1 nli_large --prefix_2 nli_large_bothskip
(+STN +Fr +De +NLI +2L +STP)     python gensen_senteval.py --prefix_1 nli_large_bothskip --prefix_2 nli_large_bothskip_2layer
(+STN +Fr +De +NLI +L +STP +Par) python gensen_senteval.py --prefix_1 nli_large_bothskip_parse --prefix_2 nli_large_bothskip

Reference

@article{subramanian2018learning,
title={Learning general purpose distributed sentence representations via large scale multi-task learning},
author={Subramanian, Sandeep and Trischler, Adam and Bengio, Yoshua and Pal, Christopher J},
journal={arXiv preprint arXiv:1804.00079},
year={2018}
}

Comments

Pre-trained word embeddings

Hi Sandeep, I have a question about the input word embeddings. As you described in READEME, the pre-trained GLOVE is used. But in the paper, the word embeddings were learned. If I understand correctly, GLOVE is only used in the case if we want to expand the Vocabulary. When generating sentence representations, the model still uses the learned word embeddings. Is this right?

opened by ShunChi100 2
parsing of glove embeddings

Hi,

thanks for the code! The provided glove2h5.py does not work on my machine as there are some words in the gloVe file which contain spaces and thus the code crashes when trying to convert the splitted lines to float. The following lines should be changed: vocab = [line[0] for line in glove_vectors] into vocab = [' '.join(line[0:-300]) for line in glove_vectors]

and vectors = np.array( [[float(val) for val in line[1:]] for line in glove_vectors] ).astype(np.float32) into vectors = np.array( [[float(val) for val in line[-300:]] for line in glove_vectors] ).astype(np.float32)

opened by tilmanbeck 1
fixed Encoder(When using vocab_expansion) and GensenSingle for CPU
GensenSingle was loading model directly to GPU instead of checking with self.cuda().

When using vocab_expansion

Encoder had bug with try,except where it was moving all tensors to GPU by default. Now added cuda attribute, which moves the data using self.cuda() checks.
opened by najafmurtaza 1
Can you provide the full multitask pre-trained Network ?

Thank you for this great contribution.

I am currently performing some experiments where it would be very helpful if I had a network that can encode sentences, decode them, and calculate entailment with regard to that same encoding.

I think it would be very useful to the community if you provided a link to your pretrained model in the same checkpointing format as that of your train.py file. I hope you still have it somewhere ..

Thanks a lot !

opened by ghazi-f 1
Is the GRU implementation in consistent with the paper?

I read the peephole GRU implementation in models.py :

newgate = F.tanh(i_n + resetgate * h_n + p_n) （line54） hy = newgate + inputgate * (hidden - newgate) . (line 55 )

Are they in consistent with the (3) and (4) equations in the paper? I think the line 54 missed the “entrywise product of r_t and h_t-1” and the line 55 also not looks like the equation (4).

opened by my-yy 0
Encoding tweets

Hi Sandeep, I would like to encoding tweets to vectors. I can use your pre-trained model. But it is possible to train (warm-start with your pre-trained parameters) your model with tweets data?

Cheers, Shun

opened by ShunChi100 0

Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning

Related tags

Overview

GenSen

About

Code

Requirements

Usage

Setting up Models & pre-trained word vecotrs

Using a pre-trained model to extract sentence representations.

Vocabulary Expansion

Training a model from scratch

Creating a GenSen model from a trained multi-task model

Transfer Learning Evaluations

Reference

Comments

Pre-trained word embeddings

parsing of glove embeddings

fixed Encoder(When using vocab_expansion) and GensenSingle for CPU

Can you provide the full multitask pre-trained Network ?

Is the GRU implementation in consistent with the paper?

Encoding tweets

Owner

Maluuba Inc.

🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.

🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.

REST API for sentence tokenization and embedding using Multilingual Universal Sentence Encoder.

Language-Agnostic SEntence Representations

Neural text generators like the GPT models promise a general-purpose means of manipulating texts.

The model is designed to train a single and large neural network in order to predict correct translation by reading the given sentence.

ACL'2021: Learning Dense Representations of Phrases at Scale

Deduplication is the task to combine different representations of the same real world entity.

CCKS-Title-based-large-scale-commodity-entity-retrieval-top1

This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL 2021.

IndoBERTweet is the first large-scale pretrained model for Indonesian Twitter. Published at EMNLP 2021 (main conference)

BMInf (Big Model Inference) is a low-resource inference package for large-scale pretrained language models (PLMs).

Code for text augmentation method leveraging large-scale language models

Tools for curating biomedical training data for large-scale language modeling

Large-scale Knowledge Graph Construction with Prompting

PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing

A simple recipe for training and inferencing Transformer architecture for Multi-Task Learning on custom datasets. You can find two approaches for achieving this in this repo.

Meta learning algorithms to train cross-lingual NLI (multi-task) models