Optimus: the first large-scale pre-trained VAE language model

Last update: Dec 19, 2022

Related tags

Overview

Optimus: the first pre-trained Big VAE language model

This repository contains source code necessary to reproduce the results presented in the EMNLP 2020 paper Optimus: Organizing Sentences via Pre-trained Modeling of a Latent Space.


The network architecture of Optimus: encoder for representation learning and decoder for generation	Sentences are organized and manipulated in a pre-trained compact and smooth latent space

For more on this project, see the Microsoft Research Blog post.

News

May 21, 2020: Releasing a demo for latent space manipulation, including sentence interpolation and analogy. Check out the website.

May 20, 2020: The latent space manipulation code is cleaned and released. See instructions at optimius_for_snli.md.

May 13, 2020: The fine-tuning code for langauge modeling is released. See instructions at optimus_finetune_language_models.md

There are four steps to use this codebase to reproduce the results in the paper.

Dependencies
Prepare datasets
Model training
1. Pre-training on setences in Wikipedia
2. Languange Modeling
3. Guided Language Generation
4. Low-resource Language Understanding
Collect and plot results

Dependencies

Pull docker from Docker Hub at: chunyl/pytorch-transformers:v2. Please see the instruction at doc/env.md

The project is organized into the following structures, with ensential files & folders visualized. output saves the models checkpoints.

├── Optimus
   └── code
       ├── examples
           ├── big_ae
               ├── modules
                   ├── vae.py
                   └── ...
               ├── run_lm_vae_pretraining_phdist_beta.py
               ├── run_lm_vae_training.py
               └── ...
	   ├── pytorch_transformers
               ├── modeling_bert.py
               ├── modeling_gpt2.py
               └── ...
       ├── scripts
           ├── scripts_docker
	   ├── scripts_local
	   ├── scripts_philly
   └── data
       └── datasets
           ├── wikipedia_json_64_filtered
               └── ...
	   ├── snli_data
           └── ...
   └── output
       ├── pretrain
       ├── LM
       └── ...

Prepare Datasets

Please download or preparation the data via following the instructions at data/download_datasets.md.

Model Training

1. Pre-training on setences in Wikipedia

We pre-trained our models on Philly (a Microsoft internal compute cluster), the code is specialized for multi-node multi-GPU compute on this platform. The pre-training main python is run_lm_vae_pretraining_phdist_beta.py. You may need to adjust the distributed training scripts.

2. Languange Modeling

To have a fair comparison with existing VAE languange models, we consider a model with latent dimension 32. The pre-trained model is fine-tuned on four commonly datasets for one epoch. Please see the details at doc/optimus_finetune_language_models.md

3. Guided Language Generation

Latent Space Manipulation To ensure good performance, we consider a model with latent dimension 768. The pre-trained model is fine-tuned on SNLI dataset, where sentences show related patterns. Please see the details at Please see the details at doc/optimius_for_snli.md

4. Low-resource Language Understanding

Collect and Plot Results

Once the networks are trained and the results are saved, we extracted key results using Python script. The results can be plotted using the included IPython notebook plots/main_plots.ipynb. Start the IPython Notebook server:

$ cd plots
$ ipython notebook

Select the main_plots.ipynb notebook and execute the included code. Note that without modification, we have copyed our extracted results into the notebook, and script will output figures in the paper. If you've run your own training and wish to plot results, you'll have to organize your results in the same format instead.

Questions?

Please drop me (Chunyuan) a line if you have any questions.

@inproceedings{li2020_Optimus,
  title={Optimus: Organizing Sentences via Pre-trained Modeling of a Latent Space},
  author={Li, Chunyuan and Gao, Xiang and Li, Yuan and Li, Xiujun and Peng, Baolin and Zhang, Yizhe and Gao, Jianfeng},
  booktitle={EMNLP},
  year={2020}
}

Comments

Curious about the Computing Resources for Pre-training Optimus

In the paper, it writes,

First, our pre-trained language VAE is still under-trained due to limited compute resource, as the training reconstruction loss can still decrease. One may further train the models with higher latent dimension and longer time to fully release the power of pre-trained latent spaces.

So how long did it take to pre-train Optimus in terms of days or weeks with its encoder and decoder initialized with weights of BERT and GPT-2 respectively?

opened by fakeProgrammer0 5

Suggestion for some added functions

Your program works very well! I rewrote the interpolation function to make it easier for me to use in different ways. Perhaps others would also find this useful.

def latent_code_from_text(text, encoder_tokenizer, model_vae, args):
    tokenized1 = encoder_tokenizer.encode(text)
    tokenized1 = [101] + tokenized1 + [102]
    coded1 = torch.Tensor([tokenized1])
    coded1 =torch.Tensor.long(coded1)
    with torch.no_grad():
        x0 = coded1
        x0 = x0.to(args.device)
        pooled_hidden_fea = model_vae.encoder(x0, attention_mask=(x0 > 0).float())[1]
        mean, logvar = model_vae.encoder.linear(pooled_hidden_fea).chunk(2, -1)
        latent_z = mean.squeeze(1)  
        coded_length = len(tokenized1)
        return latent_z, coded_length

def text_from_latent_code(latent_z, model_vae,sentence_length,args, decoder_tokenizer):
    past = latent_z
    context_tokens = decoder_tokenizer.encode('<BOS>')
    coded_length = torch.Tensor([[sentence_length]])
    coded_length = torch.Tensor.long(coded_length)
    length = torch.Tensor([[sentence_length]])
    out = sample_sequence_conditional(
        model=model_vae.decoder,
        context=context_tokens,
        past=past,
        length= length, # Chunyuan: Fix length; or use <EOS> to complete a sentence
        temperature=args.temperature,
        top_k=args.top_k,
        top_p=args.top_p,
        device=args.device,
        decoder_tokenizer = decoder_tokenizer
    )
    text_x1 = decoder_tokenizer.decode(out[0,:].tolist(), clean_up_tokenization_spaces=True)
    text_x1 = text_x1.split()[1:-1]
    text_x1 = ' '.join(text_x1)
    return text_x1

...

# and then in the main function         
latent_z1, coded_length1 = latent_code_from_text("a brown dog likes to eat his food very slowly .", tokenizer_encoder, model_vae, args)
latent_z2, coded_length2 = latent_code_from_text("a yellow cat likes to chase a long string .", tokenizer_encoder, model_vae, args)
    
result = text_from_latent_code((latent_z1 + latent_z2)/2, model_vae,coded_length1,args, tokenizer_decoder)
print(result)

opened by summerstay 3

GPT2ForLatentConnector

Hello, great work there !

Everything is fine with the docker env, and the code works amazingly. However I would like to compute the code in my own conda environement, and it seems that the version of pytorch-transformers (1.2.0) in your requirement file doesn't have GPT2ForLatentConnector and BertForLatentConnector, and I couldn't find it in other versions.

Could you give me the actual version that is used in your code ? Or even the source code for these classes.

Thanks a lot ! Romain

opened by Bila12 2
About Pre-training on the Wikipedia dataset

Hi Chunyuan,

Thank you for sharing the source code.

I am wondering any reason why we need to fine-tune Optimus on the Wikipedia dataset first and then fine-tune it on another four datasets (i.e., ptb, snli, etc.) for performance evaluation.

Pre-training Optimus directly on four datasets is also possible, though the results will definitely be worse than the reported ones. The effectiveness of Optimus should be independent of if it was pre-trained on the Wikipedia dataset. Right? Otherwise, if we want to further develop a new model, then we have to fine-tune it on the Wikipedia dataset first.

Looking forward to hearing from you.

Best,

Dong

opened by dongqian0206 2
Question: why this choice of BERT and GPT2?

Hi, Thank you for this work and for releasing the code as well ! 🎉 I was wondering if there was any reason you chose to use BERT as an encoder and GPT2 as a decoder, instead of other pretrained language models ? In particular, why not considering models that already have an encoder/decoder architecture, such as T5 or BART ? Thanks

opened by alxthm 2
Number of pretraining epochs

Settings in 'train_vae_wikipedia.yaml' and 'train_vae_wikipedia_distributed.yaml' seem to differ a lot. (20 at former, 1 at latter) How many pretraining epochs did you go over? + Which script should I refer to?

opened by bo-son 2
interpolation scheme

congratulations! indeed, controlled text generation works!

quick experiments are very promising

experiment 1: purpose is to generate sentences where age of the boy is continuously increasing and spelled by letters

src/target: 1 - > 100 seed sentence: the boy is twelve years old.

0: 0.000000 the boy is twelve years old. 13: 0.206349 the boy is twenty years old. 24: 0.380952 the boy is forty years old. 59: 0.936508 the boy is fifty years old.

(showing only uniq samples)

experiment 2: controlling both increasing age and gender

src/target 1: 1 - > 100 src/target 2: man - > woman seed sentence: the boy is twelve years old.

0: 0.000000 the boy is twelve years old. 40: 0.317460 the girl is twelve years old. 49: 0.388889 the girl is twenty years old.

(showing only uniq samples)

experiment 3: interpolation

0: 0.00 the sisters are hugging while holding up goodbye to get snacks before going home. 1: 0.10 the sisters are hugging while holding up snacks next to goodbye for their dad. 2: 0.20 the sisters are hugging while holding up goodbye to shopping bags in a . 3: 0.30 the sisters are hugging while holding up a sign in front of york airlines. 4: 0.40 the girl wearing beanies stands next to a truck while celebrating together. 5: 0.50 a girl in blue shirts stands posing next to a refrigerator while holding up important . 6: 0.60 a boy in a blue shirt standing amidst all construction logos is hugging while laying down a 7: 0.70 a man in a blue shirt standing next to packaging constructions with their thumbs in a row. 8: 0.80 a man in a blue outfit standing in front of a building styled like garage vaults with 9: 0.90 a man in a blue shirt standing in front of a construction base with styled decorations 10: 1.00 a man in a blue shirt standing in front of a design center with structure painted `` funhouse ''

i like it goes from "sisters" to "man" throught "girl" and "boy" this is aslo smooth in some sence :)

just amazing !!! not every run gives good results but it is definitely a step forward! just a question of time to get it working right.

and here is issue/question

I noticed you use linear interpolation scheme, but as it was pointed out by Ferenc Huszár here https://www.inference.vc/high-dimensional-gaussian-distributions-are-soap-bubble/ it makes sense to evolve interpolating trajectory along surface of a sphere.

opened by vseledkin 1
How about the reconstruction BLEU of AE and VAE?

Dear Researcher:

I trained an AE on Flick30k dataset, I found that the reconstruction BLEU score is about 35 on validation set. I think the reconstruction ability of AE is better than VAE. I wonder did you test the reconstruction ability of the both models? Do you have any results or cases of reconstruction?

Thanks, Chawdoe

opened by ChawDoe 0
Seems like checkpoints for {beta=0, beta=0.5} latent size=32 are the same checkpoints

For the following two checkpoints listed in optimus_finetune_language_models.md:

beta=0, latent size = 32 https://chunylcus.blob.core.windows.net/machines/msrdl/optimus/output/pretrain/philly_rr3_vc4_g8_base_vae_wikipedia_pretraining_beta_schedule_beta0.0_d1.0_ro0.5_ra0.25_32_v2/checkpoint-508523.zip

beta=0.5, latent size = 32 https://chunylcus.blob.core.windows.net/machines/msrdl/optimus/output/pretrain/philly_rr3_vc4_g8_base_vae_wikipedia_pretraining_beta_schedule_beta0.5_d1.0_ro0.5_ra0.25_32_v2/checkpoint-508523.zip

Their sums of all parameters are the same. So I think they are the same checkpoints. Could anyone please double-check this?

Btw, thanks for publishing your work on github.

opened by yiminzme 0
How to run the Label-Conditional Text Generation experiment on YELP dataset

Dears

Thanks for sharing your amazing work!

I am trying to run the Label-Conditional Text Generation experiment, but unfortunately, I didn't find the entry point for the training where there is no code call the class "Ctrl_Gen". Thus, it would be appreciated if you can guide me, where I can reproduce your results for the Label-Conditional Text Generation experiment.

Thanks in advance!

opened by eslambakr 0
Question about mutual information
Hello, thank you very much for making the code available. I'm confused about the mutual information math, more specifically about the line

E_{q(z|x)}log(q(z|x)) = -0.5*nz*log(2*\pi) - 0.5*(1+logvar).sum(-1) neg_entropy += (-0.5 * nz * math.log(2 * math.pi)- 0.5 * (1 + logvar).sum(-1)).sum().item()

When I derive it, it gives me neg_entropy += (-0.5 * nz * math.log(2 * math.pi)- 0.5 * (logvar).sum(-1)).sum().item()

So I think I must have made a mistake somewhere? Thank you very much
opened by smolPixel 0
issue about reproducing results on SNLI dataset

Hi! I'm trying to reproduce the reported result on SNLI, I followed the doc 'optimus_for_snli.md' and successfully downloaded the checkpoints, but when I run your examples, it turns out that in file run_latent_generation.py, the sample_sequence_conditional function receives 'input_ids' and 'past' in mismatched shape. I can fix this by past = torch.mean(past, dim=0).unsqueeze(0), but is it right? Thanks for reading.

opened by 20000607-lxc 1

Chinese Pretrained Model

      hi ! 
      Thanks for releasing the code and checkpoints, but i  want to know have  you released a model of pretrained with Chinese dataset?
     look forward to your reply!

opened by ywb2018 0

Owner

Researcher @ Microsoft Research

GitHub

🐥A PyTorch implementation of OpenAI's finetuned transformer language model with a script to import the weights pre-trained by OpenAI

PyTorch implementation of OpenAI's Finetuned Transformer Language Model This is a PyTorch implementation of the TensorFlow code provided with OpenAI's

1.4k Jan 5, 2023

DziriBERT: a Pre-trained Language Model for the Algerian Dialect

DziriBERT DziriBERT is the first Transformer-based Language Model that has been pre-trained specifically for the Algerian Dialect. It handles Algerian

117 Jan 7, 2023

A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval

CLIP4CMR A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval The original data and pre-calculate

9 Jan 12, 2022

SUPERVISED-CONTRASTIVE-LEARNING-FOR-PRE-TRAINED-LANGUAGE-MODEL-FINE-TUNING - The Facebook paper about fine tuning RoBERTa with contrastive loss

"# SUPERVISED-CONTRASTIVE-LEARNING-FOR-PRE-TRAINED-LANGUAGE-MODEL-FINE-TUNING" i

28 Dec 12, 2022

Codes to pre-train T5 (Text-to-Text Transfer Transformer) models pre-trained on Japanese web texts

t5-japanese Codes to pre-train T5 (Text-to-Text Transfer Transformer) models pre-trained on Japanese web texts. The following is a list of models that

1 Dec 13, 2021

The official implementation of VAENAR-TTS, a VAE based non-autoregressive TTS model.

VAENAR-TTS This repo contains code accompanying the paper "VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis". Sa

138 Oct 28, 2022

Official repository for the paper, MidiBERT-Piano: Large-scale Pre-training for Symbolic Music Understanding.

MidiBERT-Piano Authors: Yi-Hui (Sophia) Chou, I-Chun (Bronwin) Chen Introduction This is the official repository for the paper, MidiBERT-Piano: Large-

137 Dec 15, 2022

A data annotation pipeline to generate high-quality, large-scale speech datasets with machine pre-labeling and fully manual auditing.

About This repository provides data and code for the paper: Scalable Data Annotation Pipeline for High-Quality Large Speech Datasets Development (subm

86 Dec 7, 2022

UniLM AI - Large-scale Self-supervised Pre-training across Tasks, Languages, and Modalities

Pre-trained (foundation) models across tasks (understanding, generation and translation), languages (100+ languages), and modalities (language, image, audio, vision + language, audio + language, etc.)

7.6k Jan 1, 2023

Large-Scale Pre-training for Person Re-identification with Noisy Labels (LUPerson-NL)

LUPerson-NL Large-Scale Pre-training for Person Re-identification with Noisy Labels (LUPerson-NL) The repository is for our CVPR2022 paper Large-Scale

43 Dec 26, 2022

BigDetection: A Large-scale Benchmark for Improved Object Detector Pre-training

BigDetection: A Large-scale Benchmark for Improved Object Detector Pre-training By Likun Cai, Zhi Zhang, Yi Zhu, Li Zhang, Mu Li, Xiangyang Xue. This

290 Dec 29, 2022

Annotate datasets with a semi-trained or fully trained YOLOv5 model

YOLOv5 Auto Annotator Annotate datasets with a semi-trained or fully trained YOLOv5 model Prerequisites Ubuntu >=20.04 Python >=3.7 System dependencie

3 May 14, 2022

Code, Data and Demo for Paper: Controllable Generation from Pre-trained Language Models via Inverse Prompting

InversePrompting Paper: Controllable Generation from Pre-trained Language Models via Inverse Prompting Code: The code is provided in the "chinese_ip"

101 Dec 16, 2022

Source code and dataset for ACL2021 paper: "ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive Learning".

ERICA Source code and dataset for ACL2021 paper: "ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive L

75 Nov 2, 2022

Source code for paper: Knowledge Inheritance for Pre-trained Language Models

Knowledge-Inheritance Source code paper: Knowledge Inheritance for Pre-trained Language Models (preprint). The trained model parameters (in Fairseq fo

31 Nov 19, 2022

The code repository for EMNLP 2021 paper "Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization".

Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization [Paper] accepted at the EMNLP 2021: Vision Guided Genera

42 Jan 7, 2023

CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Generation

CPT This repository contains code and checkpoints for CPT. CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Gener

341 Dec 29, 2022

Pre-trained BERT Models for Ancient and Medieval Greek, and associated code for LaTeCH 2021 paper titled - "A Pilot Study for BERT Language Modelling and Morphological Analysis for Ancient and Medieval Greek"

Ancient Greek BERT The first and only available Ancient Greek sub-word BERT model! State-of-the-art post fine-tuning on Part-of-Speech Tagging and Mor

22 Dec 8, 2022

CLIP (Contrastive Language–Image Pre-training) trained on Indonesian data

CLIP-Indonesian CLIP (Radford et al., 2021) is a multimodal model that can connect images and text by training a vision encoder and a text encoder joi

17 Mar 10, 2022