Optimus: the first large-scale pre-trained VAE language model

Overview

Optimus: the first pre-trained Big VAE language model

This repository contains source code necessary to reproduce the results presented in the EMNLP 2020 paper Optimus: Organizing Sentences via Pre-trained Modeling of a Latent Space.

The network architecture of Optimus: encoder for representation learning and decoder for generation Sentences are organized and manipulated in a pre-trained compact and smooth latent space

For more on this project, see the Microsoft Research Blog post.

News

May 21, 2020: Releasing a demo for latent space manipulation, including sentence interpolation and analogy. Check out the website.

May 20, 2020: The latent space manipulation code is cleaned and released. See instructions at optimius_for_snli.md.

May 13, 2020: The fine-tuning code for langauge modeling is released. See instructions at optimus_finetune_language_models.md

Contents

There are four steps to use this codebase to reproduce the results in the paper.

  1. Dependencies
  2. Prepare datasets
  3. Model training
    1. Pre-training on setences in Wikipedia
    2. Languange Modeling
    3. Guided Language Generation
    4. Low-resource Language Understanding
  4. Collect and plot results

Dependencies

Pull docker from Docker Hub at: chunyl/pytorch-transformers:v2. Please see the instruction at doc/env.md

The project is organized into the following structures, with ensential files & folders visualized. output saves the models checkpoints.

├── Optimus
   └── code
       ├── examples
           ├── big_ae
               ├── modules
                   ├── vae.py
                   └── ...
               ├── run_lm_vae_pretraining_phdist_beta.py
               ├── run_lm_vae_training.py
               └── ...
	   ├── pytorch_transformers
               ├── modeling_bert.py
               ├── modeling_gpt2.py
               └── ...
       ├── scripts
           ├── scripts_docker
	   ├── scripts_local
	   ├── scripts_philly
   └── data
       └── datasets
           ├── wikipedia_json_64_filtered
               └── ...
	   ├── snli_data
           └── ...
   └── output
       ├── pretrain
       ├── LM
       └── ...       

Prepare Datasets

Please download or preparation the data via following the instructions at data/download_datasets.md.

Model Training

1. Pre-training on setences in Wikipedia

We pre-trained our models on Philly (a Microsoft internal compute cluster), the code is specialized for multi-node multi-GPU compute on this platform. The pre-training main python is run_lm_vae_pretraining_phdist_beta.py. You may need to adjust the distributed training scripts.

2. Languange Modeling

To have a fair comparison with existing VAE languange models, we consider a model with latent dimension 32. The pre-trained model is fine-tuned on four commonly datasets for one epoch. Please see the details at doc/optimus_finetune_language_models.md

3. Guided Language Generation

Latent Space Manipulation To ensure good performance, we consider a model with latent dimension 768. The pre-trained model is fine-tuned on SNLI dataset, where sentences show related patterns. Please see the details at Please see the details at doc/optimius_for_snli.md

4. Low-resource Language Understanding

Collect and Plot Results

Once the networks are trained and the results are saved, we extracted key results using Python script. The results can be plotted using the included IPython notebook plots/main_plots.ipynb. Start the IPython Notebook server:

$ cd plots
$ ipython notebook

Select the main_plots.ipynb notebook and execute the included code. Note that without modification, we have copyed our extracted results into the notebook, and script will output figures in the paper. If you've run your own training and wish to plot results, you'll have to organize your results in the same format instead.

Questions?

Please drop me (Chunyuan) a line if you have any questions.

@inproceedings{li2020_Optimus,
  title={Optimus: Organizing Sentences via Pre-trained Modeling of a Latent Space},
  author={Li, Chunyuan and Gao, Xiang and Li, Yuan and Li, Xiujun and Peng, Baolin and Zhang, Yizhe and Gao, Jianfeng},
  booktitle={EMNLP},
  year={2020}
}
Comments
  • Curious about the Computing Resources for Pre-training Optimus

    Curious about the Computing Resources for Pre-training Optimus

    In the paper, it writes,

    First, our pre-trained language VAE is still under-trained due to limited compute resource, as the training reconstruction loss can still decrease. One may further train the models with higher latent dimension and longer time to fully release the power of pre-trained latent spaces.

    So how long did it take to pre-train Optimus in terms of days or weeks with its encoder and decoder initialized with weights of BERT and GPT-2 respectively?

    opened by fakeProgrammer0 5
  • Suggestion for some added functions

    Suggestion for some added functions

    Your program works very well! I rewrote the interpolation function to make it easier for me to use in different ways. Perhaps others would also find this useful.

    def latent_code_from_text(text, encoder_tokenizer, model_vae, args):
        tokenized1 = encoder_tokenizer.encode(text)
        tokenized1 = [101] + tokenized1 + [102]
        coded1 = torch.Tensor([tokenized1])
        coded1 =torch.Tensor.long(coded1)
        with torch.no_grad():
            x0 = coded1
            x0 = x0.to(args.device)
            pooled_hidden_fea = model_vae.encoder(x0, attention_mask=(x0 > 0).float())[1]
            mean, logvar = model_vae.encoder.linear(pooled_hidden_fea).chunk(2, -1)
            latent_z = mean.squeeze(1)  
            coded_length = len(tokenized1)
            return latent_z, coded_length
    
    def text_from_latent_code(latent_z, model_vae,sentence_length,args, decoder_tokenizer):
        past = latent_z
        context_tokens = decoder_tokenizer.encode('<BOS>')
        coded_length = torch.Tensor([[sentence_length]])
        coded_length = torch.Tensor.long(coded_length)
        length = torch.Tensor([[sentence_length]])
        out = sample_sequence_conditional(
            model=model_vae.decoder,
            context=context_tokens,
            past=past,
            length= length, # Chunyuan: Fix length; or use <EOS> to complete a sentence
            temperature=args.temperature,
            top_k=args.top_k,
            top_p=args.top_p,
            device=args.device,
            decoder_tokenizer = decoder_tokenizer
        )
        text_x1 = decoder_tokenizer.decode(out[0,:].tolist(), clean_up_tokenization_spaces=True)
        text_x1 = text_x1.split()[1:-1]
        text_x1 = ' '.join(text_x1)
        return text_x1
    
    ...
    
    # and then in the main function         
    latent_z1, coded_length1 = latent_code_from_text("a brown dog likes to eat his food very slowly .", tokenizer_encoder, model_vae, args)
    latent_z2, coded_length2 = latent_code_from_text("a yellow cat likes to chase a long string .", tokenizer_encoder, model_vae, args)
        
    result = text_from_latent_code((latent_z1 + latent_z2)/2, model_vae,coded_length1,args, tokenizer_decoder)
    print(result)
    
    opened by summerstay 3
  • GPT2ForLatentConnector

    GPT2ForLatentConnector

    Hello, great work there !

    Everything is fine with the docker env, and the code works amazingly. However I would like to compute the code in my own conda environement, and it seems that the version of pytorch-transformers (1.2.0) in your requirement file doesn't have GPT2ForLatentConnector and BertForLatentConnector, and I couldn't find it in other versions.

    Could you give me the actual version that is used in your code ? Or even the source code for these classes.

    Thanks a lot ! Romain

    opened by Bila12 2
  • About Pre-training on the Wikipedia dataset

    About Pre-training on the Wikipedia dataset

    Hi Chunyuan,

    Thank you for sharing the source code.

    I am wondering any reason why we need to fine-tune Optimus on the Wikipedia dataset first and then fine-tune it on another four datasets (i.e., ptb, snli, etc.) for performance evaluation.

    Pre-training Optimus directly on four datasets is also possible, though the results will definitely be worse than the reported ones. The effectiveness of Optimus should be independent of if it was pre-trained on the Wikipedia dataset. Right? Otherwise, if we want to further develop a new model, then we have to fine-tune it on the Wikipedia dataset first.

    Looking forward to hearing from you.

    Best,

    Dong

    opened by dongqian0206 2
  • Question: why this choice of BERT and GPT2?

    Question: why this choice of BERT and GPT2?

    Hi, Thank you for this work and for releasing the code as well ! 🎉 I was wondering if there was any reason you chose to use BERT as an encoder and GPT2 as a decoder, instead of other pretrained language models ? In particular, why not considering models that already have an encoder/decoder architecture, such as T5 or BART ? Thanks

    opened by alxthm 2
  • Number of pretraining epochs

    Number of pretraining epochs

    Settings in 'train_vae_wikipedia.yaml' and 'train_vae_wikipedia_distributed.yaml' seem to differ a lot. (20 at former, 1 at latter) How many pretraining epochs did you go over? + Which script should I refer to?

    opened by bo-son 2
  • interpolation scheme

    interpolation scheme

    congratulations! indeed, controlled text generation works!

    quick experiments are very promising

    experiment 1: purpose is to generate sentences where age of the boy is continuously increasing and spelled by letters

    src/target: 1 - > 100 seed sentence: the boy is twelve years old.

    0: 0.000000 the boy is twelve years old. 13: 0.206349 the boy is twenty years old. 24: 0.380952 the boy is forty years old. 59: 0.936508 the boy is fifty years old.

    (showing only uniq samples)

    experiment 2: controlling both increasing age and gender

    src/target 1: 1 - > 100 src/target 2: man - > woman seed sentence: the boy is twelve years old.

    0: 0.000000 the boy is twelve years old. 40: 0.317460 the girl is twelve years old. 49: 0.388889 the girl is twenty years old.

    (showing only uniq samples)

    experiment 3: interpolation

    0: 0.00 the sisters are hugging while holding up goodbye to get snacks before going home. 1: 0.10 the sisters are hugging while holding up snacks next to goodbye for their dad. 2: 0.20 the sisters are hugging while holding up goodbye to shopping bags in a . 3: 0.30 the sisters are hugging while holding up a sign in front of york airlines. 4: 0.40 the girl wearing beanies stands next to a truck while celebrating together. 5: 0.50 a girl in blue shirts stands posing next to a refrigerator while holding up important . 6: 0.60 a boy in a blue shirt standing amidst all construction logos is hugging while laying down a 7: 0.70 a man in a blue shirt standing next to packaging constructions with their thumbs in a row. 8: 0.80 a man in a blue outfit standing in front of a building styled like garage vaults with 9: 0.90 a man in a blue shirt standing in front of a construction base with styled decorations 10: 1.00 a man in a blue shirt standing in front of a design center with structure painted `` funhouse ''

    i like it goes from "sisters" to "man" throught "girl" and "boy" this is aslo smooth in some sence :)

    just amazing !!! not every run gives good results but it is definitely a step forward! just a question of time to get it working right.

    and here is issue/question

    I noticed you use linear interpolation scheme, but as it was pointed out by Ferenc Huszár here https://www.inference.vc/high-dimensional-gaussian-distributions-are-soap-bubble/ it makes sense to evolve interpolating trajectory along surface of a sphere.

    opened by vseledkin 1
  • How about the reconstruction BLEU of AE and VAE?

    How about the reconstruction BLEU of AE and VAE?

    Dear Researcher:

    I trained an AE on Flick30k dataset, I found that the reconstruction BLEU score is about 35 on validation set. I think the reconstruction ability of AE is better than VAE. I wonder did you test the reconstruction ability of the both models? Do you have any results or cases of reconstruction?

    Thanks, Chawdoe

    opened by ChawDoe 0
  • Seems like checkpoints for {beta=0, beta=0.5} latent size=32 are the same checkpoints

    Seems like checkpoints for {beta=0, beta=0.5} latent size=32 are the same checkpoints

    For the following two checkpoints listed in optimus_finetune_language_models.md:

    beta=0, latent size = 32 https://chunylcus.blob.core.windows.net/machines/msrdl/optimus/output/pretrain/philly_rr3_vc4_g8_base_vae_wikipedia_pretraining_beta_schedule_beta0.0_d1.0_ro0.5_ra0.25_32_v2/checkpoint-508523.zip

    beta=0.5, latent size = 32 https://chunylcus.blob.core.windows.net/machines/msrdl/optimus/output/pretrain/philly_rr3_vc4_g8_base_vae_wikipedia_pretraining_beta_schedule_beta0.5_d1.0_ro0.5_ra0.25_32_v2/checkpoint-508523.zip

    Their sums of all parameters are the same. So I think they are the same checkpoints. Could anyone please double-check this?

    Btw, thanks for publishing your work on github.

    opened by yiminzme 0
  • How to run the Label-Conditional Text Generation experiment on YELP dataset

    How to run the Label-Conditional Text Generation experiment on YELP dataset

    Dears

    Thanks for sharing your amazing work!

    I am trying to run the Label-Conditional Text Generation experiment, but unfortunately, I didn't find the entry point for the training where there is no code call the class "Ctrl_Gen". Thus, it would be appreciated if you can guide me, where I can reproduce your results for the Label-Conditional Text Generation experiment.

    Thanks in advance!

    opened by eslambakr 0
  • Question about mutual information

    Question about mutual information

    Hello, thank you very much for making the code available. I'm confused about the mutual information math, more specifically about the line

         E_{q(z|x)}log(q(z|x)) = -0.5*nz*log(2*\pi) - 0.5*(1+logvar).sum(-1)
        neg_entropy += (-0.5 * nz * math.log(2 * math.pi)- 0.5 * (1 + logvar).sum(-1)).sum().item()
    

    When I derive it, it gives me neg_entropy += (-0.5 * nz * math.log(2 * math.pi)- 0.5 * (logvar).sum(-1)).sum().item()

    So I think I must have made a mistake somewhere? Thank you very much

    opened by smolPixel 0
  • issue about reproducing results on SNLI dataset

    issue about reproducing results on SNLI dataset

    Hi! I'm trying to reproduce the reported result on SNLI, I followed the doc 'optimus_for_snli.md' and successfully downloaded the checkpoints, but when I run your examples, it turns out that in file run_latent_generation.py, the sample_sequence_conditional function receives 'input_ids' and 'past' in mismatched shape. I can fix this by past = torch.mean(past, dim=0).unsqueeze(0), but is it right? Thanks for reading.

    opened by 20000607-lxc 1
  • Chinese Pretrained Model

    Chinese Pretrained Model

          hi ! 
          Thanks for releasing the code and checkpoints, but i  want to know have  you released a model of pretrained with Chinese dataset?
         look forward to your reply!
    
    opened by ywb2018 0
Owner
Researcher @ Microsoft Research
null
🐥A PyTorch implementation of OpenAI's finetuned transformer language model with a script to import the weights pre-trained by OpenAI

PyTorch implementation of OpenAI's Finetuned Transformer Language Model This is a PyTorch implementation of the TensorFlow code provided with OpenAI's

Hugging Face 1.4k Jan 5, 2023
DziriBERT: a Pre-trained Language Model for the Algerian Dialect

DziriBERT DziriBERT is the first Transformer-based Language Model that has been pre-trained specifically for the Algerian Dialect. It handles Algerian

null 117 Jan 7, 2023
A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval

CLIP4CMR A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval The original data and pre-calculate

null 9 Jan 12, 2022
Saeed Lotfi 28 Dec 12, 2022
Codes to pre-train T5 (Text-to-Text Transfer Transformer) models pre-trained on Japanese web texts

t5-japanese Codes to pre-train T5 (Text-to-Text Transfer Transformer) models pre-trained on Japanese web texts. The following is a list of models that

Kimio Kuramitsu 1 Dec 13, 2021
The official implementation of VAENAR-TTS, a VAE based non-autoregressive TTS model.

VAENAR-TTS This repo contains code accompanying the paper "VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis". Sa

THUHCSI 138 Oct 28, 2022
Official repository for the paper, MidiBERT-Piano: Large-scale Pre-training for Symbolic Music Understanding.

MidiBERT-Piano Authors: Yi-Hui (Sophia) Chou, I-Chun (Bronwin) Chen Introduction This is the official repository for the paper, MidiBERT-Piano: Large-

null 137 Dec 15, 2022
A data annotation pipeline to generate high-quality, large-scale speech datasets with machine pre-labeling and fully manual auditing.

About This repository provides data and code for the paper: Scalable Data Annotation Pipeline for High-Quality Large Speech Datasets Development (subm

Appen Repos 86 Dec 7, 2022
UniLM AI - Large-scale Self-supervised Pre-training across Tasks, Languages, and Modalities

Pre-trained (foundation) models across tasks (understanding, generation and translation), languages (100+ languages), and modalities (language, image, audio, vision + language, audio + language, etc.)

Microsoft 7.6k Jan 1, 2023
Large-Scale Pre-training for Person Re-identification with Noisy Labels (LUPerson-NL)

LUPerson-NL Large-Scale Pre-training for Person Re-identification with Noisy Labels (LUPerson-NL) The repository is for our CVPR2022 paper Large-Scale

null 43 Dec 26, 2022
BigDetection: A Large-scale Benchmark for Improved Object Detector Pre-training

BigDetection: A Large-scale Benchmark for Improved Object Detector Pre-training By Likun Cai, Zhi Zhang, Yi Zhu, Li Zhang, Mu Li, Xiangyang Xue. This

null 290 Dec 29, 2022
Annotate datasets with a semi-trained or fully trained YOLOv5 model

YOLOv5 Auto Annotator Annotate datasets with a semi-trained or fully trained YOLOv5 model Prerequisites Ubuntu >=20.04 Python >=3.7 System dependencie

Akash James 3 May 14, 2022
Code, Data and Demo for Paper: Controllable Generation from Pre-trained Language Models via Inverse Prompting

InversePrompting Paper: Controllable Generation from Pre-trained Language Models via Inverse Prompting Code: The code is provided in the "chinese_ip"

THUDM 101 Dec 16, 2022
Source code and dataset for ACL2021 paper: "ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive Learning".

ERICA Source code and dataset for ACL2021 paper: "ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive L

THUNLP 75 Nov 2, 2022
Source code for paper: Knowledge Inheritance for Pre-trained Language Models

Knowledge-Inheritance Source code paper: Knowledge Inheritance for Pre-trained Language Models (preprint). The trained model parameters (in Fairseq fo

THUNLP 31 Nov 19, 2022
The code repository for EMNLP 2021 paper "Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization".

Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization [Paper] accepted at the EMNLP 2021: Vision Guided Genera

CAiRE 42 Jan 7, 2023
CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Generation

CPT This repository contains code and checkpoints for CPT. CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Gener

fastNLP 341 Dec 29, 2022
Pre-trained BERT Models for Ancient and Medieval Greek, and associated code for LaTeCH 2021 paper titled - "A Pilot Study for BERT Language Modelling and Morphological Analysis for Ancient and Medieval Greek"

Ancient Greek BERT The first and only available Ancient Greek sub-word BERT model! State-of-the-art post fine-tuning on Part-of-Speech Tagging and Mor

Pranaydeep Singh 22 Dec 8, 2022
CLIP (Contrastive Language–Image Pre-training) trained on Indonesian data

CLIP-Indonesian CLIP (Radford et al., 2021) is a multimodal model that can connect images and text by training a vision encoder and a text encoder joi

Galuh 17 Mar 10, 2022