"Reinforcement Learning for Bandit Neural Machine Translation with Simulated Human Feedback"

Overview

bandit-nmt

THIS REPO DEMONSTRATES HOW TO INTEGRATE A POLICY GRADIENT METHOD INTO NMT. FOR A STATE-OF-THE-ART NMT CODEBASE, VISIT simple-nmt.

This is code repo for our EMNLP 2017 paper "Reinforcement Learning for Bandit Neural Machine Translation with Simulated Human Feedback", which implements the A2C algorithm on top of a neural encoder-decoder model and benchmarks the combination under simulated noisy rewards.

Requirements:

  • Python 3.6
  • PyTorch 0.2

NOTE: as of Sep 16 2017, the code got 2x slower when I upgraded to PyTorch 2.0. This is a known issue and PyTorch is fixing it.

IMPORTANT: Set home directory (otherwise scripts will not run correctly):

> export BANDIT_HOME=$PWD
> export DATA=$BANDIT_HOME/data
> export SCRIPT=$BANDIT_HOME/scripts

Data extraction

Download pre-processing scripts

> cd $DATA/scripts
> bash download_scripts.sh

For German-English

> cd $DATA/en-de
> bash extract_data_de_en.sh

NOTE: train_2014 and train_2015 highly overlap. Please be cautious when using them for other projects.

Data should be ready in $DATA/en-de/prep

TODO: Chinese-English needs segmentation

Data pre-processing

> cd $SCRIPT
> bash make_data.sh de en

Pretraining

Pretrain both actor and critic

> cd $SCRIPT
> bash pretrain.sh en-de $YOUR_LOG_DIR

See scripts/pretrain.sh for more details.

Pretrain actor only

> cd $BANDIT_HOME
> python train.py -data $YOUR_DATA -save_dir $YOUR_SAVE_DIR -end_epoch 10

Reinforcement training

> cd $BANDIT_HOME

From scratch

> python train.py -data $YOUR_DATA -save_dir $YOUR_SAVE_DIR -start_reinforce 10 -end_epoch 100 -critic_pretrain_epochs 5

From a pretrained model

> python train.py -data $YOUR_DATA -load_from $YOUR_MODEL -save_dir $YOUR_SAVE_DIR -start_reinforce -1 -end_epoch 100 -critic_pretrain_epochs 5

Perturbed rewards

For example, use thumb up/thump down reward:

> cd $BANDIT_HOME
> python train.py -data $YOUR_DATA -load_from $YOUR_MODEL -save_dir $YOUR_SAVE_DIR -start_reinforce -1 -end_epoch 100 -critic_pretrain_epochs 5 -pert_func bin -pert_param 1

See lib/metric/PertFunction.py for more types of function.

Evaluation

> cd $BANDIT_HOME

On heldout sets (heldout BLEU):

> python train.py -data $YOUR_DATA -load_from $YOUR_MODEL -eval -save_dir .

On bandit set (per-sentence BLEU):

> python train.py -data $YOUR_DATA -load_from $YOUR_MODEL -eval_sample -save_dir .
You might also like...
Comments
  • How to estimate the V value?

    How to estimate the V value?

    @khanhptnk Dear author, I am new to reinforcement learning and I am confused about the estimation of V value in your implementation. As you known, let V (y^\hat _{<t} , x) denote the current V value estimation, then V (y^\hat _{<t} , x) should approximate the excepted future reward ,i.e., r_{t+1} +r_{t+2} + ... + r_{T} ; in your implementation, V (y^\hat _{<t} , x) is estimated by minimizing the MSE between itself and the ture value R(y^\hat, x), i.e., \min [V (y^\hat _{<t} , x) - R(y^\hat, x)]^2, where I guess that R(y^\hat, x) is the final reward rather than the excepted future reward r_{t+1} +r_{t+2} + ... + r_{T} . So, my question is why the V value can be estimated in that way ( and it seems that MIXER also estimates the V value in such a way)? Appearently, V (y^\hat _{<t} , x) should approximate the excepted future reward instead of final reward R(y^\hat, x), is that right? I will be grateful if you could provide me some explainations/suggestions.

    opened by ganji15 5
  • error while run  extract_data_de_en.sh

    error while run extract_data_de_en.sh

    x en-de/train.tags.en-de.de x en-de/train.tags.en-de.en Preprocessing...

    pre-processing train data... Tokenizer Version 1.1 Language: de Number of threads: 8 WARNING: No known abbreviations for language 'de', attempting fall-back to English version... ERROR: No abbreviations files found in /Users//bandit-nmt/data/scripts/../share/nonbreaking_prefixes

    Tokenizer Version 1.1 Language: en Number of threads: 8 WARNING: No known abbreviations for language 'en', attempting fall-back to English version... ERROR: No abbreviations files found in /Users//bandit-nmt/data/scripts/../share/nonbreaking_prefixes

    clean-corpus.perl: processing prep/tmp/train_2014.en-de.tok.de & .en to prep/tmp/train_2014.en-de.clean, cutoff 1-50, ratio 9

    Input sentences: 0 Output sentences: 0 Tokenizer Version 1.1 Language: de Number of threads: 8 WARNING: No known abbreviations for language 'de', attempting fall-back to English version... ERROR: No abbreviations files found in /Users//bandit-nmt/data/scripts/../share/nonbreaking_prefixes

    opened by loveJasmine 2
  • How to split the dataset?

    How to split the dataset?

    Dear author,

    Thank you for sharing the code with high quality. In your code you split the training dataset as train_src for init vocabulary, train_xe_src for supervised learning and train_pg_src for policy gradient. Can I consider the three datasets as the same? If not, how should I split my dataset?

    Any help from you is highly appreciated.

    help wanted 
    opened by wanyao1992 1
  • predictions for test data doesn't match the order in source test data

    predictions for test data doesn't match the order in source test data

    @khanhptnk , if there is no sentence greater than length 50, then why in output predictions for test data doesn't match the order of sentences in source test data. I am unable to find if there is some reodering of sentences being done while preprocessing the data other than ignoring sentences of length greater than 50. Any help??

    opened by vikrant97 0
Owner
Khanh Nguyen
PhD student in Machine Learning student at University of Maryland, College Park
Khanh Nguyen