Code for paper "Vocabulary Learning via Optimal Transport for Neural Machine Translation"

Related tags

Deep Learning VOLT
Overview

**Codebase and data are uploaded in progress. **

VOLT(-py) is a vocabulary learning codebase that allows researchers and developers to automaticaly generate a vocabulary with suitable granularity for machine translation.

What's New:

  • July 2021: Support En-De translation, TED bilingual translation, and multilingual translation.
  • July 2021: Support subword-nmt tokenization.
  • July 2021: Support sentencepiece tokenization.

What's On-going:

  • Add translation training/evaluation codes.
  • Support classification tasks.
  • Support pip usage.

Features:

  • Efficient: CPU learning on one machine.
  • Simple: The core code is no more than 200 lines.
  • Easy-to-use: Support widely-used tokenization toolkits,subword-nmt and sentencepiece.
  • Flexible: User can customize their own tokenization rules.

Requirements and Installation

The required environments:

  • python 3.0
  • tqdm
  • mosedecoder
  • subword-nmt

To use VOLT and develop locally:

git clone https://github.com/Jingjing-NLP/VOLT/
cd VOLT
git clone https://github.com/moses-smt/mosesdecoder
git clone https://github.com/rsennrich/subword-nmt
pip3 install sentencepiece
pip3 install tqdm 

Usage

  • The first step is to get vocabulary candidates and tokenized texts. The sub-word vocabulary can be generated by subword-nmt and sentencepiece. Here are two examples:

    
    #Assume source_data is the file stroing data in the source language
    #Assume target_data is the file stroing data in the target language
    BPEROOT=subword-nmt
    size=30000 # the size of BPE
    cat source_data > training_data
    cat target_data >> training_data
    
    #subword-nmt style:
    mkdir bpeoutput
    BPE_CODE=code # the path to save vocabulary
    python3 $BPEROOT/learn_bpe.py -s $size  < training_data > $BPE_CODE
    python3 $BPEROOT/apply_bpe.py -c $BPE_CODE < source_file > bpeoutput/source.file
    python3 $BPEROOT/apply_bpe.py -c $BPE_CODE < target_file > bpeoutput/source.file
    
    #sentencepiece style:
    mkdir spmout
    python3 spm/spm_train.py --input=training_data --model_prefix=spm --vocab_size=$size --character_coverage=1.0 --model_type=bpe
    #After this step, you will see spm.vocab and spm.model
    python3 spm/spm_encoder.py --model spm.model --inputs source_data --outputs spmout/source_data --output_format piece
    python3 spm/spm_encoder.py --model spm.model --inputs target_data --outputs spmout/target_data --output_format piece
    
  • The second step is to run VOLT scripts. It accepts the following parameters:

    • --source_file: the file storing data in the source language.
    • --target_file: the file storing data in the target language.
    • --token_candidate_file: the file storing token candidates.
    • --max_number: the maximum size of the vocabulary generated by VOLT.
    • --interval: the search granularity in VOLT.
    • --loop_in_ot: the maximum interation loop in sinkhorn solution.
    • --tokenizer: which toolkit you use to get vocabulary. Only subword-nmt and sentencepiece are supported.
    • --size_file: the file to store the vocabulary size generated by VOLT.
    • --threshold: the threshold to decide which tokens are added into the final vocabulary from the optimal matrix. Less threshold means that less token candidates are dropped.
    #subword-nmt style
    python3 ../ot_run.py --source_file bpeoutput/source.file --target_file bpeoutput/target.file \
              --token_candidate_file $BPE_CODE \
              --vocab_file bpeoutput/vocab --max_number 10000 --interval 1000  --loop_in_ot 500 --tokenizer subword-nmt --size_file bpeoutput/size 
    #sentencepiece style
    python3 ../ot_run.py --source_file spmoutput/source.file --target_file spmoutput/target.file \
              --token_candidate_file $BPE_CODE \
              --vocab_file spmoutput/vocab --max_number 10000 --interval 1000  --loop_in_ot 500 --tokenizer sentencepiece --size_file spmoutput/size 
    
  • The third step is to use the generated vocabulary to tokenize your texts:

      #for subword-nmt toolkit
      python3 $BPEROOT/apply_bpe.py -c bpeoutput/vocab < source_file > bpeoutput/source.file
      python3 $BPEROOT/apply_bpe.py -c bpeoutput/vocab < target_file > bpeoutput/source.file
    
      #for sentencepiece toolkit, here we only keep the optimal size
      best_size=$(cat spmoutput/size)
      python3 spm/spm_train.py --input=training_data --model_prefix=spm --vocab_size=$best_size --character_coverage=1.0 --model_type=bpe
    
      #After this step, you will see spm.vocab and spm.model
      python3 spm/spm_encoder.py --model spm.model --inputs source_data --outputs spmout/source_data --output_format piece
      python3 spm/spm_encoder.py --model spm.model --inputs target_data --outputs spmout/target_data --output_format piece
    

Examples

We have given several examples in path "examples/".

Datasets

The WMT-14 En-de translation data can be downloaed via the running scripts.

For TED, you can download at TED.

Citation

Please cite as:

@inproceedings{volt,
  title = {Vocabulary Learning via Optimal Transport for Neural Machine Translation},
  author= {Jingjing Xu and
               Hao Zhou and
               Chun Gan and
               Zaixiang Zheng and
               Lei Li},
  booktitle = {Proceedings of ACL 2021},
  year = {2021},
}
Comments
  • 代码报错

    代码报错

    Traceback (most recent call last): File "ot_run.py", line 174, in optimal_size = run_ot(oldtokens, chars, int(max_number),int(interval)) File "ot_run.py", line 126, in run_ot Gs,_ = ot.sinkhorn(a,b,d_matrix,1.0,method='sinkhorn',numItermax=400) ValueError: too many values to unpack (expected 2)

    opened by q178 6
  • run_ot and run_ot_write

    run_ot and run_ot_write

    Hi there,

    I am trying to reconcile your code with the description in Algorithm 1 of your paper.

    In the paper:

    entropy, vocab = get_vocab(optimal matrix) vocabularies.append(entropy,vocab) Output v∗ from vocabularies satisfying Eq. 3

    https://github.com/Jingjing-NLP/VOLT/blob/c9f2e692b2a25ba6be19d067020d0eeaa288ce55/ot_run.py#L141

    However, in the code for run_ot, the transport matrix or a vocabulary set for each timestep t is not stored, only the (vocab_size, entropy) pairs are.

    Then run_ot_write() takes this optimal vocab size, and recalculates the transport matrix again, and I don't see how this is different from when it was calculated in the for loop with run_ot, surely the same matrix is outputted? I also don't understand how run_ot_write() is doing the same thing as "Output v∗ from vocabularies satisfying Eq. 3" from Algorithm 1, as there are no vocabs being taken into consideration.

    Would be very grateful if you could help clarify the above, as I am keen to implement your work :)

    opened by kirefu 4
  • Are there any recommended hyperparameters for a larger dataset?

    Are there any recommended hyperparameters for a larger dataset?

    I try to use VOLT for a larger dataset like 100 million and it decrease the bpe file from 32000 to 6000. But the new bpe codes_file brings absolutely performance degradation in BLEU, like 35->19. Are there any parameters I need to change for a larger dataset?

    opened by SefaZeng 3
  • Scripts for MUV-search

    Scripts for MUV-search

    Hello, I am trying to use the MUV-search method mentioned in your paper to find an optimal vocabulary size. As is shown in Table 4, the optimal vocabulary size is 9.7k. However, when I am trying to reproduce the results, I find that MUV values are decreasing monotonously with the vocab size, therefore the max MUV value always belongs to the minimal vocab size.

    Below are the vocab size, entropy value, and MUV values calculated on the WMT14 En-De dataset:

    | BPE vocab size | Entropy | MUV | | 4117 | 5.62 | - | | 5043 | 4.32 | 1.4038877e-3 | | 5976 | 3.62 | 7.5026795e-4 | | 6885 | 3.20 | 4.6204620e-4 | | 7783 | 2.94 | 2.8953229e-4 | | 8645 | 2.76 | 2.0881671e-4 | | 9504 | 2.64 | 1.3969732e-4 | | 10348 | 2.54 | 1.1848341e-4 | | 11173 | 2.46 | 9.6969697e-5 | | 11977 | 2.39 | 8.7064677e-5 | | 19909 | 2.081 | 3.8956127e-5 |

    Can you pleases provide the source code for MUV search or just point out where the error might happen?

    opened by Saltychtao 2
  • 标点符号的问题

    标点符号的问题

    您好, 我先用sentencepiece生成了一个词表,里边的标点符号既有单个的标点(如:。/,/!/?等),也有连续的标点(如:???/ 。。。/!!!)。通过VOLT算法生成的新词表中,我观察到词表中没有了单个的标点,只有连续的标点。并且VOLT生成的标点既不是中文的标点也不是英文的标点。

    请问在标点处理这边,您是怎么处理的呢?

    opened by andongBlue 1
  • Missing `ted/onemanydata` data?

    Missing `ted/onemanydata` data?

    run_ted_bilingual_onetomany.sh script using data from ted/onemanydata but I cannot find it in the link you provided at the end of README.md https://drive.google.com/drive/folders/1FNH7cXFYWWnUdH2LyUFFRYmaWYJJveKy

    And, i'm curious to know what is the purposeted/onemanydata experiment?

    opened by tiendung 1
  • > 请问你在测试zh-en的过程中,也是按照这样去cat zh en 训练文件吗?

    > 请问你在测试zh-en的过程中,也是按照这样去cat zh en 训练文件吗?

    请问你在测试zh-en的过程中,也是按照这样去cat zh en 训练文件吗? cat source_file > training_data cattarget_file >> training_data

    是的,不过我是分开做的,因为我的模型是中文和英文各一个词表,所以每次只cat单语种的训练语料

    Originally posted by @MarsPain in https://github.com/Jingjing-NLP/VOLT/issues/9#issuecomment-889580205

    opened by hnlp1993 0
  • Paper Issue for relaxed constraints

    Paper Issue for relaxed constraints

    Hello, there,

    In Paper 4.3

    It is inevitable to handle illegal transport case due to relaxed constraints. We remove tokens with distributed chars less than 0.001 token frequencies.

    What does We remove tokens with distributed chars less than 0.001 token frequencies. mean?

    opened by hscspring 0
  • RuntimeWarning: overflow encountered in true_divide

    RuntimeWarning: overflow encountered in true_divide

    Hi,

    When I run ~/VOLT/ot_run.py, I got the following warning:

    ~/VOLT/POT/ot/bregman.py:368: RuntimeWarning: overflow encountered in true_divide v = np.divide(b, KtransposeU)

    Does this affect the final result?

    opened by andmek 0
  • Reproducibility Study pt II

    Reproducibility Study pt II

    Dear Jingjing,

    We have found us in the situation of asking ourselves what size of vocabulary we should have in the beginning: Do we use 1.4k for de-en? Or what other size would we use?

    Thank you Kyra & Joëlle

    opened by KyraGolden 0
  • Reproducibility Study

    Reproducibility Study

    Dear Jingjing We are trying to reproduce your experiments from your paper: "Vocabulary Learning via Optimal Transport for Neural Machine Translation" for a University Seminar at the University of Zurich. And we are looking for the TED multilingual dataset as it was not found in the GitHub repo. We would also be very grateful if you could make these datasets available:

    • TED EN-X data​
    • WMT-14 EN-De

    Many thanks in advance.

    With kind regards,

    Kyra & Joëlle

    opened by KyraGolden 2
  • "../ot_run.py", line 107, in write_vocab

    "../ot_run.py", line 107, in write_vocab left, right = token.split(" ")

    嗨,这里的token.split(" ")似乎应该改成token.split("\t")

    opened by Timaos123 2
  • fixed scripts for sentencepiece

    fixed scripts for sentencepiece

    1. Fixed mistakes in get_tokens.py when reading candidate tokens from vocab files generated by sentencepiece.
    2. When using sentencepiece tokenizer, the vocabulary vocab_file generated by the run_ot.py script won't be used again in steps later on and therefore is redundant. run_ot.py was fixed to generate only size_file for sentencepiece tokenizer.
    opened by xinyiz1019 0
  • Error when generating Sentencepiece vocab

    Error when generating Sentencepiece vocab

    Running ot_run.py with Sentencepiece tokenizer gives an Value Error - while trying to unpack the tokenizer split.

    2

    The reason for the error is that Sentencepiece vocab is split by a tab space. Changing the split from space to tab space in line 107 in ot_run.py will fix the issue and generate the vocab for Sentencepiece.

    1

    But since subword-nmt vocab is split by space, this fix will cause an issue when generating subword-nmt vocab.

    opened by Aiden-Frost 0
Owner
null
Inference code for "StylePeople: A Generative Model of Fullbody Human Avatars" paper. This code is for the part of the paper describing video-based avatars.

NeuralTextures This is repository with inference code for paper "StylePeople: A Generative Model of Fullbody Human Avatars" (CVPR21). This code is for

Visual Understanding Lab @ Samsung AI Center Moscow 18 Oct 6, 2022
This is the official source code for SLATE. We provide the code for the model, the training code, and a dataset loader for the 3D Shapes dataset. This code is implemented in Pytorch.

SLATE This is the official source code for SLATE. We provide the code for the model, the training code and a dataset loader for the 3D Shapes dataset.

Gautam Singh 66 Dec 26, 2022
Code for paper ECCV 2020 paper: Who Left the Dogs Out? 3D Animal Reconstruction with Expectation Maximization in the Loop.

Who Left the Dogs Out? Evaluation and demo code for our ECCV 2020 paper: Who Left the Dogs Out? 3D Animal Reconstruction with Expectation Maximization

Benjamin Biggs 29 Dec 28, 2022
TensorFlow code for the neural network presented in the paper: "Structural Language Models of Code" (ICML'2020)

SLM: Structural Language Models of Code This is an official implementation of the model described in: "Structural Language Models of Code" [PDF] To ap

null 73 Nov 6, 2022
Code for the prototype tool in our paper "CoProtector: Protect Open-Source Code against Unauthorized Training Usage with Data Poisoning".

CoProtector Code for the prototype tool in our paper "CoProtector: Protect Open-Source Code against Unauthorized Training Usage with Data Poisoning".

Zhensu Sun 1 Oct 26, 2021
Code to use Augmented Shapiro Wilks Stopping, as well as code for the paper "Statistically Signifigant Stopping of Neural Network Training"

This codebase is being actively maintained, please create and issue if you have issues using it Basics All data files are included under losses and ea

J K Terry 32 Nov 9, 2021
Code for our method RePRI for Few-Shot Segmentation. Paper at http://arxiv.org/abs/2012.06166

Region Proportion Regularized Inference (RePRI) for Few-Shot Segmentation In this repo, we provide the code for our paper : "Few-Shot Segmentation Wit

Malik Boudiaf 138 Dec 12, 2022
Code for ACM MM 2020 paper "NOH-NMS: Improving Pedestrian Detection by Nearby Objects Hallucination"

NOH-NMS: Improving Pedestrian Detection by Nearby Objects Hallucination The offical implementation for the "NOH-NMS: Improving Pedestrian Detection by

Tencent YouTu Research 64 Nov 11, 2022
Official TensorFlow code for the forthcoming paper

~ Efficient-CapsNet ~ Are you tired of over inflated and overused convolutional neural networks? You're right! It's time for CAPSULES :)

Vittorio Mazzia 203 Jan 8, 2023
This is the code for the paper "Contrastive Clustering" (AAAI 2021)

Contrastive Clustering (CC) This is the code for the paper "Contrastive Clustering" (AAAI 2021) Dependency python>=3.7 pytorch>=1.6.0 torchvision>=0.8

Yunfan Li 210 Dec 30, 2022
Code for the paper Learning the Predictability of the Future

Learning the Predictability of the Future Code from the paper Learning the Predictability of the Future. Website of the project in hyperfuture.cs.colu

Computer Vision Lab at Columbia University 139 Nov 18, 2022
PyTorch code for the paper: FeatMatch: Feature-Based Augmentation for Semi-Supervised Learning

FeatMatch: Feature-Based Augmentation for Semi-Supervised Learning This is the PyTorch implementation of our paper: FeatMatch: Feature-Based Augmentat

null 43 Nov 19, 2022
Code for the paper A Theoretical Analysis of the Repetition Problem in Text Generation

A Theoretical Analysis of the Repetition Problem in Text Generation This repository share the code for the paper "A Theoretical Analysis of the Repeti

Zihao Fu 37 Nov 21, 2022
Code for our ICASSP 2021 paper: SA-Net: Shuffle Attention for Deep Convolutional Neural Networks

SA-Net: Shuffle Attention for Deep Convolutional Neural Networks (paper) By Qing-Long Zhang and Yu-Bin Yang [State Key Laboratory for Novel Software T

Qing-Long Zhang 199 Jan 8, 2023
Open source repository for the code accompanying the paper 'Non-Rigid Neural Radiance Fields Reconstruction and Novel View Synthesis of a Deforming Scene from Monocular Video'.

Non-Rigid Neural Radiance Fields This is the official repository for the project "Non-Rigid Neural Radiance Fields: Reconstruction and Novel View Synt

Facebook Research 296 Dec 29, 2022
Code for the Shortformer model, from the paper by Ofir Press, Noah A. Smith and Mike Lewis.

Shortformer This repository contains the code and the final checkpoint of the Shortformer model. This file explains how to run our experiments on the

Ofir Press 138 Apr 15, 2022
PyTorch code for ICLR 2021 paper Unbiased Teacher for Semi-Supervised Object Detection

Unbiased Teacher for Semi-Supervised Object Detection This is the PyTorch implementation of our paper: Unbiased Teacher for Semi-Supervised Object Detection

Facebook Research 366 Dec 28, 2022
Official code for paper "Optimization for Oriented Object Detection via Representation Invariance Loss".

Optimization for Oriented Object Detection via Representation Invariance Loss By Qi Ming, Zhiqiang Zhou, Lingjuan Miao, Xue Yang, and Yunpeng Dong. Th

ming71 56 Nov 28, 2022
Code for our CVPR 2021 paper "MetaCam+DSCE"

Joint Noise-Tolerant Learning and Meta Camera Shift Adaptation for Unsupervised Person Re-Identification (CVPR'21) Introduction Code for our CVPR 2021

FlyingRoastDuck 59 Oct 31, 2022