Code for ACL 2022 main conference paper "STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation".

Overview

STEMM: Self-learning with Speech-Text Manifold Mixup for Speech Translation

This is a PyTorch implementation for the ACL 2022 main conference paper STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation.

Training a Model on MuST-C

Let's first take a look at training an En-De model as an example.

Enviroment Configuration

  1. Clone this repository:
git clone [email protected]:ictnlp/STEMM.git
cd STEMM/
  1. Install Montreal Forced Aligner following the official guidance. Please also download the pertained models and dictionary for MFA.

  2. Please make sure you have installed PyTorch, and then install fairseq and other packages as follows:

pip install --editable ./
python3 setup.py install --user
python3 setup.py build_ext --inplace
pip install inflect sentencepiece soundfile textgrid pandas

Data Preparation

  1. First make a directory to store the dataset:
TGT_LANG=de
MUSTC_ROOT=data/mustc/
mkdir -p $MUSTC_ROOT
  1. Download the MuST-C v1.0 archive MUSTC_v1.0_en-de.tar.gz to the $MUSTC_ROOT path, and uncompress it:
cd $MUSTC_ROOT
tar -xzvf MUSTC_v1.0_en-de.tar.gz
  1. Return to the root directory, run the preprocess script preprocess.sh, which will perform forced alignment and organize the raw data and alignment information into .tsv format for using:
sh preprocess.sh $TGT_LANG
  1. Finally, the directory $MUSTC_ROOT should look like this:
.
├── en-de
│   ├── config_raw.yaml
│   ├── data
│   ├── dev_raw_seg_plus.tsv
│   ├── docs
│   ├── segment
│   ├── spm_unigram10000_raw.model
│   ├── spm_unigram10000_raw.txt
│   ├── spm_unigram10000_raw.vocab
│   ├── train_raw_seg_plus.tsv
│   ├── tst-COMMON_raw_seg_plus.tsv
│   ├── tst-HE_raw_seg_plus.tsv
└── MUSTC_v1.0_en-de.tar.gz

Pretrain the MT Module

[OPTIONAL] Use External MT Corpus

If you want to use external MT corpus, please first pretrain a MT model on this corpus following these steps:

  1. Perform BPE on external corpus with the sentencepiece model learned on MuST-C. As we mentioned in our paper, we use WMT for En-De, En-Fr, En-Ru, En-Es, En-Ro, and OPUS100 for En-Pt, En-It, En-Nl as external corpus. You can download them from the internet and put them in the data/ext_en${TGT_LANG}/ directory. Run the following command and replace $input_file with the path of raw text to perform BPE. You should apply BPE to texts in both source and target language of all subset (train/valid/test).
python3 data/scripts/apply_spm.py --input-file $input_file --output-file $output_file --model data/mustc/en-${TGT_LANG}/spm_unigram10000_raw.model
  1. Use fairseq-preprocess command to convert the BPE texts into fairseq formats. Make sure to use the sentencepiece dictionary learned on MuST-C.
$spm_dict=data/mustc/en-${TGT_LANG}/spm_unigram10000_raw.txt
fairseq-preprocess --source-lang en --target-lang $TGT_LANG --trainpref data/ext_en${TGT_LANG}/train --validpref data/ext_en${TGT_LANG}/valid --testpref data/ext_en${TGT_LANG}/test --destdir data/ext_en${TGT_LANG}/binary --joined-dictionary --srcdict $spm_dict --tgtdict $spm_dict --workers=20 --nwordssrc 10000 --nwordstgt 10000
  1. Train the model using the following command:
sh pretrain_mt_ext.sh $TGT_LANG

Pretrain the MT module on MuST-C

  1. Run the following script to pretrain the MT module. The argument --load-pretrained-mt-encoder-decoder-from indicates the path of MT model pretrained on external corpus obtained in the last step.
sh pretrain_mt.sh $TGT_LANG
  1. To ensure consistent performance, we have released our checkpoints of pretrained MT modules. You can download them and directly use them do initialize the MT module in our model for the following experiments.
Direction Link
En-De https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2022/stmm/mustc_ende_mt.pt
En-Fr https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2022/stmm/mustc_enfr_mt.pt
En-Es https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2022/stmm/mustc_enes_mt.pt
En-Ro https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2022/stmm/mustc_enro_mt.pt
En-Ru https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2022/stmm/mustc_enru_mt.pt
En-Nl https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2022/stmm/mustc_ennl_mt.pt
En-It https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2022/stmm/mustc_enit_mt.pt
En-Pt https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2022/stmm/mustc_enpt_mt.pt

Training

  1. Download the pretrained wav2vec2.0 model from the official link, and put it in the checkpoints/ directory.
  2. Just run the training scripts:
sh train.sh $TGT_LANG

Evaluate

  1. Run the following script to average the last 10 checkpoints and evaluate on the tst-COMMON set:
sh test.sh mustc_en${TGT_LANG}_stmm_self_learning $TGT_LANG
  1. We also released our checkpoints as follows. You can download and evaluate them directly.
Direction Link
En-De https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2022/stmm/mustc_ende_stmm_self_learning.pt
En-Fr https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2022/stmm/mustc_enfr_stmm_self_learning.pt
En-Es https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2022/stmm/mustc_enes_stmm_self_learning.pt
En-Ro https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2022/stmm/mustc_enro_stmm_self_learning.pt
En-Ru https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2022/stmm/mustc_enru_stmm_self_learning.pt
En-Nl https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2022/stmm/mustc_ennl_stmm_self_learning.pt
En-It https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2022/stmm/mustc_enit_stmm_self_learning.pt
En-Pt https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/acl2022/stmm/mustc_enpt_stmm_self_learning.pt

Citation

In this repository is useful for you, please cite as:

@inproceedings{fang-etal-2022-STEMM,
	title = {STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation},
	author = {Fang, Qingkai and Ye, Rong and Li, Lei and Feng, Yang and Wang, Mingxuan},
	booktitle = {Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics},
	year = {2022},
}

Contact

If you have any questions, feel free to contact me at [email protected].

You might also like...
Code for our paper
Code for our paper "Mask-Align: Self-Supervised Neural Word Alignment" in ACL 2021

Mask-Align: Self-Supervised Neural Word Alignment This is the implementation of our work Mask-Align: Self-Supervised Neural Word Alignment. @inproceed

Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer

ConSERT Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer Requirements torch==1.6.0

Code for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned Language Models in the wild .

🌳 Fingerprinting Fine-tuned Language Models in the wild This is the code and dataset for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned La

Code for our paper "Transfer Learning for Sequence Generation: from Single-source to Multi-source" in ACL 2021.

TRICE: a task-agnostic transferring framework for multi-source sequence generation This is the source code of our work Transfer Learning for Sequence

A PyTorch implementation of paper
A PyTorch implementation of paper "Learning Shared Semantic Space for Speech-to-Text Translation", ACL (Findings) 2021

Chimera: Learning Shared Semantic Space for Speech-to-Text Translation This is a Pytorch implementation for the "Chimera" paper Learning Shared Semant

A PyTorch implementation of paper
A PyTorch implementation of paper "Learning Shared Semantic Space for Speech-to-Text Translation", ACL (Findings) 2021

Chimera: Learning Shared Semantic Space for Speech-to-Text Translation This is a Pytorch implementation for the "Chimera" paper Learning Shared Semant

Code for papers "Generation-Augmented Retrieval for Open-Domain Question Answering" and "Reader-Guided Passage Reranking for Open-Domain Question Answering", ACL 2021

This repo provides the code of the following papers: (GAR) "Generation-Augmented Retrieval for Open-domain Question Answering", ACL 2021 (RIDER) "Read

This is the main repository of open-sourced speech technology by Huawei Noah's Ark Lab.

Speech-Backbones This is the main repository of open-sourced speech technology by Huawei Noah's Ark Lab. Grad-TTS Official implementation of the Grad-

Comments
  • About fine-tuning

    About fine-tuning

    In the paper, Section 2.2 , you say "We combine those pretrained modules and finetune the whole model for ST". Did you freeze the Wav2Vec2.0 Model during training ? If not , I wonder if it's because of the mix-up training strategy , so as to bridge the modality gap.

    question 
    opened by zhouyan19 2
  • Audio feature extraction during preprocessing ?

    Audio feature extraction during preprocessing ?

    I compare your preprocss procedure (preprocess.sh) to the orginial fairseq example, and find that you remove the step of extracting the audio feature ( as far as my observation is concerned ) . So when you train the fairseq ST baseline, are you using the raw audio inputs rather than the features as in the original codes ? Or are you using Wav2Vec2.0 to do an audio embedding ?

    question 
    opened by zhouyan19 2
  • statistical significance of translation results

    statistical significance of translation results

    ❓ Questions and Help

    In paper, you said "We use sacreBLEU to compute case-sensitive detokenized BLEU scores and the statistical significance of translation results with paired bootstrap resamplingfor a fair comparison." Is it convenient to realease the code used to calculate the statistical significance? Sorry for disturbing again.

    Before asking:

    1. search the issues.
    2. search the docs.

    What is your question?

    Code

    What have you tried?

    What's your environment?

    • fairseq Version (e.g., 1.0 or master):
    • PyTorch Version (e.g., 1.0)
    • OS (e.g., Linux):
    • How you installed fairseq (pip, source):
    • Build command you used (if compiling from source):
    • Python version:
    • CUDA/cuDNN version:
    • GPU models and configuration:
    • Any other relevant information:
    question 
    opened by zhhao1 1
  • external parallel mt data download.

    external parallel mt data download.

    ❓ Questions and Help

    Can you provide the download and processing scripts for the additional machine translation data mentioned in the article?The current repository only gives the calculation of bpe. There are many data sources given in the official website, and it is not clear which ones to download. Thank you very much.

    question 
    opened by zhhao1 1
Owner
ICTNLP
Natural Language Processing Group, Institute of Computing Technology, Chinese Academy of Sciences
ICTNLP
Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

LancoPKU 105 Jan 3, 2023
IndoBERTweet is the first large-scale pretrained model for Indonesian Twitter. Published at EMNLP 2021 (main conference)

IndoBERTweet ?? ???? 1. Paper Fajri Koto, Jey Han Lau, and Timothy Baldwin. IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effe

IndoLEM 40 Nov 30, 2022
Python Implementation of ``Modeling the Influence of Verb Aspect on the Activation of Typical Event Locations with BERT'' (Findings of ACL: ACL 2021)

BERT-for-Surprisal Python Implementation of ``Modeling the Influence of Verb Aspect on the Activation of Typical Event Locations with BERT'' (Findings

null 7 Dec 5, 2022
Hostapd-mac-tod-acl - Setup a hostapd AP with MAC ToD ACL

A brief explanation This script provides a quick way to setup a Time-of-day (Tod

null 2 Feb 3, 2022
(ACL 2022) The source code for the paper "Towards Abstractive Grounded Summarization of Podcast Transcripts"

Towards Abstractive Grounded Summarization of Podcast Transcripts We provide the source code for the paper "Towards Abstractive Grounded Summarization

null 10 Jul 1, 2022
Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors"

SWRM Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors" Clone Clone th

null 14 Jan 3, 2023
Implementaion of our ACL 2022 paper Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation

Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation This is the implementaion of our paper: Bridging the

hezw.tkcw 20 Dec 12, 2022
Code for "Parallel Instance Query Network for Named Entity Recognition", accepted at ACL 2022.

README Code for Two-stage Identifier: "Parallel Instance Query Network for Named Entity Recognition", accepted at ACL 2022. For details of the model a

Yongliang Shen 45 Nov 29, 2022
PyTorch Implementation of "Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging" (Findings of ACL 2022)

Feature_CRF_AE Feature_CRF_AE provides a implementation of Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging

Jacob Zhou 6 Apr 29, 2022
Entity Disambiguation as text extraction (ACL 2022)

ExtEnD: Extractive Entity Disambiguation This repository contains the code of ExtEnD: Extractive Entity Disambiguation, a novel approach to Entity Dis

Sapienza NLP group 121 Jan 3, 2023