Selecting Parallel In-domain Sentences for Neural Machine Translation Using Monolingual Texts

Javad Pourmostafa

Last update: Jan 7, 2023

Related tags

Overview

DataSelection-NMT

Selecting Parallel In-domain Sentences for Neural Machine Translation Using Monolingual Texts

Quick update: The paper got accepted on Dec 6, 2021! I will link the repository to the paper as soon as it got published.

Our Pre-trained models on Hugging Face

Systems	Link	Systems	Link
Top1	Download	Top1	Download
Top2+Top1	Download	Top2	Download
Top3+Top2+...	Download	Top3	Donwload
Top4+Top3+...	Download	Top4	Donwload
Top5+Top4+...	Download	Top5	Donwload
Top6+Top5+...	Download	Top6	Donwload

How to use

Note: we ported the best checkpoints of trained models to the Hugging Face (HF). Since our models were trained by OpenNMT-py, it was not possible to employ them directly for inference on HF. To bypass this issue, we use CTranslate2– an inference engine for transformer models.

Follow steps below to translate your sentences:

1. Install the Python package:

pip install --upgrade pip
pip install ctranslate2

2. Download models from our HF repository: You can do this manually or use the following python script:

import requests

url = "Download Link"
model_path = "Model Path"
r = requests.get(url, allow_redirects=True)
open(model_path, 'wb').write(r.content)

3. Convert the downloaded model:

ct2-opennmt-py-converter --model_path model_path --output_dir output_directory

3. Translate tokenized inputs:

Note: the inputs should be tokenized by SentencePiece. You can also use tokenized version of IWSLT test sets.

import ctranslate2
translator = ctranslate2.Translator("output_directory/")
translator.translate_batch([["▁H", "ello", "▁world", "!"]])

import ctranslate2
translator = ctranslate2.Translator("output_directory/")
translator.translate_file(input_file, output_file, batch_type= "tokens/examples")

To customize the CTranslate2 functions, read this API document.

4. Detokenize the outputs:

Note: you need to detokenize the output with the same sentencepiece model as used in step 3.

tools/detokenize.perl -no-escape -l fr \
< output_file \
> output_file.detok

5. Remove the @@ tokens:

cat output_file.detok | sed -E 's/(@@)|(@@ )|(@@ ?$)//g' \
> output._file.detok.postprocessd

Use grep to check if @@ tokens removed successfully:

grep @@ output._file.detok.postprocessd

Authors

Javad Pourmostafa - Email, Website
Dimitar Shterionov - Email, Website
Pieter Spronck - Email, Website

You might also like...

Pytorch Implementation of Google's Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

Parallel Tacotron2 Pytorch Implementation of Google's Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

170 Dec 27, 2022

Code for the paper "JANUS: Parallel Tempered Genetic Algorithm Guided by Deep Neural Networks for Inverse Molecular Design"

JANUS: Parallel Tempered Genetic Algorithm Guided by Deep Neural Networks for Inverse Molecular Design This repository contains code for the paper: JA

55 Nov 29, 2022

PyKale is a PyTorch library for multimodal learning and transfer learning as well as deep learning and dimensionality reduction on graphs, images, texts, and videos

PyKale is a PyTorch library for multimodal learning and transfer learning as well as deep learning and dimensionality reduction on graphs, images, texts, and videos. By adopting a unified pipeline-based API design, PyKale enforces standardization and minimalism, via reusing existing resources, reducing repetitions and redundancy, and recycling learning models across areas.

370 Dec 27, 2022

Code for EMNLP'21 paper "Types of Out-of-Distribution Texts and How to Detect Them"

ood-text-emnlp Code for EMNLP'21 paper "Types of Out-of-Distribution Texts and How to Detect Them" Files fine_tune.py is used to finetune the GPT-2 mo

19 Oct 28, 2022

Generate images from texts. In Russian. In PaddlePaddle

ruDALL-E PaddlePaddle ruDALL-E in PaddlePaddle. Install: pip install rudalle_paddle==0.0.1rc1 Run with free v100 on AI Studio. Original Pytorch versi

20 Oct 18, 2022

Codes to pre-train T5 (Text-to-Text Transfer Transformer) models pre-trained on Japanese web texts

t5-japanese Codes to pre-train T5 (Text-to-Text Transfer Transformer) models pre-trained on Japanese web texts. The following is a list of models that

1 Dec 13, 2021

Build an Amazon SageMaker Pipeline to Transform Raw Texts to A Knowledge Graph

Build an Amazon SageMaker Pipeline to Transform Raw Texts to A Knowledge Graph This repository provides a pipeline to create a knowledge graph from ra

3 Jan 1, 2022

Code for CVPR2021 "Visualizing Adapted Knowledge in Domain Transfer". Visualization for domain adaptation. #explainable-ai

Visualizing Adapted Knowledge in Domain Transfer @inproceedings{hou2021visualizing, title={Visualizing Adapted Knowledge in Domain Transfer}, auth

80 Dec 25, 2022

[CVPR2021] Domain Consensus Clustering for Universal Domain Adaptation

[CVPR2021] Domain Consensus Clustering for Universal Domain Adaptation [Paper] Prerequisites To install requirements: pip install -r requirements.txt

84 Dec 26, 2022

Selecting Parallel In-domain Sentences for Neural Machine Translation Using Monolingual Texts

Related tags

Overview

DataSelection-NMT

Quick update: The paper got accepted on Dec 6, 2021! I will link the repository to the paper as soon as it got published.

Our Pre-trained models on Hugging Face

How to use

Authors

You might also like...

Pytorch Implementation of Google's Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

Code for the paper "JANUS: Parallel Tempered Genetic Algorithm Guided by Deep Neural Networks for Inverse Molecular Design"

PyKale is a PyTorch library for multimodal learning and transfer learning as well as deep learning and dimensionality reduction on graphs, images, texts, and videos

Code for EMNLP'21 paper "Types of Out-of-Distribution Texts and How to Detect Them"

Generate images from texts. In Russian. In PaddlePaddle

Codes to pre-train T5 (Text-to-Text Transfer Transformer) models pre-trained on Japanese web texts

Build an Amazon SageMaker Pipeline to Transform Raw Texts to A Knowledge Graph

Code for CVPR2021 "Visualizing Adapted Knowledge in Domain Transfer". Visualization for domain adaptation. #explainable-ai

[CVPR2021] Domain Consensus Clustering for Universal Domain Adaptation

Releases(1.1)

1.1(Oct 25, 2021)

Owner

Javad Pourmostafa

Official code of our work, AVATAR: A Parallel Corpus for Java-Python Program Translation.

Neural machine translation between the writings of Shakespeare and modern English using TensorFlow

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice （『飞桨』核心框架，深度学习&机器学习高性能单机、分布式训练和跨平台部署）

Code for the paper "Balancing Training for Multilingual Neural Machine Translation, ACL 2020"

Code for paper "Vocabulary Learning via Optimal Transport for Neural Machine Translation"

Implementation of "Glancing Transformer for Non-Autoregressive Neural Machine Translation"

Contrastive Learning for Many-to-many Multilingual Neural Machine Translation(mCOLT/mRASP2), ACL2021

"Reinforcement Learning for Bandit Neural Machine Translation with Simulated Human Feedback"

codes for "Scheduled Sampling Based on Decoding Steps for Neural Machine Translation" (long paper of EMNLP-2022)

PyTorch Implementation of "Non-Autoregressive Neural Machine Translation"