Improving Machine Translation Systems via Isotopic Replacement

Zeyu Sun

Last update: Nov 30, 2022

Related tags

Deep Learning CAT

Overview

CAT (Improving Machine Translation Systems via Isotopic Replacement)

Machine translation plays an essential role in people’s daily international communication. However, machine translation systems are far from perfect. To tackle this problem, researchers have proposed several approaches to testing machine translation. A promising trend among these approaches is to use word replacement, where only one word in the original sentence is replaced with another word to form a sentence pair. However, precise control of the impact of word replacement remains an outstanding issue in these approaches.

To address this issue, we propose CAT, a novel word-replacement-based approach, whose basic idea is to identify word replacement with controlled impact (referred to as isotopic replacement). To achieve this purpose, we use a neural-based language model to encode the sentence context, and design a neural-network-based algorithm to evaluate context-aware semantic similarity between two words. Furthermore, similar to TransRepair, a state-of-the-art word-replacement-based approach, CAT also provides automatic fixing of revealed bugs without model retraining.

Our evaluation on Google Translate and Transformer indicates that CAT achieves significant improvements over TransRepair. In particular, 1) CAT detects seven more types of bugs than TransRepair; 2) CAT detects 129% more translation bugs than TransRepair; 3) CAT repairs twice more bugs than TransRepair, many of which may bring serious consequences if left unfixed; and 4) CAT has better efficiency than TransRepair in input generation (0.01s v.s. 0.41s) and comparable efficiency with TransRepair in bug repair (1.92s v.s. 1.34s).

The main file tree of CAT

.
├── Labeled data
│   ├── RQ1 Test Input Generation
│   ├── RQ2 Bug Detection
│   ├── RQ3 Bug Repair
│   └── Extended Analysis
├── TS
├── MutantGen-Test.py
├── MutantGen-Repair.py
├── Repair.py
├── Testing.py
├── NewThres
│   ├── TestGenerator-NMT
│   └── TestGenerator-NMTRep
└── NMT_zh_en0-8Mu
    ├── padTrans
    └── repair-new

The manual assessment results are in the Labeled data folder.

For Testing:

python3 Testing.py

After it, the results are in the NMT_zh_en0-8Mu/padTrans folder.

For Repair:

python3 Repair.py

After it, the results are in the TS/quickstart0/repair-NEW folder.

Data

The LookUpTable.txt used in NMT_zh_en_0-8Mu/padTrans and NMT_zh_en_0-8Mu/repair-new is available at https://drive.google.com/file/d/1fjGpryzGohla0ZA4u7KDgRJeAHegy0A1/view?usp=sharing

Dependenices

NLTK 3.2.1
Pytorch 1.6.1
Python 3.7
Ubuntu 16.04
Transformers 3.3.0

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.

H2O H2O is an in-memory platform for distributed, scalable machine learning. H2O uses familiar interfaces like R, Python, Scala, Java, JSON and the Fl

6.1k Jan 5, 2023

Improving Machine Translation Systems via Isotopic Replacement

Related tags

Overview

CAT (Improving Machine Translation Systems via Isotopic Replacement)

Data

Dependenices

You might also like...

Repository for "Improving evidential deep learning via multi-task learning," published in AAAI2022

Python implementation of cover trees, near-drop-in replacement for scipy.spatial.kdtree

An unopinionated replacement for PyTorch's Dataset and ImageFolder, that handles Tar archives

Implementation of gMLP, an all-MLP replacement for Transformers, in Pytorch

Reproducing code of hair style replacement method from Barbershorp.

Active window border replacement for window managers.

Autolfads-tf2 - A TensorFlow 2.0 implementation of Latent Factor Analysis via Dynamical Systems (LFADS) and AutoLFADS

Official pytorch implementation of paper "Image-to-image Translation via Hierarchical Style Disentanglement".

Owner

Zeyu Sun

Code for Private Recommender Systems: How Can Users Build Their Own Fair Recommender Systems without Log Data? (SDM 2022)

Improving Deep Network Debuggability via Sparse Decision Layers

Source code and dataset for ACL2021 paper: "ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive Learning".

An integration of several popular automatic augmentation methods, including OHL (Online Hyper-Parameter Learning for Auto-Augmentation Strategy) and AWS (Improving Auto Augment via Augmentation Wise Weight Sharing) by Sensetime Research.

code for "AttentiveNAS Improving Neural Architecture Search via Attentive Sampling"

Official PyTorch implementation of the paper: Improving Graph Neural Network Expressivity via Subgraph Isomorphism Counting.

Improving Convolutional Networks via Attention Transfer (ICLR 2017)

[NeurIPS 2021] “Improving Contrastive Learning on Imbalanced Data via Open-World Sampling”,

Improving Transferability of Representations via Augmentation-Aware Self-Supervision

MetaBalance: Improving Multi-Task Recommendations via Adapting Gradient Magnitudes of Auxiliary Tasks