A Chinese to English Neural Model Translation Project

Zhenbang Feng

Last update: Nov 26, 2022

Related tags

Text Data & NLP nlp natural-language-processing machine-translation pytorch lstm gru seq2seq nmt nmt-model chinese-english-translator machine-translation-models

Overview

ZH-EN NMT Chinese to English Neural Machine Translation

This project is inspired by Stanford's CS224N NMT Project

Dataset used in this project: News Commentary v14

Intro

This project is more of a learning project to make myself familiar with Pytorch, machine translation, and NLP model training.

To investigate how would various setups of the recurrent layer affect the final performance, I compared Training Efficiency and Effectiveness of different types of RNN layer for encoder by changing one feature each time while controlling all other parameters:

RNN types
- GRU
- LSTM
Activation Functions on Output Layer
- Tanh
- ReLU
- LeakyReLU
Number of layers
- single layer
- double layer

Code Files

_/
├─ utils.py # utilities
├─ vocab.py # generate vocab
├─ model_embeddings.py # embedding layer
├─ nmt_model.py # nmt model definition
├─ run.py # training and testing

Good Translation Examples

source: 相反,这意味着合作的基础应当是共同的长期战略利益,而不是共同的价值观。
- target: Instead, it means that cooperation must be anchored not in shared values, but in shared long-term strategic interests.
- translation: On the contrary, that means cooperation should be a common long-term strategic interests, rather than shared values.
source: 但这个问题其实很简单: 谁来承受这些用以降低预算赤字的紧缩措施的冲击。
- target: But the issue is actually simple: Who will bear the brunt of measures to reduce the budget deficit?
- translation: But the question is simple: Who is to bear the impact of austerity measures to reduce budget deficits?
source: 上述合作对打击恐怖主义、贩卖人口和移民可能发挥至关重要的作用。
- target: Such cooperation is essential to combat terrorism, human trafficking, and migration.
- translation: Such cooperation is essential to fighting terrorism, trafficking, and migration.
source: 与此同时, 政治危机妨碍着政府追求艰难的改革。
- target: At the same time, political crisis is impeding the government’s pursuit of difficult reforms.
- translation: Meanwhile, political crises hamper the government’s pursuit of difficult reforms.

Preprocessing

Preprocessing Colab notebook

using jieba to separate Chinese words by spaces

Generate Vocab From Training Data

Input: training data of Chinese and English
Output: a vocab file containing mapping from (sub)words to ids of Chinese and English -- a limited size of vocab is selected using SentencePiece (essentially Byte Pair Encoding of character n-grams) to cover around 99.95% of training data

Model Definition

a Seq2Seq model with attention

This image is from the book DIVE INTO DEEP LEARNING
- Encoder
  - A Recurrent Layer
- Decoder
  - LSTMCell (hidden_size=512)
- Attention
  - Multiplicative Attention

Training And Testing Results

Training Colab notebook

Hyperparameters:
- Embedding Size & Hidden Size: 512
- Dropout Rate: 0.25
- Starting Learning Rate: 5e-4
- Batch Size: 32
- Beam Size for Beam Search: 10
NOTE: The BLEU score calculated here is based on the Test Set, so it could only be used to compare the relative effectiveness of the models using this data

For Experiment

Dataset: the dataset is split into training set(~260000), validation set(~20000), and testing set(~20000) randomly (they are the same for each experiment group)
Max Number of Iterations: 50000
NOTE: I've tried Vanilla-RNN(nn.RNN) in various ways, but the BLEU score turns out to be extremely low for it (absence of residual connections might be the issue)
- I decided to not include it for comparison until the issue is resolved

	Training Time(sec)	BLEU Score on Test Set
A. Bidirectional 1-Layer GRU with Tanh	5158.99	14.26
B. Bidirectional 1-Layer LSTM with Tanh	5150.31	16.20
C. Bidirectional 2-Layer LSTM with Tanh	6197.58	16.38
D. Bidirectional 1-Layer LSTM with ReLU	5275.12	14.01
E. Bidirectional 1-Layer LSTM with LeakyReLU(slope=0.1)	5292.58	14.87

Current Best Version

Bidirectional 2-Layer LSTM with Tanh, 1024 embed_size & hidden_size, trained 11517.19 sec (44000 iterations), BLEU score 17.95

	Traning Time	BLEU Score on Test Set	Training Perplexities	Validation Perplexities
Best Model	11517.19	17.95

Analysis

LSTM tends to have better performance than GRU (it has an extra set of parameters)
Tanh tends to be better since less information is lost
Making the LSTM deeper (more layers) could improve the performance, but it cost more time to train
Surprisingly, the training time for A, B, and D are roughly the same
- the issue may be the dataset is not large enough, or the cloud service I used to train models does not perform consistently

Bad Examples & Case Analysis

source: 全球目击组织(Global Witness)的报告记录, 光是2015年就有16个国家的185人被杀。
- target: A Global Witness report documented 185 killings across 16 countries in 2015 alone.
- translation: According to the Global eye, the World Health Organization reported that 185 people were killed in 2015.
- problems:
  - Information Loss: 16 countries
  - Unknown Proper Noun: Global Witness
source: 大自然给了足以满足每个人需要的东西, 但无法满足每个人的贪婪。
- target: Nature provides enough for everyone’s needs, but not for everyone’s greed.
- translation: Nature provides enough to satisfy everyone.
- problems:
  - Huge Information Loss
source: 我衷心希望全球经济危机和巴拉克·奥巴马当选总统能对新冷战的荒唐理念进行正确的评估。
- target: It is my hope that the global economic crisis and Barack Obama’s presidency will put the farcical idea of a new Cold War into proper perspective.
- translation: I do hope that the global economic crisis and President Barack Obama will be corrected for a new Cold War.
- problems:
  - Action Sender And Receiver Exchanged
  - Failed To Translate Complex Sentence
source: 人们纷纷猜测欧元区将崩溃。
- target: Speculation about a possible breakup was widespread.
- translation: The eurozone would collapse.
- problems:
  - Significant Information Loss

Means to Improve the NMT model

Dataset
- The dataset is fairly small, and our model is not being trained thorough all data
- Being a native Chinese speaker, I could not understand what some of the source sentences are saying
- The target sentences are not informational comprehensive; they themselves need context to be understood (e.g. the target sentence in the last "Bad Examples")
- Even for human, some of the source sentence was too hard to translate
Model Architecture
- CNN & Transformer
- character based model
- Make the model even larger & deeper (... I need GPUs)
Tricks that might help
- Add a proper noun dictionary to translate unknown proper nouns word-by-word (phrase-by-phrase)
- Initialize (sub)word embedding with pretrained embedding

How To Run

Download the dataset you desire, and change all "./zh_en_data" in run.sh to the path where your data is stored
To run locally on a CPU (mostly for sanity check, CPU is not able to train the model)
- set up the environment using conda/miniconda conda env create --file local env.yml
To run on a GPU
- set up the environment and running process following the Colab notebook

Contact

If you have any questions or you have trouble running the code, feel free to contact me via email

You might also like...

A Multi-modal Model Chinese Spell Checker Released on ACL2021.

ReaLiSe ReaLiSe is a multi-modal Chinese spell checking model. This the office code for the paper Read, Listen, and See: Leveraging Multimodal Informa

106 Dec 29, 2022

Chinese NER with albert/electra or other bert descendable model (keras)

Chinese NLP (albert/electra with Keras) Named Entity Recognization Project Structure ./ ├── NER │ ├── __init__.py │ ├── log

2 Nov 20, 2022

Chinese NewsTitle Generation Project by GPT2.带有超级详细注释的中文GPT2新闻标题生成项目。

GPT2-NewsTitle 带有超详细注释的GPT2新闻标题生成项目 UpDate 01.02.2021 从网上收集数据，将清华新闻数据、搜狗新闻数据等新闻数据集，以及开源的一些摘要数据进行整理清洗，构建一个较完善的中文摘要数据集。数据集清洗时，仅进行了简单地规则清洗。

785 Dec 29, 2022

T‘rex Park is a Youzan sponsored project. Offering Chinese NLP and image models pretrained from E-commerce datasets

T‘rex Park is a Youzan sponsored project. Offering Chinese NLP and image models pretrained from E-commerce datasets (product titles, images, comments, etc.).

55 Nov 22, 2022

In this repository, I have developed an end to end Automatic speech recognition project. I have developed the neural network model for automatic speech recognition with PyTorch and used MLflow to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry.

End to End Automatic Speech Recognition In this repository, I have developed an end to end Automatic speech recognition project. I have developed the

22 Nov 13, 2022

A python wrapper around the ZPar parser for English.

NOTE This project is no longer under active development since there are now really nice pure Python parsers such as Stanza and Spacy. The repository w

49 Sep 12, 2022

Coreference resolution for English, German and Polish, optimised for limited training data and easily extensible for further languages

Coreferee Author: Richard Paul Hudson, msg systems ag 1. Introduction 1.1 The basic idea 1.2 Getting started 1.2.1 English 1.2.2 German 1.2.3 Polish 1

169 Dec 21, 2022

Using context-free grammar formalism to parse English sentences to determine their structure to help computer to better understand the meaning of the sentence.

Sentance Parser Executing the Program Make sure Python 3.6+ is installed. Install requirements $ pip install requirements.txt Run the program:

12 Sep 28, 2022

मराठी भाषा वाचविण्याचा एक प्रयास. इंग्रजी ते मराठीचा शब्दकोश. An attempt to preserve the Marathi language. A lightweight and ad free English to Marathi thesaurus.

For English, scroll down मराठी शब्द मराठी भाषा वाचवण्यासाठी मी हा ओपन सोर्स प्रोजेक्ट सुरू केला आहे. माझ्या मते, आपली भाषा हळूहळू आणि कोणाचाही लक्षात

20 Oct 11, 2022