Lingtrain Aligner — ML powered library for the accurate texts alignment.

Sergei Averkiev

Last update: Dec 14, 2022

Related tags

Text Data & NLP lingtrain-aligner

Overview

Lingtrain Aligner

ML powered library for the accurate texts alignment in different languages.

Purpose

Main purpose of this alignment tool is to build parallel corpora using two or more raw texts in different languages. Texts should contain the same information (i.e., one text should be a translated analog oh the other text). E.g., it can be the Drei Kameraden by Remarque in German and the Three Comrades — it's translation into English.

Process

There are plenty of obstacles during the alignment process:

The translator could translate several sentences as one.
The translator could translate one sentence as many.
There are some service marks in the text
- Page numbers
- Chapters and other section headings
- Author and title information
- Notes

While service marks can be handled manually (the tool helps to detect them), the translation conflicts should be handled more carefully.

Lingtrain Aligner tool will do almost all alignment work for you. It matches the sentence pairs automatically using the multilingual machine learning models. Then it searches for the alignment conflicts and resolves them. As output you will have the parallel corpora either as two distinct plain text files or as the merged corpora in widely used TMX format.

Supported languages and models

Automated alignment process relies on the sentence embeddings models. Embeddings are multidimensional vectors of a special kind which are used to calculate a distance between the sentences. Supported languages list depend on the selected backend model.

distiluse-base-multilingual-cased-v2
- more reliable and fast
- moderate weights size — 500MB
- supports 50+ languages
- full list of supported languages can be found in this paper
LaBSE (Language-agnostic BERT Sentence Embedding)
- can be used for rare languages
- pretty heavy weights — 1.8GB
- supports 100+ languages
- full list of supported languages can be found here

Profit

Parallel corpora by itself can used as the resource for machine translation models or for linguistic researches.
My personal goal of this project is to help people building parallel translated books for the foreign language learning.

Comments

File Already Exists

Делаю docker pull lingtrain/aligner:v4 Загружаю текстовый файл и...

После вот такого предупреждения ничего не происходит Причём оно вылазит на любой текстовый файл

opened by puffofsmoke 1
Fix XML creation:
prevent parent tag duplication for (langs, author, title)

add tags for tmx export

use 'direction' for splitting paragraphs

do not use bs4 (generates incorrect xml), change to lxml
opened by BorisNA 0
A error when I use “splitter.split_by_sentences_wrapper”，please help check the error

when I use “splitted_from = splitter.split_by_sentences_wrapper(text1_prepared, lang_from)” return list，

But I see that there will be a conflict when insert sqlite ，specific error：

File "ling_test.py", line 36, in aligner.fill_db(db_path, splitted_from, splitted_to) File "lingtrain_aligner/aligner.py", line 498, in fill_db db.executemany("insert into languages(key, val) values(?,?)", [("from", lang_from), ("to", lang_to)]) sqlite3.InterfaceError: Error binding parameter 1 - probably unsupported type.

opened by Amen-bang 5
Add text splitting into small parts
The current version ignores the H1-H5 headers that were added by user. But when book was translate text from chapter 1 will be translate as a chapter 1 text into another language. You can use this fact and split a big text to small parts.

Next idea - try split a big text to small blocks automatically: Select a few sentences from original text(for example 10 sentences) and using loop try to find translate block in the thanslated text.

You can use the next psedocode:

left_array = original_sentences[100:110] sum=[] for i=50;i<150 do: right_array_candidate=translated_sentences[i:i+10] sum[i]=sum(cosunuse_distance(left_array,right_array_candidate)) rigth_array=get_index_with_max_value(sum) left_text_split_index=left_array[0] rigth_text_split_index=rigth_array[0]
opened by AigizK 0

Releases(0.1.0)

0.1.0(Apr 21, 2021)

The initial release. Already works. Does not have requirements yet.
Source code(tar.gz)
Source code(zip)

Lingtrain Aligner — ML powered library for the accurate texts alignment.

Related tags

Overview

Lingtrain Aligner

Purpose

Process

Supported languages and models

Profit

You might also like...

Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)

Code for EMNLP'21 paper "Types of Out-of-Distribution Texts and How to Detect Them"

Neural text generators like the GPT models promise a general-purpose means of manipulating texts.

Biterm Topic Model (BTM): modeling topics in short texts

This repository contains Python scripts for extracting linguistic features from Filipino texts.

Text Classification in Turkish Texts with Bert

Code for our paper "Mask-Align: Self-Supervised Neural Word Alignment" in ACL 2021

A pytorch implementation of the ACL2019 paper "Simple and Effective Text Matching with Richer Alignment Features".

Tensorflow Implementation of A Generative Flow for Text-to-Speech via Monotonic Alignment Search

Comments

File Already Exists

Fix XML creation:

A error when I use “splitter.split_by_sentences_wrapper”，please help check the error

Add text splitting into small parts

Releases(0.1.0)

0.1.0(Apr 21, 2021)

Owner

Sergei Averkiev

This python module is an easy-to-use port of the text normalization used in the paper "Not low-resource anymore: Aligner ensembling, batch filtering, and new datasets for Bengali-English machine translation". It is intended to be used for normalizing / cleaning Bengali and English text.

Augmenty is an augmentation library based on spaCy for augmenting texts.

This library is testing the ethics of language models by using natural adversarial texts.

:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

VADER Sentiment Analysis. VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media, and works well on texts from other domains.

Python package to easily retrain OpenAI's GPT-2 text-generating model on new texts

VADER Sentiment Analysis. VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media, and works well on texts from other domains.

Python package to easily retrain OpenAI's GPT-2 text-generating model on new texts