A sentence aligner for comparable corpora

Machinalis

Last update: Aug 24, 2022

Related tags

Text Data & NLP yalign

Overview

About

Yalign is a tool for extracting parallel sentences from comparable corpora.

Statistical Machine Translation relies on parallel corpora (eg.. europarl) for training translation models. However these corpora are limited and take time to create. Yalign is designed to automate this process by finding sentences that are close translation matches from comparable corpora. This opens up avenues for harvesting parallel corpora from sources like translated documents and the web.

Installation

Yalign requires that you install scikit-learn.

After that you can install Yalign from PyPi via pip:

sudo pip install yalign

Usage

Firstly we need to download and unpack the english to spanish model.

wget https://raw.githubusercontent.com/machinalis/yalign/develop/data/models/0.1/en-es.tar.gz
tar -xvzf en-es.tar.gz

Now we can use the yalign-align script along with the english to spanish model to align two web pages.

yalign-align en-es http://en.wikipedia.org/wiki/Antiparticle http://es.wikipedia.org/wiki/Antipart%C3%ADcula

Yalign is not limited to any one language pair. By creating your own models you can align any two languages. For more details on how to use yalign and on yalign's implementation please read the docs.

The Yalign Team:

Yalign is a Machinalis project. You can view our other open source contributions here.

Andrew Vine

Gonzalo García Berrotarán

Rafael Carrascosa

Elías Andrawos

Laura Alonso Alemany

Comments

Problem with yalign-train

Hello,

I am writting to you with ask for help. It is very important for me to gain some additional data for my MT systems. You tools seems to be great but it does not work for me.

I did installation from http://yalign.readthedocs.org/en/latest/installation.html#installing-from-pypi.

With the files from tutorial all works but not with other.

I took dictionary from phrase table of mine MT system and uses OpenSubtitles 2012 corpora from OPUS project you recommended in tutorial.

For some time yalign-train works (2-3 mins) than it becomes Killed. I have no idea why. Is there any way to display what causes the error?

Really hope you can help me out.

opened by kamwolk 5
Set WordPairScore to prefer maximum scoring pairs (rather than random).

Before, WordPairScore will take the last value of the potential translation, in the event of clashes. This didn't seem correct.

For example (using the dictionary.csv from the tutorial):

He abstained from any further comments. Se abstuvo de hacer mas comentarios.

The words 'abstained' and 'any' both can map to 'se', but 'abstained' score is 0.0138 while 'any' is 0.0015. The current code will return the smallest value because 'any' appears later in the sentence.

This commit fixes this issue, by updating the values to keep the maximum score registered within the sentence.

opened by DrDub 3
ModuleNotFoundError

hi how are you, I have the following error from yalign ModuleNotFoundError: No module named 'yalignmodel' I already have the yalign module installed. I am supposed to import the yalignmodel without problems, but I really don't understand what the problem is. Could you help me? i am using python 2.7.18

opened by anthobio23 1
Running on Windows 10

Hi there,

Yalign seems to be a great alignment tool for wikitext!

I'm wondering whether can it run on Windows?

I've setted up python2 in the path, and installed all the required pakage via powershell.

What should I do next?

Many thanks to your kindly help!

opened by LukeTu 0
Site machinalis.com is dead

wget: unable to resolve host address ‘yalign.machinalis.com’

The site machinalis.com seems to be dead. Is there an alternative location where the models can be downloaded from?

Thanks.

opened by msoutopico 0
ResolutionError on Scripts while running in Python 2.7 (using bash shell)

Hi,

I managed to install the package in python 2.7 conda enviornment. When I run the help or any other command is give me this ResourceError as below, not sure If I'm missing anything.

Appreciate any help !

Mohammed Ayub

opened by mohammedayub44 0
please provide a phrase table demo

Hi, I found that this align tool is very useful. And I wanna to train a model of my own, but I do not have any phrase table could you provide a phrase table demo? many thanks!

opened by keyboardWitch 8
yalign-align problem

Hi， I tried to align en es plan text file with en-es model that provided But there is some problem after "yalign-align en-es en.txt es.txt"

/yalign/wordpairscore.py", line 51, in call AttributeError: 'tuple' object has no attribute 'lower'

I don't know what is the problem , please help me many thanks!

opened by keyboardWitch 1

Owner

Machinalis

GitHub

🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.

pySBD: Python Sentence Boundary Disambiguation (SBD) pySBD - python Sentence Boundary Disambiguation (SBD) - is a rule-based sentence boundary detecti

277 Feb 18, 2021

REST API for sentence tokenization and embedding using Multilingual Universal Sentence Encoder.

What is MUSE? MUSE stands for Multilingual Universal Sentence Encoder - multilingual extension (16 languages) of Universal Sentence Encoder (USE). MUS

47 Sep 5, 2022

Automated Phrase Mining from Massive Text Corpora in Python.

28 Apr 15, 2021

Code for Discovering Topics in Long-tailed Corpora with Causal Intervention.

Code for Discovering Topics in Long-tailed Corpora with Causal Intervention ACL2021 Findings Usage 0. Prepare environment Requirements: python==3.6 te

8 Dec 16, 2022

Lingtrain Aligner — ML powered library for the accurate texts alignment.

Lingtrain Aligner ML powered library for the accurate texts alignment in different languages. Purpose Main purpose of this alignment tool is to build

76 Dec 14, 2022

This python module is an easy-to-use port of the text normalization used in the paper "Not low-resource anymore: Aligner ensembling, batch filtering, and new datasets for Bengali-English machine translation". It is intended to be used for normalizing / cleaning Bengali and English text.

normalizer This python module is an easy-to-use port of the text normalization used in the paper "Not low-resource anymore: Aligner ensembling, batch

23 Nov 30, 2022

Sentence Embeddings with BERT & XLNet

Sentence Transformers: Multilingual Sentence Embeddings using BERT / RoBERTa / XLM-RoBERTa & Co. with PyTorch This framework provides an easy method t

9.1k Jan 2, 2023

Extract Keywords from sentence or Replace keywords in sentences.

FlashText This module can be used to replace keywords in sentences or extract keywords from sentences. It is based on the FlashText algorithm. Install

5.3k Jan 1, 2023

Sentence Embeddings with BERT & XLNet

Sentence Transformers: Multilingual Sentence Embeddings using BERT / RoBERTa / XLM-RoBERTa & Co. with PyTorch This framework provides an easy method t

4.2k Feb 18, 2021

Extract Keywords from sentence or Replace keywords in sentences.

FlashText This module can be used to replace keywords in sentences or extract keywords from sentences. It is based on the FlashText algorithm. Install

4.7k Feb 17, 2021

Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)

Bunkai Bunkai is a sentence boundary (SB) disambiguation tool for Japanese texts. Quick Start $ pip install bunkai $ echo -e '宿を予約しました♪!まだ2ヶ月も先だけど。早すぎ

160 Dec 23, 2022

SimCSE: Simple Contrastive Learning of Sentence Embeddings

SimCSE: Simple Contrastive Learning of Sentence Embeddings This repository contains the code and pre-trained models for our paper SimCSE: Simple Contr

2.5k Jan 7, 2023

Language-Agnostic SEntence Representations

LASER Language-Agnostic SEntence Representations LASER is a library to calculate and use multilingual sentence embeddings. NEWS 2019/11/08 CCMatrix is

3.2k Jan 4, 2023

Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning

GenSen Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning Sandeep Subramanian, Adam Trischler, Yoshua B

309 Oct 19, 2022

InferSent sentence embeddings

InferSent InferSent is a sentence embeddings method that provides semantic representations for English sentences. It is trained on natural language in

2.2k Dec 27, 2022

source code for paper: WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach.

WhiteningBERT Source code and data for paper WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach. Preparation git clone https://github.com

49 Dec 17, 2022

Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer

ConSERT Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer Requirements torch==1.6.0

478 Dec 25, 2022

Using context-free grammar formalism to parse English sentences to determine their structure to help computer to better understand the meaning of the sentence.

Sentance Parser Executing the Program Make sure Python 3.6+ is installed. Install requirements $ pip install requirements.txt Run the program:

12 Sep 28, 2022

Cải thiện Elasticsearch trong bài toán semantic search sử dụng phương pháp Sentence Embeddings

Cải thiện Elasticsearch trong bài toán semantic search sử dụng phương pháp Sentence Embeddings Trong bài viết này mình sẽ sử dụng pretrain model SimCS

18 Nov 25, 2022