A sentence aligner for comparable corpora

Overview

About

Yalign is a tool for extracting parallel sentences from comparable corpora.

Statistical Machine Translation relies on parallel corpora (eg.. europarl) for training translation models. However these corpora are limited and take time to create. Yalign is designed to automate this process by finding sentences that are close translation matches from comparable corpora. This opens up avenues for harvesting parallel corpora from sources like translated documents and the web.

Installation

Yalign requires that you install scikit-learn.

After that you can install Yalign from PyPi via pip:

sudo pip install yalign

Usage

Firstly we need to download and unpack the english to spanish model.

wget https://raw.githubusercontent.com/machinalis/yalign/develop/data/models/0.1/en-es.tar.gz
tar -xvzf en-es.tar.gz

Now we can use the yalign-align script along with the english to spanish model to align two web pages.

yalign-align en-es http://en.wikipedia.org/wiki/Antiparticle http://es.wikipedia.org/wiki/Antipart%C3%ADcula

Yalign is not limited to any one language pair. By creating your own models you can align any two languages. For more details on how to use yalign and on yalign's implementation please read the docs.

The Yalign Team:

Yalign is a Machinalis project. You can view our other open source contributions here.

Andrew Vine
Gonzalo García Berrotarán
Rafael Carrascosa
Elías Andrawos
Laura Alonso Alemany
Comments
  • Problem with yalign-train

    Problem with yalign-train

    Hello,

    I am writting to you with ask for help. It is very important for me to gain some additional data for my MT systems. You tools seems to be great but it does not work for me.

    I did installation from http://yalign.readthedocs.org/en/latest/installation.html#installing-from-pypi.

    With the files from tutorial all works but not with other.

    I took dictionary from phrase table of mine MT system and uses OpenSubtitles 2012 corpora from OPUS project you recommended in tutorial.

    For some time yalign-train works (2-3 mins) than it becomes Killed. I have no idea why. Is there any way to display what causes the error?

    Really hope you can help me out.

    opened by kamwolk 5
  • Set WordPairScore to prefer maximum scoring pairs (rather than random).

    Set WordPairScore to prefer maximum scoring pairs (rather than random).

    Before, WordPairScore will take the last value of the potential translation, in the event of clashes. This didn't seem correct.

    For example (using the dictionary.csv from the tutorial):

    He abstained from any further comments. Se abstuvo de hacer mas comentarios.

    The words 'abstained' and 'any' both can map to 'se', but 'abstained' score is 0.0138 while 'any' is 0.0015. The current code will return the smallest value because 'any' appears later in the sentence.

    This commit fixes this issue, by updating the values to keep the maximum score registered within the sentence.

    opened by DrDub 3
  • ModuleNotFoundError

    ModuleNotFoundError

    hi how are you, I have the following error from yalign ModuleNotFoundError: No module named 'yalignmodel' I already have the yalign module installed. I am supposed to import the yalignmodel without problems, but I really don't understand what the problem is. Could you help me? i am using python 2.7.18

    opened by anthobio23 1
  • Running on Windows 10

    Running on Windows 10

    Hi there,

    Yalign seems to be a great alignment tool for wikitext!

    I'm wondering whether can it run on Windows?

    I've setted up python2 in the path, and installed all the required pakage via powershell.

    What should I do next?

    Many thanks to your kindly help!

    opened by LukeTu 0
  • Site machinalis.com is dead

    Site machinalis.com is dead

    wget: unable to resolve host address ‘yalign.machinalis.com’

    The site machinalis.com seems to be dead. Is there an alternative location where the models can be downloaded from?

    Thanks.

    opened by msoutopico 0
  • ResolutionError on Scripts while running in Python 2.7 (using bash shell)

    ResolutionError on Scripts while running in Python 2.7 (using bash shell)

    Hi,

    I managed to install the package in python 2.7 conda enviornment. When I run the help or any other command is give me this ResourceError as below, not sure If I'm missing anything.

    image

    image

    Appreciate any help !

    Mohammed Ayub

    opened by mohammedayub44 0
  • please provide a phrase table demo

    please provide a phrase table demo

    Hi, I found that this align tool is very useful. And I wanna to train a model of my own, but I do not have any phrase table could you provide a phrase table demo? many thanks!

    opened by keyboardWitch 8
  • yalign-align problem

    yalign-align problem

    Hi, I tried to align en es plan text file with en-es model that provided But there is some problem after "yalign-align en-es en.txt es.txt"

    /yalign/wordpairscore.py", line 51, in call AttributeError: 'tuple' object has no attribute 'lower'

    I don't know what is the problem , please help me many thanks!

    opened by keyboardWitch 1
Owner
Machinalis
Machinalis
🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.

pySBD: Python Sentence Boundary Disambiguation (SBD) pySBD - python Sentence Boundary Disambiguation (SBD) - is a rule-based sentence boundary detecti

Nipun Sadvilkar 277 Feb 18, 2021
REST API for sentence tokenization and embedding using Multilingual Universal Sentence Encoder.

What is MUSE? MUSE stands for Multilingual Universal Sentence Encoder - multilingual extension (16 languages) of Universal Sentence Encoder (USE). MUS

Dani El-Ayyass 47 Sep 5, 2022
Automated Phrase Mining from Massive Text Corpora in Python.

Automated Phrase Mining from Massive Text Corpora in Python.

luozhouyang 28 Apr 15, 2021
Code for Discovering Topics in Long-tailed Corpora with Causal Intervention.

Code for Discovering Topics in Long-tailed Corpora with Causal Intervention ACL2021 Findings Usage 0. Prepare environment Requirements: python==3.6 te

Xiaobao Wu 8 Dec 16, 2022
Lingtrain Aligner — ML powered library for the accurate texts alignment.

Lingtrain Aligner ML powered library for the accurate texts alignment in different languages. Purpose Main purpose of this alignment tool is to build

Sergei Averkiev 76 Dec 14, 2022
Sentence Embeddings with BERT & XLNet

Sentence Transformers: Multilingual Sentence Embeddings using BERT / RoBERTa / XLM-RoBERTa & Co. with PyTorch This framework provides an easy method t

Ubiquitous Knowledge Processing Lab 9.1k Jan 2, 2023
Extract Keywords from sentence or Replace keywords in sentences.

FlashText This module can be used to replace keywords in sentences or extract keywords from sentences. It is based on the FlashText algorithm. Install

Vikash Singh 5.3k Jan 1, 2023
Sentence Embeddings with BERT & XLNet

Sentence Transformers: Multilingual Sentence Embeddings using BERT / RoBERTa / XLM-RoBERTa & Co. with PyTorch This framework provides an easy method t

Ubiquitous Knowledge Processing Lab 4.2k Feb 18, 2021
Extract Keywords from sentence or Replace keywords in sentences.

FlashText This module can be used to replace keywords in sentences or extract keywords from sentences. It is based on the FlashText algorithm. Install

Vikash Singh 4.7k Feb 17, 2021
Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)

Bunkai Bunkai is a sentence boundary (SB) disambiguation tool for Japanese texts. Quick Start $ pip install bunkai $ echo -e '宿を予約しました♪!まだ2ヶ月も先だけど。早すぎ

Megagon Labs 160 Dec 23, 2022
SimCSE: Simple Contrastive Learning of Sentence Embeddings

SimCSE: Simple Contrastive Learning of Sentence Embeddings This repository contains the code and pre-trained models for our paper SimCSE: Simple Contr

Princeton Natural Language Processing 2.5k Jan 7, 2023
Language-Agnostic SEntence Representations

LASER Language-Agnostic SEntence Representations LASER is a library to calculate and use multilingual sentence embeddings. NEWS 2019/11/08 CCMatrix is

Facebook Research 3.2k Jan 4, 2023
Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning

GenSen Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning Sandeep Subramanian, Adam Trischler, Yoshua B

Maluuba Inc. 309 Oct 19, 2022
InferSent sentence embeddings

InferSent InferSent is a sentence embeddings method that provides semantic representations for English sentences. It is trained on natural language in

Facebook Research 2.2k Dec 27, 2022
source code for paper: WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach.

WhiteningBERT Source code and data for paper WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach. Preparation git clone https://github.com

null 49 Dec 17, 2022
Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer

ConSERT Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer Requirements torch==1.6.0

Yan Yuanmeng 478 Dec 25, 2022
Using context-free grammar formalism to parse English sentences to determine their structure to help computer to better understand the meaning of the sentence.

Sentance Parser Executing the Program Make sure Python 3.6+ is installed. Install requirements $ pip install requirements.txt Run the program:

Vaibhaw 12 Sep 28, 2022
Cải thiện Elasticsearch trong bài toán semantic search sử dụng phương pháp Sentence Embeddings

Cải thiện Elasticsearch trong bài toán semantic search sử dụng phương pháp Sentence Embeddings Trong bài viết này mình sẽ sử dụng pretrain model SimCS

Vo Van Phuc 18 Nov 25, 2022