Lingtrain Alignment Studio is an ML based app for texts alignment on different languages.

Overview

Lingtrain Alignment Studio

asd

Intro

Lingtrain Alignment Studio is the ML based app for accurate texts alignment on different languages.

  • Extracts parallel corpora from two texts.
  • Makes the formatted parallel book from it with sentence highlightning.

Models

Automated alignment process relies on the sentence embeddings models. Embeddings are multidimensional vectors of a special kind which are used to calculate a distance between the sentences. You can also plug your own model using the interface described in models directory. Supported languages list depend on the selected backend model.

  • distiluse-base-multilingual-cased-v2
    • more reliable and fast
    • moderate weights size — 500MB
    • supports 50+ languages
    • full list of supported languages can be found in this paper
  • LaBSE (Language-agnostic BERT Sentence Embedding)
    • can be used for rare languages
    • pretty heavy weights — 1.8GB
    • supports 100+ languages
    • full list of supported languages can be found here

Running on local machine

You can run the application on your computer using docker.

  1. Make sure that docker is installed by typing the docker version command in your console.

  2. Images configured to run locally are available on Docker Hub.

  3. Run the following commads in your console:

    • docker pull lingtrain/aligner:v6
    • docker run -v C:\app\data:/app/data -v C:\app\img:/app/static/img -p 80:80 lingtrain/aligner:v6
    • Use lingtrain/aligner:v6-labse for LaBSE version (109 languages).
  4. App will be available in your browser on the localhost address.

  5. If you need to run the container on another port (e.g. localhost:8081):

    • Change the API_URL parameter in config.js
    • Rebuild the docker container
    • Start it with changed -p parameter (e.g. -p 8081:80)

Running in development mode

Clone this repo on your machine.

Backend

Flask/uwsgi backend REST API service. It's pretty simple and contains all the alignment logic.

cd /be python main.py

Frontend

SPA. Vue + vuex + vuetify. UI for managing alignment process using BE and a tool for translators to edit processing documents.

cd /fe

Setup

npm install

Compile and run with hot-reloads for development

npm run serve

Feedback

You can crate an issue or send me a message in telegram: @averkij

License

This work is licensed under a Attribution-NonCommercial-NoDerivatives 4.0 International license. See LICENSE.

Creative Commons License

Comments
  • File already Exists

    File already Exists

    Делаю docker pull lingtrain/aligner:v4 Загружаю текстовый файл и...

    image

    После вот такого предупреждения ничего не происходит Причём оно вылазит на любой текстовый файл

    opened by puffofsmoke 4
  • Change API_URL usage from absolute URL to root path setting.

    Change API_URL usage from absolute URL to root path setting.

    Позволяет запускать сервис без указания в API_URL конкретного имени хоста, порта и протокола (http, https).

    TODO связанное с URL:

    1. Возможность задать API_URL через переменные среды
    2. Использование префикса для URL например если указать API_URL = /studio/ то доступ будет по такому URL: http(s)://example.com/studio/ Изменить @app.route в be/main.py с использования абсолютного url на префикс
    opened by lnikonl 3
  • Document how to run on Linux

    Document how to run on Linux

    Thanks for creating this project. It seems amazing.

    OK. So I read an article by you here: https://habr.com/en/post/590549/.

    In it, you write:

    The app is packed into the docker container. It's a simple technology to deploy your stuff anywhere from the server to your local machine. It's available across all the operating systems.

    If I read between the lines, I guess this means the app could work under Linux as well, but your README.md file only gives this command/path, which is for Windows:

    docker run -v C:\app\data:/app/data -v C:\app\img:/app/static/img -p 80:80 lingtrain/aligner:v4

    So what would the correct command be here for Linux?

    opened by monmima 2
  • FR: Добавить горизонтальное выравнивание строк в просмотре меток (Marks) в разделе загрузки (Load)

    FR: Добавить горизонтальное выравнивание строк в просмотре меток (Marks) в разделе загрузки (Load)

    Проблема: для длинных текстов, сложных по структуре вертикальные списки меток разделов разъезжаются, и их очень сложно проверить, сопоставить и найти ошибки в разметке входных текстов

    Предложение: сделать их в виде выравненной по горизонтали таблицы, чтобы метки с одним порядковым номером находились напротив друг друга. Тогда можно будет достаточно быстро визуально понять, где что потеряно.

    Пример как сейчас

    image

    opened by BorisNA 1
  • Exception (500 code) when trying to preview a book

    Exception (500 code) when trying to preview a book

    Previewing a book after the successful alignment generates an exception.

    Steps to reproduce:

    1. Upload provided de and ru files
    2. Make the automatic alignment (should work without manual editing)
    3. Open the book creation section, select "Paragraph structure" from "from" (i.e. German)
    4. Try to preview the book - exception

    Exporting a book works Ok though, and previewing from "to" (i.e. Russian)

    Tested with both 0.6.0 and 0.6.1 aligners

    exception.txt test_de.txt test_ru.txt

    opened by BorisNA 1
  • FR: Add encoding tag to HTML output

    FR: Add encoding tag to HTML output

    Current HTML book export does not contain the encoding tag. While it usually works Ok in modern desktop browsers, the text fails to open correctly in mobile chrome.

    I propose to add <meta charset="utf-8"> to the <head> section.

    opened by BorisNA 1
Owner
Sergei Averkiev
Software Engineer. Eager to learn languages and machine learning approaches. Live in Moscow.
Sergei Averkiev
Interactive Web App with Streamlit and Scikit-learn that applies different Classification algorithms to popular datasets

Interactive Web App with Streamlit and Scikit-learn that applies different Classification algorithms to popular datasets Datasets Used: Iris dataset,

Samrat Mitra 2 Nov 18, 2021
Temporal Alignment Prediction for Supervised Representation Learning and Few-Shot Sequence Classification

Temporal Alignment Prediction for Supervised Representation Learning and Few-Shot Sequence Classification Introduction. This package includes the pyth

null 5 Dec 6, 2022
Implementation of different ML Algorithms from scratch, written in Python 3.x

Implementation of different ML Algorithms from scratch, written in Python 3.x

Gautam J 393 Nov 29, 2022
Implementations of Machine Learning models, Regularizers, Optimizers and different Cost functions.

Linear Models Implementations of LinearRegression, LassoRegression and RidgeRegression with appropriate Regularizers and Optimizers. Linear Regression

Keivan Ipchi Hagh 1 Nov 22, 2021
Breast-Cancer-Classification - Using SKLearn breast cancer dataset which contains 569 examples and 32 features classifying has been made with 6 different algorithms

Breast-Cancer-Classification - Using SKLearn breast cancer dataset which contains 569 examples and 32 features classifying has been made with 6 different algorithms

Mert Sezer Ardal 1 Jan 31, 2022
machine learning model deployment project of Iris classification model in a minimal UI using flask web framework and deployed it in Azure cloud using Azure app service

This is a machine learning model deployment project of Iris classification model in a minimal UI using flask web framework and deployed it in Azure cloud using Azure app service. We initially made this project as a requirement for an internship at Indian Servers. We are now making it open to contribution.

Krishna Priyatham Potluri 73 Dec 1, 2022
Iris species predictor app is used to classify iris species created using python's scikit-learn, fastapi, numpy and joblib packages.

Iris Species Predictor Iris species predictor app is used to classify iris species using their sepal length, sepal width, petal length and petal width

Siva Prakash 5 Apr 5, 2022
Traingenerator 🧙 A web app to generate template code for machine learning ✨

Traingenerator ?? A web app to generate template code for machine learning ✨ ?? Traingenerator is now live! ??

Johannes Rieke 1.2k Jan 7, 2023
Penguins species predictor app is used to classify penguins species created using python's scikit-learn, fastapi, numpy and joblib packages.

Penguins Classification App Penguins species predictor app is used to classify penguins species using their island, sex, bill length (mm), bill depth

Siva Prakash 3 Apr 5, 2022
MLflow App Using React, Hooks, RabbitMQ, FastAPI Server, Celery, Microservices

Katana ML Skipper This is a simple and flexible ML workflow engine. It helps to orchestrate events across a set of microservices and create executable

Tom Xu 8 Nov 17, 2022
Flask app to predict daily radiation from the time series of Solcast from Islamabad, Pakistan

Solar-radiation-ISB-MLOps - Flask app to predict daily radiation from the time series of Solcast from Islamabad, Pakistan.

Abid Ali Awan 1 Dec 31, 2021
A scikit-learn based module for multi-label et. al. classification

scikit-multilearn scikit-multilearn is a Python module capable of performing multi-label learning tasks. It is built on-top of various scientific Pyth

null 802 Jan 1, 2023
Python-based implementations of algorithms for learning on imbalanced data.

ND DIAL: Imbalanced Algorithms Minimalist Python-based implementations of algorithms for imbalanced learning. Includes deep and representational learn

DIAL | Notre Dame 220 Dec 13, 2022
A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

Light Gradient Boosting Machine LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed a

Microsoft 14.5k Jan 7, 2023
A Python toolkit for rule-based/unsupervised anomaly detection in time series

Anomaly Detection Toolkit (ADTK) Anomaly Detection Toolkit (ADTK) is a Python package for unsupervised / rule-based time series anomaly detection. As

Arundo Analytics 888 Dec 30, 2022
jaxfg - Factor graph-based nonlinear optimization library for JAX.

Factor graphs + nonlinear optimization in JAX

Brent Yi 134 Dec 21, 2022
LibTraffic is a unified, flexible and comprehensive traffic prediction library based on PyTorch

LibTraffic is a unified, flexible and comprehensive traffic prediction library, which provides researchers with a credibly experimental tool and a convenient development framework. Our library is implemented based on PyTorch, and includes all the necessary steps or components related to traffic prediction into a systematic pipeline.

null 432 Jan 5, 2023
WAGMA-SGD is a decentralized asynchronous SGD for distributed deep learning training based on model averaging.

WAGMA-SGD is a decentralized asynchronous SGD based on wait-avoiding group model averaging. The synchronization is relaxed by making the collectives externally-triggerable, namely, a collective can be initiated without requiring that all the processes enter it. It partially reduces the data within non-overlapping groups of process, improving the parallel scalability.

Shigang Li 6 Jun 18, 2022
A Python-based application demonstrating various search algorithms, namely Depth-First Search (DFS), Breadth-First Search (BFS), and A* Search (Manhattan Distance Heuristic)

A Python-based application demonstrating various search algorithms, namely Depth-First Search (DFS), Breadth-First Search (BFS), and the A* Search (using the Manhattan Distance Heuristic)

null 17 Aug 14, 2022