Lingtrain Alignment Studio is an ML based app for texts alignment on different languages.

Sergei Averkiev

Last update: Jan 3, 2023

Related tags

Machine Learning a-studio

Overview

Lingtrain Alignment Studio

Intro

Lingtrain Alignment Studio is the ML based app for accurate texts alignment on different languages.

Extracts parallel corpora from two texts.
Makes the formatted parallel book from it with sentence highlightning.

Models

Automated alignment process relies on the sentence embeddings models. Embeddings are multidimensional vectors of a special kind which are used to calculate a distance between the sentences. You can also plug your own model using the interface described in models directory. Supported languages list depend on the selected backend model.

distiluse-base-multilingual-cased-v2
- more reliable and fast
- moderate weights size — 500MB
- supports 50+ languages
- full list of supported languages can be found in this paper
LaBSE (Language-agnostic BERT Sentence Embedding)
- can be used for rare languages
- pretty heavy weights — 1.8GB
- supports 100+ languages
- full list of supported languages can be found here

Running on local machine

You can run the application on your computer using docker.

Make sure that docker is installed by typing the docker version command in your console.
Images configured to run locally are available on Docker Hub.
Run the following commads in your console:
- docker pull lingtrain/aligner:v6
- docker run -v C:\app\data:/app/data -v C:\app\img:/app/static/img -p 80:80 lingtrain/aligner:v6
- Use lingtrain/aligner:v6-labse for LaBSE version (109 languages).
App will be available in your browser on the localhost address.
If you need to run the container on another port (e.g. localhost:8081):
- Change the API_URL parameter in config.js
- Rebuild the docker container
- Start it with changed -p parameter (e.g. -p 8081:80)

Running in development mode

Clone this repo on your machine.

Backend

Flask/uwsgi backend REST API service. It's pretty simple and contains all the alignment logic.

cd /be python main.py

Frontend

SPA. Vue + vuex + vuetify. UI for managing alignment process using BE and a tool for translators to edit processing documents.

cd /fe

Setup

npm install

Compile and run with hot-reloads for development

npm run serve

Feedback

You can crate an issue or send me a message in telegram: @averkij

License

This work is licensed under a Attribution-NonCommercial-NoDerivatives 4.0 International license. See LICENSE.

Comments

File already Exists

Делаю docker pull lingtrain/aligner:v4 Загружаю текстовый файл и...

После вот такого предупреждения ничего не происходит Причём оно вылазит на любой текстовый файл

opened by puffofsmoke 4
Change API_URL usage from absolute URL to root path setting.
Позволяет запускать сервис без указания в API_URL конкретного имени хоста, порта и протокола (http, https).

TODO связанное с URL:

Возможность задать API_URL через переменные среды

Использование префикса для URL например если указать API_URL = /studio/ то доступ будет по такому URL: http(s)://example.com/studio/ Изменить @app.route в be/main.py с использования абсолютного url на префикс
opened by lnikonl 3
Document how to run on Linux

Thanks for creating this project. It seems amazing.

OK. So I read an article by you here: https://habr.com/en/post/590549/.

In it, you write:

The app is packed into the docker container. It's a simple technology to deploy your stuff anywhere from the server to your local machine. It's available across all the operating systems.

If I read between the lines, I guess this means the app could work under Linux as well, but your README.md file only gives this command/path, which is for Windows:

docker run -v C:\app\data:/app/data -v C:\app\img:/app/static/img -p 80:80 lingtrain/aligner:v4

So what would the correct command be here for Linux?

opened by monmima 2
FR: Добавить горизонтальное выравнивание строк в просмотре меток (Marks) в разделе загрузки (Load)

Проблема: для длинных текстов, сложных по структуре вертикальные списки меток разделов разъезжаются, и их очень сложно проверить, сопоставить и найти ошибки в разметке входных текстов

Предложение: сделать их в виде выравненной по горизонтали таблицы, чтобы метки с одним порядковым номером находились напротив друг друга. Тогда можно будет достаточно быстро визуально понять, где что потеряно.

Пример как сейчас

opened by BorisNA 1
Exception (500 code) when trying to preview a book
Previewing a book after the successful alignment generates an exception.

Steps to reproduce:

Upload provided de and ru files

Make the automatic alignment (should work without manual editing)

Open the book creation section, select "Paragraph structure" from "from" (i.e. German)

Try to preview the book - exception

Exporting a book works Ok though, and previewing from "to" (i.e. Russian)

Tested with both 0.6.0 and 0.6.1 aligners

exception.txt test_de.txt test_ru.txt
opened by BorisNA 1
FR: Add encoding tag to HTML output

Current HTML book export does not contain the encoding tag. While it usually works Ok in modern desktop browsers, the text fails to open correctly in mobile chrome.

I propose to add <meta charset="utf-8"> to the <head> section.

opened by BorisNA 1

Owner

Sergei Averkiev

Software Engineer. Eager to learn languages and machine learning approaches. Live in Moscow.

GitHub

Interactive Web App with Streamlit and Scikit-learn that applies different Classification algorithms to popular datasets

Interactive Web App with Streamlit and Scikit-learn that applies different Classification algorithms to popular datasets Datasets Used: Iris dataset,

2 Nov 18, 2021

Temporal Alignment Prediction for Supervised Representation Learning and Few-Shot Sequence Classification

Temporal Alignment Prediction for Supervised Representation Learning and Few-Shot Sequence Classification Introduction. This package includes the pyth

5 Dec 6, 2022

Implementation of different ML Algorithms from scratch, written in Python 3.x

393 Nov 29, 2022

Implementations of Machine Learning models, Regularizers, Optimizers and different Cost functions.

Linear Models Implementations of LinearRegression, LassoRegression and RidgeRegression with appropriate Regularizers and Optimizers. Linear Regression

1 Nov 22, 2021

Breast-Cancer-Classification - Using SKLearn breast cancer dataset which contains 569 examples and 32 features classifying has been made with 6 different algorithms

1 Jan 31, 2022

machine learning model deployment project of Iris classification model in a minimal UI using flask web framework and deployed it in Azure cloud using Azure app service

This is a machine learning model deployment project of Iris classification model in a minimal UI using flask web framework and deployed it in Azure cloud using Azure app service. We initially made this project as a requirement for an internship at Indian Servers. We are now making it open to contribution.

73 Dec 1, 2022

Iris species predictor app is used to classify iris species created using python's scikit-learn, fastapi, numpy and joblib packages.

Iris Species Predictor Iris species predictor app is used to classify iris species using their sepal length, sepal width, petal length and petal width

5 Apr 5, 2022

Traingenerator 🧙 A web app to generate template code for machine learning ✨

Traingenerator ?? A web app to generate template code for machine learning ✨ ?? Traingenerator is now live! ??

1.2k Jan 7, 2023

Penguins species predictor app is used to classify penguins species created using python's scikit-learn, fastapi, numpy and joblib packages.

Penguins Classification App Penguins species predictor app is used to classify penguins species using their island, sex, bill length (mm), bill depth

3 Apr 5, 2022

MLflow App Using React, Hooks, RabbitMQ, FastAPI Server, Celery, Microservices

Katana ML Skipper This is a simple and flexible ML workflow engine. It helps to orchestrate events across a set of microservices and create executable

8 Nov 17, 2022

Flask app to predict daily radiation from the time series of Solcast from Islamabad, Pakistan

Solar-radiation-ISB-MLOps - Flask app to predict daily radiation from the time series of Solcast from Islamabad, Pakistan.

1 Dec 31, 2021

A scikit-learn based module for multi-label et. al. classification

scikit-multilearn scikit-multilearn is a Python module capable of performing multi-label learning tasks. It is built on-top of various scientific Pyth

802 Jan 1, 2023

Python-based implementations of algorithms for learning on imbalanced data.

ND DIAL: Imbalanced Algorithms Minimalist Python-based implementations of algorithms for imbalanced learning. Includes deep and representational learn

220 Dec 13, 2022

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

Light Gradient Boosting Machine LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed a

14.5k Jan 7, 2023

A Python toolkit for rule-based/unsupervised anomaly detection in time series

Anomaly Detection Toolkit (ADTK) Anomaly Detection Toolkit (ADTK) is a Python package for unsupervised / rule-based time series anomaly detection. As

888 Dec 30, 2022

jaxfg - Factor graph-based nonlinear optimization library for JAX.

Factor graphs + nonlinear optimization in JAX

134 Dec 21, 2022

LibTraffic is a unified, flexible and comprehensive traffic prediction library based on PyTorch

LibTraffic is a unified, flexible and comprehensive traffic prediction library, which provides researchers with a credibly experimental tool and a convenient development framework. Our library is implemented based on PyTorch, and includes all the necessary steps or components related to traffic prediction into a systematic pipeline.

432 Jan 5, 2023

WAGMA-SGD is a decentralized asynchronous SGD for distributed deep learning training based on model averaging.

WAGMA-SGD is a decentralized asynchronous SGD based on wait-avoiding group model averaging. The synchronization is relaxed by making the collectives externally-triggerable, namely, a collective can be initiated without requiring that all the processes enter it. It partially reduces the data within non-overlapping groups of process, improving the parallel scalability.

6 Jun 18, 2022

A Python-based application demonstrating various search algorithms, namely Depth-First Search (DFS), Breadth-First Search (BFS), and A* Search (Manhattan Distance Heuristic)

A Python-based application demonstrating various search algorithms, namely Depth-First Search (DFS), Breadth-First Search (BFS), and the A* Search (using the Manhattan Distance Heuristic)

17 Aug 14, 2022

Lingtrain Alignment Studio is an ML based app for texts alignment on different languages.

Related tags

Overview

Lingtrain Alignment Studio

Intro

Models

Running on local machine

Running in development mode

Backend

Frontend

Setup

Compile and run with hot-reloads for development

Feedback

License

Comments

File already Exists

Change API_URL usage from absolute URL to root path setting.

Document how to run on Linux

FR: Добавить горизонтальное выравнивание строк в просмотре меток (Marks) в разделе загрузки (Load)

Exception (500 code) when trying to preview a book

FR: Add encoding tag to HTML output

Owner

Sergei Averkiev

Interactive Web App with Streamlit and Scikit-learn that applies different Classification algorithms to popular datasets

Temporal Alignment Prediction for Supervised Representation Learning and Few-Shot Sequence Classification

Implementation of different ML Algorithms from scratch, written in Python 3.x

Implementations of Machine Learning models, Regularizers, Optimizers and different Cost functions.

Breast-Cancer-Classification - Using SKLearn breast cancer dataset which contains 569 examples and 32 features classifying has been made with 6 different algorithms

machine learning model deployment project of Iris classification model in a minimal UI using flask web framework and deployed it in Azure cloud using Azure app service

Iris species predictor app is used to classify iris species created using python's scikit-learn, fastapi, numpy and joblib packages.

Traingenerator 🧙 A web app to generate template code for machine learning ✨

Penguins species predictor app is used to classify penguins species created using python's scikit-learn, fastapi, numpy and joblib packages.

MLflow App Using React, Hooks, RabbitMQ, FastAPI Server, Celery, Microservices

Flask app to predict daily radiation from the time series of Solcast from Islamabad, Pakistan

A scikit-learn based module for multi-label et. al. classification

Python-based implementations of algorithms for learning on imbalanced data.

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

A Python toolkit for rule-based/unsupervised anomaly detection in time series

jaxfg - Factor graph-based nonlinear optimization library for JAX.

LibTraffic is a unified, flexible and comprehensive traffic prediction library based on PyTorch

WAGMA-SGD is a decentralized asynchronous SGD for distributed deep learning training based on model averaging.

A Python-based application demonstrating various search algorithms, namely Depth-First Search (DFS), Breadth-First Search (BFS), and A* Search (Manhattan Distance Heuristic)