Trains an OpenNMT PyTorch model and SentencePiece tokenizer.

Argos Open Tech

Last update: Dec 13, 2022

Related tags

Text Data & NLP argos-train

Overview

Argos Train

Trains an OpenNMT PyTorch model and SentencePiece tokenizer. Designed for use with Argos Translate and LibreTranslate.

Argos Translate packages are also available for download.

Training example

$ su argosopentech
$ source ~/argos-train-init

...


$ argos-train
From code (ISO 639): en
To code (ISO 639): es
From name: English
To name: Spanish
Package version: 1.0
Argos version: 1.0

...

Package saved to /home/argosopentech/argos-train/run/en_es.argosmodel

Data

Uses data from the Opus project in the Moses format stored in data index.

Environment

CUDA required, tested on vast.ai.

Docker

Docker image available at argosopentech/argostrain.

docker run -it argosopentech/argostrain /bin/bash

Run training

argos-train

Troubleshooting

If you're running out of GPU memory reduce batch_size and valid_batch_size in config.yml.

License

Licensed under either the MIT or CC0 License (same as Argos Translate).

Comments

Argos Train Beta

The plan is to redirect this repository soon to argosopentech/argos-train and move to a more automated and beginner friendly training process. Just a heads up to anyone running the current version since there will be a decent amount of breaking changes.

opened by argosopentech 9
Minimum requirement for training

I don't have a supercomputer but I wouldn't mind running a training model in the background for days, weeks or months, my electricity is cheap too.

Can I train on a "non server grade" computer? What is the average amount of RAM/CPU required? Can I train on an AMD graphic card (Vulkan)?. Once trained, can I share/send it to Argos?

Thank you

opened by Extarys 6
How to improve training?

I commented in the language request discussion that Brazilian Portuguese is missing so I'm trying to train a en→pt_BR model. I will also train a es→pt_BR model to avoid pivoting around English. I'm having some issues:

My hardware is quite limited: I got 4 GB of GPU RAM, so, following the documentation I reduced both batch_size and valid_batch_size and I could train a model, but this model was quite bad (example: "Roses are red" was translated to "The queen is red"). I tried increasing train_steps but it's still not great (example: "Roses are red" was translated to "Roses é vermelho", i.e. "Roses (given name) is red"). Seems the issue was with the dataset I was using (subtitles), so I added more data... then the script crashed every time: my system only have 8 GB of RAM.

Is it possible to improve training under those hardware constraints? Or the only way is buying more RAM?

opened by qgustavor 4
Retrain on custom input

I would love to create a model which fits use in technical situations such as translation for Windows Event Logs and computer logs in general.

What's the best way to go about this?

Additionally as an example using a user account as the translation subject: If the text NT AUTHORITY\СИСТЕМА is given, the uk>en translation truncates it to NT AUTHORITY as the output (it should be NT AUTHORITY\SYSTEM

opened by geekscrapy 3

Found no NVIDIA driver on your system.

While training I get the following error

Traceback (most recent call last):
  File "/home/argosopentech/env/bin/onmt_train", line 33, in <module>
    sys.exit(load_entry_point('OpenNMT-py', 'console_scripts', 'onmt_train')())
  File "/home/argosopentech/OpenNMT-py/onmt/bin/train.py", line 172, in main
    train(opt)
  File "/home/argosopentech/OpenNMT-py/onmt/bin/train.py", line 157, in train
    train_process(opt, device_id=0)
  File "/home/argosopentech/OpenNMT-py/onmt/train_single.py", line 64, in main
    configure_process(opt, device_id)
  File "/home/argosopentech/OpenNMT-py/onmt/train_single.py", line 19, in configure_process
    torch.cuda.set_device(device_id)
  File "/home/argosopentech/env/lib/python3.10/site-packages/torch/cuda/__init__.py", line 326, in set_device
    torch._C._cuda_setDevice(device)
  File "/home/argosopentech/env/lib/python3.10/site-packages/torch/cuda/__init__.py", line 229, in _lazy_init
    torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx
Traceback (most recent call last):
  File "/home/argosopentech/env/bin/argos-train", line 7, in <module>
    exec(compile(f.read(), __file__, 'exec'))
  File "/home/argosopentech/argos-train/bin/argos-train", line 20, in <module>
    train.train(from_code, to_code, from_name, to_name, version, package_version, argos_version, data_exists, epochs_count)
  File "/home/argosopentech/argos-train/argostrain/train.py", line 173, in train
    str(opennmt_checkpoints[-2].f),
IndexError: list index out of range

So, basically it cannot find my graphics card. I run it on windows in docker via

docker run -it argosopentech/argostrain /bin/bash

Do I have to enable something before starting the training?

Thanks and best regards.

opened by dextreem 1

Auto generated data-index.json

Rather than manually retrieving the data and adding them to the data-index.json, it would be simple I think to create a script that would automatically generate the data-index.json using the tables from https://opus.nlpl.eu/ and would automatically download the data? a kind of scrapper.

opened by dingedi 1
prepare_data.py is appending multiple (nine) copies of source and target lines to train and val files

The issue is lines 38 and 39 in prepare_data.py are being called once for every naughty_string type, so source.append and target.append are being called nine times for every valid source/target line (once for each type of naughty_string).

lines 38 and 39 should not be in the loop on line 35, and should be move left one level.

38 source.append(source_line) 39 target.append(target_line)

Best Regards

opened by ahirsbrunner 1
Filter emojis to avoid having them in our training

This will add emojis as 'unwanted character' next to HTML tags and entities. More info: https://github.com/argosopentech/argos-translate/issues/51

Alternatively, We could strip unwanted-symbols from the lines instead of filtering/removing the lines. But for now this should do it.

opened by mmokhi 1
General questions
@PJ-Finlay Hi there, Currently i have interest in machine translation, with an eye on the Arabic, Urdu and English languages. My target for now is to learn how to train a new model from scratch. therefor the following questions arise:

How does onmt compares with the other machine translation models such as mt5, criss , XLM-RoBERTa

How many parameters does onmt model have, and what is it's performance to translate 100 sentences on a cpu only device (ram usage, cpu usage, time required, etc...) Waiting for your reply
opened by seekingdeep 1
question about contribution

Hey, I'm not at all a programmer nor do I really understand the tech speech but I heard about this project and I was wondering it was a way for me to help? I am fluent in English and Hebrew (which I saw you didn't have) (I don't even understand if knowing the language is important)?

opened by menachem-dev 1
Question
I am using default settings to test around and i've noticed that the docker instance is stuck on corpus_1's transforms: TransformPipe(SentencePieceTransform(share_vocab=True, src_subword_model=run/sentencepiece.model, tgt_subword_model=run/sentencepiece.model, src_subword_alpha=0.0, tgt_subword_alpha=0.0, src_subword_vocab=, tgt_subword_vocab=, src_vocab_threshold=0, tgt_vocab_threshold=0, src_subword_nbest=1, tgt_subword_nbest=1), FilterTooLongTransform(src_seq_length=150, tgt_seq_length=150)) 1core 100% usage (let it run for a hour) no disk / ram activity

is that normal or am I doing something wrong?

System:

OS: Ubuntu2204

CPU: Ryzen 7 2700x (OC to 4.5GHz all core)

RAM: 32GB 3200mhz

Storage: nvme~2tb

GPUs:2x1060 6GB

//EDIT after restarting the host system it worked within 20 mins
opened by Allesanddro 0
References not correctly added to the model README

There's functionality to auto add the values from the datasets "reference" field to the .argosmodel package READMEs. There's a bug and it's been broken recently and I haven't been able to figure out why.
bug help wanted good first issue

opened by PJ-Finlay 0
Support data from a local file
Currently Argos Train loads data from data-index.json from a remote HTTP server. I'd like to add support for loading data packages from a file on the local server like this:

"links": ["file:///home/argosopentech/Desktop/data-opensubtitles-en_de.argosdata"]
enhancement help wanted good first issue
opened by PJ-Finlay 0
OpenNMT-py v3 support

https://forum.opennmt.net/t/opennmt-py-v3-0-is-out/5077

The vanilla transformer uses sinusoidal positional encoding (position_encoding = true). We recommend to use “maximum relative positions” encoding instead (max_relative_positions=20, position_encoding=false) which again has a small overhead.

We kept the “fusedadam” (old legacy code) which provides the best performance in speed (compare to pytroch amp adam fp16, apex level O1/O2). We tested the new Adam(fused=true) released with pytorch 1.13 but it is way slower.

Always use the highest batch size possible (to your GPU ram capacity) and use an update interval according to the “true bach size” you want. For instance, if your GPU can accept 8192 tokens, then if you use accum_count=12, you will have a true batch size of 98304 tokens.

Adjust the bucket size to your CPU ram. Most of the time a bucket between 200K and 500K examples will be suitable. The highest your bucket size is, the less padding you will have since examples are sorted based on this bucket and batches yield from this bucket.

enhancement help wanted

opened by argosopentech 2
Data priority, incremental training?
Hi there!

I would like to use the data currently provided in data-index.json, but at the same time, I would like to use my custom data. Can I tell the script to generate a model considering my custom data is more relevant / has a bigger priority?

Let's say I have one large dataset I am using all the time, and then I have multiple smaller datasets which I would like to train different models for each. Is something like an incremental build possible, so I would reuse some previous output and just "append" my custom data to save some training time and resources?

Thanks!
enhancement help wanted good first issue
opened by JanCizmar 1
Improved error message

We should add better error handling when there's no data available for a language.

https://forum.opennmt.net/t/traceback-assertionerror-while-training-in-vast-ai/5037/4
enhancement good first issue

opened by PJ-Finlay 0
Traceback AssertionError while training in Vast.ai

I always get this error message after uploading .argosdata package to Vast.ai after running argos-train command.

Traceback (most recent call last): File "/home/argosopentech/env/bin/argos-train", line 7, in exec(compile(f.read(), file, 'exec')) File "/home/argosopentech/argos-train/bin/argos-train", line 18, in train.train(from_code, to_code, from_name, to_name, version, package_version, argos_version, data_exists) File "/home/argosopentech/argos-train/argostrain/train.py", line 78, in train source, target = dataset.data() File "/home/argosopentech/argos-train/argostrain/dataset.py", line 247, in data self.local_dataset = LocalDataset(filepath) File "/home/argosopentech/argos-train/argostrain/dataset.py", line 152, in init assert len(dir_names) > 0 AssertionError

opened by pocakka 0

Owner

Argos Open Tech

GitHub https://www.argosopentech.com

This is a Python binding to the tokenizer Ucto. Tokenisation is one of the first step in almost any Natural Language Processing task, yet it is not always as trivial a task as it appears to be. This binding makes the power of the ucto tokeniser available to Python. Ucto itself is regular-expression based, extensible, and advanced tokeniser written in C++ (http://ilk.uvt.nl/ucto).

Ucto for Python This is a Python binding to the tokeniser Ucto. Tokenisation is one of the first step in almost any Natural Language Processing task,

27 Dec 14, 2022

Train BPE with fastBPE, and load to Huggingface Tokenizer.

BPEer Train BPE with fastBPE, and load to Huggingface Tokenizer. Description The BPETrainer of Huggingface consumes a lot of memory when I am training

1 Dec 23, 2021

Unsupervised text tokenizer for Neural Network-based text generation.

SentencePiece SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabu

6.4k Jan 1, 2023

Unsupervised text tokenizer focused on computational efficiency

YouTokenToMe YouTokenToMe is an unsupervised text tokenizer focused on computational efficiency. It currently implements fast Byte Pair Encoding (BPE)

847 Dec 19, 2022

Unsupervised text tokenizer for Neural Network-based text generation.

SentencePiece SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabu

4.8k Feb 18, 2021

Unsupervised text tokenizer focused on computational efficiency

YouTokenToMe YouTokenToMe is an unsupervised text tokenizer focused on computational efficiency. It currently implements fast Byte Pair Encoding (BPE)

718 Feb 18, 2021

Japanese Long-Unit-Word Tokenizer with RemBertTokenizerFast of Transformers

Japanese-LUW-Tokenizer Japanese Long-Unit-Word (国語研長単位) Tokenizer for Transformers based on 青空文庫 Basic Usage >>> from transformers import RemBertToken

3 Dec 22, 2021

iBOT: Image BERT Pre-Training with Online Tokenizer

Image BERT Pre-Training with iBOT Official PyTorch implementation and pretrained models for paper iBOT: Image BERT Pre-Training with Online Tokenizer.

435 Jan 6, 2023

A Japanese tokenizer based on recurrent neural networks

Nagisa is a python module for Japanese word segmentation/POS-tagging. It is designed to be a simple and easy-to-use tool. This tool has the following

325 Jan 5, 2023

Tokenizer - Module python d'analyse syntaxique et de grammaire, tokenization

Tokenizer Le Tokenizer est un analyseur lexicale, il permet, comme Flex and Yacc par exemple, de tokenizer du code, c'est à dire transformer du code e

1 Aug 15, 2022

In this repository, I have developed an end to end Automatic speech recognition project. I have developed the neural network model for automatic speech recognition with PyTorch and used MLflow to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry.

End to End Automatic Speech Recognition In this repository, I have developed an end to end Automatic speech recognition project. I have developed the

22 Nov 13, 2022

An ultra fast tiny model for lane detection, using onnx_parser, TensorRTAPI, torch2trt to accelerate. our model support for int8, dynamic input and profiling. (Nvidia-Alibaba-TensoRT-hackathon2021)

Ultra_Fast_Lane_Detection_TensorRT An ultra fast tiny model for lane detection, using onnx_parser, TensorRTAPI to accelerate. our model support for in

121 Dec 27, 2022

Explore different way to mix speech model(wav2vec2, hubert) and nlp model(BART,T5,GPT) together

SpeechMix Explore different way to mix speech model(wav2vec2, hubert) and nlp model(BART,T5,GPT) together. Introduction For the same input: from datas

31 Nov 7, 2022

Transformers-regression - Regression Bugs Are In Your Model! Measuring, Reducing and Analyzing Regressions In NLP Model Updates

Regression Free Model Update Code for the paper: Regression Bugs Are In Your Mod

2 Feb 17, 2022

Incorporating KenLM language model with HuggingFace implementation of Wav2Vec2CTC Model using beam search decoding

Wav2Vec2CTC With KenLM Using KenLM ARPA language model with beam search to decode audio files and show the most probable transcription. Assuming you'v

65 Sep 21, 2022

RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2

RoNER RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2. It is meant to be an easy to use, hi

9 Nov 7, 2022

Implementation of COCO-LM, Correcting and Contrasting Text Sequences for Language Model Pretraining, in Pytorch

COCO LM Pretraining (wip) Implementation of COCO-LM, Correcting and Contrasting Text Sequences for Language Model Pretraining, in Pytorch. They were a

44 Jul 28, 2022

Sequence model architectures from scratch in PyTorch

This repository implements a variety of sequence model architectures from scratch in PyTorch. Effort has been put to make the code well structured so that it can serve as learning material. The training loop implements the learner design pattern from fast.ai in pure PyTorch, with access to the loop provided through callbacks. Detailed logging and graphs are also provided with python logging and wandb. Additional implementations will be added.

11 Mar 28, 2022

PyTorch original implementation of Cross-lingual Language Model Pretraining.

XLM NEW: Added XLM-R model. PyTorch original implementation of Cross-lingual Language Model Pretraining. Includes: Monolingual language model pretrain

2.7k Dec 27, 2022