Trains an OpenNMT PyTorch model and SentencePiece tokenizer.

Overview

Argos Train

Trains an OpenNMT PyTorch model and SentencePiece tokenizer. Designed for use with Argos Translate and LibreTranslate.

Argos Translate packages are also available for download.

Training example

$ su argosopentech
$ source ~/argos-train-init

...


$ argos-train
From code (ISO 639): en
To code (ISO 639): es
From name: English
To name: Spanish
Package version: 1.0
Argos version: 1.0

...

Package saved to /home/argosopentech/argos-train/run/en_es.argosmodel

Data

Uses data from the Opus project in the Moses format stored in data index.

Environment

CUDA required, tested on vast.ai.

Docker

Docker image available at argosopentech/argostrain.

docker run -it argosopentech/argostrain /bin/bash

Run training

argos-train

Troubleshooting

  • If you're running out of GPU memory reduce batch_size and valid_batch_size in config.yml.

License

Licensed under either the MIT or CC0 License (same as Argos Translate).

Comments
  • Argos Train Beta

    Argos Train Beta

    opened by argosopentech 9
  • Minimum requirement for training

    Minimum requirement for training

    I don't have a supercomputer but I wouldn't mind running a training model in the background for days, weeks or months, my electricity is cheap too.

    Can I train on a "non server grade" computer? What is the average amount of RAM/CPU required? Can I train on an AMD graphic card (Vulkan)?. Once trained, can I share/send it to Argos?

    Thank you

    opened by Extarys 6
  • How to improve training?

    How to improve training?

    I commented in the language request discussion that Brazilian Portuguese is missing so I'm trying to train a en→pt_BR model. I will also train a es→pt_BR model to avoid pivoting around English. I'm having some issues:

    My hardware is quite limited: I got 4 GB of GPU RAM, so, following the documentation I reduced both batch_size and valid_batch_size and I could train a model, but this model was quite bad (example: "Roses are red" was translated to "The queen is red"). I tried increasing train_steps but it's still not great (example: "Roses are red" was translated to "Roses é vermelho", i.e. "Roses (given name) is red"). Seems the issue was with the dataset I was using (subtitles), so I added more data... then the script crashed every time: my system only have 8 GB of RAM.

    Is it possible to improve training under those hardware constraints? Or the only way is buying more RAM?

    opened by qgustavor 4
  • Retrain on custom input

    Retrain on custom input

    I would love to create a model which fits use in technical situations such as translation for Windows Event Logs and computer logs in general.

    What's the best way to go about this?

    Additionally as an example using a user account as the translation subject: If the text NT AUTHORITY\СИСТЕМА is given, the uk>en translation truncates it to NT AUTHORITY as the output (it should be NT AUTHORITY\SYSTEM

    opened by geekscrapy 3
  • Found no NVIDIA driver on your system.

    Found no NVIDIA driver on your system.

    While training I get the following error

    Traceback (most recent call last):
      File "/home/argosopentech/env/bin/onmt_train", line 33, in <module>
        sys.exit(load_entry_point('OpenNMT-py', 'console_scripts', 'onmt_train')())
      File "/home/argosopentech/OpenNMT-py/onmt/bin/train.py", line 172, in main
        train(opt)
      File "/home/argosopentech/OpenNMT-py/onmt/bin/train.py", line 157, in train
        train_process(opt, device_id=0)
      File "/home/argosopentech/OpenNMT-py/onmt/train_single.py", line 64, in main
        configure_process(opt, device_id)
      File "/home/argosopentech/OpenNMT-py/onmt/train_single.py", line 19, in configure_process
        torch.cuda.set_device(device_id)
      File "/home/argosopentech/env/lib/python3.10/site-packages/torch/cuda/__init__.py", line 326, in set_device
        torch._C._cuda_setDevice(device)
      File "/home/argosopentech/env/lib/python3.10/site-packages/torch/cuda/__init__.py", line 229, in _lazy_init
        torch._C._cuda_init()
    RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx
    Traceback (most recent call last):
      File "/home/argosopentech/env/bin/argos-train", line 7, in <module>
        exec(compile(f.read(), __file__, 'exec'))
      File "/home/argosopentech/argos-train/bin/argos-train", line 20, in <module>
        train.train(from_code, to_code, from_name, to_name, version, package_version, argos_version, data_exists, epochs_count)
      File "/home/argosopentech/argos-train/argostrain/train.py", line 173, in train
        str(opennmt_checkpoints[-2].f),
    IndexError: list index out of range
    

    So, basically it cannot find my graphics card. I run it on windows in docker via

    docker run -it argosopentech/argostrain /bin/bash
    

    Do I have to enable something before starting the training?

    Thanks and best regards.

    opened by dextreem 1
  • Auto generated data-index.json

    Auto generated data-index.json

    Rather than manually retrieving the data and adding them to the data-index.json, it would be simple I think to create a script that would automatically generate the data-index.json using the tables from https://opus.nlpl.eu/ and would automatically download the data? a kind of scrapper.

    opened by dingedi 1
  • prepare_data.py is appending multiple (nine) copies of source and target lines to train and val files

    prepare_data.py is appending multiple (nine) copies of source and target lines to train and val files

    The issue is lines 38 and 39 in prepare_data.py are being called once for every naughty_string type, so source.append and target.append are being called nine times for every valid source/target line (once for each type of naughty_string).

    lines 38 and 39 should not be in the loop on line 35, and should be move left one level.

    38 source.append(source_line) 39 target.append(target_line)

    Best Regards

    prepare_data py

    opened by ahirsbrunner 1
  • Filter emojis to avoid having them in our training

    Filter emojis to avoid having them in our training

    This will add emojis as 'unwanted character' next to HTML tags and entities. More info: https://github.com/argosopentech/argos-translate/issues/51

    Alternatively, We could strip unwanted-symbols from the lines instead of filtering/removing the lines. But for now this should do it.

    opened by mmokhi 1
  • General questions

    General questions

    @PJ-Finlay Hi there, Currently i have interest in machine translation, with an eye on the Arabic, Urdu and English languages. My target for now is to learn how to train a new model from scratch. therefor the following questions arise:

    • How does onmt compares with the other machine translation models such as mt5, criss , XLM-RoBERTa
    • How many parameters does onmt model have, and what is it's performance to translate 100 sentences on a cpu only device (ram usage, cpu usage, time required, etc...) Waiting for your reply
    opened by seekingdeep 1
  • question about contribution

    question about contribution

    Hey, I'm not at all a programmer nor do I really understand the tech speech but I heard about this project and I was wondering it was a way for me to help? I am fluent in English and Hebrew (which I saw you didn't have) (I don't even understand if knowing the language is important)?

    opened by menachem-dev 1
  • Question

    Question

    I am using default settings to test around and i've noticed that the docker instance is stuck on corpus_1's transforms: TransformPipe(SentencePieceTransform(share_vocab=True, src_subword_model=run/sentencepiece.model, tgt_subword_model=run/sentencepiece.model, src_subword_alpha=0.0, tgt_subword_alpha=0.0, src_subword_vocab=, tgt_subword_vocab=, src_vocab_threshold=0, tgt_vocab_threshold=0, src_subword_nbest=1, tgt_subword_nbest=1), FilterTooLongTransform(src_seq_length=150, tgt_seq_length=150)) 1core 100% usage (let it run for a hour) no disk / ram activity

    is that normal or am I doing something wrong?

    System:

    • OS: Ubuntu2204
    • CPU: Ryzen 7 2700x (OC to 4.5GHz all core)
    • RAM: 32GB 3200mhz
    • Storage: nvme~2tb
    • GPUs:2x1060 6GB

    //EDIT after restarting the host system it worked within 20 mins

    opened by Allesanddro 0
  • References not correctly added to the model README

    References not correctly added to the model README

    There's functionality to auto add the values from the datasets "reference" field to the .argosmodel package READMEs. There's a bug and it's been broken recently and I haven't been able to figure out why.

    bug help wanted good first issue 
    opened by PJ-Finlay 0
  • Support data from a local file

    Support data from a local file

    Currently Argos Train loads data from data-index.json from a remote HTTP server. I'd like to add support for loading data packages from a file on the local server like this:

    "links": ["file:///home/argosopentech/Desktop/data-opensubtitles-en_de.argosdata"]
    
    enhancement help wanted good first issue 
    opened by PJ-Finlay 0
  • OpenNMT-py v3 support

    OpenNMT-py v3 support

    https://forum.opennmt.net/t/opennmt-py-v3-0-is-out/5077

    The vanilla transformer uses sinusoidal positional encoding (position_encoding = true). We recommend to use “maximum relative positions” encoding instead (max_relative_positions=20, position_encoding=false) which again has a small overhead.

    We kept the “fusedadam” (old legacy code) which provides the best performance in speed (compare to pytroch amp adam fp16, apex level O1/O2). We tested the new Adam(fused=true) released with pytorch 1.13 but it is way slower.

    Always use the highest batch size possible (to your GPU ram capacity) and use an update interval according to the “true bach size” you want. For instance, if your GPU can accept 8192 tokens, then if you use accum_count=12, you will have a true batch size of 98304 tokens.

    Adjust the bucket size to your CPU ram. Most of the time a bucket between 200K and 500K examples will be suitable. The highest your bucket size is, the less padding you will have since examples are sorted based on this bucket and batches yield from this bucket.

    enhancement help wanted 
    opened by argosopentech 2
  • Data priority, incremental training?

    Data priority, incremental training?

    Hi there!

    1. I would like to use the data currently provided in data-index.json, but at the same time, I would like to use my custom data. Can I tell the script to generate a model considering my custom data is more relevant / has a bigger priority?

    2. Let's say I have one large dataset I am using all the time, and then I have multiple smaller datasets which I would like to train different models for each. Is something like an incremental build possible, so I would reuse some previous output and just "append" my custom data to save some training time and resources?

    Thanks!

    enhancement help wanted good first issue 
    opened by JanCizmar 1
  • Improved error message

    Improved error message

    We should add better error handling when there's no data available for a language.

    https://forum.opennmt.net/t/traceback-assertionerror-while-training-in-vast-ai/5037/4

    enhancement good first issue 
    opened by PJ-Finlay 0
  •  Traceback AssertionError while training in Vast.ai

    Traceback AssertionError while training in Vast.ai

    I always get this error message after uploading .argosdata package to Vast.ai after running argos-train command.

    Traceback (most recent call last): File "/home/argosopentech/env/bin/argos-train", line 7, in exec(compile(f.read(), file, 'exec')) File "/home/argosopentech/argos-train/bin/argos-train", line 18, in train.train(from_code, to_code, from_name, to_name, version, package_version, argos_version, data_exists) File "/home/argosopentech/argos-train/argostrain/train.py", line 78, in train source, target = dataset.data() File "/home/argosopentech/argos-train/argostrain/dataset.py", line 247, in data self.local_dataset = LocalDataset(filepath) File "/home/argosopentech/argos-train/argostrain/dataset.py", line 152, in init assert len(dir_names) > 0 AssertionError

    opened by pocakka 0
Train BPE with fastBPE, and load to Huggingface Tokenizer.

BPEer Train BPE with fastBPE, and load to Huggingface Tokenizer. Description The BPETrainer of Huggingface consumes a lot of memory when I am training

Lizhuo 1 Dec 23, 2021
Unsupervised text tokenizer for Neural Network-based text generation.

SentencePiece SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabu

Google 6.4k Jan 1, 2023
Unsupervised text tokenizer focused on computational efficiency

YouTokenToMe YouTokenToMe is an unsupervised text tokenizer focused on computational efficiency. It currently implements fast Byte Pair Encoding (BPE)

VK.com 847 Dec 19, 2022
Unsupervised text tokenizer for Neural Network-based text generation.

SentencePiece SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabu

Google 4.8k Feb 18, 2021
Unsupervised text tokenizer focused on computational efficiency

YouTokenToMe YouTokenToMe is an unsupervised text tokenizer focused on computational efficiency. It currently implements fast Byte Pair Encoding (BPE)

VK.com 718 Feb 18, 2021
Japanese Long-Unit-Word Tokenizer with RemBertTokenizerFast of Transformers

Japanese-LUW-Tokenizer Japanese Long-Unit-Word (国語研長単位) Tokenizer for Transformers based on 青空文庫 Basic Usage >>> from transformers import RemBertToken

Koichi Yasuoka 3 Dec 22, 2021
iBOT: Image BERT Pre-Training with Online Tokenizer

Image BERT Pre-Training with iBOT Official PyTorch implementation and pretrained models for paper iBOT: Image BERT Pre-Training with Online Tokenizer.

Bytedance Inc. 435 Jan 6, 2023
A Japanese tokenizer based on recurrent neural networks

Nagisa is a python module for Japanese word segmentation/POS-tagging. It is designed to be a simple and easy-to-use tool. This tool has the following

null 325 Jan 5, 2023
Tokenizer - Module python d'analyse syntaxique et de grammaire, tokenization

Tokenizer Le Tokenizer est un analyseur lexicale, il permet, comme Flex and Yacc par exemple, de tokenizer du code, c'est à dire transformer du code e

Manolo 1 Aug 15, 2022
An ultra fast tiny model for lane detection, using onnx_parser, TensorRTAPI, torch2trt to accelerate. our model support for int8, dynamic input and profiling. (Nvidia-Alibaba-TensoRT-hackathon2021)

Ultra_Fast_Lane_Detection_TensorRT An ultra fast tiny model for lane detection, using onnx_parser, TensorRTAPI to accelerate. our model support for in

steven.yan 121 Dec 27, 2022
Explore different way to mix speech model(wav2vec2, hubert) and nlp model(BART,T5,GPT) together

SpeechMix Explore different way to mix speech model(wav2vec2, hubert) and nlp model(BART,T5,GPT) together. Introduction For the same input: from datas

Eric Lam 31 Nov 7, 2022
Yuqing Xie 2 Feb 17, 2022
Incorporating KenLM language model with HuggingFace implementation of Wav2Vec2CTC Model using beam search decoding

Wav2Vec2CTC With KenLM Using KenLM ARPA language model with beam search to decode audio files and show the most probable transcription. Assuming you'v

farisalasmary 65 Sep 21, 2022
RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2

RoNER RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2. It is meant to be an easy to use, hi

Stefan Dumitrescu 9 Nov 7, 2022
Implementation of COCO-LM, Correcting and Contrasting Text Sequences for Language Model Pretraining, in Pytorch

COCO LM Pretraining (wip) Implementation of COCO-LM, Correcting and Contrasting Text Sequences for Language Model Pretraining, in Pytorch. They were a

Phil Wang 44 Jul 28, 2022
Sequence model architectures from scratch in PyTorch

This repository implements a variety of sequence model architectures from scratch in PyTorch. Effort has been put to make the code well structured so that it can serve as learning material. The training loop implements the learner design pattern from fast.ai in pure PyTorch, with access to the loop provided through callbacks. Detailed logging and graphs are also provided with python logging and wandb. Additional implementations will be added.

Brando Koch 11 Mar 28, 2022
PyTorch original implementation of Cross-lingual Language Model Pretraining.

XLM NEW: Added XLM-R model. PyTorch original implementation of Cross-lingual Language Model Pretraining. Includes: Monolingual language model pretrain

Facebook Research 2.7k Dec 27, 2022