Neural Machine Translation (NMT) tutorial with OpenNMT-py

Overview

OpenNMT-py Tutorial

Neural Machine Translation (NMT) tutorial with OpenNMT-py. Data preprocessing, model training, evaluation, and deployment.

Fundamentals

Advanced Topics

  • Running TensorBoard with OpenNMT (tutorial)
  • Low-Resource Neural Machine Translation (tutorial)
  • Domain Adaptation with Mixed Fine-tuning (tutorial)
  • Overview of Domain Adaptation Techniques (tutorial)
  • Multilingual Machine Translation (tutorial)
You might also like...
A machine learning toolkit dedicated to time-series data

tslearn The machine learning toolkit for time series analysis in Python Section Description Installation Installing the dependencies and tslearn Getti

Automated Machine Learning with scikit-learn

auto-sklearn auto-sklearn is an automated machine learning toolkit and a drop-in replacement for a scikit-learn estimator. Find the documentation here

MLBox is a powerful Automated Machine Learning python library.
MLBox is a powerful Automated Machine Learning python library.

MLBox is a powerful Automated Machine Learning python library. It provides the following features: Fast reading and distributed data preprocessing/cle

Library for machine learning stacking generalization.

stacked_generalization Implemented machine learning *stacking technic[1]* as handy library in Python. Feature weighted linear stacking is also availab

Python package for stacking (machine learning technique)
Python package for stacking (machine learning technique)

vecstack Python package for stacking (stacked generalization) featuring lightweight functional API and fully compatible scikit-learn API Convenient wa

A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning

imbalanced-learn imbalanced-learn is a python package offering a number of re-sampling techniques commonly used in datasets showing strong between-cla

Extreme Learning Machine implementation in Python

Python-ELM v0.3 --- ARCHIVED March 2021 --- This is an implementation of the Extreme Learning Machine [1][2] in Python, based on scikit-learn. From

TensorFlow implementation of an arbitrary order Factorization Machine

This is a TensorFlow implementation of an arbitrary order (=2) Factorization Machine based on paper Factorization Machines with libFM. It supports: d

Relevance Vector Machine implementation using the scikit-learn API.

scikit-rvm scikit-rvm is a Python module implementing the Relevance Vector Machine (RVM) machine learning technique using the scikit-learn API. Quicks

Comments
  • AssertionError when training the model

    AssertionError when training the model

    Hi, I'm trying to train a model on my data. The dataset is pretty small, less than 6000 sentences. I used the first tutorial for preprocessing, everything worked just fine. Now when I try model training I get an error:

    [2022-11-19 10:42:07,824 WARNING] Corpus corpus_1's weight should be given. We default it to 1 for you.
    [2022-11-19 10:42:07,825 INFO] Parsed 2 corpora from -data.
    [2022-11-19 10:42:07,826 INFO] Get special vocabs from Transforms: {'src': set(), 'tgt': set()}.
    [2022-11-19 10:42:07,887 INFO] Building model...
    Traceback (most recent call last):
      File "/usr/local/bin/onmt_train", line 8, in <module>
        sys.exit(main())
      File "/usr/local/lib/python3.7/dist-packages/onmt/bin/train.py", line 65, in main
        train(opt)
      File "/usr/local/lib/python3.7/dist-packages/onmt/bin/train.py", line 50, in train
        train_process(opt, device_id=0)
      File "/usr/local/lib/python3.7/dist-packages/onmt/train_single.py", line 136, in main
        model = build_model(model_opt, opt, vocabs, checkpoint)
      File "/usr/local/lib/python3.7/dist-packages/onmt/model_builder.py", line 327, in build_model
        model = build_base_model(model_opt, vocabs, use_gpu(opt), checkpoint)
      File "/usr/local/lib/python3.7/dist-packages/onmt/model_builder.py", line 242, in build_base_model
        model = build_task_specific_model(model_opt, vocabs)
      File "/usr/local/lib/python3.7/dist-packages/onmt/model_builder.py", line 158, in build_task_specific_model
        encoder, src_emb = build_encoder_with_embeddings(model_opt, vocabs)
      File "/usr/local/lib/python3.7/dist-packages/onmt/model_builder.py", line 131, in build_encoder_with_embeddings
        encoder = build_encoder(model_opt, src_emb)
      File "/usr/local/lib/python3.7/dist-packages/onmt/model_builder.py", line 73, in build_encoder
        return str2enc[enc_type].from_opt(opt, embeddings)
      File "/usr/local/lib/python3.7/dist-packages/onmt/encoders/transformer.py", line 120, in from_opt
        add_qkvbias=opt.add_qkvbias
      File "/usr/local/lib/python3.7/dist-packages/onmt/encoders/transformer.py", line 103, in __init__
        for i in range(num_layers)])
      File "/usr/local/lib/python3.7/dist-packages/onmt/encoders/transformer.py", line 103, in <listcomp>
        for i in range(num_layers)])
      File "/usr/local/lib/python3.7/dist-packages/onmt/encoders/transformer.py", line 38, in __init__
        attn_type="self", add_qkvbias=add_qkvbias)
      File "/usr/local/lib/python3.7/dist-packages/onmt/modules/multi_headed_attn.py", line 118, in __init__
        assert model_dim % head_count == 0
    AssertionError
    

    I found this issue with the same error, the solution is supposed to be with hyperparameters, but I checked them and can't find the problem. Could you give a hint how to solve this? Thank you!

    opened by sete-nay 2
  • Weighted corpora loaded so far❓

    Weighted corpora loaded so far❓

    image I tested this code on my own dataset with 4k parallel sentence. But when I trained the model, it always shows "Weighted corpora loaded so far", how can I solve this problem. Thank u

    opened by pariskang 1
  • RuntimeError: DataLoader worker (pid n) is killed by signal: Killed

    RuntimeError: DataLoader worker (pid n) is killed by signal: Killed

    On Google Colab (free version), the training stops after some time with an error like:

    RuntimeError: DataLoader worker (pid 629) is killed by signal: Killed.
    

    As verified by running dmesg -T this is a RAM out of memory error.

    Memory cgroup out of memory: Killed process 629 (onmt_train) total-vm:14119556kB, anon-rss:6538204kB, file-rss:80652kB, shmem-rss:16kB, UID:0 pgtables:13432kB oom_score_adj:0
    
    opened by ymoslem 1
  • ValueError: invalid literal for int() with base 10: '-2.34575'

    ValueError: invalid literal for int() with base 10: '-2.34575'

    I've followed all instructions with a corpus size of around 300,000 (vocab 25,000) and keep on running into this issues (have tried multiple times, same problem). I've completed all pre-processing, model training etc successfully but the library just errors upon a specific entry in the source.vocab (below) image

    Do you have any idea how I can resolve my issue?

    image
    opened by ArtanisTheOne 1
Owner
Yasmin Moslem
Machine Translation Researcher
Yasmin Moslem
Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques

Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learning.

Vowpal Wabbit 8.1k Dec 30, 2022
CD) in machine learning projectsImplementing continuous integration & delivery (CI/CD) in machine learning projects

CML with cloud compute This repository contains a sample project using CML with Terraform (via the cml-runner function) to launch an AWS EC2 instance

Iterative 19 Oct 3, 2022
High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

What is xLearn? xLearn is a high performance, easy-to-use, and scalable machine learning package that contains linear model (LR), factorization machin

Chao Ma 3k Jan 8, 2023
cuML - RAPIDS Machine Learning Library

cuML - GPU Machine Learning Algorithms cuML is a suite of libraries that implement machine learning algorithms and mathematical primitives functions t

RAPIDS 3.1k Dec 28, 2022
mlpack: a scalable C++ machine learning library --

a fast, flexible machine learning library Home | Documentation | Doxygen | Community | Help | IRC Chat Download: current stable version (3.4.2) mlpack

mlpack 4.2k Jan 1, 2023
A toolkit for making real world machine learning and data analysis applications in C++

dlib C++ library Dlib is a modern C++ toolkit containing machine learning algorithms and tools for creating complex software in C++ to solve real worl

Davis E. King 11.6k Jan 2, 2023
A library of extension and helper modules for Python's data analysis and machine learning libraries.

Mlxtend (machine learning extensions) is a Python library of useful tools for the day-to-day data science tasks. Sebastian Raschka 2014-2021 Links Doc

Sebastian Raschka 4.2k Dec 29, 2022
50% faster, 50% less RAM Machine Learning. Numba rewritten Sklearn. SVD, NNMF, PCA, LinearReg, RidgeReg, Randomized, Truncated SVD/PCA, CSR Matrices all 50+% faster

[Due to the time taken @ uni, work + hell breaking loose in my life, since things have calmed down a bit, will continue commiting!!!] [By the way, I'm

Daniel Han-Chen 1.4k Jan 1, 2023
Machine Learning toolbox for Humans

Reproducible Experiment Platform (REP) REP is ipython-based environment for conducting data-driven research in a consistent and reproducible way. Main

Yandex 663 Dec 31, 2022
Uplift modeling and causal inference with machine learning algorithms

Disclaimer This project is stable and being incubated for long-term support. It may contain new experimental code, for which APIs are subject to chang

Uber Open Source 3.7k Jan 7, 2023