This is a graphql api build using ariadne python that serves a graphql-endpoint at port 3002 to perform language translation and identification using deep learning in python pytorch.

crispengari

Last update: Dec 30, 2021

Related tags

GraphQL python graphql machine-learning deep-learning machine-translation pytorch artificial-intelligence ariadne torchtext

Overview

Language Translation and Identification

this machine/deep learning api that will be served as a graphql-api using ariadne, to perform the following tasks.

1. Language Identification

Identifying the language which the text belongs to using a simple text classification model. This model will be able to identify 7 different languages:

english (en)
french (fr)
german (de)
spanish (es)
italian (it)
portuguese (pt)
swedish (sw)

2. Language Translation

Language translation offers a bi-direction english to another language translation for example `english-to-french`. The model translation api will be able to translate the following languages:

eng-de (english to german)
de-eng (german to english)
eng-af (english to afrikaans)
af-eng (afrikaans to german)
fr-eng (french to german)
eng-fr (english to french)
es-eng (spanish to german)
eng-es (english to spanish)
it-eng (italian to german)
eng-it (english to italian)
pt-eng (portuguese to german)
eng-pt (english to portuguese)
sw-eng (swedish to german)
eng-sw (english to swedish)

Starting the server

To start the server first you need to install all the packages that we used and make sure you have the .pt files for both the translation and identification models. To install the packages you need to run the following command:

Note that to save the .pt files for model you have to train the models first. The notebooks for doing so can be found on the repositories links that are given at the end of this README file.

pip install -r requirements.txt

Models Metrics Summary

Language Translation models

model name	model description	BLEU metric	test PPL	challenges
eng-de	translate sentences from english to germany.	36.64	8.807	the model trains for a short period of time due to google colab session limitations.
de-eng	translate sentences from germany to english.	46.20	7.783	the model trains for a short period of time due to google colab session limitations.
eng-af	translate sentences from english to afrikaans.	0.00	23.635	the dataset that i used was having few examples.
eng-af	translate sentences from english to afrikaans.	0.00	23.635	the dataset that i used was having few examples.
es-eng	translate sentences from spanish to english.	44.12	8.097	the model trains for a short period of time due to google colab session limitations.
eng-es	translate sentences from english to spanish.	33.74	12.877	the model trains for a short period of time due to google colab session limitations.
eng-fr	translate sentences from english to french.	52.45	8.803	the model trains for a short period of time due to google colab session limitations.
fr-eng	translate sentences from french to english.	40.17	8.803	the model trains for a short period of time due to google colab session limitations.
eng-it	translate sentences from english to italian.	48.90	6.288	the model trains for a short period of time due to google colab session limitations.
it-eng	translate sentences from italian to english.	72.67	2.530	the model trains for a short period of time due to google colab session limitations.
eng-pt	translate sentences from portuguese to french.	45.92	7.721	the model trains for a short period of time due to google colab session limitations.
pt-eng	translate sentences from portuguese to english.	58.23	4.371	the model trains for a short period of time due to google colab session limitations.
eng-sw	translate sentences from swedish to french.	26.19	11.406	the model trains for a short period of time due to google colab session limitations.
sw-eng	translate sentences from swedish to english.	37.13	10.160	the model trains for a short period of time due to google colab session limitations.

Language Identification models

For language identification i used the model based on fasttext paper for quick training on google colab GPU

model name	model description	test accuracy	validation accuracy	train accuracy	test loss	validation loss	train loss
best-lang-ident-model	identifies which language does the sentence belongs to.	99.22%	99.00%	100%	0.036	0.036	0.000

Language Translation Model (graphql api)

The graphql server is running on http://127.0.0.1:3002/graphql if you send the following graphql mutation:

mutation Translator($input: TranslationInputType!) {
  translate(input: $input) {
    from_
    meta {
      name
      language
      author
      package
      description
      project
    }
    translation
    sent
  }
}

With the following query variables:

{
  "input": {
    "to": "eng",
    "from_": "it",
    "text": "ciao , come stai ?"
  }
}

You will get the following response:

{
  "data": {
    "translate": {
      "from_": "it",
      "meta": {
        "author": "@crispengari",
        "description": "language identification and translation graphql api.",
        "language": "python",
        "name": "ml backend",
        "package": "pytorch",
        "project": "noteme"
      },
      "sent": "ciao , come stai ?",
      "translation": "hello , how are you ? ."
    }
  }
}

Language Identification Model (graphql api)

To identify the language that the text is written in, we run the following mutation on http://127.0.0.1:3002/graphql

mutation Identify($input: IdentificationInputType!) {
  identify(input: $input) {
    probability
    label
    lang
    prediction {
      code
      id
      name
    }
    predictions {
      prediction {
        code
        id
        name
      }
      probability
    }
  }
}

With the following query variables:

{
  "input": {
    "text": "how are you?"
  }
}

To get the following response:

{
  "data": {
    "identify": {
      "label": 0,
      "lang": "eng",
      "prediction": {
        "code": "eng",
        "id": 0,
        "name": "english"
      },
      "predictions": [
        {
          "prediction": {
            "code": "eng",
            "id": 0,
            "name": "english"
          },
          "probability": 1
        },
        {
          "prediction": {
            "code": "swe",
            "id": 1,
            "name": "swedish"
          },
          "probability": 0
        },
        {
          "prediction": {
            "code": "fra",
            "id": 2,
            "name": "french"
          },
          "probability": 0
        },
        {
          "prediction": {
            "code": "deu",
            "id": 3,
            "name": "germany"
          },
          "probability": 0
        },
        {
          "prediction": {
            "code": "ita",
            "id": 4,
            "name": "italian"
          },
          "probability": 0
        },
        {
          "prediction": {
            "code": "por",
            "id": 5,
            "name": "portuguese"
          },
          "probability": 0
        },
        {
          "prediction": {
            "code": "afr",
            "id": 6,
            "name": "afrikaans"
          },
          "probability": 0
        }
      ],
      "probability": 1
    }
  }
}

Why graphql?

With graphql we allow the client to select fields he/she is interested in. And this give us an advantage of using a single endpoint for example http://127.0.0.1:3002/graphql for all the identification and translation models.

Why language translation?

This project was build to translate simple and complex sentences for 7 different languages. The idea was brought forward with the project likeme where we perform some processing on user's caption using pytorch deep learning models. The following steps were considered to preprocess the caption:

identify the language the caption in
translate the given caption to a certain language.

Notebooks

Translation models

All the notebooks for the translation models are found here

Identification model

The notebook for language identification model is found here

Comments

Problem running the project

Hi Crispen,

I'm getting an error about a missing JSON file for vocab (machine-translator/translation/models/eng-deu/static/src_vocab.json). Any chance you can see what I'm doing wrong? Am I missing a file?

Here's what I have:

 ~/Doc/p/machine-translator  on main ?1  python main.py                                                                       ✔  machine-translator   2.6.3   at 16:14:49 
 ✅ LOADING TOKENIZERS

 ✅ LOADING TOKENIZERS DONE!

 ✅ LOADING TRANSLATION MODELS

Traceback (most recent call last):
  File "/Users/manolo/Documents/python/machine-translator/main.py", line 22, in <module>
    from resolvers.mutations import mutation
  File "/Users/manolo/Documents/python/machine-translator/resolvers/mutations/__init__.py", line 2, in <module>
    from translation import getFunctionParams, translate_sentence, EOS_TOKEN, UNK_TOKEN, device, meta
  File "/Users/manolo/Documents/python/machine-translator/translation/__init__.py", line 88, in <module>
    DE_DE_DICT, DE_EN_DICT = createDictMappings('eng-deu')
  File "/Users/manolo/Documents/python/machine-translator/translation/__init__.py", line 59, in createDictMappings
    with open(src_json_path, 'r') as src, open(trg_json_path, 'r') as trg:
FileNotFoundError: [Errno 2] No such file or directory: '/Users/manolo/Documents/python/machine-translator/translation/models/eng-deu/static/src_vocab.json'

opened by paulterinho 6