Language Translation and Identification
this machine/deep learning api that will be served as a graphql-api
using ariadne, to perform the following tasks.
1. Language Identification
Identifying the language which the text belongs to using a simple text classification model. This model will be able to identify 7 different languages:
- english (en)
- french (fr)
- german (de)
- spanish (es)
- italian (it)
- portuguese (pt)
- swedish (sw)
2. Language Translation Language translation offers a bi-direction english to another language translation for example `english-to-french`. The model translation api will be able to translate the following languages:
- eng-de (english to german)
- de-eng (german to english)
- eng-af (english to afrikaans)
- af-eng (afrikaans to german)
- fr-eng (french to german)
- eng-fr (english to french)
- es-eng (spanish to german)
- eng-es (english to spanish)
- it-eng (italian to german)
- eng-it (english to italian)
- pt-eng (portuguese to german)
- eng-pt (english to portuguese)
- sw-eng (swedish to german)
- eng-sw (english to swedish)
Starting the server
To start the server first you need to install all the packages that we used and make sure you have the .pt
files for both the translation and identification models. To install the packages you need to run the following command:
Note that to save the
.pt
files for model you have to train the models first. The notebooks for doing so can be found on the repositories links that are given at the end of this README file.
pip install -r requirements.txt
Models Metrics Summary
- Language Translation models
model name | model description | BLEU metric | test PPL | challenges |
---|---|---|---|---|
eng-de | translate sentences from english to germany. | 36.64 | 8.807 | the model trains for a short period of time due to google colab session limitations. |
de-eng | translate sentences from germany to english. | 46.20 | 7.783 | the model trains for a short period of time due to google colab session limitations. |
eng-af | translate sentences from english to afrikaans. | 0.00 | 23.635 | the dataset that i used was having few examples. |
eng-af | translate sentences from english to afrikaans. | 0.00 | 23.635 | the dataset that i used was having few examples. |
es-eng | translate sentences from spanish to english. | 44.12 | 8.097 | the model trains for a short period of time due to google colab session limitations. |
eng-es | translate sentences from english to spanish. | 33.74 | 12.877 | the model trains for a short period of time due to google colab session limitations. |
eng-fr | translate sentences from english to french. | 52.45 | 8.803 | the model trains for a short period of time due to google colab session limitations. |
fr-eng | translate sentences from french to english. | 40.17 | 8.803 | the model trains for a short period of time due to google colab session limitations. |
eng-it | translate sentences from english to italian. | 48.90 | 6.288 | the model trains for a short period of time due to google colab session limitations. |
it-eng | translate sentences from italian to english. | 72.67 | 2.530 | the model trains for a short period of time due to google colab session limitations. |
eng-pt | translate sentences from portuguese to french. | 45.92 | 7.721 | the model trains for a short period of time due to google colab session limitations. |
pt-eng | translate sentences from portuguese to english. | 58.23 | 4.371 | the model trains for a short period of time due to google colab session limitations. |
eng-sw | translate sentences from swedish to french. | 26.19 | 11.406 | the model trains for a short period of time due to google colab session limitations. |
sw-eng | translate sentences from swedish to english. | 37.13 | 10.160 | the model trains for a short period of time due to google colab session limitations. |
- Language Identification models
For language identification i used the model based on fasttext paper for quick training on google colab GPU
model name | model description | test accuracy | validation accuracy | train accuracy | test loss | validation loss | train loss |
---|---|---|---|---|---|---|---|
best-lang-ident-model | identifies which language does the sentence belongs to. | 99.22% | 99.00% | 100% | 0.036 | 0.036 | 0.000 |
Language Translation Model (graphql api)
The graphql server is running on http://127.0.0.1:3002/graphql
if you send the following graphql mutation:
mutation Translator($input: TranslationInputType!) {
translate(input: $input) {
from_
meta {
name
language
author
package
description
project
}
translation
sent
}
}
With the following query
variables:
{
"input": {
"to": "eng",
"from_": "it",
"text": "ciao , come stai ?"
}
}
You will get the following response:
{
"data": {
"translate": {
"from_": "it",
"meta": {
"author": "@crispengari",
"description": "language identification and translation graphql api.",
"language": "python",
"name": "ml backend",
"package": "pytorch",
"project": "noteme"
},
"sent": "ciao , come stai ?",
"translation": "hello , how are you ? ."
}
}
}
Language Identification Model (graphql api)
To identify the language that the text is written in, we run the following mutation on http://127.0.0.1:3002/graphql
mutation Identify($input: IdentificationInputType!) {
identify(input: $input) {
probability
label
lang
prediction {
code
id
name
}
predictions {
prediction {
code
id
name
}
probability
}
}
}
With the following query variables:
{
"input": {
"text": "how are you?"
}
}
To get the following response:
{
"data": {
"identify": {
"label": 0,
"lang": "eng",
"prediction": {
"code": "eng",
"id": 0,
"name": "english"
},
"predictions": [
{
"prediction": {
"code": "eng",
"id": 0,
"name": "english"
},
"probability": 1
},
{
"prediction": {
"code": "swe",
"id": 1,
"name": "swedish"
},
"probability": 0
},
{
"prediction": {
"code": "fra",
"id": 2,
"name": "french"
},
"probability": 0
},
{
"prediction": {
"code": "deu",
"id": 3,
"name": "germany"
},
"probability": 0
},
{
"prediction": {
"code": "ita",
"id": 4,
"name": "italian"
},
"probability": 0
},
{
"prediction": {
"code": "por",
"id": 5,
"name": "portuguese"
},
"probability": 0
},
{
"prediction": {
"code": "afr",
"id": 6,
"name": "afrikaans"
},
"probability": 0
}
],
"probability": 1
}
}
}
Why graphql?
With graphql we allow the client
to select fields
he/she is interested in. And this give us an advantage of using a single endpoint for example http://127.0.0.1:3002/graphql
for all the identification and translation models.
Why language translation?
This project was build to translate simple and complex sentences for 7 different languages. The idea was brought forward with the project likeme
where we perform some processing on user's caption using pytorch deep learning models. The following steps were considered to preprocess the caption:
- identify the language the caption in
- translate the given caption to a certain language.
Notebooks
- Translation models
- All the notebooks for the translation models are found here
- Identification model
- The notebook for language identification model is found here