Project Insight
NLP as a Service
Contents
Introduction
Project Insight is designed to create NLP as a service with code base for both front end GUI (streamlit
) and backend server (FastApi
) the usage of transformers models on various downstream NLP task.
The downstream NLP tasks covered:
-
News Classification
-
Entity Recognition
-
Sentiment Analysis
-
Summarization
-
Information Extraction
To Do
The user can select different models from the drop down to run the inference.
The users can also directly use the backend fastapi server to have a command line inference.
Features of the solution
- Python Code Base: Built using
Fastapi
andStreamlit
making the complete code base in Python. - Expandable: The backend is desinged in a way that it can be expanded with more Transformer based models and it will be available in the front end app automatically.
- Micro-Services: The backend is designed with a microservices architecture, with dockerfile for each service and leveraging on Nginx as a reverse proxy to each independently running service.
- This makes it easy to update, manitain, start, stop individual NLP services.
Installation
- Clone the Repo.
- Run the
Docker Compose
to spin up the Fastapi based backend service. - Run the Streamlit app with the
streamlit run command
.
Setup and Documentation
-
Download the models
- Download the models from here
- Save them in the specific model folders inside the
src_fastapi
folder.
-
Running the backend service.
- Go to the
src_fastapi
folder - Run the
Docker Compose
comnand
$ cd src_fastapi src_fastapi:~$ sudo docker-compose up -d
- Go to the
-
Running the frontend app.
- Go to the
src_streamlit
folder
- Run the app with the streamlit run command
$ cd src_streamlit src_streamlit:~$ streamlit run NLPfily.py
- Go to the
-
Access to Fastapi Documentation: Since this is a microservice based design, every NLP task has its own seperate documentation
- News Classification: http://localhost:8080/api/v1/classification/docs
- Sentiment Analysis: http://localhost:8080/api/v1/sentiment/docs
- NER: http://localhost:8080/api/v1/ner/docs
- Summarization: http://localhost:8080/api/v1/summary/docs
Project Details
Demonstration
Directory Details
-
Front End: Front end code is in the
src_streamlit
folder. Along with theDockerfile
andrequirements.txt
-
Back End: Back End code is in the
src_fastapi
folder.- This folder contains directory for each task:
Classification
,ner
,summary
...etc - Each NLP task has been implemented as a microservice, with its own fastapi server and requirements and Dockerfile so that they can be independently mantained and managed.
- Each NLP task has its own folder and within each folder each trained model has 1 folder each. For example:
- sentiment > app > api > distilbert - model.bin - network.py - tokeniser files >roberta - model.bin - network.py - tokeniser files
-
For each new model under each service a new folder will have to be added.
-
Each folder model will need the following files:
- Model bin file.
- Tokenizer files
network.py
Defining the class of the model if customised model used.
-
config.json
: This file contains the details of the models in the backend and the dataset they are trained on.
- This folder contains directory for each task:
How to Add a new Model
-
Fine Tune a transformer model for specific task. You can leverage the transformers-tutorials
-
Save the model files, tokenizer files and also create a
network.py
script if using a customized training network. -
Create a directory within the NLP task with
directory_name
as themodel name
and save all the files in this directory. -
Update the
config.json
with the model details and dataset details. -
Update the
<service>pro.py
with the correct imports and conditions where the model is imported. For example for a new Bert model in Classification Task, do the following:-
Create a new directory in
classification/app/api/
. Directory namebert
. -
Update
config.json
with following:"classification": { "model-1": { "name": "DistilBERT", "info": "This model is trained on News Aggregator Dataset from UC Irvin Machine Learning Repository. The news headlines are classified into 4 categories: **Business**, **Science and Technology**, **Entertainment**, **Health**. [New Dataset](https://archive.ics.uci.edu/ml/datasets/News+Aggregator)" }, "model-2": { "name": "BERT", "info": "Model Info" } }
-
Update
classificationpro.py
with the following snippets:Only if customized class used
from classification.bert import BertClass
Section where the model is selected
if model == "bert": self.model = BertClass() self.tokenizer = BertTokenizerFast.from_pretrained(self.path)
-
License
This project is licensed under the GPL-3.0 License - see the LICENSE.md file for details