A machine learning model for analyzing text for user sentiment and determine whether its a positive, neutral, or negative review.

Roberto Sanchez

Last update: Aug 4, 2021

Related tags

Overview

Sentiment Analysis on Yelp's Dataset

Author: Roberto Sanchez, Talent Path: D1 Group

Docker Deployment:

Deployment of this application can be found here hosted on AWS

Running it locally:

docker pull rsanchez2892/sentiment_analysis_app

Overview

The scope of this capstone is centered around the data processing, exploratory data analysis, and training of a model to predict sentiment on user reviews.

Business Goals

Create a model to be able to be used in generating sentiment on reviews or comments found in external / internal websites to give insights on how people feel about certain topics.

This could give the company insights not easily available on sites where ratings are required or for internal use to determine sentiment on blogs or comments.

Business Applications

By utilizing this model, the business can use it for the following purposes:

External:

Monitoring Brand and Reputation online
Product Research

Internal:

Customer Support
Customer Feedback
Employee Satisfaction

Currently method to achieving this is by using outside resources which come at a cost and increases risk for leaking sensitive data to the public. This product will bypass these outside resources and give the company the ability to do it in house.

Model Deployment

Link: Review Analyzer

After running multiple models and comparing accuracy, I found that the LinearSVC model is a viable candidate to be used in production for analyzing reviews of services or food.

Classification Report / Confusion Matrix:

Technology Stack

I have been using these technologies for this project:

Jupyter Notebook - Version 6.3.0
- Used for most of the data processing, EDA, and model training.
Python - Version 3.8.8
- The main language this project will be done in.
Scikit-learn - Version 0.24
- Utilizing metrics reports and certain models.
Postgres - Version 13
- Main database application used to store this data.
Flask - Version 1.1.2
- Main backend technology to host a usable version of this project to the public.
GitHub
- Versioning control and online documentation
Heroku
- Online cloud platform to host this application for public use

Data Processing

This capstone uses the Yelp dataset found on Kaggle which comprises of multiple files:

Business Data
Check-in Data
Review Data
Tips Data
User Data

Stage 1 - Read in From JSON files into Postgres

Overview

Read in JSON files
General observations on the features found in each file
Modifying feature names to meet Postgres naming convention
Normalized the data to prepare for import to Postgres
Saved copies of each table as CSV file for backup incase Database goes down
Exported data into Postgres

As stated above, Kaggle provided several JSON files with a large amount of data that needed to be stored in a location for easy access and provide a quick way to query data on the fly. As the files were read in Jupyter notebook a general observation was made to the feature names and amount each file contained to see what data I was dealing with along with the types associated with them. The business data contained a strange number of attributes that had to be broken up into separate data frames to be normalized for Postgres.

Stage 2 - Pre-Processing Data

Overview

Read in data from Postgres
IDing Null Values
Removing Sparse features
Saved data frame as a pickle to be used in model training

This stage I performed elementary data analysis where I analyze any null values, see the distribution of my ratings and review lengths.

Stage 3 - Cleaning Up Data

Overview

Replace contractions with expanded versions
Lemmatized text
Removed special characters, dates, emails, and URLs
Removed stop words
Remove non-english text
Normalized text

Exploratory Data Analysis

Analyzing Null Values in Dataset

Below is a visualization of the data provided by Kaggle showing which features have "NaN " values. Its is clear that the review ratings (review_stars) and reviews (text) are fully populated. Some of the business attributes are sparse but have enough values to be useful for other things. Note several other features were dropped in the Data Processing since they did not provide any insights for the scope of this project.

Looking Closer at the Ratings (review_stars)

This is a sample of 2 million rows from the original 8 million in the dataset. This distribution of ratings has a left skew on it where most of the reviews are 4 to 5 stars.

I simplified the ratings to better categorize the sentiment of the review by grouping 1 and 2 star reviews as 'negative', 3 star review as 'neutral', and 4 and 5 star reviews as 'positive'.

Looking Closer at the Reviews (text)

To analyze the text, I've calculated the length of each review in the sample and plotted a distribution graph showing them the number of characters of each review. The statistics were that the median review was approx. 606 characters with a range of 0 through 5000 characters.

A closer inspection on the range 0 - 2000 we can see that most of the reviews are around this general area.

In order to produce a viable word cloud, I've had to process all of the text in the sample to remove special characters and stop words from NLTK to produce a viable string to be used in word cloud. Below is a visualization of all of the key words found in the positive reviews.

As expected, words like "perfect", "great", "good", "great place", and "highly recommend" came out on top.

On the negative word cloud, words like "bad", "customer service", "never", "horrible", and "awful" are appearing on the word cloud.

Model Training

Model Selection

These four models were chosen to be trained with this data. Each of these models had a pipeline created with TfidfVectorizer.

Model Training

Run a StratifiedKFold with a 5 fold split and analyze the average scores and classification reports
- Get an average accuracy of the model for comparison
Create a single model to generate a confusion matrix
Test out model on a handful of examples

Below is the average metrics after running 5 fold cross validation on LinearSVC

Testing Model

After the model was trained, I fed it some reviews I found online to test out whether or not the model can properly detect the right sentiment. The following reviews are ordered as "Negative", "Neutral", and "Positive":

new_test_data = [
    "This was the worst place I've ever eaten at. The staff was rude and did not take my order until after i pulled out my wallet.",
    "The food was alright, nothing special about this place. I would recommend going elsewhere.",
    "I had a pleasent time with kimberly at the granny shack. The food was amazing and very family friendly.",
]
res = model.prediction(new_test_data)

Below is the results of the prediction, notice that the neutral review has been labeled as negative. This makes sense since the model has a poor recall for neutral reviews as shown in the classification report.

End Notes

There are some improvements to be made such as the follow:

Balancing the data
- This can be seen in the confusion matrix for the candidate models and other models created that the predictions come out more positive than negative or neutral.
- While having poor scores in the neutral category, the most important features are found in the negative and positive predictions for business applications.
Hyper-parametrization improvement
- Logistic Regression and Multinomial NB models produced models within a reasonable time frame while returning reasonable scores. Random Forrest Classifier and SVM took a significant amount of time to produce just one iteration. In order to produce results from this model StratifiedKFold was not used in these two models. Changing SVM to LinearSVC improved performance dramatically and replaced the SVM model and outperformed Logistic Regression which was the original candidate model.

Transformers-regression - Regression Bugs Are In Your Model! Measuring, Reducing and Analyzing Regressions In NLP Model Updates

Regression Free Model Update Code for the paper: Regression Bugs Are In Your Mod

2 Feb 17, 2022

🍊 PAUSE (Positive and Annealed Unlabeled Sentence Embedding), accepted by EMNLP'2021 🌴

PAUSE: Positive and Annealed Unlabeled Sentence Embedding Sentence embedding refers to a set of effective and versatile techniques for converting raw

21 Dec 15, 2022

Text to speech is a process to convert any text into voice. Text to speech project takes words on digital devices and convert them into audio. Here I have used Google-text-to-speech library popularly known as gTTS library to convert text file to .mp3 file. Hope you like my project!

Text to speech (using Python) Text to speech is a process to convert any text into voice. Text to speech project takes words on digital devices and co

19 Jun 30, 2022

Predicting the usefulness of reviews given the review text and metadata surrounding the reviews.

Predicting Yelp Review Quality Table of Contents Introduction Motivation Goal and Central Questions The Data Data Storage and ETL EDA Data Pipeline Da

3 Nov 27, 2022

Using context-free grammar formalism to parse English sentences to determine their structure to help computer to better understand the meaning of the sentence.

Sentance Parser Executing the Program Make sure Python 3.6+ is installed. Install requirements $ pip install requirements.txt Run the program:

12 Sep 28, 2022

Analyse japanese ebooks using MeCab to determine the difficulty level for japanese learners

japanese-ebook-analysis This aim of this project is to make analysing the contents of a japanese ebook easy and streamline the process for non-technic

14 Jul 23, 2022

This project uses unsupervised machine learning to identify correlations between daily inoculation rates in the USA and twitter sentiment in regards to COVID-19.

4 Oct 15, 2022

Negative sampling for solving the unlabeled entity problem in NER. ICLR-2021 paper: Empirical Analysis of Unlabeled Entity Problem in Named Entity Recognition.

Negative Sampling for NER Unlabeled entity problem is prevalent in many NER scenarios (e.g., weakly supervised NER). Our paper in ICLR-2021 proposes u

128 Dec 29, 2022

Skipgram Negative Sampling in PyTorch

PyTorch SGNS Word2Vec's SkipGramNegativeSampling in Python. Yet another but quite general negative sampling loss implemented in PyTorch. It can be use

287 Dec 14, 2022

A machine learning model for analyzing text for user sentiment and determine whether its a positive, neutral, or negative review.

Related tags

Overview

Sentiment Analysis on Yelp's Dataset

Docker Deployment:

Overview

Business Goals

Business Applications

Model Deployment

Classification Report / Confusion Matrix:

Technology Stack

Data Processing

Stage 1 - Read in From JSON files into Postgres

Stage 2 - Pre-Processing Data

Stage 3 - Cleaning Up Data

Exploratory Data Analysis

Analyzing Null Values in Dataset

Looking Closer at the Ratings (review_stars)

Looking Closer at the Reviews (text)

Model Training

Model Selection

Model Training

Testing Model

End Notes

You might also like...

Transformers-regression - Regression Bugs Are In Your Model! Measuring, Reducing and Analyzing Regressions In NLP Model Updates

🍊 PAUSE (Positive and Annealed Unlabeled Sentence Embedding), accepted by EMNLP'2021 🌴

Text to speech is a process to convert any text into voice. Text to speech project takes words on digital devices and convert them into audio. Here I have used Google-text-to-speech library popularly known as gTTS library to convert text file to .mp3 file. Hope you like my project!

Predicting the usefulness of reviews given the review text and metadata surrounding the reviews.

Using context-free grammar formalism to parse English sentences to determine their structure to help computer to better understand the meaning of the sentence.

Analyse japanese ebooks using MeCab to determine the difficulty level for japanese learners

This project uses unsupervised machine learning to identify correlations between daily inoculation rates in the USA and twitter sentiment in regards to COVID-19.

Negative sampling for solving the unlabeled entity problem in NER. ICLR-2021 paper: Empirical Analysis of Unlabeled Entity Problem in Named Entity Recognition.

Skipgram Negative Sampling in PyTorch

Owner

Roberto Sanchez

VADER Sentiment Analysis. VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media, and works well on texts from other domains.

Honor's thesis project analyzing whether the GPT-2 model can more effectively generate free-verse or structured poetry.

Machine Learning Course Project, IMDB movie review sentiment analysis by lstm, cnn, and transformer

IMDB film review sentiment classification based on BERT's supervised learning model.

This Project is based on NLTK It generates a RANDOM WORD from a predefined list of words, From that random word it read out the word, its meaning with parts of speech , its antonyms, its synonyms

Product-Review-Summarizer - Created a product review summarizer which clustered thousands of product reviews and summarized them into a maximum of 500 characters, saving precious time of customers and helping them make a wise buying decision.

Easy to start. Use deep nerual network to predict the sentiment of movie review.

Text-Summarization-using-NLP - Text Summarization using NLP to fetch BBC News Article and summarize its text and also it includes custom article Summarization

Twitter-Sentiment-Analysis - Twitter sentiment analysis for india's top online retailers(2019 to 2022)

Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors"