Twitter COVID-19 Sentiment Analysis

Members: Christopher Bach | Khalid Hamid Fallous | Jay Hirpara | Jing Tang | Graham Thomas | David Wetherhold

Project Overview

This project seeks to identify any correlation between ∆ daily inoculation rates and ∆ twitter sentiment surrounding COVID-19. We chose the pandemic as our topic because of it's societal relevance and implications as an ongoing event.

Data Sources: Twitter | CDC | Kaggle

Analysis Methods

Integrated Database

Extract CSV datasets from data sources (referenced above), transforming and cleaning them with Python, and loading the datasets using Amazon Web Services and PostgreSQL (server/database). This allows us to establish connection with our model, and store static data for use during the project.

Constructed as an Amazon RDS instance:
- Connection Parameter: (covidsentiment.cqciwtn1qpki.us-east-2.rds.amazonaws.com)
- Accessed with a password upon request

Further transformations:

Machine Learning Model

Next, implementing a natural language processing algorithm allows us to gather our sentiment analysis

Machine Learning Libraries: nltk, sklearn
Description of preliminary data preprocessing

Load historical twitter covid vaccine data from kaggle.
Clean tweets with clean_tweet function(regex), tokenize and get ready for text classification. Also, clean up function for removing hashtags, URL's, mentions, and retweets.
Apply Textblob.sentiment.polarity and Textblob.sentiment.subjectivity, ready for sentiment analysis.
Apply analyze_sentiment function on tweet texts to label texts with sentiment range from -1 (negative) to 1(positve).
Plot top 10 words from postivie and negative-resulted words.

Description of preliminary feature engineering and preliminary feature selection, including their decision-making process

Import CountVectorizerfrom sklearn.feature_extraction.text. CountVectorizer is a tool provided by the scikit-learn library in Python. It is used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text. The value of each cell is nothing but the count of the word in that particular text sample.
Fit sentiment texts features with vectorizer, and target sentiment column.

Description of how data was split into training and testing sets Splitting into training and testing set so as to evaluate the classifier. The aim is to get an industry standard sample split of 80% train and 20% test.
Explanation of model choice, including limitations and benefits

Naive Bayes classifier is a collection of many algorithms where all the algorithms share one common principle, and that is each feature being classified is not related to any other feature. The algorithm is based on the Bayes theorem and predicts the tag of a text such as a piece of email or newspaper article. It calculates the probability of each tag for a given sample and then gives the tag with the highest probability as output. The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification).Multinomial Naive Bayes algorithm is a probabilistic learning method that is mostly used in Natural Language Processing (NLP).
Multinomial Naive Bayes classification algorithm tends to be a baseline solution for sentiment analysis task. The basic idea of Naive Bayes technique is to find the probabilities of classes assigned to texts by using the joint probabilities of words and classes.
Naive Bayes algorithm is only used for textual data classification and cannot be used to predict numeric values. The result of naive bayes model provide statistical sense by predicting how often that certain words with the sentimental labels appear, which does not necessarily indicate the factual attitudes/sentiments towards covid vaccine, and it does not work with regression because it is not numerical data. One of the benefits of Naive Bayes is that if its assumption of the independence of features holds true, it can perform better than other models and requires much less training data.

Changes of model choice from segment 2 to segment 3

Vader Analysis: VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media. It uses a list of lexical features (e.g. word) which are labeled as positive or negative according to their semantic orientation to calculate the text sentiment. VADER not only tells about the Positivity and Negativity score but also tells us about how positive or negative a sentiment is. VADER Sentiment Analyzer:
Solution to limitations: We discovered the most common words appeared in our twitter dataset are associated with covid vaccines because we retrieved the data with covid vaccine as search terms. Textblob Polarity is float which lies in the range of [-1,1] where 1 means positive statement and -1 means a negative statement. Subjective sentences generally refer to personal opinion, emotion or judgment whereas objective refers to factual information. Subjectivity is also a float which lies in the range of [0,1]. We are trying to process text classification with another function to get more accurate sentiment labels on the tweet texts.

Changes from segment 3 to segment 4

Added sentiment "NLTK" which is a votes based combined algorithm encompassing multiple natural language processing techniques.

Regression Results

2 Factor Regression

Initial regressions were positive, with an r^2 value of .29

However, the p value for Textblob was very high, so we removed it:

1 Factor Regression

with one factor removed, the r^2 was still .29, but the p value was 0.000, indicating excellent results.

However, these correlations were against cumulative administration rates. We disaggregated the cumulation and re-ran the regression with 2 factors:

2 Factor Regression - Marginal

and the R^2 dropped to close to zero. p-values are corresondingly high.

Dashboard COVID-19 DASHBOARD

A blueprint for the dashboard is created and includes all of the following:
Storyboard on Google Slide(s)
Description of the tool(s) that will be used to create final dashboard
Description of interactive element(s)

Presentation

Selected topic
Why we selected our topic
Description of our source of data
Questions we hope to answer with the data
Description of the data exploration phase of the project
Description of the analysis phase of the project
Limitations and solutions

Challenges and Limitations

Problems

Facebook, Instagram and TikTok were all considered initially, but did not have the necessary data readily available.
Some members ran into issues with gaining Academic Twitter accounts to be able to access the Twitter API.
After gaining access to tweets our original goal of using the location of tweets was not possible due to most tweets not having geotag data
The Twitter API was very limited to the amount of data we could pull. Alternative dataset will be needed.
Group ran into a machine learning natural language paradox, where we noticed an issue within our sentiment analysis. When analyzing tweets for Covid-19 Vaccination sentiment (pro/anti-vaccine) when running into a tweet such as “I hate anti-vaxxers”, this would return a negative sentiment when this person is actually pro-vaccine.
Using academic accounts only allows access back to 7 days of tweets. We could not get twitter's full archive search without having a twitter scholar account.

Solutions

The group decided to use Twitter since it's API was available after submitting applications.
Members had to submit extra information to the Twitter developers platform to qualify for academic research accounts
Due to lack of geodata, the team decided to switch to using twitter sentiment over time, rather than region
The group decided to use a Kaggle Dataset, which provided us with tweets from December 21, 2020.

Unsupervised Language Modeling at scale for robust sentiment classification

** DEPRECATED ** This repo has been deprecated. Please visit Megatron-LM for our up to date Large-scale unsupervised pretraining and finetuning code.

1k Nov 17, 2022

A machine learning model for analyzing text for user sentiment and determine whether its a positive, neutral, or negative review.

Sentiment Analysis on Yelp's Dataset Author: Roberto Sanchez, Talent Path: D1 Group Docker Deployment: Deployment of this application can be found her

0 Aug 4, 2021

The ability of computer software to identify words and phrases in spoken language and convert them to human-readable text

speech-recognition-py Speech recognition is the ability of computer software to identify words and phrases in spoken language and convert them to huma

1 Apr 3, 2022

Sentiment Analysis Project using Count Vectorizer and TF-IDF Vectorizer

Sentiment Analysis Project This project contains two sentiment analysis programs for Hotel Reviews using a Hotel Reviews dataset from Datafiniti. The

0 Mar 28, 2022

An example project using OpenPrompt under pytorch-lightning for prompt-based SST2 sentiment analysis model

pl_prompt_sst An example project using OpenPrompt under the framework of pytorch-lightning for a training prompt-based text classification model on SS

5 Oct 21, 2022

One Stop Anomaly Shop: Anomaly detection using two-phase approach: (a) pre-labeling using statistics, Natural Language Processing and static rules; (b) anomaly scoring using supervised and unsupervised machine learning.

One Stop Anomaly Shop (OSAS) Quick start guide Step 1: Get/build the docker image Option 1: Use precompiled image (might not reflect latest changes):

148 Dec 26, 2022

Jarvis is a simple Chatbot with a GUI capable of chatting and retrieving information and daily news from the internet for it's user.

J.A.R.V.I.S Kindly consider starring this repository if you like the program :-) What/Who is J.A.R.V.I.S? J.A.R.V.I.S is an chatbot written that is bu

50 Dec 31, 2022

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Texar is a toolkit aiming to support a broad set of machine learning, especially natural language processing and text generation tasks. Texar provides

2.3k Jan 7, 2023

2.1k Feb 17, 2021

This project uses unsupervised machine learning to identify correlations between daily inoculation rates in the USA and twitter sentiment in regards to COVID-19.

Related tags

Overview

Twitter COVID-19 Sentiment Analysis

Project Overview

Analysis Methods

Challenges and Limitations

You might also like...

Unsupervised Language Modeling at scale for robust sentiment classification

A machine learning model for analyzing text for user sentiment and determine whether its a positive, neutral, or negative review.

The ability of computer software to identify words and phrases in spoken language and convert them to human-readable text

Sentiment Analysis Project using Count Vectorizer and TF-IDF Vectorizer

An example project using OpenPrompt under pytorch-lightning for prompt-based SST2 sentiment analysis model

One Stop Anomaly Shop: Anomaly detection using two-phase approach: (a) pre-labeling using statistics, Natural Language Processing and static rules; (b) anomaly scoring using supervised and unsupervised machine learning.

Jarvis is a simple Chatbot with a GUI capable of chatting and retrieving information and daily news from the internet for it's user.

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Owner

VADER Sentiment Analysis. VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media, and works well on texts from other domains.

VADER Sentiment Analysis. VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media, and works well on texts from other domains.

Twitter bot that uses NLP models to summarize news articles referenced in a user's twitter timeline

☀️ Measuring the accuracy of BBC weather forecasts in Honolulu, USA

Original implementation of the pooling method introduced in "Speaker embeddings by modeling channel-wise correlations"

Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors"

Implementaion of our ACL 2022 paper Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation

Twitter Sentiment Analysis using #tag, words and username

Machine Learning Course Project, IMDB movie review sentiment analysis by lstm, cnn, and transformer

Twitter-NLP-Analysis - Twitter Natural Language Processing Analysis