Sentiment analysis on streaming twitter data using Spark Structured Streaming & Python

Related tags

Data Analysis twittersentimentpyspark

Overview

Sentiment analysis on streaming twitter data using Spark Structured Streaming & Python

This project is a good starting point for those who have little or no experience with Apache Spark Streaming. We use Twitter data since Twitter provides an API for developers that is easy to access. We present an end-to-end architecture on how to stream data from Twitter, clean it, and apply a simple sentiment analysis model to detect the polarity and subjectivity of each tweet.

Input data: Live tweets with a keyword
Main model: Data preprocessing and apply sentiment analysis on the tweets
Output: A parquet file with all the tweets and their sentiment analysis scores (polarity and subjectivity)

We use Python version 3.7.6 and Spark version 2.4.7. We should be cautious on the versions that we use because different versions of Spark require a different version of Python.

Main Libraries

tweepy: interact with the Twitter Streaming API and create a live data streaming pipeline with Twitter
pyspark: preprocess the twitter data (Python's Spark library)
textblob: apply sentiment analysis on the twitter text data

Instructions

First, run the Part 1: twitter_connection.py and let it continue running.
Then, run the Part 2: sentiment_analysis.py from a different IDE.

Part 1: Send tweets from the Twitter API

In this part, we use our developer credentials to authenticate and connect to the Twitter API. We also create a TCP socket between Twitter's API and Spark, which waits for the call of the Spark Structured Streaming and then sends the Twitter data. Here, we use Python's Tweepy library for connecting and getting the tweets from the Twitter API.

Part 2: Tweet preprocessing and sentiment analysis

In this part, we receive the data from the TCP socket and preprocess it with the pyspark library, which is Python's API for Spark. Then, we apply sentiment analysis using textblob, which is Python's library for processing textual data. After sentiment analysis, we save the tweet and the sentiment analysis scores in a parquet file, which is a data storage format.

You might also like...

A real-time financial data streaming pipeline and visualization platform using Apache Kafka, Cassandra, and Bokeh.

Realtime Financial Market Data Visualization and Analysis Introduction This repo shows my project about real-time stock data pipeline. All the code is

6 Sep 7, 2022

Stream-Kafka-ELK-Stack - Weather data streaming using Apache Kafka and Elastic Stack.

Streaming Data Pipeline - Kafka + ELK Stack Streaming weather data using Apache Kafka and Elastic Stack. Data source: https://openweathermap.org/api O

2 Jan 20, 2022

Statistical Analysis 📈 focused on statistical analysis and exploration used on various data sets for personal and professional projects.

Statistical Analysis 📈 This repository focuses on statistical analysis and the exploration used on various data sets for personal and professional pr

1 Sep 3, 2022

🧪 Panel-Chemistry - exploratory data analysis and build powerful data and viz tools within the domain of Chemistry using Python and HoloViz Panel.

🧪📈 🐍. The purpose of the panel-chemistry project is to make it really easy for you to do DATA ANALYSIS and build powerful DATA AND VIZ APPLICATIONS within the domain of Chemistry using using Python and HoloViz Panel.

97 Dec 8, 2022

Sentiment analysis on streaming twitter data using Spark Structured Streaming & Python

Related tags

Overview

Sentiment analysis on streaming twitter data using Spark Structured Streaming & Python

Main Libraries

Instructions

Part 1: Send tweets from the Twitter API

Part 2: Tweet preprocessing and sentiment analysis

You might also like...

A real-time financial data streaming pipeline and visualization platform using Apache Kafka, Cassandra, and Bokeh.

Stream-Kafka-ELK-Stack - Weather data streaming using Apache Kafka and Elastic Stack.

Statistical Analysis 📈 focused on statistical analysis and exploration used on various data sets for personal and professional projects.

🧪 Panel-Chemistry - exploratory data analysis and build powerful data and viz tools within the domain of Chemistry using Python and HoloViz Panel.

An ETL Pipeline of a large data set from a fictitious music streaming service named Sparkify.

Python data processing, analysis, visualization, and data operations

A set of functions and analysis classes for solvation structure analysis

Tablexplore is an application for data analysis and plotting built in Python using the PySide2/Qt toolkit.

A data analysis using python and pandas to showcase trends in school performance.

Owner

Himanshu Kumar singh

Reading streams of Twitter data, save them to Kafka, then process with Kafka Stream API and Spark Streaming

BigDL - Evaluate the performance of BigDL (Distributed Deep Learning on Apache Spark) in big data analysis problems

PySpark Structured Streaming ROS Kafka ApacheSpark Cassandra

This mini project showcase how to build and debug Apache Spark application using Python

Building house price data pipelines with Apache Beam and Spark on GCP

Pyspark project that able to do joins on the spark data frames.

Hatchet is a Python-based library that allows Pandas dataframes to be indexed by structured tree and graph data.

The Spark Challenge Student Check-In/Out Tracking Script

Monitor the stability of a pandas or spark dataframe ⚙︎

Pandas and Spark DataFrame comparison for humans