Reading streams of Twitter data, save them to Kafka, then process with Kafka Stream API and Spark Streaming

Rustam Zokirov

Last update: Dec 6, 2021

Related tags

Data Analysis kafka-to-spark-streaming

Overview

Using Streaming Twitter Data with Kafka and Spark

Reading streams of Twitter data, publishing them to Kafka topic, process message using Kafka Stream API and Spark Streaming

Make sure that VPN is switched on, so that you can use Twitter. In some countries Twitter is blocked.

Moreover, you should have own consumer_key, consumer_secret, and access_token with its secret inside config.py file

Create environment using conda with Python 3.8:
- conda create -n python38 python=3.8
- conda activate python38
- Check requirements inside requirements.txt and install then using conda:
  - conda install -c conda-forge tweepy==4.4.0
  - conda install -c conda-forge kafka-python==2.0.2
Kafka should be installed in your machine, check the documentation for installation. if you use brew with Mac you can use brew install kafka
Start zookeeper: zookeeper-server-start /usr/local/etc/kafka/zookeeper.properties, port: 2181
On another terminal window start broker: kafka-server-start /usr/local/etc/kafka/server.properties, port: 9092 - In terminal window list topics you have: kafka-topics --list --bootstrap-server localhost:9092
Create Kafka topic "tweeter" with 1 partition and no replication because we use local machine: kafka-topics --create --topic tweeter --bootstrap-server localhost:9092 --partitions 1 --replication-factor 1
Now list again, the topics you have: kafka-topics --list --bootstrap-server localhost:9092
Let's see what we have inside the "tweeter" topic kafka-console-consumer --bootstrap-server localhost:9092 --topic tweeter --from-beginning, absolutely noting), but when we start streaming, data will be generated
Now run python kafka_producer.py to start stream Twitter and push message to topic.
And now check that the data is inside topic with kafka-console-consumer --bootstrap-server localhost:9092 --topic tweeter --from-beginning
Congrats! You have done it!

So what's next?

You can use generated data with Kafka Stream and Spark Streaming, and practice more!

You might also like...

MetPy is a collection of tools in Python for reading, visualizing and performing calculations with weather data.

MetPy MetPy is a collection of tools in Python for reading, visualizing and performing calculations with weather data. MetPy follows semantic versioni

971 Dec 25, 2022

Using Python to scrape some basic player information from www.premierleague.com and then use Pandas to analyse said data.

PremiershipPlayerAnalysis Using Python to scrape some basic player information from www.premierleague.com and then use Pandas to analyse said data. No

5 Sep 6, 2021

A project consists in a set of assignements corresponding to a BI process: data integration, construction of an OLAP cube, qurying of a OPLAP cube and reporting.

TennisBusinessIntelligenceProject - A project consists in a set of assignements corresponding to a BI process: data integration, construction of an OLAP cube, qurying of a OPLAP cube and reporting.

1 Jan 2, 2022

Reading streams of Twitter data, save them to Kafka, then process with Kafka Stream API and Spark Streaming

Related tags

Overview

Using Streaming Twitter Data with Kafka and Spark

Reading streams of Twitter data, publishing them to Kafka topic, process message using Kafka Stream API and Spark Streaming

You might also like...

MetPy is a collection of tools in Python for reading, visualizing and performing calculations with weather data.

Using Python to scrape some basic player information from www.premierleague.com and then use Pandas to analyse said data.

Pandas and Spark DataFrame comparison for humans

This mini project showcase how to build and debug Apache Spark application using Python

The Spark Challenge Student Check-In/Out Tracking Script

Monitor the stability of a pandas or spark dataframe ⚙︎

An ETL Pipeline of a large data set from a fictitious music streaming service named Sparkify.

Leverage Twitter API v2 to analyze tweet metrics such as impressions and profile clicks over time.

A project consists in a set of assignements corresponding to a BI process: data integration, construction of an OLAP cube, qurying of a OPLAP cube and reporting.

Owner

Rustam Zokirov

Stream-Kafka-ELK-Stack - Weather data streaming using Apache Kafka and Elastic Stack.

Sentiment analysis on streaming twitter data using Spark Structured Streaming & Python

A real-time financial data streaming pipeline and visualization platform using Apache Kafka, Cassandra, and Bokeh.

PySpark Structured Streaming ROS Kafka ApacheSpark Cassandra

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

Streamz helps you build pipelines to manage continuous streams of data

PLStream: A Framework for Fast Polarity Labelling of Massive Data Streams

Building house price data pipelines with Apache Beam and Spark on GCP

Pyspark project that able to do joins on the spark data frames.

BigDL - Evaluate the performance of BigDL (Distributed Deep Learning on Apache Spark) in big data analysis problems