Using Streaming Twitter Data with Kafka and Spark
Reading streams of Twitter data, publishing them to Kafka topic, process message using Kafka Stream API and Spark Streaming
Make sure that VPN is switched on, so that you can use Twitter. In some countries Twitter is blocked.
Moreover, you should have own consumer_key, consumer_secret, and access_token with its secret inside config.py
file
- Create environment using conda with Python 3.8:
conda create -n python38 python=3.8
conda activate python38
- Check requirements inside
requirements.txt
and install then using conda:conda install -c conda-forge tweepy==4.4.0
conda install -c conda-forge kafka-python==2.0.2
- Kafka should be installed in your machine, check the documentation for installation. if you use brew with Mac you can use
brew install kafka
- Start zookeeper:
zookeeper-server-start /usr/local/etc/kafka/zookeeper.properties
, port: 2181 - On another terminal window start broker:
kafka-server-start /usr/local/etc/kafka/server.properties
, port: 9092 - In terminal window list topics you have:kafka-topics --list --bootstrap-server localhost:9092
- Create Kafka topic "tweeter" with 1 partition and no replication because we use local machine:
kafka-topics --create --topic tweeter --bootstrap-server localhost:9092 --partitions 1 --replication-factor 1
- Now list again, the topics you have:
kafka-topics --list --bootstrap-server localhost:9092
- Let's see what we have inside the "tweeter" topic
kafka-console-consumer --bootstrap-server localhost:9092 --topic tweeter --from-beginning
, absolutely noting), but when we start streaming, data will be generated - Now run
python kafka_producer.py
to start stream Twitter and push message to topic. - And now check that the data is inside topic with
kafka-console-consumer --bootstrap-server localhost:9092 --topic tweeter --from-beginning
- Congrats! You have done it!
So what's next?
You can use generated data with Kafka Stream and Spark Streaming, and practice more!