Covid-datapipeline-using-pyspark-and-mysql
Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.
Tools used : PySpark , MySQL
Procedure
-
Fetch latest data from API using requests & pandas module of python.
-
Apply some data processing and filtering to generate summarized information.
-
Store that summarized information into database using MySQL.
To build above pipeline i had used pyspark
{IMPORTANT}
Before move to the execution part please read below sentences
-
Use correct connector and drivername while making connection with MySQL db if you are going to use different db then procedure may differ.
-
change login credentials (username & password) in covid-config.json.
-
Make sure that mentioned database and table is already created.
How to use
-
clone Covid-datapipeline-using-pyspark-and-mysql repo.
-
start MySQL server
-
execute following command
python main.py
Results:
command line output:
Database status after execution: