An ETL Pipeline of a large data set from a fictitious music streaming service named Sparkify.

Last update: Feb 11, 2022

Related tags

Data Analysis ETL_for_DWH_on_Redshift

Overview

Data Warehouse on AWS Redshift

ETL Pipeline in AWS Redshift and S3

Project Summary

In this project, I have built an ETL Pipeline of a large data set from a fictitious music streaming service named Sparkify. The ETL process flows from AWS's S3 into staging tables in AWS Redshift.

I then query the staged data into analytics tables. This will help Sparkify's analytics team get quicker insights about its customer base.

File Descriptions

create_tables.py

create fact and dimension tables for the star schema in Redshift.

sql_queries.py

define SQL statements, which will then be imported into the other files.

etl.py

load data from S3 into staging tables on Redshift, and then process that data into analytics tables on Redshift.

Design Decisions

Keyspace Star Schema

The star schema is used, with a fact table centered around dimension tables at its periphery.

Fact table: songplays -- every occurrence of a song being played is stored here.

Dimension tables:

users -- the users of the Sparkify music streaming app
songs -- the songs in Sparkify's music catalog
artists -- the artists who record the catalog's songs
time -- the timestamps of records in songplays, broken down into specific date and time units (year, day, hour, etc.)

Run Instructions

Clone this repository, which will place the 3 .py files and the .cfg file into the same directory.
Duplicate the dwh_template.cfg file to create a new file named dwh.cfg. Because this will contain private login credentials, be sure it is added to the .gitignore file.
Fill in the [CLUSTER] and [IAM_ROLE] attributes from AWS, according to the IAM role and Redshift cluster already created. Please consult AWS's well-documented instructions as necessary.
Run python create_tables.py to set up the Redshift data warehouse cluster.
Run python etl.py. This will copy the 2 large tables from S3 into staging tables. After that, this will also populate the smaller dimension tables.

You might also like...

Two phase pipeline + StreamlitTwo phase pipeline + Streamlit

Two phase pipeline + Streamlit This is an example project that demonstrates how to create a pipeline that consists of two phases of execution. In betw

1 Nov 17, 2021

Udacity-api-reporting-pipeline - Udacity api reporting pipeline

udacity-api-reporting-pipeline In this exercise, you'll use portions of each of

1 Feb 15, 2022

Pyspark Spotify ETL

This is my first Data Engineering project, it extracts data from the user's recently played tracks using Spotify's API, transforms data and then loads it into Postgresql using SQLAlchemy engine. Data is shown as a Spark Dataframe before loading and the whole ETL job is scheduled with crontab. Token never expires since an HTTP POST method with Spotify's token API is used in the beginning of the script.

16 Jun 9, 2022

ETL flow framework based on Yaml configs in Python

ETL framework based on Yaml configs in Python A light framework for creating data streams. Setting up streams through configuration in the Yaml file.

18 Jul 6, 2022

An ETL framework + Monitoring UI/API (experimental project for learning purposes)

Fastlane An ETL framework for building pipelines, and Flask based web API/UI for monitoring pipelines. Project structure fastlane |- fastlane: (ETL fr

2 Jan 6, 2022

Airflow ETL With EKS EFS Sagemaker

Airflow ETL With EKS EFS & Sagemaker (en desarrollo) Diagrama de la solución Imp

1 Feb 14, 2022

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

2 Nov 20, 2021

Reading streams of Twitter data, save them to Kafka, then process with Kafka Stream API and Spark Streaming

Using Streaming Twitter Data with Kafka and Spark Reading streams of Twitter data, publishing them to Kafka topic, process message using Kafka Stream

1 Dec 6, 2021

Stream-Kafka-ELK-Stack - Weather data streaming using Apache Kafka and Elastic Stack.

Streaming Data Pipeline - Kafka + ELK Stack Streaming weather data using Apache Kafka and Elastic Stack. Data source: https://openweathermap.org/api O

2 Jan 20, 2022

An ETL Pipeline of a large data set from a fictitious music streaming service named Sparkify.

Related tags

Overview

Data Warehouse on AWS Redshift

Project Summary

File Descriptions

create_tables.py

sql_queries.py

etl.py

Design Decisions

Keyspace Star Schema

Run Instructions

You might also like...

Two phase pipeline + StreamlitTwo phase pipeline + Streamlit

Udacity-api-reporting-pipeline - Udacity api reporting pipeline

Pyspark Spotify ETL

ETL flow framework based on Yaml configs in Python

An ETL framework + Monitoring UI/API (experimental project for learning purposes)

Airflow ETL With EKS EFS Sagemaker

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

Reading streams of Twitter data, save them to Kafka, then process with Kafka Stream API and Spark Streaming

Stream-Kafka-ELK-Stack - Weather data streaming using Apache Kafka and Elastic Stack.

Owner

PrimaryBid - Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift

Educational project on how to build an ETL (Extract, Transform, Load) data pipeline, orchestrated with Airflow.

In this project, ETL pipeline is build on data warehouse hosted on AWS Redshift.

ETL pipeline on movie data using Python and postgreSQL

Sentiment analysis on streaming twitter data using Spark Structured Streaming & Python

A real-time financial data streaming pipeline and visualization platform using Apache Kafka, Cassandra, and Bokeh.

X-news - Pipeline data use scrapy, kafka, spark streaming, spark ML and elasticsearch, Kibana

A Big Data ETL project in PySpark on the historical NYC Taxi Rides data

Python ELT Studio, an application for building ELT (and ETL) data flows.

SNV calling pipeline developed explicitly to process individual or trio vcf files obtained from Illumina based pipeline (grch37/grch38).