Pyspark Spotify ETL

Last update: Jun 9, 2022

Related tags

Data Analysis Pyspark_Spotify_ETL

Overview

Pyspark Spotify ETL

Description

This is my first Data Engineering project, it extracts data from the user's recently played tracks using Spotify's API, transforms data and then loads it into Postgresql using SQLAlchemy engine. Data is shown as a Spark Dataframe before loading and the whole ETL job is scheduled with crontab. Token never expires since an HTTP POST method with Spotify's token API is used in the beginning of the script.

The purpose of this is to help those that want to become Data Engineers, like myself, create their first project.

Essentials

Extra libraries that must be imported: sys, json, datetime.

ETL Execution

Install all the necessary libraries from the Pipfile.
Read the "Token_request_instructions" to get your own refresh token. In case you don't want that you can get one from this website https://developer.spotify.com/console/get-recently-played/ which will have to be changed every hour.
Add your you postgreSQL credentials in the engine variable. In case you'll be using another RDBMS, use this website https://docs.sqlalchemy.org/en/14/core/engines.html.
Create SQL Database/Table (Optional).
Create a bash file. This file is were you'll write down the path to Spark, Python and your script. If this isn't created you'll get the "ModuleNotFoundError" for each module you import inside your script. (Think of this as the ETL's own ~/.bash_profile)
Create a new crontab or use the existing one if you want the job to run on midnight every day.

Extras

To verify that your scheduled job is working you can change the crontab to "* * * * *".
Here is the website https://developer.spotify.com/documentation/general/guides/scopes/ with other Spotify scopes in case you don't want to use "recently played tracks".
Thank you Karolina Sowinska for your DE Beginners guide.

You might also like...

:truck: Agile Data Preparation Workflows made easy with dask, cudf, dask_cudf and pyspark

To launch a live notebook server to test optimus using binder or Colab, click on one of the following badges: Optimus is the missing framework to prof

1.3k Dec 30, 2022

Churn prediction with PySpark

It is expected to develop a machine learning model that can predict customers who will leave the company.

3 Aug 13, 2021

Instant search for and access to many datasets in Pyspark.

SparkDataset Provides instant access to many datasets right from Pyspark (in Spark DataFrame structure). Drop a star if you like the project. 😃 Motiv

31 Dec 16, 2022

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

2 Nov 20, 2021

PySpark bindings for H3, a hierarchical hexagonal geospatial indexing system

Pyspark Spotify ETL

Related tags

Overview

Pyspark Spotify ETL

You might also like...

:truck: Agile Data Preparation Workflows made easy with dask, cudf, dask_cudf and pyspark

Churn prediction with PySpark

Instant search for and access to many datasets in Pyspark.

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

PySpark bindings for H3, a hierarchical hexagonal geospatial indexing system

Calculate multilateral price indices in Python (with Pandas and PySpark).

Pyspark project that able to do joins on the spark data frames.

PySpark Structured Streaming ROS Kafka ApacheSpark Cassandra

A data structure that extends pyspark.sql.DataFrame with metadata information.

Owner

Educational project on how to build an ETL (Extract, Transform, Load) data pipeline, orchestrated with Airflow.

Python ELT Studio, an application for building ELT (and ETL) data flows.

ETL flow framework based on Yaml configs in Python

In this project, ETL pipeline is build on data warehouse hosted on AWS Redshift.

An ETL framework + Monitoring UI/API (experimental project for learning purposes)

ETL pipeline on movie data using Python and postgreSQL

PrimaryBid - Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift

Airflow ETL With EKS EFS Sagemaker

Using Data Science with Machine Learning techniques (ETL pipeline and ML pipeline) to classify received messages after disasters.

An ETL Pipeline of a large data set from a fictitious music streaming service named Sparkify.