A data engineering project with Kafka, Spark Streaming, dbt, Docker, Airflow, Terraform, GCP and much more!

Ankur Chavda

Last update: Dec 30, 2022

Related tags

Miscellaneous python airflow kafka spark gcp data-engineering dbt

Overview

Streamify

A data pipeline with Kafka, Spark Streaming, dbt, Docker, Airflow, Terraform, GCP and much more!

Description

Objective

The project will stream events generated from a fake music streaming service (like Spotify) and create a data pipeline that consumes the real-time data. The data coming in would be similar to an event of a user listening to a song, navigating on the website, authenticating. The data would be processed in real-time and stored to the data lake periodically (every two minutes). The hourly batch job will then consume this data, apply transformations, and create the desired tables for our dashboard to generate analytics. We will try to analyze metrics like popular songs, active users, user demographics etc.

Dataset

Eventsim is a program that generates event data to replicate page requests for a fake music web site. The results look like real use data, but are totally fake. The docker image is borrowed from viirya's fork of it, as the original project has gone without maintenance for a few years now.

Eventsim uses song data from Million Songs Dataset to generate events. I have used a subset of 10000 songs.

Tools & Technologies

Cloud - Google Cloud Platform
Infrastructure as Code software - Terraform
Containerization - Docker, Docker Compose
Stream Processing - Kafka, Spark Streaming
Orchestration - Airflow
Transformation - dbt
Data Lake - Google Cloud Storage
Data Warehouse - BigQuery
Data Visualization - Data Studio
Language - Python

Architecture

Final Result

Setup

WARNING: You will be charged for all the infra setup. You can avail 300$ in credit by creating a new account on GCP.

Pre-requisites

If you already have a Google Cloud account and a working terraform setup, you can skip the pre-requisite steps.

Google Cloud Platform.
- GCP Account and Access Setup
- gcloud alternate installation method - Windows
Terraform
- Setup Terraform

Get Going!

A video walkthrough of how I run my project - YouTube Video

Procure infra on GCP with Terraform - Setup
(Extra) SSH into your VMs, Forward Ports - Setup
Setup Kafka Compute Instance and start sending messages from Eventsim - Setup
Setup Spark Cluster for stream processing - Setup
Setup Airflow on Compute Instance to trigger the hourly data pipeline - Setup

Debug

If you run into issues, see if you find something in this debug guide.

How can I make this better?!

A lot can still be done :).

Choose managed Infra
- Cloud Composer for Airflow
- Confluent Cloud for Kafka
Create your own VPC network
Build dimensions and facts incrementally instead of full refresh
Write data quality tests
Create dimensional models for additional business processes
Include CI/CD
Add more visualizations

Special Mentions

I'd like to thank the DataTalks.Club for offering this Data Engineering course for completely free. All the things I learnt there, enabled me to come up with this project. If you want to upskill on Data Engineering technologies, please check out the course. :)

Comments

Error creating Dataproc cluster

Hi, in the last step: Terraform apply I have this error:

Error: Error creating Dataproc cluster: googleapi: Error 400: User not authorized to act as service account '[email protected]'. To act as a service account, user must have one of [Owner, Editor, Service Account Actor] roles. See https://cloud.google.com/iam/docs/understanding-service-accounts for additional details., badRequest

I tried many things to solve but I don't know what is the problem, please help.

opened by rafical 1
The number of topics in kafka control center did not increase

After creating docker container for the kafka instance and running the script to send message to it with the eventsim, the number of topics in the control center (9021) would not increase. I remained at 40.

Notice the warning at the top. What could that mean please?

opened by topefolorunso 1

Error while provisioning spark cluster with terraform

I used e2-standard-2 for the kafka and airflow instances I then used e2-medium for both master and worker in the spark cluster Apparently, the kafka and airflow instances were successful.

google_dataproc_cluster.mulitnode_spark_cluster: Creating...
╷
│ Error: Error creating Dataproc cluster: googleapi: Error 400: Multiple validation errors:
│  - Insufficient 'CPUS' quota. Requested 6.0, available 4.0.
│  - Insufficient 'IN_USE_ADDRESSES' quota. Requested 3.0, available 2.0.
│  - This request exceeds CPU quota. Some things to try: request fewer workers (a minimum of 2 is required), 
use smaller master and/or worker machine types (such as n1-standard-2)., badRequest
│
│   with google_dataproc_cluster.mulitnode_spark_cluster,
│   on main.tf line 93, in resource "google_dataproc_cluster" "mulitnode_spark_cluster":
│   93: resource "google_dataproc_cluster" "mulitnode_spark_cluster" {
│

opened by topefolorunso 3

I have an error when I launch the terraform apply

│ Error: Invalid value for network: project: required field is not set │ │ with google_compute_firewall.rules, │ on main.tf line 19, in resource "google_compute_firewall" "port_rules": │ 19: resource "google_compute_firewall" "port_rules" { │ ╵ ╷ │ Error: project: required field is not set │ │ with google_compute_instance.kafka_vm_instance, │ on main.tf line 35, in resource "google_compute_instance" "kafka_vm_instance": │ 35: resource "google_compute_instance" "kafka_vm_instance" { │ ╵ ╷ │ Error: project: required field is not set │ │ with google_compute_instance.airflow_vm_instance, │ on main.tf line 56, in resource "google_compute_instance" "airflow_vm_instance": │ 56: resource "google_compute_instance" "airflow_vm_instance" { │ ╵ ╷ │ Error: project: required field is not set │ │ with google_storage_bucket.bucket, │ on main.tf line 75, in resource "google_storage_bucket" "bucket": │ 75: resource "google_storage_bucket" "bucket" { │ ╵ ╷ │ Error: project: required field is not set │ │ with google_dataproc_cluster.mulitnode_spark_cluster, │ on main.tf line 93, in resource "google_dataproc_cluster" "mulitnode_spark_cluster": │ 93: resource "google_dataproc_cluster" "mulitnode_spark_cluster" { │ ╵ ╷ │ Error: project: required field is not set │ │ with google_bigquery_dataset.stg_dataset, │ on main.tf line 139, in resource "google_bigquery_dataset" "stg_dataset": │ 139: resource "google_bigquery_dataset" "stg_dataset" { │ ╵ ╷ │ Error: project: required field is not set │ │ with google_bigquery_dataset.prod_dataset, │ on main.tf line 146, in resource "google_bigquery_dataset" "prod_dataset": │ 146: resource "google_bigquery_dataset" "prod_dataset" { │

opened by mbiombani 6
Command for spark-submit
I have a question regarding the command for spark-submit:

spark-submit \ --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2 \ stream_all_events.py

What is the meaning of the packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2?
opened by stephenllh 27

Owner

Ankur Chavda

Data Engineer-ish

GitHub

This is a practice on Airflow, which is building virtual env, installing Airflow and constructing data pipeline (DAGs)

airflow-test This is a practice on Airflow, which is Builing virtualbox env and setting Airflow on that env Installing Airflow using python virtual en

1 Nov 1, 2021

An Airflow operator to call the main function from the dbt-core Python package

airflow-dbt-python An Airflow operator to call the main function from the dbt-core Python package Motivation Airflow running in a managed environment

93 Jan 8, 2023

Yet another Airflow plugin using CLI command as RESTful api, supports Airflow v2.X.

中文版文档 Airflow Extended API Plugin Airflow Extended API, which export airflow CLI command as REST-ful API to extend the ability of airflow official API

106 Nov 9, 2022

Project repository of Apache Airflow, deployed on Docker in Amazon EC2 via GitLab.

Airflow on Docker in EC2 + GitLab's CI/CD Personal project for simple data pipeline using Airflow. Airflow will be installed inside Docker container,

13 Nov 29, 2022

dbt (data build tool) adapter for Oracle Autonomous Database

dbt-oracle version 1.0.0 dbt (data build tool) adapter for the Oracle database. dbt "adapters" are responsible for adapting dbt's functionality to a g

22 Nov 15, 2022

🤖🤖 Jarvis is an virtual assistant which can some tasks easy for you like surfing on web opening an app and much more... 🤖🤖

Jarvis ?? ?? Jarvis is an virtual assistant which can some tasks easy for you like surfing on web opening an app and much more... ?? ?? Developer : su

1 Nov 8, 2021

A minimal configuration for a dockerized kafka project.

Docker Kafka Quickstart A minimal configuration for a dockerized kafka project. Usage: Run this command to build kafka and zookeeper containers, and c

5 Jan 12, 2022

Make dbt docs and Apache Superset talk to one another

dbt-superset-lineage Make dbt docs and Apache Superset talk to one another Why do I need something like this? Odds are rather high that you use dbt to

81 Jan 6, 2023

:fishing_pole_and_fish: List of `pre-commit` hooks to ensure the quality of your `dbt` projects.

pre-commit-dbt List of pre-commit hooks to ensure the quality of your dbt projects. BETA NOTICE: This tool is still BETA and may have some bugs, so pl

262 Nov 25, 2022

dbt adapter for Firebolt

dbt-firebolt dbt adapter for Firebolt dbt-firebolt supports dbt 0.21 and newer Installation First, download the JDBC driver and place it wherever you'

23 Dec 14, 2022

Model synchronization from dbt to Metabase.

dbt-metabase Model synchronization from dbt to Metabase. If dbt is your source of truth for database schemas and you use Metabase as your analytics to

270 Jan 8, 2023

GCP Scripts and API Client Toolss

GCP Scripts and API Client Toolss Script Authentication The scripts and CLI assume GCP Application Default Credentials are set. Credentials can be set

3 Feb 21, 2022

A docker container (Docker Desktop) for a simple python Web app few unit tested

Short web app using Flask, tested with unittest on making massive requests, responses of the website, containerized

1 Dec 13, 2021

Airflow Operator for running Soda SQL scans

7 Oct 18, 2022

A tutorial presents several practical examples of how to build DAGs in Apache Airflow

Apache Airflow - Python Brasil 2021 Este tutorial apresenta vários exemplos práticos de como construir DAGs no Apache Airflow. Background Apache Airfl

14 Jun 3, 2022

Repositório para estudo do airflow

airflow-101 Repositório para estudo do airflow Docker criado baseado no tutorial Exemplo de API da pokeapi Para executar clone o repo execute as confi

1 Nov 23, 2021

A reproduction repo for a Scheduling bug in AirFlow 2.2.3

1 Feb 9, 2022

Here is my Senior Design Project that I implemented to graduate from Computer Engineering.

Here is my Senior Design Project that I implemented to graduate from Computer Engineering. It is a chatbot made in RASA and helps the user to plan their vacation in the Turkish language. In order to plan the user's vacation, it provides reservations by asking various questions for hotel, flight, or event.

25 May 31, 2022

YunoHost is an operating system aiming to simplify as much as possible the administration of a server.

YunoHost is an operating system aiming to simplify as much as possible the administration of a server. This repository corresponds to the core code, written mostly in Python and Bash.

1.5k Jan 9, 2023

A data engineering project with Kafka, Spark Streaming, dbt, Docker, Airflow, Terraform, GCP and much more!

Related tags

Overview

Streamify

Description

Objective

Dataset

Tools & Technologies

Architecture

Final Result

Setup

Pre-requisites

Get Going!

Debug

How can I make this better?!

Special Mentions

Comments

Error creating Dataproc cluster

The number of topics in kafka control center did not increase

Error while provisioning spark cluster with terraform

I have an error when I launch the terraform apply

Command for spark-submit

Owner

Ankur Chavda

This is a practice on Airflow, which is building virtual env, installing Airflow and constructing data pipeline (DAGs)

An Airflow operator to call the main function from the dbt-core Python package

Yet another Airflow plugin using CLI command as RESTful api, supports Airflow v2.X.

Project repository of Apache Airflow, deployed on Docker in Amazon EC2 via GitLab.

dbt (data build tool) adapter for Oracle Autonomous Database

🤖🤖 Jarvis is an virtual assistant which can some tasks easy for you like surfing on web opening an app and much more... 🤖🤖

A minimal configuration for a dockerized kafka project.

Make dbt docs and Apache Superset talk to one another

:fishing_pole_and_fish: List of `pre-commit` hooks to ensure the quality of your `dbt` projects.

dbt adapter for Firebolt

Model synchronization from dbt to Metabase.

GCP Scripts and API Client Toolss

A docker container (Docker Desktop) for a simple python Web app few unit tested

Airflow Operator for running Soda SQL scans

A tutorial presents several practical examples of how to build DAGs in Apache Airflow

Repositório para estudo do airflow

A reproduction repo for a Scheduling bug in AirFlow 2.2.3

Here is my Senior Design Project that I implemented to graduate from Computer Engineering.

YunoHost is an operating system aiming to simplify as much as possible the administration of a server.