The scope of this project will be to build a data ware house on Google Cloud Platform that will help answer common business questions as well as powering dashboards

Shweta_kumawat

Last update: Jan 20, 2022

Related tags

Third-party APIs Wrappers Data-lake-on-Google-Cloud-Platform

Overview

Project Scopes

The scope of this project will be to build a data ware house on Google Cloud Platform that will help answer common business questions as well as powering dashboards. To do that, a conceptual data model and a data pipeline will be defined.

Architecture

Data are uploaded to Google Cloud Storage bucket. GCS will act as the data lake where all raw files are stored. Data will then be loaded to staging tables on BigQuery. The ETL process will take data from those staging tables and create data mart tables. An Airflow instance can be deployed on a Google Compute Engine or locally to orchestrate the pipeline.

Here are the justifications for the technologies used:

Google Cloud Storage: act as the data lake, vertically scalable.
Google Big Query: act as data base engine for data warehousing, data mart and ETL processes. BigQuery is a serverless solution that can easily and effectively process petabytes scale dataset.
Apache Airflow: orchestrate the workflow by issuing command line to load data to BigQuery or SQL queries for ETL process. Airflow does not have to process any data by itself, thus allowing the architecture to scale.

Data Model

The database is designed following a star-schema principal with 1 fact table and 5 dimensions tables.

F_IMMIGRATION_DATA: contains immigration information such as arrival date, departure date, visa type, gender, country of origin, etc.
D_TIME: contains dimensions for date column
D_PORT: contains port_id and port_name
D_AIRPORT: contains airports within a state
D_STATE: contains state_id and state_name
D_COUNTRY: contains country_id and country_name
D_WEATHER: contains average weather for a state
D_CITY_DEMO: contains demographic information for a city

Data pipeline

This project uses Airflow for orchestration.

A DummyOperator start_pipeline kick off the pipeline followed by 4 load operations. Those operations load data from GCS bucket to BigQuery tables. The immigration_data is loaded as parquet files while the others are csv formatted. There are operations to check rows after loading to BigQuery.

Next the pipeline loads 3 master data object from the I94 Data dictionary. Then the F_IMMIGRATION_DATA table is created and check to make sure that there is no duplicates. Other dimension tables are also created and the pipelines finishes.

Scenarios

Data increase by 100x

Currently infrastructure can easily supports 100x increase in data size. GCS and BigQuery can handle petabytes scale data. Airflow is not a bottle neck since it only issue commands to other services.

Pipelines would be run on 7am daily. how to update dashboard? would it still work?

Schedule dag to be run daily at 7 AM. Setup dag retry, email/slack notification on failures.

Make it available to 100+ people

BigQuery is auto-scaling so if 100+ people need to access, it can handle that easily. If more people or services need access to the database, we can add steps to write to a NoSQL database like Data Store or Cassandra, or write to a SQL one that supports horizontal scaling like BigTable.

Project Instructions

GCP setup

Follow the following steps:

Create a project on GCP
Enable billing by adding a credit card (you have free credits worth $300)
Navigate to IAM and create a service account
Grant the account project owner. It is convenient for this project, but not recommended for production system. You should keep your key somewhere safe.

Create a bucket on your project and upload the data with the following structure:

gs://cloud-data-lake-gcp/airports/:
gs://cloud-data-lake-gcp/airports/airport-codes_csv.csv
gs://cloud-data-lake-gcp/airports/airport_codes.json

gs://cloud-data-lake-gcp/cities/:
gs://cloud-data-lake-gcp/cities/us-cities-demographics.csv
gs://cloud-data-lake-gcp/cities/us_cities_demo.json

gs://cloud-data-lake-gcp/immigration_data/:
gs://cloud-data-lake-gcp/immigration_data/part-00000-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
gs://cloud-data-lake-gcp/immigration_data/part-00001-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
gs://cloud-data-lake-gcp/immigration_data/part-00002-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
gs://cloud-data-lake-gcp/immigration_data/part-00003-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
gs://cloud-data-lake-gcp/immigration_data/part-00004-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
gs://cloud-data-lake-gcp/immigration_data/part-00005-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
gs://cloud-data-lake-gcp/immigration_data/part-00006-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
gs://cloud-data-lake-gcp/immigration_data/part-00007-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
gs://cloud-data-lake-gcp/immigration_data/part-00008-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
gs://cloud-data-lake-gcp/immigration_data/part-00009-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
gs://cloud-data-lake-gcp/immigration_data/part-00010-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
gs://cloud-data-lake-gcp/immigration_data/part-00011-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
gs://cloud-data-lake-gcp/immigration_data/part-00012-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
gs://cloud-data-lake-gcp/immigration_data/part-00013-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet

gs://cloud-data-lake-gcp/master_data/:
gs://cloud-data-lake-gcp/master_data/
gs://cloud-data-lake-gcp/master_data/I94ADDR.csv
gs://cloud-data-lake-gcp/master_data/I94CIT_I94RES.csv
gs://cloud-data-lake-gcp/master_data/I94PORT.csv

gs://cloud-data-lake-gcp/weather/:
gs://cloud-data-lake-gcp/weather/GlobalLandTemperaturesByCity.csv
gs://cloud-data-lake-gcp/weather/temperature_by_city.json

You can copy the data to your own bucket by running the following:

gsutil cp -r gs://cloud-data-lake-gcp/ gs://{your_bucket_name}

Local setup

Clone the project, create environment, install required packages by running the following:

Install docker if it's not already installed. You can find the resources to do that here.

Install the Astronomer CLI following the instructions here.

Run the following commands to bring up the Airflow instance:

astro d start

You can look at the logs by running make logs if you need to debug something. You can access and manage the pipeline by typing the following address to a browser:

localhost:8080/admin/

If everything is setup correctly, you will see the following screen:

Navigate to Admin -> Connections and paste in the credentials for the following two connections: bigquery_default and google_cloud_default

Navigate to the main dag on path dags\cloud-data-lake-pipeline.py and change the following parameters with your own setup:

project_id = 'cloud-data-lake'
staging_dataset = 'IMMIGRATION_DWH_STAGING'
dwh_dataset = 'IMMIGRATION_DWH'
gs_bucket = 'cloud-data-lake-gcp'

You can then trigger the dag and the pipeline will run.

The data warehouse

The final data warehouse looks like this:

Python client for using Prefect Cloud with Saturn Cloud

prefect-saturn prefect-saturn is a Python package that makes it easy to run Prefect Cloud flows on a Dask cluster with Saturn Cloud. For a detailed tu

15 Dec 7, 2022

RichWatch is wrapper around AWS Cloud Watch to display beautiful logs with help of Python library Rich.

RichWatch is TUI (Textual User Interface) for AWS Cloud Watch. It formats and pretty prints Cloud Watch's logs so they are much more readable. Because

21 Jul 25, 2022

The official wrapper for spyse.com API, written in Python, aimed to help developers build their integrations with Spyse.

Python wrapper for Spyse API The official wrapper for spyse.com API, written in Python, aimed to help developers build their integrations with Spyse.

15 Nov 22, 2022

Get some python in google cloud functions

[NOTE]: This is a highly experimental (and proof of concept) library so do not expect all python packages to work flawlessly. Also, cloud functions ar

200 Nov 24, 2022

Discord Bot that leverages the idea of nested containers using podman, runs untrusted user input, executes Quantum Circuits, allows users to refer to the Qiskit Documentation, and provides the ability to search questions on the Quantum Computing StackExchange.

Discord Bot that leverages the idea of nested containers using podman, runs untrusted user input, executes Quantum Circuits, allows users to refer to the Qiskit Documentation, and provides the ability to search questions on the Quantum Computing StackExchange.

23 Oct 18, 2022

google-resumable-media Apache-2google-resumable-media (🥉28 · ⭐ 27) - Utilities for Google Media Downloads and Resumable.. Apache-2

google-resumable-media Utilities for Google Media Downloads and Resumable Uploads See the docs for examples and usage. Experimental asyncio Support Wh

36 Nov 22, 2022

An attendance bot that joins google meet automatically according to schedule and marks present in the google meet.

Google-meet-self-attendance-bot An attendance bot which joins google meet automatically according to schedule and marks present in the google meet. I

12 Sep 20, 2022

Google Drive, OneDrive and Youtube as covert-channels - Control systems remotely by uploading files to Google Drive, OneDrive, Youtube or Telegram

covert-control Control systems remotely by uploading files to Google Drive, OneDrive, Youtube or Telegram using Python to create the files and the lis

52 Dec 6, 2022

Easy Google Translate: Unofficial Google Translate API

easygoogletranslate Unofficial Google Translate API. This library does not need an api key or something else to use, it's free and simple. You can eit

9 Nov 6, 2022

The scope of this project will be to build a data ware house on Google Cloud Platform that will help answer common business questions as well as powering dashboards

Related tags

Overview

Project Scopes

Architecture

Data Model

Data pipeline

Scenarios

Data increase by 100x

Pipelines would be run on 7am daily. how to update dashboard? would it still work?

Make it available to 100+ people

Project Instructions

GCP setup

Local setup

The data warehouse

You might also like...

Python client for using Prefect Cloud with Saturn Cloud

RichWatch is wrapper around AWS Cloud Watch to display beautiful logs with help of Python library Rich.

The official wrapper for spyse.com API, written in Python, aimed to help developers build their integrations with Spyse.

Get some python in google cloud functions

Discord Bot that leverages the idea of nested containers using podman, runs untrusted user input, executes Quantum Circuits, allows users to refer to the Qiskit Documentation, and provides the ability to search questions on the Quantum Computing StackExchange.

google-resumable-media Apache-2google-resumable-media (🥉28 · ⭐ 27) - Utilities for Google Media Downloads and Resumable.. Apache-2

An attendance bot that joins google meet automatically according to schedule and marks present in the google meet.

Google Drive, OneDrive and Youtube as covert-channels - Control systems remotely by uploading files to Google Drive, OneDrive, Youtube or Telegram

Easy Google Translate: Unofficial Google Translate API

Owner

Shweta_kumawat

Ubuntu env build; Nginx build; DB build;

First Party data integration solution built for marketing teams to enable audience and conversion onboarding into Google Marketing products (Google Ads, Campaign Manager, Google Analytics).

For Help/Questions Join in discord

An open source development framework to help you build data workflows and modern data architecture on AWS.

Python Business Transactions Library - ContractsPY

This is a repository for the Duke University Cloud Computing course project on Serveless Data Engineering Pipeline. For this project, I recreated the below pipeline.

E0 AI Bot is based on the message, it prints the answer with the highest probability using probability from the database.

Google scholar share - Simple python script to pull Google Scholar data from an author's profile

💻 A fully functional local AWS cloud stack. Develop and test your cloud & Serverless apps offline!

Prisma Cloud utility scripts, and a Python SDK for Prisma Cloud APIs.