Free Data Engineering course!

DataTalksClub

Last update: Dec 30, 2022

Related tags

Miscellaneous docker airflow kafka spark data-engineering dbt

Overview

Data Engineering Zoomcamp

Register in DataTalks.Club's Slack
Join the #course-data-engineering channel
The videos are published to DataTalks.Club's YouTube channel in the course playlist
Frequenty asked technical questions

Syllabus

Week 1: Introduction & Prerequisites
Week 2: Data ingestion
Week 3: Data Warehouse
Week 4: Analytics Engineering
Week 5: Batch processing
Week 6: Streaming
Week 7, 8 & 9: Project

Taking the course

Self-paced mode

All the materials of the course are freely available, so you can take the course at your own pace

Follow the suggested syllabus (see below) week by week
You don't need to fill in the registration form. Just start watching the videos and join Slack
Check FAQ if you have problems
If you can't find a solution to your problem in FAQ, ask for help in Slack

2022 Cohort

Start: 17 January 2022
Registration link: https://airtable.com/shr6oVXeQvSI5HuWD
Leaderboard
Subscribe to our public Google Calendar (it works from Desktop only)

Asking for help in Slack

The best way to get support is to use DataTalks.Club's Slack. Join the #course-data-engineering channel.

To make discussions in Slack more organized:

Follow these recommendations when asking for help
Read the DataTalks.Club community guidelines

Syllabus

Week 1: Introduction & Prerequisites

Course overview
Introduction to GCP
Docker and docker-compose
Running Postgres locally with Docker
Setting up infrastructure on GCP with Terraform
Preparing the environment for the course
Homework

More details

Week 2: Data ingestion

Data Lake
Workflow orchestration
Setting up Airflow locally
Ingesting data to GCP with Airflow
Ingesting data to local Postgres with Airflow
Moving data from AWS to GCP (Transfer service)
Homework

More details

Week 3: Data Warehouse

Data Warehouse
BigQuery
Partitoning and clustering
BigQuery best practices
Internals of BigQuery
Integrating BigQuery with Airflow
BigQuery Machine Learning

More details

Week 4: Analytics engineering

Basics of analytics engineering
dbt (data build tool)
BigQuery and dbt
Postgres and dbt
dbt models
Testing and documenting
Deployment to the cloud and locally
Visualising the data with google data studio and metabase

More details

Week 5: Batch processing

Batch processing
What is Spark
Spark Dataframes
Spark SQL
Internals: GroupBy and joins

More details

Week 6: Streaming

Introduction to Kafka
Schemas (avro)
Kafka Streams
Kafka Connect and KSQL

More details

Week 7, 8 & 9: Project

Putting everything we learned to practice

Week 7 and 8: working on your own project
Week 9: reviewing your peers

More details

Overview

Architecture diagram

Technologies

Google Cloud Platform (GCP): Cloud-based auto-scaling platform by Google
- Google Cloud Storage (GCS): Data Lake
- BigQuery: Data Warehouse
Terraform: Infrastructure-as-Code (IaC)
Docker: Containerization
SQL: Data Analysis & Exploration
Airflow: Pipeline Orchestration
dbt: Data Transformation
Spark: Distributed Processing
Kafka: Streaming

Prerequisites

To get most out of this course, you should feel comfortable with coding and command line, and know the basics of SQL. Prior experience with Python will be helpful, but you can pick Python relatively fast if you have experience with other programming languages.

Prior experience with data engineering is not required.

Instructors

Ankush Khanna (https://linkedin.com/in/ankushkhanna2)
Sejal Vaidya (https://linkedin.com/in/vaidyasejal)
Victoria Perez Mola (https://www.linkedin.com/in/victoriaperezmola/)
Alexey Grigorev (https://linkedin.com/in/agrigorev)

Tools

For this course you'll need to have the following software installed on your computer:

Docker and Docker-Compose
Python 3 (e.g. via Anaconda)
Google Cloud SDK
Terraform

See Week 1 for more details about installing these tools

FAQ

Q: I registered, but haven't received a confirmation email. Is it normal? A: Yes, it's normal. It's not automated. But you will receive an email eventually
Q: At what time of the day will it happen? A: Office hours will happen on Mondays at 17:00 CET. But everything will be recorded, so you can watch it whenever it's convenient for you
Q: Will there be a certificate? A: Yes, if you complete the project
Q: I'm 100% not sure I'll be able to attend. Can I still sign up? A: Yes, please do! You'll receive all the updates and then you can watch the course at your own pace.
Q: Do you plan to run a ML engineering course as well? A: Glad you asked. We do :)
Q: I'm stuck! I've got a technical question! A: Ask on Slack! And check out the student FAQ; many common issues have been answered already. If your issue is solved, please add how you solved it to the document. Thanks!

Our friends

Big thanks to other communities for helping us spread the word about the course:

Check them out - they are cool!

Comments

networking for docker-compose.yaml is not available

In week 1, we are supposed to create a network to connect pgadmin to postgres database.

Alexey shows how to do it from CLi, which is the less recommended way if we want to keep the network bridge available for the future.

I did some research online but want able to find a solution. Is it possibile to update docker-compose.yaml with the above specification please?

opened by giuliosmall 6
URL in Readme on week_1_basics_n_setup/2_docker_sql needs changed to .parquet

The README.md under Data ingestion -> Running locally is: URL="https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2021-01.csv"

It should be URL="https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2021-01.parquet"

opened by kyleaddis 6
Setup dbt locally with BQ on Docker

Raising a PR to add a quick guide to set up dbt on Docker.

Added three files docker-compose.yml, Dockerfile and docker-setup.md for ease of access. Changed the README.md to link it to the setup markdown.

If this is too much clutter then I will just link my repo guide in the README.md and remove all other files from here. Let me know :D

opened by ankurchavda 5
Data in week 1 is not available(yellow_tripdata_2021-01.csv)

I have tried requesting data from https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2021-01.csv, It seems not available now. Please update the data link from s3. Thanks

opened by toandaominh1997 4
update read_csv to parquet, add iteration function, change while to loop
A few changes to upload-data.py for week 1:

cleared the cell outputs (helps with future diffs)

changed read_csv to read_parquet to reflect the file format change on the TLC Trip Record Data website

created an interate_df function to emulate the previous read_csv iteration since there is no equivalent for read_parquet

changed the while loop to a for loop so the StopIteration error is handled automatically
opened by joeeoj 3
Week 2 - Task stuck at up_for_retry

Hi,

I tried to run data_intestrion_gcs_dag from the webserver but the first task stuck at up_for_retry stage. When I tried to check the log this is the only thing I found:

However, I can still run this dag via the Airflow CLI. I'm not sure what's the problem here - I built the image on two different machines, one worked fine, and one got stuck like this.

opened by hoanghapham 3

Ingesting Data to GCP with Airflow - parquet update

Hi guys,

Really enjoying the course! Thanks for putting this together.

I couldn't get the data_ingestion_gcs_DAG.py to work, as the NYC taxi data has been updated to parquet format on the website. Here's my workaround, I left in the comments so you can easily see what I changed:

Changed the dataset_file variable to yellow_tripdata_2021-01.parquet
Commented out the format_to_parquet function
Commented out the format_to_parquet_task in the DAG declaration
Removed format_to_parquet_task from the workflow line at the end of the code

Cheers, Martin

import os
import logging

from airflow import DAG
from airflow.utils.dates import days_ago
from airflow.operators.bash import BashOperator
from airflow.operators.python import PythonOperator

from google.cloud import storage
from airflow.providers.google.cloud.operators.bigquery import BigQueryCreateExternalTableOperator
import pyarrow.csv as pv
import pyarrow.parquet as pq

PROJECT_ID = os.environ.get("GCP_PROJECT_ID")
BUCKET = os.environ.get("GCP_GCS_BUCKET")

dataset_file = "yellow_tripdata_2021-01.parquet"
# dataset_file = "yellow_tripdata_2021-01.csv"
dataset_url = f"https://s3.amazonaws.com/nyc-tlc/trip+data/{dataset_file}"
path_to_local_home = os.environ.get("AIRFLOW_HOME", "/opt/airflow/")
# parquet_file = dataset_file.replace('.csv', '.parquet')
parquet_file = dataset_file
BIGQUERY_DATASET = os.environ.get("BQ_DATASET", 'trips_data_all')


# this is not needed anymore since the file is already in parquet format
# def format_to_parquet(src_file):
#     if not src_file.endswith('.csv'):
#         logging.error("Can only accept source files in CSV format, for the moment")
#         return
#     table = pv.read_csv(src_file)
#     pq.write_table(table, src_file.replace('.csv', '.parquet'))


# NOTE: takes 20 mins, at an upload speed of 800kbps. Faster if your internet has a better upload speed
def upload_to_gcs(bucket, object_name, local_file):
    """
    Ref: https://cloud.google.com/storage/docs/uploading-objects#storage-upload-object-python
    :param bucket: GCS bucket name
    :param object_name: target path & file-name
    :param local_file: source path & file-name
    :return:
    """
    # WORKAROUND to prevent timeout for files > 6 MB on 800 kbps upload speed.
    # (Ref: https://github.com/googleapis/python-storage/issues/74)
    storage.blob._MAX_MULTIPART_SIZE = 5 * 1024 * 1024  # 5 MB
    storage.blob._DEFAULT_CHUNKSIZE = 5 * 1024 * 1024  # 5 MB
    # End of Workaround

    client = storage.Client()
    bucket = client.bucket(bucket)

    blob = bucket.blob(object_name)
    blob.upload_from_filename(local_file)


default_args = {
    "owner": "airflow",
    "start_date": days_ago(1),
    "depends_on_past": False,
    "retries": 1,
}

# NOTE: DAG declaration - using a Context Manager (an implicit way)
with DAG(
    dag_id="data_ingestion_gcs_dag",
    schedule_interval="@daily",
    default_args=default_args,
    catchup=False,
    max_active_runs=1,
    tags=['dtc-de'],
) as dag:

    download_dataset_task = BashOperator(
        task_id="download_dataset_task",
        bash_command=f"curl -sSL {dataset_url} > {path_to_local_home}/{dataset_file}"
    )

    # format_to_parquet_task = PythonOperator(
    #     task_id="format_to_parquet_task",
    #     python_callable=format_to_parquet,
    #     op_kwargs={
    #         "src_file": f"{path_to_local_home}/{dataset_file}",
    #     },
    # )

    # TODO: Homework - research and try XCOM to communicate output values between 2 tasks/operators
    local_to_gcs_task = PythonOperator(
        task_id="local_to_gcs_task",
        python_callable=upload_to_gcs,
        op_kwargs={
            "bucket": BUCKET,
            "object_name": f"raw/{parquet_file}",
            "local_file": f"{path_to_local_home}/{parquet_file}",
        },
    )

    bigquery_external_table_task = BigQueryCreateExternalTableOperator(
        task_id="bigquery_external_table_task",
        table_resource={
            "tableReference": {
                "projectId": PROJECT_ID,
                "datasetId": BIGQUERY_DATASET,
                "tableId": "external_table",
            },
            "externalDataConfiguration": {
                "sourceFormat": "PARQUET",
                "sourceUris": [f"gs://{BUCKET}/raw/{parquet_file}"],
            },
        },
    )

    # download_dataset_task >> format_to_parquet_task >> local_to_gcs_task >> bigquery_external_table_task
    download_dataset_task >> local_to_gcs_task >> bigquery_external_table_task

opened by MartyC-137 3

Owner

DataTalksClub

The place to talk about data

GitHub

UdemyPy is a bot that hourly looks for Udemy free courses and post them in my Telegram Channel: Free Courses.

UdemyPy UdemyPy is a bot that hourly looks for Udemy free courses and post them in my Telegram Channel: Free Courses. How does it work? For publishing

88 Dec 25, 2022

an opensourced roblox group finder writen in python 100% free and virus-free

Roblox-Group-Finder an opensourced roblox group finder writen in python 100% free and virus-free note : if you don't want install python or just use w

1 Nov 11, 2021

Gba-free-fonts - Free font resources for GBA game development

gba-free-fonts Free font resources for GBA game development This repo contains m

28 Dec 30, 2022

HPomb Is Socail Engineering Tool , Used For Bombing , Spoofing and Anonymity Available For Linux And Android(Termux)

HPomb v2020.02 Coming Soon Created By Secanonm HPomb Is Socail Engineering Tool , Used For Bombing , Spoofing and Anonymity Available For Linux And An

10 Jul 25, 2022

Here is my Senior Design Project that I implemented to graduate from Computer Engineering.

Here is my Senior Design Project that I implemented to graduate from Computer Engineering. It is a chatbot made in RASA and helps the user to plan their vacation in the Turkish language. In order to plan the user's vacation, it provides reservations by asking various questions for hotel, flight, or event.