Free Data Engineering course!

Overview

Data Engineering Zoomcamp

Syllabus

Taking the course

Self-paced mode

All the materials of the course are freely available, so you can take the course at your own pace

  • Follow the suggested syllabus (see below) week by week
  • You don't need to fill in the registration form. Just start watching the videos and join Slack
  • Check FAQ if you have problems
  • If you can't find a solution to your problem in FAQ, ask for help in Slack

2022 Cohort

Asking for help in Slack

The best way to get support is to use DataTalks.Club's Slack. Join the #course-data-engineering channel.

To make discussions in Slack more organized:

Syllabus

Week 1: Introduction & Prerequisites

  • Course overview
  • Introduction to GCP
  • Docker and docker-compose
  • Running Postgres locally with Docker
  • Setting up infrastructure on GCP with Terraform
  • Preparing the environment for the course
  • Homework

More details

Week 2: Data ingestion

  • Data Lake
  • Workflow orchestration
  • Setting up Airflow locally
  • Ingesting data to GCP with Airflow
  • Ingesting data to local Postgres with Airflow
  • Moving data from AWS to GCP (Transfer service)
  • Homework

More details

Week 3: Data Warehouse

  • Data Warehouse
  • BigQuery
  • Partitoning and clustering
  • BigQuery best practices
  • Internals of BigQuery
  • Integrating BigQuery with Airflow
  • BigQuery Machine Learning

More details

Week 4: Analytics engineering

  • Basics of analytics engineering
  • dbt (data build tool)
  • BigQuery and dbt
  • Postgres and dbt
  • dbt models
  • Testing and documenting
  • Deployment to the cloud and locally
  • Visualising the data with google data studio and metabase

More details

Week 5: Batch processing

  • Batch processing
  • What is Spark
  • Spark Dataframes
  • Spark SQL
  • Internals: GroupBy and joins

More details

Week 6: Streaming

  • Introduction to Kafka
  • Schemas (avro)
  • Kafka Streams
  • Kafka Connect and KSQL

More details

Week 7, 8 & 9: Project

Putting everything we learned to practice

  • Week 7 and 8: working on your own project
  • Week 9: reviewing your peers

More details

Overview

Architecture diagram

Technologies

  • Google Cloud Platform (GCP): Cloud-based auto-scaling platform by Google
    • Google Cloud Storage (GCS): Data Lake
    • BigQuery: Data Warehouse
  • Terraform: Infrastructure-as-Code (IaC)
  • Docker: Containerization
  • SQL: Data Analysis & Exploration
  • Airflow: Pipeline Orchestration
  • dbt: Data Transformation
  • Spark: Distributed Processing
  • Kafka: Streaming

Prerequisites

To get most out of this course, you should feel comfortable with coding and command line, and know the basics of SQL. Prior experience with Python will be helpful, but you can pick Python relatively fast if you have experience with other programming languages.

Prior experience with data engineering is not required.

Instructors

Tools

For this course you'll need to have the following software installed on your computer:

  • Docker and Docker-Compose
  • Python 3 (e.g. via Anaconda)
  • Google Cloud SDK
  • Terraform

See Week 1 for more details about installing these tools

FAQ

  • Q: I registered, but haven't received a confirmation email. Is it normal? A: Yes, it's normal. It's not automated. But you will receive an email eventually
  • Q: At what time of the day will it happen? A: Office hours will happen on Mondays at 17:00 CET. But everything will be recorded, so you can watch it whenever it's convenient for you
  • Q: Will there be a certificate? A: Yes, if you complete the project
  • Q: I'm 100% not sure I'll be able to attend. Can I still sign up? A: Yes, please do! You'll receive all the updates and then you can watch the course at your own pace.
  • Q: Do you plan to run a ML engineering course as well? A: Glad you asked. We do :)
  • Q: I'm stuck! I've got a technical question! A: Ask on Slack! And check out the student FAQ; many common issues have been answered already. If your issue is solved, please add how you solved it to the document. Thanks!

Our friends

Big thanks to other communities for helping us spread the word about the course:

Check them out - they are cool!

Comments
  • networking for docker-compose.yaml is not available

    networking for docker-compose.yaml is not available

    In week 1, we are supposed to create a network to connect pgadmin to postgres database.

    Alexey shows how to do it from CLi, which is the less recommended way if we want to keep the network bridge available for the future.

    I did some research online but want able to find a solution. Is it possibile to update docker-compose.yaml with the above specification please?

    opened by giuliosmall 6
  • URL in Readme on week_1_basics_n_setup/2_docker_sql needs changed to .parquet

    URL in Readme on week_1_basics_n_setup/2_docker_sql needs changed to .parquet

    The README.md under Data ingestion -> Running locally is: URL="https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2021-01.csv"

    It should be URL="https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2021-01.parquet"

    opened by kyleaddis 6
  • Setup dbt locally with BQ on Docker

    Setup dbt locally with BQ on Docker

    Raising a PR to add a quick guide to set up dbt on Docker.

    Added three files docker-compose.yml, Dockerfile and docker-setup.md for ease of access. Changed the README.md to link it to the setup markdown.

    If this is too much clutter then I will just link my repo guide in the README.md and remove all other files from here. Let me know :D

    opened by ankurchavda 5
  • Data in week 1 is not available(yellow_tripdata_2021-01.csv)

    Data in week 1 is not available(yellow_tripdata_2021-01.csv)

    I have tried requesting data from https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2021-01.csv, It seems not available now. Please update the data link from s3. Thanks

    opened by toandaominh1997 4
  • update read_csv to parquet, add iteration function, change while to loop

    update read_csv to parquet, add iteration function, change while to loop

    A few changes to upload-data.py for week 1:

    • cleared the cell outputs (helps with future diffs)
    • changed read_csv to read_parquet to reflect the file format change on the TLC Trip Record Data website
    • created an interate_df function to emulate the previous read_csv iteration since there is no equivalent for read_parquet
    • changed the while loop to a for loop so the StopIteration error is handled automatically
    opened by joeeoj 3
  • Week 2 - Task stuck at up_for_retry

    Week 2 - Task stuck at up_for_retry

    Hi,

    I tried to run data_intestrion_gcs_dag from the webserver but the first task stuck at up_for_retry stage. When I tried to check the log this is the only thing I found:

    image

    However, I can still run this dag via the Airflow CLI. I'm not sure what's the problem here - I built the image on two different machines, one worked fine, and one got stuck like this.

    opened by hoanghapham 3
  • Ingesting Data to GCP with Airflow - parquet update

    Ingesting Data to GCP with Airflow - parquet update

    Hi guys,

    Really enjoying the course! Thanks for putting this together.

    I couldn't get the data_ingestion_gcs_DAG.py to work, as the NYC taxi data has been updated to parquet format on the website. Here's my workaround, I left in the comments so you can easily see what I changed:

    • Changed the dataset_file variable to yellow_tripdata_2021-01.parquet
    • Commented out the format_to_parquet function
    • Commented out the format_to_parquet_task in the DAG declaration
    • Removed format_to_parquet_task from the workflow line at the end of the code

    Cheers, Martin

    import os
    import logging
    
    from airflow import DAG
    from airflow.utils.dates import days_ago
    from airflow.operators.bash import BashOperator
    from airflow.operators.python import PythonOperator
    
    from google.cloud import storage
    from airflow.providers.google.cloud.operators.bigquery import BigQueryCreateExternalTableOperator
    import pyarrow.csv as pv
    import pyarrow.parquet as pq
    
    PROJECT_ID = os.environ.get("GCP_PROJECT_ID")
    BUCKET = os.environ.get("GCP_GCS_BUCKET")
    
    dataset_file = "yellow_tripdata_2021-01.parquet"
    # dataset_file = "yellow_tripdata_2021-01.csv"
    dataset_url = f"https://s3.amazonaws.com/nyc-tlc/trip+data/{dataset_file}"
    path_to_local_home = os.environ.get("AIRFLOW_HOME", "/opt/airflow/")
    # parquet_file = dataset_file.replace('.csv', '.parquet')
    parquet_file = dataset_file
    BIGQUERY_DATASET = os.environ.get("BQ_DATASET", 'trips_data_all')
    
    
    # this is not needed anymore since the file is already in parquet format
    # def format_to_parquet(src_file):
    #     if not src_file.endswith('.csv'):
    #         logging.error("Can only accept source files in CSV format, for the moment")
    #         return
    #     table = pv.read_csv(src_file)
    #     pq.write_table(table, src_file.replace('.csv', '.parquet'))
    
    
    # NOTE: takes 20 mins, at an upload speed of 800kbps. Faster if your internet has a better upload speed
    def upload_to_gcs(bucket, object_name, local_file):
        """
        Ref: https://cloud.google.com/storage/docs/uploading-objects#storage-upload-object-python
        :param bucket: GCS bucket name
        :param object_name: target path & file-name
        :param local_file: source path & file-name
        :return:
        """
        # WORKAROUND to prevent timeout for files > 6 MB on 800 kbps upload speed.
        # (Ref: https://github.com/googleapis/python-storage/issues/74)
        storage.blob._MAX_MULTIPART_SIZE = 5 * 1024 * 1024  # 5 MB
        storage.blob._DEFAULT_CHUNKSIZE = 5 * 1024 * 1024  # 5 MB
        # End of Workaround
    
        client = storage.Client()
        bucket = client.bucket(bucket)
    
        blob = bucket.blob(object_name)
        blob.upload_from_filename(local_file)
    
    
    default_args = {
        "owner": "airflow",
        "start_date": days_ago(1),
        "depends_on_past": False,
        "retries": 1,
    }
    
    # NOTE: DAG declaration - using a Context Manager (an implicit way)
    with DAG(
        dag_id="data_ingestion_gcs_dag",
        schedule_interval="@daily",
        default_args=default_args,
        catchup=False,
        max_active_runs=1,
        tags=['dtc-de'],
    ) as dag:
    
        download_dataset_task = BashOperator(
            task_id="download_dataset_task",
            bash_command=f"curl -sSL {dataset_url} > {path_to_local_home}/{dataset_file}"
        )
    
        # format_to_parquet_task = PythonOperator(
        #     task_id="format_to_parquet_task",
        #     python_callable=format_to_parquet,
        #     op_kwargs={
        #         "src_file": f"{path_to_local_home}/{dataset_file}",
        #     },
        # )
    
        # TODO: Homework - research and try XCOM to communicate output values between 2 tasks/operators
        local_to_gcs_task = PythonOperator(
            task_id="local_to_gcs_task",
            python_callable=upload_to_gcs,
            op_kwargs={
                "bucket": BUCKET,
                "object_name": f"raw/{parquet_file}",
                "local_file": f"{path_to_local_home}/{parquet_file}",
            },
        )
    
        bigquery_external_table_task = BigQueryCreateExternalTableOperator(
            task_id="bigquery_external_table_task",
            table_resource={
                "tableReference": {
                    "projectId": PROJECT_ID,
                    "datasetId": BIGQUERY_DATASET,
                    "tableId": "external_table",
                },
                "externalDataConfiguration": {
                    "sourceFormat": "PARQUET",
                    "sourceUris": [f"gs://{BUCKET}/raw/{parquet_file}"],
                },
            },
        )
    
        # download_dataset_task >> format_to_parquet_task >> local_to_gcs_task >> bigquery_external_table_task
        download_dataset_task >> local_to_gcs_task >> bigquery_external_table_task
    
    opened by MartyC-137 3
Owner
DataTalksClub
The place to talk about data
DataTalksClub
UdemyPy is a bot that hourly looks for Udemy free courses and post them in my Telegram Channel: Free Courses.

UdemyPy UdemyPy is a bot that hourly looks for Udemy free courses and post them in my Telegram Channel: Free Courses. How does it work? For publishing

null 88 Dec 25, 2022
an opensourced roblox group finder writen in python 100% free and virus-free

Roblox-Group-Finder an opensourced roblox group finder writen in python 100% free and virus-free note : if you don't want install python or just use w

mollomm1 1 Nov 11, 2021
Gba-free-fonts - Free font resources for GBA game development

gba-free-fonts Free font resources for GBA game development This repo contains m

null 28 Dec 30, 2022
HPomb Is Socail Engineering Tool , Used For Bombing , Spoofing and Anonymity Available For Linux And Android(Termux)

HPomb v2020.02 Coming Soon Created By Secanonm HPomb Is Socail Engineering Tool , Used For Bombing , Spoofing and Anonymity Available For Linux And An

Secanonm 10 Jul 25, 2022
Here is my Senior Design Project that I implemented to graduate from Computer Engineering.

Here is my Senior Design Project that I implemented to graduate from Computer Engineering. It is a chatbot made in RASA and helps the user to plan their vacation in the Turkish language. In order to plan the user's vacation, it provides reservations by asking various questions for hotel, flight, or event.

Ezgi Subaşı 25 May 31, 2022
Group P-11's submission for the University of Waterloo's 2021 Engineering Competition (Programming section).

P-11-WEC2021 Group P-11's submission for the University of Waterloo's 2021 Engineering Competition (Programming section). Part I Compute typing time f

TRISTAN PARRY 1 May 14, 2022
CoreSE - basic of social Engineering tool

Core Social Engineering basic of social Engineering tool. just for fun :) About First of all, I must say that I wrote such a project because of my int

Hamed Mohammadvand 7 Jun 10, 2022
Feature engineering library that helps you keep track of feature dependencies, documentation and schema

Feature engineering library that helps you keep track of feature dependencies, documentation and schema

null 28 May 31, 2022
Streamlit apps done following data professor's course on YouTube

streamlit-twelve-apps Streamlit apps done following data professor's course on YouTube Español Curso de apps de data science hecho por Data Professor

Federico Bravin 1 Jan 10, 2022
This is the course project of AI3602: Data Mining of SJTU

This is the course project of AI3602: Data Mining of SJTU. Group Members include Jinghao Feng, Mingyang Jiang and Wenzhong Zheng.

null 2 Jan 13, 2022
Data on Free Food at MIT

MIT Free Food Timing Procrastinating research by plotting data on how long it takes emails on the free-food at mit edu mailing list to go through. Dat

Peter Sharpe 2 Nov 1, 2021
JimShapedCoding Python Crash Course 2021

Python CRASH Course by JimShapedCoding - Click Here to Start! This Repository includes the code and MORE exercises on each section of the entire cours

Jim Erg 64 Dec 23, 2022
Assignment for python course, BUPT 2021.

pyFuujinrokuDestiny Assignment for python course, BUPT 2021. Notice username and password must be ASCII encoding. If username exists in database, syst

Ellias Kiri Stuart 3 Jun 18, 2021
This is an online course where you can learn and master the skill of low-level performance analysis and tuning.

Performance Ninja Class This is an online course where you can learn to find and fix low-level performance issues, for example CPU cache misses and br

Denis Bakhvalov 1.2k Dec 30, 2022
An interactive course to git

OperatorEquals' Sandbox Git Course! Preface This Git course is an ongoing project containing use cases that I've met (and still meet) while working in

John Torakis 62 Sep 19, 2022
Do you need a screensaver for CircuitPython? Of course you do

circuitpython_screensaver Do you need a screensaver for CircuitPython? Of course you do Demo video of dvdlogo screensaver: screensaver_dvdlogo.mp4 Dem

Tod E. Kurt 8 Sep 2, 2021
Tutor plugin for integration of Open edX with a Richie course catalog

Richie plugin for Tutor This is a plugin to integrate Richie, the learning portal CMS, with Open edX. The integration takes the form of a Tutor plugin

Overhang.IO 2 Sep 8, 2022
Convex Optimisation MVA course - Assignment

Convex Optimisation MVA course - Assignment This repository contains the coding files of the third assignment in the MVA Convex Optimisation course. U

null 1 Nov 27, 2021
The last walk-through project in code institute diploma course

Welcome Rocky.C, This is the Code Institute student template for Gitpod. We have preinstalled all of the tools you need to get started. It's perfectly

Rocky.C 1 Jan 31, 2022