Data Engineering ZoomCamp

Aaron

Last update: Jan 6, 2023

Related tags

Deep Learning DataEngineerZoomCamp

Overview

Data Engineering ZoomCamp

I'm partaking in a Data Engineering Bootcamp / Zoomcamp and will be tracking my progress here. I can't promise these notes will be neat and tidy, but I hope they can help anyone who is working through this bootcamp.

I'll aim to document any problems or errors I come across during my journey, and describe concepts that I found tricky.

Each week I'll work through a series of videos and follow this up with homework exercises.

The Task

The goal is to develop a data pipeline following the architecture below. We will be looking at New York City Taxi data!

Tools

We'll use a range of tools:

Google Cloud Platform (GCP): Cloud-based auto-scaling platform by Google
Google Cloud Storage (GCS): Data Lake
BigQuery: Data Warehouse
Terraform: Infrastructure-as-Code (IaC)
Docker: Containerization
SQL: Data Analysis & Exploration
Airflow: Pipeline Orchestration
DBT: Data Transformation
Spark: Distributed Processing
Kafka: Streaming

Progress

Week1

PostgreSQL | Terraform | Docker | Google Cloud Platform

This week was a lot of setup, and a lot of work! Here I was introduced to Docker - a framework for managing containers. I created some containers for PostgreSQL and PgAdmin, before finally creating my own image, which when run, created and populated tables within my PostgreSQL database.

Next up I learned a bit about Google Cloud Platform (GCP), which is suite of Google Cloud Computing resources. Here I setup a service account (more or less a user account for service running in GCP and even setup a Virtual Machine, and connected to it using SSH right from my terminal.

I was also introduced to Terraform - an infrastructure-as-code tool. I used this to generate some stuff on GCP - Big Query and Google Cloud Storage - from a simple script.

I enjoyed this week, although it was heavy going. A lot of late nights trying to understand new concepts and fix unexpected bugs. Although I'm by no means an expert in any of these tools, I do feel more confident in understanding and utilsing them.
Week 2

This week I'm learning about Airflow!
Week 3: Pending...
Week 4: Pending...
Week 5: Pending...
Week 6: Pending...

Automatically download the cwru data set, and then divide it into training data set and test data set

Automatically download the cwru data set, and then divide it into training data set and test data set.自动下载cwru数据集，然后分训练数据集和测试数据集

6 Jun 27, 2022

A complete end-to-end demonstration in which we collect training data in Unity and use that data to train a deep neural network to predict the pose of a cube. This model is then deployed in a simulated robotic pick-and-place task.

Object Pose Estimation Demo This tutorial will go through the steps necessary to perform pose estimation with a UR3 robotic arm in Unity. You’ll gain

187 Dec 24, 2022

The source code for the Cutoff data augmentation approach proposed in this paper: "A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation".

Cutoff: A Simple Data Augmentation Approach for Natural Language This repository contains source code necessary to reproduce the results presented in

49 Dec 22, 2022

Providing the solutions for high-frequency trading (HFT) strategies using data science approaches (Machine Learning) on Full Orderbook Tick Data.

Modeling High-Frequency Limit Order Book Dynamics Using Machine Learning Framework to capture the dynamics of high-frequency limit order books. Overvi

1.3k Jan 7, 2023

Rayvens makes it possible for data scientists to access hundreds of data services within Ray with little effort.

Rayvens augments Ray with events. With Rayvens, Ray applications can subscribe to event streams, process and produce events. Rayvens leverages Apache

32 Dec 25, 2022

nnDetection is a self-configuring framework for 3D (volumetric) medical object detection which can be applied to new data sets without manual intervention. It includes guides for 12 data sets that were used to develop and evaluate the performance of the proposed method.

What is nnDetection? Simultaneous localisation and categorization of objects in medical images, also referred to as medical object detection, is of hi

365 Jan 9, 2023

Moving Object Segmentation in 3D LiDAR Data: A Learning-based Approach Exploiting Sequential Data

LiDAR-MOS: Moving Object Segmentation in 3D LiDAR Data This repo contains the code for our paper: Moving Object Segmentation in 3D LiDAR Data: A Learn

394 Dec 29, 2022

[IJCAI-2021] A benchmark of data-free knowledge distillation from paper "Contrastive Model Inversion for Data-Free Knowledge Distillation"

DataFree A benchmark of data-free knowledge distillation from paper "Contrastive Model Inversion for Data-Free Knowledge Distillation" Authors: Gongfa

47 Jan 9, 2023

Data Preparation, Processing, and Visualization for MoVi Data

MoVi-Toolbox Data Preparation, Processing, and Visualization for MoVi Data, https://www.biomotionlab.ca/movi/ MoVi is a large multipurpose dataset of

51 Nov 27, 2022

Data Engineering ZoomCamp

Related tags

Overview

Data Engineering ZoomCamp

The Task

Tools

Progress

You might also like...

Automatically download the cwru data set, and then divide it into training data set and test data set

A complete end-to-end demonstration in which we collect training data in Unity and use that data to train a deep neural network to predict the pose of a cube. This model is then deployed in a simulated robotic pick-and-place task.

The source code for the Cutoff data augmentation approach proposed in this paper: "A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation".

Providing the solutions for high-frequency trading (HFT) strategies using data science approaches (Machine Learning) on Full Orderbook Tick Data.

Rayvens makes it possible for data scientists to access hundreds of data services within Ray with little effort.

nnDetection is a self-configuring framework for 3D (volumetric) medical object detection which can be applied to new data sets without manual intervention. It includes guides for 12 data sets that were used to develop and evaluate the performance of the proposed method.

Moving Object Segmentation in 3D LiDAR Data: A Learning-based Approach Exploiting Sequential Data

[IJCAI-2021] A benchmark of data-free knowledge distillation from paper "Contrastive Model Inversion for Data-Free Knowledge Distillation"

Data Preparation, Processing, and Visualization for MoVi Data

Owner

Aaron

Reverse engineering Rosetta 2 in M1 Mac

Multi-tool reverse engineering collaboration solution.

It's final year project of Diploma Engineering. This project is based on Computer Vision.

Official Pytorch implementation of paper "Reverse Engineering of Generative Models: Inferring Model Hyperparameters from Generated Images"

Evaluating different engineering tricks that make RL work

NVIDIA Merlin is an open source library providing end-to-end GPU-accelerated recommender systems, from feature engineering and preprocessing to training deep learning models and running inference in production.

House_prices_kaggle - Predict sales prices and practice feature engineering, RFs, and gradient boosting

3ds-Ghidra-Scripts - Ghidra scripts to help with 3ds reverse engineering

Diabet Feature Engineering - Predict whether people have diabetes when their characteristics are specified

Diabetes-Feature-Engineering - A machine learning model that can predict whether people have diabetes when their characteristics are specified