Containerized Demo of Apache Spark MLlib on a Data Lakehouse (2022)

Last update: Mar 21, 2022

Related tags

Data Analysis docker binder apache-spark jupyter-notebook delta-lake data-lakehouse

Overview

Spark-DeltaLake-Demo

Reliable, Scalable Machine Learning (2022)

This project was completed in an attempt to become better acquainted with the latest big data tools. Further details can be found on my blog here.

The world is producing an exponentially increasing amount of digital data, and the tools we use to derive insights from data are evolving just as rapidly.

In recent years, a new architecture called the Data Lakehouse has begun to gain prominence as an enterprise solution to storing and processing big data. This trend piqued my interest and led to my exploration of some of the key underlying technologies fueling the revolution.

Of particular focus are two open-source technologies: Delta Lake and Apache Spark. Delta Lake provides a metadata layer to data lakes, bringing ACID transaction guarantees and time travel to a heretofore messy approach to data science at scale. Apache Spark offers a distributed processing engine for a diverse set of workloads (e.g., SQL queries, machine learning, stream processing), which can be programmed in Python, R, Scala, etc.

It is my belief that these technologies―among several others further detailed on my blog―will play a major role in how businesses leverage the power of data going forward. As such, this research prepares me well to confront many emerging data engineering and data science challenges.

The demonstration linked below is deployed using the Binder service, which processes a Jupyter notebook in the cloud, based on a custom Docker image described by the supporting files in this repository.

Live Link:

Contained in this repository:

Jupyter notebook demonstrating Apache Spark and Delta Lake
Files to construct a custom Docker image deployed using Binder
- Dockerfile
- docker-compose.yml
- requirements.txt

You might also like...

Active Learning demo using two small datasets

ActiveLearningDemo How to run step one put the dataset folder and use command below to split the dataset to the required structure run utils.py For ea

3 Nov 10, 2021

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

3.7k Jan 3, 2023

Elementary is an open-source data reliability framework for modern data teams. The first module of the framework is data lineage.

Data lineage made simple, reliable, and automated. Effortlessly track the flow of data, understand dependencies and analyze impact. Features Visualiza

898 Jan 9, 2023

ICLR 2022 Paper submission trend analysis

Visualize ICLR 2022 OpenReview Data

75 Dec 6, 2022

[CVPR2022] This repository contains code for the paper "Nested Collaborative Learning for Long-Tailed Visual Recognition", published at CVPR 2022

Nested Collaborative Learning for Long-Tailed Visual Recognition This repository is the official PyTorch implementation of the paper in CVPR 2022: Nes

65 Dec 9, 2022

🧪 Panel-Chemistry - exploratory data analysis and build powerful data and viz tools within the domain of Chemistry using Python and HoloViz Panel.

🧪📈 🐍. The purpose of the panel-chemistry project is to make it really easy for you to do DATA ANALYSIS and build powerful DATA AND VIZ APPLICATIONS within the domain of Chemistry using using Python and HoloViz Panel.

97 Dec 8, 2022

fds is a tool for Data Scientists made by DAGsHub to version control data and code at once.

Fast Data Science, AKA fds, is a CLI for Data Scientists to version control data and code at once, by conveniently wrapping git and dvc

359 Dec 22, 2022

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code. Tuplex has similar Python APIs to Apache Spark or Dask, but rather than invoking the Python interpreter, Tuplex generates optimized LLVM bytecode for the given pipeline and input data set.

791 Jan 4, 2023

A data parser for the internal syncing data format used by Fog of World.

A data parser for the internal syncing data format used by Fog of World. The parser is not designed to be a well-coded library with good performance, it is more like a demo for showing the data structure.

40 Dec 12, 2022

Containerized Demo of Apache Spark MLlib on a Data Lakehouse (2022)

Related tags

Overview

Spark-DeltaLake-Demo

Reliable, Scalable Machine Learning (2022)

Live Link:

Contained in this repository:

You might also like...

Active Learning demo using two small datasets

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Elementary is an open-source data reliability framework for modern data teams. The first module of the framework is data lineage.

ICLR 2022 Paper submission trend analysis

[CVPR2022] This repository contains code for the paper "Nested Collaborative Learning for Long-Tailed Visual Recognition", published at CVPR 2022

🧪 Panel-Chemistry - exploratory data analysis and build powerful data and viz tools within the domain of Chemistry using Python and HoloViz Panel.

fds is a tool for Data Scientists made by DAGsHub to version control data and code at once.

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

A data parser for the internal syncing data format used by Fog of World.

Owner

BigDL - Evaluate the performance of BigDL (Distributed Deep Learning on Apache Spark) in big data analysis problems

This mini project showcase how to build and debug Apache Spark application using Python

Sentiment analysis on streaming twitter data using Spark Structured Streaming & Python

Reading streams of Twitter data, save them to Kafka, then process with Kafka Stream API and Spark Streaming

Pyspark project that able to do joins on the spark data frames.

A real-time financial data streaming pipeline and visualization platform using Apache Kafka, Cassandra, and Bokeh.

Stream-Kafka-ELK-Stack - Weather data streaming using Apache Kafka and Elastic Stack.

The Spark Challenge Student Check-In/Out Tracking Script

Monitor the stability of a pandas or spark dataframe ⚙︎

Pandas and Spark DataFrame comparison for humans