This mini project showcase how to build and debug Apache Spark application using Python

Denny Imanuel

Last update: Dec 29, 2021

Related tags

Data Analysis SparkPython

Overview

Spark Python

by Denny Imanuel

This mini project showcase how to build and debug Apache Spark application using Python programming language. There are also options to run Spark application on Spark container

Spark on Localhost

Requirement

PyCharm IDE - You need to install PyCharm IDE
Java JDK - You need to install Java JDK and set JAVA_HOME env
Python - You need to install Python and set PYTHONPATH env
Spark Hadoop - You need to install Spark Hadoop and set HADOOP_HOME and SPARK_HOME env

For more info: https://dotnet.microsoft.com/en-us/learn/data/spark-tutorial/install-spark

Run Config

To run Spark app run Spark Submit command or create a new 'Run Config' under Shell Script as follows:

\SparkPython\venv\Scripts\python.exe" spark-submit --class SparkPython SparkPython.py">

set PYSPARK_PYTHON "
    
     \SparkPython\venv\Scripts\python.exe"
spark-submit --class SparkPython SparkPython.py

Build Config

To build Spark app run Spark Submit command or create a new 'Build Config' under Python Debug Server as follows:

venv\Scripts\activate
pip install pydevd-pycharm~=

Debug Config

To debug Spark app create 'Debug Config' using standard Python configuration file and then insert following code. In order to debug run above 'Build Config' first, set breakpoint, and then run this 'Debug Config':

import pydevd_pycharm
pydevd_pycharm.settrace('localhost', port=8888, stdoutToServer=True, stderrToServer=True)

Spark on Docker

Requirement

Rider IDE / Visual Studio - You need to install Rider IDE or Visual Studio
Docker Desktop - You need to install Docker Desktop to run Docker
Spark Image - Make sure you pull same version of Spark image as your local Spark:

docker pull bitnami/spark:3.1.2

Spark Clusters

Docker Compose below will run Spark cluster in master and worker node. First comment the debug line(6,7) and then pack the venv folder into venv.tar.gz and then submit both SparkPython.py file and venv.tar.gz to Spark cluster.

docker-compose up
spark-submit --master spark://localhost:7070 --class SparkPython SparkPython.py --archives venv.tar.gz

Output Result

If the Spark application is successfully build it should print out result table as follows:

You might also like...

Stream-Kafka-ELK-Stack - Weather data streaming using Apache Kafka and Elastic Stack.

Streaming Data Pipeline - Kafka + ELK Stack Streaming weather data using Apache Kafka and Elastic Stack. Data source: https://openweathermap.org/api O

2 Jan 20, 2022

🧪 Panel-Chemistry - exploratory data analysis and build powerful data and viz tools within the domain of Chemistry using Python and HoloViz Panel.

🧪📈 🐍. The purpose of the panel-chemistry project is to make it really easy for you to do DATA ANALYSIS and build powerful DATA AND VIZ APPLICATIONS within the domain of Chemistry using using Python and HoloViz Panel.

97 Dec 8, 2022

Educational project on how to build an ETL (Extract, Transform, Load) data pipeline, orchestrated with Airflow.

ETL Pipeline with Airflow, Spark, s3, MongoDB and Amazon Redshift

214 Jan 2, 2023

In this project, ETL pipeline is build on data warehouse hosted on AWS Redshift.

ETL Pipeline for AWS Project Description In this project, ETL pipeline is build on data warehouse hosted on AWS Redshift. The data is loaded from S3 t

1 Nov 1, 2021

simple way to build the declarative and destributed data pipelines with python

unipipeline simple way to build the declarative and distributed data pipelines. Why you should use it Declarative strict config Scaffolding Fully type

0 Jan 26, 2022

Tablexplore is an application for data analysis and plotting built in Python using the PySide2/Qt toolkit.

81 Dec 26, 2022

Weather Image Recognition - Python weather application using series of data

1 Feb 4, 2022

Recommendations from Cramer: On the show Mad-Money (CNBC) Jim Cramer picks stocks which he recommends to buy. We will use this data to build a portfolio

Backtesting the "Cramer Effect" & Recommendations from Cramer Recommendations from Cramer: On the show Mad-Money (CNBC) Jim Cramer picks stocks which

12 Aug 30, 2022

Streamz helps you build pipelines to manage continuous streams of data

Streamz helps you build pipelines to manage continuous streams of data. It is simple to use in simple cases, but also supports complex pipelines that involve branching, joining, flow control, feedback, back pressure, and so on.

1.1k Dec 28, 2022

This mini project showcase how to build and debug Apache Spark application using Python

Related tags

Overview

Spark Python

Spark on Localhost

Requirement

Run Config

Build Config

Debug Config

Spark on Docker

Requirement

Spark Clusters

Output Result

You might also like...

Stream-Kafka-ELK-Stack - Weather data streaming using Apache Kafka and Elastic Stack.

🧪 Panel-Chemistry - exploratory data analysis and build powerful data and viz tools within the domain of Chemistry using Python and HoloViz Panel.

Educational project on how to build an ETL (Extract, Transform, Load) data pipeline, orchestrated with Airflow.

In this project, ETL pipeline is build on data warehouse hosted on AWS Redshift.

simple way to build the declarative and destributed data pipelines with python

Tablexplore is an application for data analysis and plotting built in Python using the PySide2/Qt toolkit.

Weather Image Recognition - Python weather application using series of data

Recommendations from Cramer: On the show Mad-Money (CNBC) Jim Cramer picks stocks which he recommends to buy. We will use this data to build a portfolio

Streamz helps you build pipelines to manage continuous streams of data

Owner

Denny Imanuel

Building house price data pipelines with Apache Beam and Spark on GCP

A data analysis using python and pandas to showcase trends in school performance.

BigDL - Evaluate the performance of BigDL (Distributed Deep Learning on Apache Spark) in big data analysis problems

Pyspark project that able to do joins on the spark data frames.

Sentiment analysis on streaming twitter data using Spark Structured Streaming & Python

Reading streams of Twitter data, save them to Kafka, then process with Kafka Stream API and Spark Streaming

Pandas and Spark DataFrame comparison for humans

A real-time financial data streaming pipeline and visualization platform using Apache Kafka, Cassandra, and Bokeh.

The Spark Challenge Student Check-In/Out Tracking Script

Monitor the stability of a pandas or spark dataframe ⚙︎