Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks

Jose A Dianes

Last update: Jan 2, 2023

Related tags

Machine Learning python data-science machine-learning big-data spark notebook ipython bigdata ipython-notebook pyspark mllib data-analysis

Overview

Spark Python Notebooks

This is a collection of IPython notebook/Jupyter notebooks intended to train the reader on different Apache Spark concepts, from basic to advanced, by using the Python language.

If Python is not your language, and it is R, you may want to have a look at our R on Apache Spark (SparkR) notebooks instead. Additionally, if your are interested in being introduced to some basic Data Science Engineering, you might find these series of tutorials interesting. There we explain different concepts and applications using Python and R.

Instructions

A good way of using these notebooks is by first cloning the repo, and then starting your own IPython notebook/Jupyter in pySpark mode. For example, if we have a standalone Spark installation running in our localhost with a maximum of 6Gb per node assigned to IPython:

MASTER="spark://127.0.0.1:7077" SPARK_EXECUTOR_MEMORY="6G" IPYTHON_OPTS="notebook --pylab inline" ~/spark-1.5.0-bin-hadoop2.6/bin/pyspark

Notice that the path to the pyspark command will depend on your specific installation. So as requirement, you need to have Spark installed in the same machine you are going to start the IPython notebook server.

For more Spark options see here. In general it works the rule of passing options described in the form spark.executor.memory as SPARK_EXECUTOR_MEMORY when calling IPython/pySpark.

Datasets

We will be using datasets from the KDD Cup 1999. The results of this competition can be found here.

References

The reference book for these and other Spark related topics is:

Learning Spark by Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia.

Notebooks

The following notebooks can be examined individually, although there is a more or less linear 'story' when followed in sequence. By using the same dataset they try to solve a related set of tasks with it.

RDD creation

About reading files and parallelize.

RDDs basics

A look at map, filter, and collect.

Sampling RDDs

RDD sampling methods explained.

RDD set operations

Brief introduction to some of the RDD pseudo-set operations.

Data aggregations on RDDs

RDD actions reduce, fold, and aggregate.

Working with key/value pair RDDs

How to deal with key/value pairs in order to aggregate and explore data.

MLlib: Basic Statistics and Exploratory Data Analysis

A notebook introducing Local Vector types, basic statistics in MLlib for Exploratory Data Analysis and model selection.

MLlib: Logistic Regression

Labeled points and Logistic Regression classification of network attacks in MLlib. Application of model selection techniques using correlation matrix and Hypothesis Testing.

MLlib: Decision Trees

Use of tree-based methods and how they help explaining models and feature selection.

Spark SQL: structured processing for Data Analysis

In this notebook a schema is inferred for our network interactions dataset. Based on that, we use Spark's SQL DataFrame abstraction to perform a more structured exploratory data analysis.

Applications

Beyond the basics. Close to real-world applications using Spark and other technologies.

Olssen: On-line Spectral Search ENgine for proteomics

Same tech stack this time with an AngularJS client app.

An on-line movie recommendation web service

This tutorial can be used independently to build a movie recommender model based on the MovieLens dataset. Most of the code in the first part, about how to use ALS with the public MovieLens dataset, comes from my solution to one of the exercises proposed in the CS100.1x Introduction to Big Data with Apache Spark by Anthony D. Joseph on edX, that is also publicly available since 2014 at Spark Summit.

There I've added with minor modifications to use a larger dataset and also code about how to store and reload the model for later use. On top of that we build a Flask web service so the recommender can be use to provide movie recommendations on-line.

KDD Cup 1999

My try using Spark with this classic dataset and Knowledge Discovery competition.

Contributing

Contributions are welcome! For bug reports or requests please submit an issue.

Contact

Feel free to contact me to discuss any issues, questions, or comments.

Twitter: @ja_dianes
GitHub: jadianes
LinkedIn: jadianes
Website: jadianes.me

License

This repository contains a variety of content; some developed by Jose A. Dianes, and some from third-parties. The third-party content is distributed under the license provided by those parties.

The content developed by Jose A. Dianes is distributed under the following license:

Copyright 2016 Jose A Dianes

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

Comments

Website isn't working
Thanks for the tutorials! The domain of the website is probably expired and the .github.io link is routing to that domain too.

Possible solutions:

Renew the domain subscription

Cancel the alias or record that's causing the GitHub page to go to the custom domain
opened by ammarasmro 1
spark context

I had an issue with the command line $ MASTER="spark://127.0.0.1:7077" SPARK_EXECUTOR_MEMORY="1G" IPYTHON_OPTS="notebook --pylab inline" /home/philippe/Downloads/spark-master/bin/pyspark

the error was Connection refused: /127.0.0.1:7077

and was resolved with $ MASTER=local[4] SPARK_EXECUTOR_MEMORY="1G" IPYTHON_OPTS="notebook --pylab inline" /home/philippe/Downloads/spark-master/bin/pyspark maybe you could say a word in the readme about it.

Otherwise great notebooks and great help Thank you!

opened by PChiberre 1
Add a Gitter chat badge to README.md

jadianes/spark-py-notebooks now has a Chat Room on Gitter

@jadianes has just created a chat room. You can visit it here: https://gitter.im/jadianes/spark-py-notebooks.

This pull-request adds this badge to your README.md:

If my aim is a little off, please let me know.

Happy chatting.

PS: Click here if you would prefer not to receive automatic pull-requests from Gitter in future.

opened by gitter-badger 0
[bug] About nb10-sql-dataframes.ipynb (DF.map→RDD.map)
@jadianes hello I'm Hiroyuki. nice Tutorial, Thank you!

In[7]

tcp_interactions_out = tcp_interactions.map(lambda p: "Duration: {}, Dest. bytes: {}".format(p.duration, p.dst_bytes)) for ti_out in tcp_interactions_out.collect(): print ti_out

but map can use only for RDD. so we need to change tcp_interactions(DataFrame) to RDD , I think.

here is the sample

tcp_interactions_out = tcp_interactions.rdd.map(lambda p: "Duration: {}, Dest. bytes: {}".format(p.duration, p.dst_bytes)) for ti_out in tcp_interactions_out.collect(): print ti_out

how do you think about it?

If there is my mistake in my code or in my sentence , sorry. (couse Im not good at writting English) please forgive me if I make you feel bad.
opened by Hiroyuki93 0
Question on: Pyspark MLib Model want to deploy on docker, But the performance is out of expectation

Env: spark standalone on docker

Case: the trained pyspark model (randomforest) deployed on docker

Questions: When I use gunicorn to start the service, including (model loading, prediction) and expose API service with Python Flask framework, it seems pretty slow to call the api..

Could I get your help or any suggestions on spark model deployment? Thanks!

opened by robotsp 0
Integrate with k8s

Amazing resource - thank you. I've cross-posted an issue at https://github.com/SnappyDataInc/spark-on-k8s/issues/24, but in summary: How would I get these jupyter notebooks running on spark-on-k8s?

Thanks again

opened by jtlz2 0
urllib module in nb1-rdd-creation

I think for python3.x users,urllib module has been split into several modules and therefore import urllib.request.urlretrieve will make more sense i guess. Possibly update on the same if you thing is needed.

opened by kmr0877 0
Apparent Memory Issues

juyptererror.txt commandprompt.txt commandprompterror.txt

Hi - I am a student attempting to learn how to use PYSPSARK/JUPYTER to build classification models for large data. I installedPYSPARK V2.2.1 and Juypter as per tutorial on medium website by Michael Galarnyk. It seemed to install ok and I was able to run your first notebook. However in the second notebook nb2-rdd-basics I had problems with the "collect" code

from time import time t0 = time() head_rows = csv_data.take(100000) tt = time() - t0 print "Parse completed in {} seconds".format(round(tt,3)) Thinking it was a memory issue I then launched Jupyter with command pyspark --master local[4] --driver-memory 32g --executor-memory 32g I have attached the Juypter error and command prompt data before and after error Please help - how do I increase memory in the kernel

opened by johnbutler123 0

Owner

Jose A Dianes

Principal Data Scientist at Mosaic Therapeutics.

GitHub http://jadianes.github.io/spark-py-notebooks

BigDL: Distributed Deep Learning Framework for Apache Spark

BigDL: Distributed Deep Learning on Apache Spark What is BigDL? BigDL is a distributed deep learning library for Apache Spark; with BigDL, users can w

4.1k Jan 9, 2023

Distributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray

A unified Data Analytics and AI platform for distributed TensorFlow, Keras and PyTorch on Apache Spark/Flink & Ray What is Analytics Zoo? Analytics Zo

2.5k Dec 28, 2022

TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters.

TensorFlowOnSpark TensorFlowOnSpark brings scalable deep learning to Apache Hadoop and Apache Spark clusters. By combining salient features from the T

3.8k Jan 4, 2023

[DEPRECATED] Tensorflow wrapper for DataFrames on Apache Spark

TensorFrames (Deprecated) Note: TensorFrames is deprecated. You can use pandas UDF instead. Experimental TensorFlow binding for Scala and Apache Spark

757 Dec 31, 2022

Apache Liminal is an end-to-end platform for data engineers & scientists, allowing them to build, train and deploy machine learning models in a robust and agile way

Apache Liminals goal is to operationalise the machine learning process, allowing data scientists to quickly transition from a successful experiment to an automated pipeline of model training, validation, deployment and inference in production. Liminal provides a Domain Specific Language to build ML workflows on top of Apache Airflow.

121 Dec 28, 2022

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

eXtreme Gradient Boosting Community | Documentation | Resources | Contributors | Release Notes XGBoost is an optimized distributed gradient boosting l

Distributed (Deep) Machine Learning Community

23.6k Jan 3, 2023

Python library which makes it possible to dynamically mask/anonymize data using JSON string or python dict rules in a PySpark environment.

pyspark-anonymizer Python library which makes it possible to dynamically mask/anonymize data using JSON string or python dict rules in a PySpark envir