Tutorial on scikit-learn and IPython for parallel machine learning

Olivier Grisel

Last update: Dec 26, 2022

Related tags

Deep Learning parallel_ml_tutorial

Overview

Parallel Machine Learning with scikit-learn and IPython

Video recording of this tutorial given at PyCon in 2013. The tutorial material has been rearranged in part and extended. Look at the title of the of the notebooks to be able to follow along the presentation.

Browse the static notebooks on nbviewer.ipython.org.

Scope of this tutorial:

Learn common machine learning concepts and how they match the scikit-learn Estimator API.
Learn about scalable feature extraction for text classification and clustering
Learn how to perform parallel cross validation and hyper parameters grid search in parallel with IPython.
Learn to analyze the kinds of common errors predictive models are subject to and how to refine your modeling to take this analysis into account.
Learn to optimize memory allocation on your computing nodes with numpy memory mapping features.
Learn how to run a cheap IPython cluster for interactive predictive modeling on the Amazon EC2 spot instances using StarCluster.

Target audience

This tutorial targets developers with some experience with scikit-learn and machine learning concepts in general.

It is recommended to first go through one of the tutorials hosted at scikit-learn.org if you are new to scikit-learn.

You might might also want to have a look at SciPy Lecture Notes first if you are new to the NumPy / SciPy / matplotlib ecosystem.

Setup

Install NumPy, SciPy, matplotlib, IPython, psutil, and scikit-learn in their latest stable version (e.g. IPython 2.2.0 and scikit-learn 0.15.2 at the time of writing).

You can find up to date installation instructions on scikit-learn.org and ipython.org .

To check your installation, launch the ipython interactive shell in a console and type the following import statements to check each library:

>>> import numpy
>>> import scipy
>>> import matplotlib
>>> import psutil
>>> import sklearn

If you don't get any message, everything is fine. If you get an error message, please ask for help on the mailing list of the matching project and don't forget to mention the version of the library you are trying to install along with the type of platform and version (e.g. Windows 8.1, Ubuntu 14.04, OSX 10.9...).

You can exit the ipython shell by typing exit.

Fetching the data

It is recommended to fetch the datasets ahead of time before diving into the tutorial material itself. To do so run the fetch_data.py script in this folder:

python fetch_data.py

Using the IPython notebook to follow the tutorial

The tutorial material and exercises are hosted in a set of IPython executable notebook files.

To run them interactively do:

$ cd notebooks
$ ipython notebook

This should automatically open a new browser window listing all the notebooks of the folder.

You can then execute the cell in order by hitting the "Shift-Enter" keys and watch the output display directly under the cell and the cursor move on to the next cell. Go to the "Help" menu for links to the notebook tutorial.

Credits

Some of this material is adapted from the scipy 2013 tutorial:

http://github.com/jakevdp/sklearn_scipy2013

Original authors:

Gael Varoquaux @GaelVaroquaux | http://gael-varoquaux.info
Jake VanderPlas @jakevdp | http://jakevdp.github.com
Olivier Grisel @ogrisel | http://ogrisel.com

Comments

Text Feature Extraction Notebook - Cell 2

When running the line X_train = vectorizer.fit_transform(twenty_train_small.data)

I get a ValueError thrown. Is this a general issue in this notebook or is it just me. I have attached the stack trace as well.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-2-3cf347c25e00> in <module>()
     17 # Turn the text documents into vectors of word frequencies
     18 vectorizer = TfidfVectorizer(min_df=2)
---> 19 X_train = vectorizer.fit_transform(twenty_train_small.data)
     20 y_train = twenty_train_small.target
     21 

/Users/adamwalz/.virtualenvs/scikit_learn/lib/python2.7/site-packages/scikit_learn-0.15_git-py2.7-macosx-10.9-intel.egg/sklearn/feature_extraction/text.pyc in fit_transform(self, raw_documents, y)
   1236         # X is already a transformed view of raw_documents so
   1237         # we set copy to False
-> 1238         return self._tfidf.transform(X, copy=False)
   1239 
   1240     def transform(self, raw_documents, copy=True):

/Users/adamwalz/.virtualenvs/scikit_learn/lib/python2.7/site-packages/scikit_learn-0.15_git-py2.7-macosx-10.9-intel.egg/sklearn/feature_extraction/text.pyc in transform(self, X, copy)
   1008 
   1009         if self.norm:
-> 1010             X = normalize(X, norm=self.norm, copy=False)
   1011 
   1012         return X

/Users/adamwalz/.virtualenvs/scikit_learn/lib/python2.7/site-packages/scikit_learn-0.15_git-py2.7-macosx-10.9-intel.egg/sklearn/preprocessing/data.pyc in normalize(X, norm, axis, copy)
    540             inplace_csr_row_normalize_l1(X)
    541         elif norm == 'l2':
--> 542             inplace_csr_row_normalize_l2(X)
    543     else:
    544         if norm == 'l1':

/Users/adamwalz/.virtualenvs/scikit_learn/lib/python2.7/site-packages/scikit_learn-0.15_git-py2.7-macosx-10.9-intel.egg/sklearn/utils/sparsefuncs.so in sklearn.utils.sparsefuncs.inplace_csr_row_normalize_l2 (sklearn/utils/sparsefuncs.c:2714)()

ValueError: Buffer dtype mismatch, expected 'int' but got 'long'

opened by adamwalz 6

Fix couple of typos.

Hi, I was going through your tutorial at Pycon 2015 and noticed couple of minor typos. :)

Thanks for the talk by the way, really helpful to get up to speed with scikit learn as well as pandas :-)

opened by abhinav-upadhyay 1
small typo correction

Thanks for making this available!!

Going through "01 - An Introduction to Predictive Modeling in Python L..."

Below, "Let us not forget to imput the median age for passengers without age information:", I found a typo.

rich_features_final = features.fillna(features.dropna().median())

should be

rich_features_final = rich_features.fillna(rich_features.dropna().median())

rich_features_final.head(5)

opened by aaelony 1
Add input cells to be able to render in nbviewer

This adds the input field to the JSON/ipynb.

To fix these, I simply opened these in IPython notebook and re-saved them.

I'm currently going through to close issues on nbviewer, many of which are about specific notebooks. This one specifically comes from https://github.com/ipython/nbviewer/issues/54.

opened by rgbkrk 1
pylab module was replaced

Hi Olivier,

In the 6th ipython notebook tutorial, the plot was invisible. pylab module was replaced by matplotlib.pyplot in model_selection.py, and IPython.parallel was replaced by ipyparallel.

And in my mac, !ipcluster did not work well in the notebook. I activate and deactivate it in terminal instead. I don't know why it happens though.

Thank you again for good tutorials. Namshik

opened by physhik 0
Learning curve

Hi Olivier, I found your Github during studying ipyparallel. Thank you for the nice tutorials. I tried something on your code, and hope you don't mind it. Thanks, Namshik

opened by physhik 0

00 - Tutorial Setup Deprecation warning

Code:

import IPython.parallel

Warning:

C:\anaconda\lib\site-packages\IPython\parallel.py:13: ShimWarning: The `IPython.parallel` package has been deprecated. You should import from ipyparallel instead.

"You should import from ipyparallel instead.", ShimWarning)

opened by BenjaminKay 0

01 - Introduction second cell fives deprecation warning

Code:

# Import the example plot from the figures directory
from figures import plot_sgd_separator
plot_sgd_separator()

Gives the warning:

C:\anaconda\lib\site-packages\sklearn\utils\validation.py:386: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.

DeprecationWarning)

The error is repeated many times. The figure does load properly despite these warnings.

opened by BenjaminKay 1

Owner

Olivier Grisel

Machine Learning Engineer a Inria Saclay (Parietal team).

GitHub

Python package for Bayesian Machine Learning with scikit-learn API

Python package for Bayesian Machine Learning with scikit-learn API Installing & Upgrading package pip install https://github.com/AmazaspShumik/sklearn

482 Jan 4, 2023

scikit-learn: machine learning in Python

scikit-learn is a Python module for machine learning built on top of SciPy and is distributed under the 3-Clause BSD license. The project was started

52.5k Jan 8, 2023

This repository is related to an Arabic tutorial, within the tutorial we discuss the common data structure and algorithms and their worst and best case for each, then implement the code using Python.

Data Structure and Algorithms with Python This repository is related to the Arabic tutorial here, within the tutorial we discuss the common data struc

33 Dec 2, 2022

Using python and scikit-learn to make stock predictions

MachineLearningStocks in python: a starter project and guide EDIT as of Feb 2021: MachineLearningStocks is no longer actively maintained MachineLearni

1.3k Dec 29, 2022

Regression Metrics Calculation Made easy for tensorflow2 and scikit-learn

Regression Metrics Installation To install the package from the PyPi repository you can execute the following command: pip install regressionmetrics I

11 Dec 16, 2022

A real-time speech emotion recognition application using Scikit-learn and gradio

Speech-Emotion-Recognition-App A real-time speech emotion recognition application using Scikit-learn and gradio. Requirements librosa==0.6.3 numpy sou

6 Oct 4, 2022

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice （『飞桨』核心框架，深度学习&机器学习高性能单机、分布式训练和跨平台部署）

English | 简体中文 Welcome to the PaddlePaddle GitHub. PaddlePaddle, as the only independent R&D deep learning platform in China, has been officially open

19.4k Jan 4, 2023

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.

H2O H2O is an in-memory platform for distributed, scalable machine learning. H2O uses familiar interfaces like R, Python, Scala, Java, JSON and the Fl

6.1k Jan 5, 2023

Tutorial on scikit-learn and IPython for parallel machine learning

Related tags

Overview

Parallel Machine Learning with scikit-learn and IPython

Scope of this tutorial:

Target audience

Setup

Fetching the data

Using the IPython notebook to follow the tutorial

Credits

Comments

Text Feature Extraction Notebook - Cell 2

Fix couple of typos.

small typo correction

Add input cells to be able to render in nbviewer

pylab module was replaced

Learning curve

00 - Tutorial Setup Deprecation warning

01 - Introduction second cell fives deprecation warning

Owner

Olivier Grisel

Python package for Bayesian Machine Learning with scikit-learn API

scikit-learn: machine learning in Python

This repository is related to an Arabic tutorial, within the tutorial we discuss the common data structure and algorithms and their worst and best case for each, then implement the code using Python.

Using python and scikit-learn to make stock predictions

Regression Metrics Calculation Made easy for tensorflow2 and scikit-learn

A real-time speech emotion recognition application using Scikit-learn and gradio

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice （『飞桨』核心框架，深度学习&机器学习高性能单机、分布式训练和跨平台部署）

A scikit-learn compatible neural network library that wraps PyTorch

A scikit-learn compatible neural network library that wraps PyTorch

A scikit-learn compatible neural network library that wraps PyTorch

Scikit-learn compatible estimation of general graphical models

scikit-learn inspired API for CRFsuite

Genetic Programming in Python, with a scikit-learn inspired API

Genetic feature selection module for scikit-learn

Use evolutionary algorithms instead of gridsearch in scikit-learn

SigOpt wrappers for scikit-learn methods

A scikit-learn-compatible module for estimating prediction intervals.

Convert scikit-learn models to PyTorch modules