Tutorial on scikit-learn and IPython for parallel machine learning

Overview

Parallel Machine Learning with scikit-learn and IPython

Video Tutorial

Video recording of this tutorial given at PyCon in 2013. The tutorial material has been rearranged in part and extended. Look at the title of the of the notebooks to be able to follow along the presentation.

Browse the static notebooks on nbviewer.ipython.org.

Scope of this tutorial:

  • Learn common machine learning concepts and how they match the scikit-learn Estimator API.

  • Learn about scalable feature extraction for text classification and clustering

  • Learn how to perform parallel cross validation and hyper parameters grid search in parallel with IPython.

  • Learn to analyze the kinds of common errors predictive models are subject to and how to refine your modeling to take this analysis into account.

  • Learn to optimize memory allocation on your computing nodes with numpy memory mapping features.

  • Learn how to run a cheap IPython cluster for interactive predictive modeling on the Amazon EC2 spot instances using StarCluster.

Target audience

This tutorial targets developers with some experience with scikit-learn and machine learning concepts in general.

It is recommended to first go through one of the tutorials hosted at scikit-learn.org if you are new to scikit-learn.

You might might also want to have a look at SciPy Lecture Notes first if you are new to the NumPy / SciPy / matplotlib ecosystem.

Setup

Install NumPy, SciPy, matplotlib, IPython, psutil, and scikit-learn in their latest stable version (e.g. IPython 2.2.0 and scikit-learn 0.15.2 at the time of writing).

You can find up to date installation instructions on scikit-learn.org and ipython.org .

To check your installation, launch the ipython interactive shell in a console and type the following import statements to check each library:

>>> import numpy
>>> import scipy
>>> import matplotlib
>>> import psutil
>>> import sklearn

If you don't get any message, everything is fine. If you get an error message, please ask for help on the mailing list of the matching project and don't forget to mention the version of the library you are trying to install along with the type of platform and version (e.g. Windows 8.1, Ubuntu 14.04, OSX 10.9...).

You can exit the ipython shell by typing exit.

Fetching the data

It is recommended to fetch the datasets ahead of time before diving into the tutorial material itself. To do so run the fetch_data.py script in this folder:

python fetch_data.py

Using the IPython notebook to follow the tutorial

The tutorial material and exercises are hosted in a set of IPython executable notebook files.

To run them interactively do:

$ cd notebooks
$ ipython notebook

This should automatically open a new browser window listing all the notebooks of the folder.

You can then execute the cell in order by hitting the "Shift-Enter" keys and watch the output display directly under the cell and the cursor move on to the next cell. Go to the "Help" menu for links to the notebook tutorial.

Credits

Some of this material is adapted from the scipy 2013 tutorial:

http://github.com/jakevdp/sklearn_scipy2013

Original authors:

Comments
  • Text Feature Extraction Notebook - Cell 2

    Text Feature Extraction Notebook - Cell 2

    When running the line X_train = vectorizer.fit_transform(twenty_train_small.data)

    I get a ValueError thrown. Is this a general issue in this notebook or is it just me. I have attached the stack trace as well.

    ---------------------------------------------------------------------------
    ValueError                                Traceback (most recent call last)
    <ipython-input-2-3cf347c25e00> in <module>()
         17 # Turn the text documents into vectors of word frequencies
         18 vectorizer = TfidfVectorizer(min_df=2)
    ---> 19 X_train = vectorizer.fit_transform(twenty_train_small.data)
         20 y_train = twenty_train_small.target
         21 
    
    /Users/adamwalz/.virtualenvs/scikit_learn/lib/python2.7/site-packages/scikit_learn-0.15_git-py2.7-macosx-10.9-intel.egg/sklearn/feature_extraction/text.pyc in fit_transform(self, raw_documents, y)
       1236         # X is already a transformed view of raw_documents so
       1237         # we set copy to False
    -> 1238         return self._tfidf.transform(X, copy=False)
       1239 
       1240     def transform(self, raw_documents, copy=True):
    
    /Users/adamwalz/.virtualenvs/scikit_learn/lib/python2.7/site-packages/scikit_learn-0.15_git-py2.7-macosx-10.9-intel.egg/sklearn/feature_extraction/text.pyc in transform(self, X, copy)
       1008 
       1009         if self.norm:
    -> 1010             X = normalize(X, norm=self.norm, copy=False)
       1011 
       1012         return X
    
    /Users/adamwalz/.virtualenvs/scikit_learn/lib/python2.7/site-packages/scikit_learn-0.15_git-py2.7-macosx-10.9-intel.egg/sklearn/preprocessing/data.pyc in normalize(X, norm, axis, copy)
        540             inplace_csr_row_normalize_l1(X)
        541         elif norm == 'l2':
    --> 542             inplace_csr_row_normalize_l2(X)
        543     else:
        544         if norm == 'l1':
    
    /Users/adamwalz/.virtualenvs/scikit_learn/lib/python2.7/site-packages/scikit_learn-0.15_git-py2.7-macosx-10.9-intel.egg/sklearn/utils/sparsefuncs.so in sklearn.utils.sparsefuncs.inplace_csr_row_normalize_l2 (sklearn/utils/sparsefuncs.c:2714)()
    
    ValueError: Buffer dtype mismatch, expected 'int' but got 'long'
    
    opened by adamwalz 6
  • Fix couple of typos.

    Fix couple of typos.

    Hi, I was going through your tutorial at Pycon 2015 and noticed couple of minor typos. :)

    Thanks for the talk by the way, really helpful to get up to speed with scikit learn as well as pandas :-)

    opened by abhinav-upadhyay 1
  • small typo correction

    small typo correction

    Thanks for making this available!!

    Going through "01 - An Introduction to Predictive Modeling in Python L..."

    Below, "Let us not forget to imput the median age for passengers without age information:", I found a typo.

    rich_features_final = features.fillna(features.dropna().median())

    should be

    rich_features_final = rich_features.fillna(rich_features.dropna().median())

    rich_features_final.head(5)

    opened by aaelony 1
  • Add input cells to be able to render in nbviewer

    Add input cells to be able to render in nbviewer

    This adds the input field to the JSON/ipynb.

    To fix these, I simply opened these in IPython notebook and re-saved them.

    I'm currently going through to close issues on nbviewer, many of which are about specific notebooks. This one specifically comes from https://github.com/ipython/nbviewer/issues/54.

    opened by rgbkrk 1
  • pylab module was replaced

    pylab module was replaced

    Hi Olivier,

    In the 6th ipython notebook tutorial, the plot was invisible. pylab module was replaced by matplotlib.pyplot in model_selection.py, and IPython.parallel was replaced by ipyparallel.

    And in my mac, !ipcluster did not work well in the notebook. I activate and deactivate it in terminal instead. I don't know why it happens though.

    Thank you again for good tutorials. Namshik

    opened by physhik 0
  • Learning curve

    Learning curve

    Hi Olivier, I found your Github during studying ipyparallel. Thank you for the nice tutorials. I tried something on your code, and hope you don't mind it. Thanks, Namshik

    opened by physhik 0
  • 00 - Tutorial Setup Deprecation warning

    00 - Tutorial Setup Deprecation warning

    Code:

    import IPython.parallel
    

    Warning:

    C:\anaconda\lib\site-packages\IPython\parallel.py:13: ShimWarning: The `IPython.parallel` package has been deprecated. You should import from ipyparallel instead.
    

    "You should import from ipyparallel instead.", ShimWarning)

    opened by BenjaminKay 0
  • 01 - Introduction second cell fives deprecation warning

    01 - Introduction second cell fives deprecation warning

    Code:

    # Import the example plot from the figures directory
    from figures import plot_sgd_separator
    plot_sgd_separator()
    

    Gives the warning:

    C:\anaconda\lib\site-packages\sklearn\utils\validation.py:386: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
    

    DeprecationWarning)

    The error is repeated many times. The figure does load properly despite these warnings.

    opened by BenjaminKay 1
Owner
Olivier Grisel
Machine Learning Engineer a Inria Saclay (Parietal team).
Olivier Grisel
Python package for Bayesian Machine Learning with scikit-learn API

Python package for Bayesian Machine Learning with scikit-learn API Installing & Upgrading package pip install https://github.com/AmazaspShumik/sklearn

Amazasp Shaumyan 482 Jan 4, 2023
scikit-learn: machine learning in Python

scikit-learn is a Python module for machine learning built on top of SciPy and is distributed under the 3-Clause BSD license. The project was started

scikit-learn 52.5k Jan 8, 2023
This repository is related to an Arabic tutorial, within the tutorial we discuss the common data structure and algorithms and their worst and best case for each, then implement the code using Python.

Data Structure and Algorithms with Python This repository is related to the Arabic tutorial here, within the tutorial we discuss the common data struc

Mohamed Ayman 33 Dec 2, 2022
Using python and scikit-learn to make stock predictions

MachineLearningStocks in python: a starter project and guide EDIT as of Feb 2021: MachineLearningStocks is no longer actively maintained MachineLearni

Robert Martin 1.3k Dec 29, 2022
Regression Metrics Calculation Made easy for tensorflow2 and scikit-learn

Regression Metrics Installation To install the package from the PyPi repository you can execute the following command: pip install regressionmetrics I

Ashish Patel 11 Dec 16, 2022
A real-time speech emotion recognition application using Scikit-learn and gradio

Speech-Emotion-Recognition-App A real-time speech emotion recognition application using Scikit-learn and gradio. Requirements librosa==0.6.3 numpy sou

Son Tran 6 Oct 4, 2022
PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)

English | 简体中文 Welcome to the PaddlePaddle GitHub. PaddlePaddle, as the only independent R&D deep learning platform in China, has been officially open

null 19.4k Jan 4, 2023
A scikit-learn compatible neural network library that wraps PyTorch

A scikit-learn compatible neural network library that wraps PyTorch. Resources Documentation Source Code Examples To see more elaborate examples, look

null 4.9k Dec 31, 2022
A scikit-learn compatible neural network library that wraps PyTorch

A scikit-learn compatible neural network library that wraps PyTorch. Resources Documentation Source Code Examples To see more elaborate examples, look

null 3.8k Feb 13, 2021
A scikit-learn compatible neural network library that wraps PyTorch

A scikit-learn compatible neural network library that wraps PyTorch. Resources Documentation Source Code Examples To see more elaborate examples, look

null 4.9k Jan 3, 2023
Scikit-learn compatible estimation of general graphical models

skggm : Gaussian graphical models using the scikit-learn API In the last decade, learning networks that encode conditional independence relationships

null 213 Jan 2, 2023
scikit-learn inspired API for CRFsuite

sklearn-crfsuite sklearn-crfsuite is a thin CRFsuite (python-crfsuite) wrapper which provides interface simlar to scikit-learn. sklearn_crfsuite.CRF i

null 417 Dec 20, 2022
Genetic Programming in Python, with a scikit-learn inspired API

Welcome to gplearn! gplearn implements Genetic Programming in Python, with a scikit-learn inspired and compatible API. While Genetic Programming (GP)

Trevor Stephens 1.3k Jan 3, 2023
Genetic feature selection module for scikit-learn

sklearn-genetic Genetic feature selection module for scikit-learn Genetic algorithms mimic the process of natural selection to search for optimal valu

Manuel Calzolari 260 Dec 14, 2022
Use evolutionary algorithms instead of gridsearch in scikit-learn

sklearn-deap Use evolutionary algorithms instead of gridsearch in scikit-learn. This allows you to reduce the time required to find the best parameter

rsteca 709 Jan 3, 2023
SigOpt wrappers for scikit-learn methods

SigOpt + scikit-learn Interfacing This package implements useful interfaces and wrappers for using SigOpt and scikit-learn together Getting Started In

SigOpt 73 Sep 30, 2022
A scikit-learn-compatible module for estimating prediction intervals.

|Anaconda|_ MAPIE - Model Agnostic Prediction Interval Estimator MAPIE allows you to easily estimate prediction intervals using your favourite sklearn

SimAI 584 Dec 27, 2022
Convert scikit-learn models to PyTorch modules

sk2torch sk2torch converts scikit-learn models into PyTorch modules that can be tuned with backpropagation and even compiled as TorchScript. Problems

Alex Nichol 101 Dec 16, 2022