Python Data Science Handbook: full text in Jupyter Notebooks

Overview

Python Data Science Handbook

Binder Colab

This repository contains the entire Python Data Science Handbook, in the form of (free!) Jupyter notebooks.

cover image

How to Use this Book

About

The book was written and tested with Python 3.5, though other Python versions (including Python 2.7) should work in nearly all cases.

The book introduces the core libraries essential for working with data in Python: particularly IPython, NumPy, Pandas, Matplotlib, Scikit-Learn, and related packages. Familiarity with Python as a language is assumed; if you need a quick introduction to the language itself, see the free companion project, A Whirlwind Tour of Python: it's a fast-paced introduction to the Python language aimed at researchers and scientists.

See Index.ipynb for an index of the notebooks available to accompany the text.

Software

The code in the book was tested with Python 3.5, though most (but not all) will also work correctly with Python 2.7 and other older Python versions.

The packages I used to run the code in the book are listed in requirements.txt (Note that some of these exact version numbers may not be available on your platform: you may have to tweak them for your own use). To install the requirements using conda, run the following at the command-line:

$ conda install --file requirements.txt

To create a stand-alone environment named PDSH with Python 3.5 and all the required package versions, run the following:

$ conda create -n PDSH python=3.5 --file requirements.txt

You can read more about using conda environments in the Managing Environments section of the conda documentation.

License

Code

The code in this repository, including all code samples in the notebooks listed above, is released under the MIT license. Read more at the Open Source Initiative.

Text

The text content of the book is released under the CC-BY-NC-ND license. Read more at Creative Commons.

Comments
  • Missing baseman from requirements.txt.

    Missing baseman from requirements.txt.

    Required by 04.13-Geographic-Data-With-Basemap.

    This was is annoying because it can't be installed from PyPi using pip. You instead need to list the URL of where it is located on SourceForge.

    • https://downloads.sourceforge.net/project/matplotlib/matplotlib-toolkits/basemap-1.0.7/basemap-1.0.7.tar.gz
    opened by GrahamDumpleton 14
  • Invalid requirements.txt file.

    Invalid requirements.txt file.

    The requirements.txt file has:

    numpy=1.11
    pandas=0.18.1
    scipy=0.17.1
    sklearn=0.17.1
    matplotlib=1.5.1
    jupyter
    notebook
    line_profiler
    memory_profiler
    

    which will be rejected by pip.

    Invalid requirement: 'numpy=1.11'
    = is not a valid operator. Did you mean == ?
    

    Need to change = to == in each instance where pinning the version.

    opened by GrahamDumpleton 11
  • Instructions to create PDSH conda environment fails

    Instructions to create PDSH conda environment fails

    Fetching package metadata ......... Solving package specifications: ....

    UnsatisfiableError: The following specifications were found to be in conflict:

    • numpy ==1.11
    • python 3.5* Use "conda info " to see the dependencies for each package.
    opened by dwelden 9
  • update scikit-image dependency

    update scikit-image dependency

    There's a bug that's been introduced with an interaction between python 3.6, numpy, and the version of scikit-image that's pinned here. This should fix that, and AFAICT skimage actually isn't used in the notebooks, so this upgrade shouldn't be breaking yeah?

    https://github.com/jakevdp/PythonDataScienceHandbook/search?utf8=%E2%9C%93&q=skimage&type=

    would fix https://github.com/jakevdp/PythonDataScienceHandbook/issues/108

    opened by choldgraf 6
  • Pandas version and 03.02-Data-Indexing-and-Selection

    Pandas version and 03.02-Data-Indexing-and-Selection

    The requirements.txt file has:

    pandas==0.18.1
    

    yet 03.02-Data-Indexing-and-Selection fails with:

    AttributeErrorTraceback (most recent call last)
    <ipython-input-5-8721e0616114> in <module>()
    ----> 1 list(data.items())
    
    /opt/app-root/lib/python2.7/site-packages/pandas/core/generic.pyc in __getattr__(self, name)
       2666         if (name in self._internal_names_set or name in self._metadata or
       2667                 name in self._accessors):
    -> 2668             return object.__getattribute__(self, name)
       2669         else:
       2670             if name in self._info_axis:
    
    AttributeError: 'Series' object has no attribute 'items'
    

    So maybe is meant to be pandas 0.19.1.

    opened by GrahamDumpleton 5
  • Example: Recipe Database - !gunzip command not working

    Example: Recipe Database - !gunzip command not working

    Hi,

    I am trying to reproduce the code at page 184 of the book (Chapter 3).

    !curl -O http://openrecipes.s3.amazonaws.com/recipeitems-latest.json.gz

    !gunzip recipeitems-latest.json.gz

    the second command to unzip the file gives me the error: 'gunzip' is not recognized as an internal or external command, operable program or batch file.

    How can I unzip the file? Thanks,

    Marco

    opened by marco-vene 4
  • correction to inches calculation in 02.06-Boolean-Arrays-and-Masks

    correction to inches calculation in 02.06-Boolean-Arrays-and-Masks

    I wasn't able to recreate a little part of this notebook's output. It seemed that dividing rainfall by 254 made the numbers too small.

    May I propose rerunning it with the change below?

    Line 6 in Input 1 (-) inches = rainfall / 254 # 1/10mm -> inches (+) inches = rainfall / 25.4 # 1/10mm -> inches

    This makes the numbers on a scale that show up in the plot.

    It does mean changing the number reported between cells 23 and 24. It'll also affect the output once rerun of 23, 24, 25, and 29 though I think it doesn't change the meaning of those examples. I'd have done this through a pull request, but notebooks in github are an unfamiliar beast to me for now.

    opened by adamkski 4
  • About Japanese translation

    About Japanese translation

    Hi @jakevdp Thank you for publishing these great notebooks.

    I would like to translate your notebooks into Japanese (and publish it on GitHub). But I am concerned that the translation is contrary to the ND (NoDerivatives).

    I would like your comments on this CC-BY-NC-ND license issue.

    opened by kozo2 4
  • Chanpter  03.11-Working-with-Time-Series

    Chanpter 03.11-Working-with-Time-Series

    hi,jakevdp. I can't download the data using codes below: from pandas_datareader import data goog = data.DataReader('GOOG', start='2004', end='2016', data_source='google')

    goog = data.DataReader('GOOG', start='2004', end='2016',data_source='google') Traceback (most recent call last):

    File "", line 1, in goog = data.DataReader('GOOG', start='2004', end='2016',data_source='google')

    File "D:\Anaconda3\lib\site-packages\pandas_datareader\data.py", line 137, in DataReader session=session).read()

    File "D:\Anaconda3\lib\site-packages\pandas_datareader\base.py", line 181, in read params=self._get_params(self.symbols))

    File "D:\Anaconda3\lib\site-packages\pandas_datareader\base.py", line 79, in _read_one_data out = self._read_url_as_StringIO(url, params=params)

    File "D:\Anaconda3\lib\site-packages\pandas_datareader\base.py", line 90, in _read_url_as_StringIO response = self._get_response(url, params=params)

    File "D:\Anaconda3\lib\site-packages\pandas_datareader\base.py", line 126, in _get_response headers=headers)

    File "D:\Anaconda3\lib\site-packages\requests\sessions.py", line 531, in get return self.request('GET', url, **kwargs)

    File "D:\Anaconda3\lib\site-packages\requests\sessions.py", line 518, in request resp = self.send(prep, **send_kwargs)

    File "D:\Anaconda3\lib\site-packages\requests\sessions.py", line 639, in send r = adapter.send(request, **kwargs)

    File "D:\Anaconda3\lib\site-packages\requests\adapters.py", line 502, in send raise ConnectionError(e, request=request)

    ConnectionError: HTTPConnectionPool(host='www.google.com', port=80): Max retries exceeded with url: /finance/historical?q=GOOG&startdate=Jan+01%2C+2004&enddate=Jan+01%2C+2016&output=csv (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x0000020DE647BFD0>: Failed to establish a new connection: [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。',))

    opened by Radar-Lei 3
  • Question: porting Ruby language (and publishing it on GitHub)

    Question: porting Ruby language (and publishing it on GitHub)

    I would like to port PythonDataScienceHandbook to Ruby programming language (at the first in Japanese for the text). At the moment the Ruby datascience environment is immature and can not be a complete port, but I would like to borrow the dataset and the text outline. Is this attempt contrary to your PythonDataScienceHandbook license terms?

    opened by kozo2 3
  • max margin example: magic numbers?

    max margin example: magic numbers?

    Hey jake. I'm about to steal your max margin example for my lecture. It looks like there are magic numbers that show the margin of the other candidates.

    Where do they come from? eyeballing?

    Also, I feel like we might put that into the sklearn example gallery?

    opened by amueller 3
  • New in version 0.20: SimpleImputer replaces the previous sklearn.prep…

    New in version 0.20: SimpleImputer replaces the previous sklearn.prep…

    …rocessing.Imputer estimator which is now removed. The import statement and three instances of Imputer --> SimpleImputer are changed in this commit.

    opened by kernelrich 0
  • Bump pillow from 3.4.2 to 9.3.0

    Bump pillow from 3.4.2 to 9.3.0

    Bumps pillow from 3.4.2 to 9.3.0.

    Release notes

    Sourced from pillow's releases.

    9.3.0

    https://pillow.readthedocs.io/en/stable/releasenotes/9.3.0.html

    Changes

    ... (truncated)

    Changelog

    Sourced from pillow's changelog.

    9.3.0 (2022-10-29)

    • Limit SAMPLESPERPIXEL to avoid runtime DOS #6700 [wiredfool]

    • Initialize libtiff buffer when saving #6699 [radarhere]

    • Inline fname2char to fix memory leak #6329 [nulano]

    • Fix memory leaks related to text features #6330 [nulano]

    • Use double quotes for version check on old CPython on Windows #6695 [hugovk]

    • Remove backup implementation of Round for Windows platforms #6693 [cgohlke]

    • Fixed set_variation_by_name offset #6445 [radarhere]

    • Fix malloc in _imagingft.c:font_setvaraxes #6690 [cgohlke]

    • Release Python GIL when converting images using matrix operations #6418 [hmaarrfk]

    • Added ExifTags enums #6630 [radarhere]

    • Do not modify previous frame when calculating delta in PNG #6683 [radarhere]

    • Added support for reading BMP images with RLE4 compression #6674 [npjg, radarhere]

    • Decode JPEG compressed BLP1 data in original mode #6678 [radarhere]

    • Added GPS TIFF tag info #6661 [radarhere]

    • Added conversion between RGB/RGBA/RGBX and LAB #6647 [radarhere]

    • Do not attempt normalization if mode is already normal #6644 [radarhere]

    ... (truncated)

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • Data science and Business Process Management(BPM)

    Data science and Business Process Management(BPM)

    Hello Team,

    I am an IT student specialising in data science career, recently being interested in BPM(Business Process Management). If I am here is to have a help from you on: what are the different ways by which we can apply data science with BPM(Business Process Management)?

    In the expectation of a positive response please receive my deepest greetings.

    Best regards.

    opened by christbryan 1
  • Appendix : Covariance Type

    Appendix : Covariance Type

    The code is found in the a/m.

    Encountering error: cannot unpack non-iterable numpy.float64 object Error in callback <function install_repl_displayhook..post_execute at 0x0000028D8639F9D0> (for post_execute):

    It is related to draw_ellipse(model.means_[0], model.covariances_[0], ax[i], alpha=0.2)

    When model.covariance passed data to width, height = 2 * np.sqrt(covariance)

    I think that the reason is height is blank or no value causing the problem.

    What is the fix?

    Thank you.

    opened by FluffyG88 0
  • Alternative Documentation with Mkdocs

    Alternative Documentation with Mkdocs

    Hi everyone. I have developed an alternative documentation with mkdocs, google colab and github actions. This repository lives at fralfaro/PythonDataScienceHandbook (documentation: link). In each chapter there is a jupyter notebook connected with google colab. This documentation solve pandas output:

    • Original Documentation: 1

    • Alternative Documentation: 2

    Another interesting point is that all the documentation is generated in a single branch (master), and it is not necessary to use website branch. With this, the costs of maintaining both branches will be reduced.

    I hope this is a good alternative. If you have any questions, happy to answer them!

    opened by fralfaro 0
Owner
Jake Vanderplas
Python, Astronomy, Data Science
Jake Vanderplas
Data intensive science for everyone.

The latest information about Galaxy can be found on the Galaxy Community Hub. Community support is available at Galaxy Help. Galaxy Quickstart Galaxy

Galaxy Project 1k Jan 8, 2023
CS 506 - Computational Tools for Data Science

CS 506 - Computational Tools for Data Science Code, slides, and notes for Boston University CS506 Fall 2021 The Final Project Repository can be found

Lance Galletti 14 Mar 23, 2022
A logical, reasonably standardized, but flexible project structure for doing and sharing data science work.

Cookiecutter Data Science A logical, reasonably standardized, but flexible project structure for doing and sharing data science work. Project homepage

Jon C Cline 0 Sep 5, 2021
A framework for feature exploration in Data Science

Beehive A framework for feature exploration in Data Science Background What do we do when we finish one episode of feature exploration in a jupyter no

Steven IJ 1 Jan 3, 2022
ReproZip is a tool that simplifies the process of creating reproducible experiments from command-line executions, a frequently-used common denominator in computational science.

ReproZip ReproZip is a tool aimed at simplifying the process of creating reproducible experiments from command-line executions, a frequently-used comm

null 267 Jan 1, 2023
collection of interesting Computer Science resources

collection of interesting Computer Science resources

Kirill Bobyrev 137 Dec 22, 2022
PsychoPy is an open-source package for creating experiments in behavioral science.

PsychoPy is an open-source package for creating experiments in behavioral science. It aims to provide a single package that is: precise enoug

PsychoPy 1.3k Dec 31, 2022
Algorithms covered in the Bioinformatics Course part of the Cambridge Computer Science Tripos

Bioinformatics This is a repository of all the algorithms covered in the Bioinformatics Course part of the Cambridge Computer Science Tripos Algorithm

null 16 Jun 30, 2022
3D visualization of scientific data in Python

Mayavi: 3D visualization of scientific data in Python Mayavi docs: http://docs.enthought.com/mayavi/mayavi/ TVTK docs: http://docs.enthought.com/mayav

Enthought, Inc. 1.1k Jan 6, 2023
Efficient Python Tricks and Tools for Data Scientists

Why efficient Python? Because using Python more efficiently will make your code more readable and run more efficiently.

Khuyen Tran 944 Dec 28, 2022
An interactive explorer for single-cell transcriptomics data

an interactive explorer for single-cell transcriptomics data cellxgene (pronounced "cell-by-gene") is an interactive data explorer for single-cell tra

Chan Zuckerberg Initiative 424 Dec 15, 2022
🍊 :bar_chart: :bulb: Orange: Interactive data analysis

Orange Data Mining Orange is a data mining and visualization toolbox for novice and expert alike. To explore data with Orange, one requires no program

Bioinformatics Laboratory 3.9k Jan 5, 2023
Datamol is a python library to work with molecules

Datamol is a python library to work with molecules. It's a layer built on top of RDKit and aims to be as light as possible.

datamol 276 Dec 19, 2022
Incubator for useful bioinformatics code, primarily in Python and R

Collection of useful code related to biological analysis. Much of this is discussed with examples at Blue collar bioinformatics. All code, images and

Brad Chapman 560 Dec 24, 2022
Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs (CIKM 2020)

Karate Club is an unsupervised machine learning extension library for NetworkX. Please look at the Documentation, relevant Paper, Promo Video, and Ext

Benedek Rozemberczki 1.8k Dec 31, 2022
Probabilistic Programming in Python: Bayesian Modeling and Probabilistic Machine Learning with Aesara

PyMC3 is a Python package for Bayesian statistical modeling and Probabilistic Machine Learning focusing on advanced Markov chain Monte Carlo (MCMC) an

PyMC 7.2k Dec 30, 2022
Statsmodels: statistical modeling and econometrics in Python

About statsmodels statsmodels is a Python package that provides a complement to scipy for statistical computations including descriptive statistics an

statsmodels 8.1k Dec 30, 2022
A computer algebra system written in pure Python

SymPy See the AUTHORS file for the list of authors. And many more people helped on the SymPy mailing list, reported bugs, helped organize SymPy's part

SymPy 9.9k Jan 8, 2023
PennyLane is a cross-platform Python library for differentiable programming of quantum computers.

PennyLane is a cross-platform Python library for differentiable programming of quantum computers. Train a quantum computer the same way as a neural network.

PennyLaneAI 1.6k Jan 4, 2023