Python Data Science Handbook: full text in Jupyter Notebooks

Jake Vanderplas

Last update: Dec 28, 2022

Related tags

Science python numpy scikit-learn jupyter-notebook pandas matplotlib

Overview

Python Data Science Handbook

This repository contains the entire Python Data Science Handbook, in the form of (free!) Jupyter notebooks.

How to Use this Book

Read the book in its entirety online at https://jakevdp.github.io/PythonDataScienceHandbook/
Run the code using the Jupyter notebooks available in this repository's notebooks directory.
Launch executable versions of these notebooks using Google Colab:
Launch a live notebook server with these notebooks using binder:
Buy the printed book through O'Reilly Media

About

The book was written and tested with Python 3.5, though other Python versions (including Python 2.7) should work in nearly all cases.

The book introduces the core libraries essential for working with data in Python: particularly IPython, NumPy, Pandas, Matplotlib, Scikit-Learn, and related packages. Familiarity with Python as a language is assumed; if you need a quick introduction to the language itself, see the free companion project, A Whirlwind Tour of Python: it's a fast-paced introduction to the Python language aimed at researchers and scientists.

See Index.ipynb for an index of the notebooks available to accompany the text.

Software

The code in the book was tested with Python 3.5, though most (but not all) will also work correctly with Python 2.7 and other older Python versions.

The packages I used to run the code in the book are listed in requirements.txt (Note that some of these exact version numbers may not be available on your platform: you may have to tweak them for your own use). To install the requirements using conda, run the following at the command-line:

$ conda install --file requirements.txt

To create a stand-alone environment named PDSH with Python 3.5 and all the required package versions, run the following:

$ conda create -n PDSH python=3.5 --file requirements.txt

You can read more about using conda environments in the Managing Environments section of the conda documentation.

License

Code

The code in this repository, including all code samples in the notebooks listed above, is released under the MIT license. Read more at the Open Source Initiative.

Text

The text content of the book is released under the CC-BY-NC-ND license. Read more at Creative Commons.

Comments

Missing baseman from requirements.txt.
Required by 04.13-Geographic-Data-With-Basemap.

This was is annoying because it can't be installed from PyPi using pip. You instead need to list the URL of where it is located on SourceForge.

https://downloads.sourceforge.net/project/matplotlib/matplotlib-toolkits/basemap-1.0.7/basemap-1.0.7.tar.gz
opened by GrahamDumpleton 14

Invalid requirements.txt file.

The requirements.txt file has:

numpy=1.11
pandas=0.18.1
scipy=0.17.1
sklearn=0.17.1
matplotlib=1.5.1
jupyter
notebook
line_profiler
memory_profiler

which will be rejected by pip.

Invalid requirement: 'numpy=1.11'
= is not a valid operator. Did you mean == ?

Need to change = to == in each instance where pinning the version.

opened by GrahamDumpleton 11

Instructions to create PDSH conda environment fails
Fetching package metadata ......... Solving package specifications: ....

UnsatisfiableError: The following specifications were found to be in conflict:

numpy ==1.11

python 3.5* Use "conda info " to see the dependencies for each package.
opened by dwelden 9
update scikit-image dependency

There's a bug that's been introduced with an interaction between python 3.6, numpy, and the version of scikit-image that's pinned here. This should fix that, and AFAICT skimage actually isn't used in the notebooks, so this upgrade shouldn't be breaking yeah?

https://github.com/jakevdp/PythonDataScienceHandbook/search?utf8=%E2%9C%93&q=skimage&type=

would fix https://github.com/jakevdp/PythonDataScienceHandbook/issues/108

opened by choldgraf 6

Pandas version and 03.02-Data-Indexing-and-Selection

The requirements.txt file has:

pandas==0.18.1

yet 03.02-Data-Indexing-and-Selection fails with:

AttributeErrorTraceback (most recent call last)
<ipython-input-5-8721e0616114> in <module>()
----> 1 list(data.items())

/opt/app-root/lib/python2.7/site-packages/pandas/core/generic.pyc in __getattr__(self, name)
   2666         if (name in self._internal_names_set or name in self._metadata or
   2667                 name in self._accessors):
-> 2668             return object.__getattribute__(self, name)
   2669         else:
   2670             if name in self._info_axis:

AttributeError: 'Series' object has no attribute 'items'

So maybe is meant to be pandas 0.19.1.

opened by GrahamDumpleton 5

Example: Recipe Database - !gunzip command not working

Hi,

I am trying to reproduce the code at page 184 of the book (Chapter 3).

!curl -O http://openrecipes.s3.amazonaws.com/recipeitems-latest.json.gz

!gunzip recipeitems-latest.json.gz

the second command to unzip the file gives me the error: 'gunzip' is not recognized as an internal or external command, operable program or batch file.

How can I unzip the file? Thanks,

Marco

opened by marco-vene 4
correction to inches calculation in 02.06-Boolean-Arrays-and-Masks

I wasn't able to recreate a little part of this notebook's output. It seemed that dividing rainfall by 254 made the numbers too small.

May I propose rerunning it with the change below?

Line 6 in Input 1 (-) inches = rainfall / 254 # 1/10mm -> inches (+) inches = rainfall / 25.4 # 1/10mm -> inches

This makes the numbers on a scale that show up in the plot.

It does mean changing the number reported between cells 23 and 24. It'll also affect the output once rerun of 23, 24, 25, and 29 though I think it doesn't change the meaning of those examples. I'd have done this through a pull request, but notebooks in github are an unfamiliar beast to me for now.

opened by adamkski 4
About Japanese translation

Hi @jakevdp Thank you for publishing these great notebooks.

I would like to translate your notebooks into Japanese (and publish it on GitHub). But I am concerned that the translation is contrary to the ND (NoDerivatives).

I would like your comments on this CC-BY-NC-ND license issue.

opened by kozo2 4
Chanpter 03.11-Working-with-Time-Series

hi,jakevdp. I can't download the data using codes below: from pandas_datareader import data goog = data.DataReader('GOOG', start='2004', end='2016', data_source='google')

goog = data.DataReader('GOOG', start='2004', end='2016',data_source='google') Traceback (most recent call last):

File "", line 1, in goog = data.DataReader('GOOG', start='2004', end='2016',data_source='google')

File "D:\Anaconda3\lib\site-packages\pandas_datareader\data.py", line 137, in DataReader session=session).read()

File "D:\Anaconda3\lib\site-packages\pandas_datareader\base.py", line 181, in read params=self._get_params(self.symbols))

File "D:\Anaconda3\lib\site-packages\pandas_datareader\base.py", line 79, in _read_one_data out = self._read_url_as_StringIO(url, params=params)

File "D:\Anaconda3\lib\site-packages\pandas_datareader\base.py", line 90, in _read_url_as_StringIO response = self._get_response(url, params=params)

File "D:\Anaconda3\lib\site-packages\pandas_datareader\base.py", line 126, in _get_response headers=headers)

File "D:\Anaconda3\lib\site-packages\requests\sessions.py", line 531, in get return self.request('GET', url, **kwargs)

File "D:\Anaconda3\lib\site-packages\requests\sessions.py", line 518, in request resp = self.send(prep, **send_kwargs)

File "D:\Anaconda3\lib\site-packages\requests\sessions.py", line 639, in send r = adapter.send(request, **kwargs)

File "D:\Anaconda3\lib\site-packages\requests\adapters.py", line 502, in send raise ConnectionError(e, request=request)

ConnectionError: HTTPConnectionPool(host='www.google.com', port=80): Max retries exceeded with url: /finance/historical?q=GOOG&startdate=Jan+01%2C+2004&enddate=Jan+01%2C+2016&output=csv (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x0000020DE647BFD0>: Failed to establish a new connection: [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应，连接尝试失败。',))

opened by Radar-Lei 3
Question: porting Ruby language (and publishing it on GitHub)

I would like to port PythonDataScienceHandbook to Ruby programming language (at the first in Japanese for the text). At the moment the Ruby datascience environment is immature and can not be a complete port, but I would like to borrow the dataset and the text outline. Is this attempt contrary to your PythonDataScienceHandbook license terms?

opened by kozo2 3
max margin example: magic numbers?

Hey jake. I'm about to steal your max margin example for my lecture. It looks like there are magic numbers that show the margin of the other candidates.

Where do they come from? eyeballing?

Also, I feel like we might put that into the sklearn example gallery?

opened by amueller 3
New in version 0.20: SimpleImputer replaces the previous sklearn.prep…

…rocessing.Imputer estimator which is now removed. The import statement and three instances of Imputer --> SimpleImputer are changed in this commit.

opened by kernelrich 0
Bump pillow from 3.4.2 to 9.3.0
Bumps pillow from 3.4.2 to 9.3.0.

Release notes

Sourced from pillow's releases.

9.3.0

https://pillow.readthedocs.io/en/stable/releasenotes/9.3.0.html

Changes

Initialize libtiff buffer when saving #6699 [@radarhere]

Limit SAMPLESPERPIXEL to avoid runtime DOS #6700 [@wiredfool]

Inline fname2char to fix memory leak #6329 [@nulano]

Fix memory leaks related to text features #6330 [@nulano]

Use double quotes for version check on old CPython on Windows #6695 [@hugovk]

GHA: replace deprecated set-output command with GITHUB_OUTPUT file #6697 [@nulano]

Remove backup implementation of Round for Windows platforms #6693 [@cgohlke]

Upload fribidi.dll to GitHub Actions #6532 [@nulano]

Fixed set_variation_by_name offset #6445 [@radarhere]

Windows build improvements #6562 [@nulano]

Fix malloc in _imagingft.c:font_setvaraxes #6690 [@cgohlke]

Only use ASCII characters in C source file #6691 [@cgohlke]

Release Python GIL when converting images using matrix operations #6418 [@hmaarrfk]

Added ExifTags enums #6630 [@radarhere]

Do not modify previous frame when calculating delta in PNG #6683 [@radarhere]

Added support for reading BMP images with RLE4 compression #6674 [@npjg]

Decode JPEG compressed BLP1 data in original mode #6678 [@radarhere]

pylint warnings #6659 [@marksmayo]

Added GPS TIFF tag info #6661 [@radarhere]

Added conversion between RGB/RGBA/RGBX and LAB #6647 [@radarhere]

Do not attempt normalization if mode is already normal #6644 [@radarhere]

Fixed seeking to an L frame in a GIF #6576 [@radarhere]

Consider all frames when selecting mode for PNG save_all #6610 [@radarhere]

Don't reassign crc on ChunkStream close #6627 [@radarhere]

Raise a warning if NumPy failed to raise an error during conversion #6594 [@radarhere]

Only read a maximum of 100 bytes at a time in IMT header #6623 [@radarhere]

Show all frames in ImageShow #6611 [@radarhere]

Allow FLI palette chunk to not be first #6626 [@radarhere]

If first GIF frame has transparency for RGB_ALWAYS loading strategy, use RGBA mode #6592 [@radarhere]

Round box position to integer when pasting embedded color #6517 [@radarhere]

Removed EXIF prefix when saving WebP #6582 [@radarhere]

Pad IM palette to 768 bytes when saving #6579 [@radarhere]

Added DDS BC6H reading #6449 [@ShadelessFox]

Added support for opening WhiteIsZero 16-bit integer TIFF images #6642 [@JayWiz]

Raise an error when allocating translucent color to RGB palette #6654 [@jsbueno]

Moved mode check outside of loops #6650 [@radarhere]

Added reading of TIFF child images #6569 [@radarhere]

Improved ImageOps palette handling #6596 [@PososikTeam]

Defer parsing of palette into colors #6567 [@radarhere]

Apply transparency to P images in ImageTk.PhotoImage #6559 [@radarhere]

Use rounding in ImageOps contain() and pad() #6522 [@bibinhashley]

Fixed GIF remapping to palette with duplicate entries #6548 [@radarhere]

Allow remap_palette() to return an image with less than 256 palette entries #6543 [@radarhere]

Corrected BMP and TGA palette size when saving #6500 [@radarhere]

... (truncated)

Changelog

Sourced from pillow's changelog.

9.3.0 (2022-10-29)

Limit SAMPLESPERPIXEL to avoid runtime DOS #6700 [wiredfool]

Initialize libtiff buffer when saving #6699 [radarhere]

Inline fname2char to fix memory leak #6329 [nulano]

Fix memory leaks related to text features #6330 [nulano]

Use double quotes for version check on old CPython on Windows #6695 [hugovk]

Remove backup implementation of Round for Windows platforms #6693 [cgohlke]

Fixed set_variation_by_name offset #6445 [radarhere]

Fix malloc in _imagingft.c:font_setvaraxes #6690 [cgohlke]

Release Python GIL when converting images using matrix operations #6418 [hmaarrfk]

Added ExifTags enums #6630 [radarhere]

Do not modify previous frame when calculating delta in PNG #6683 [radarhere]

Added support for reading BMP images with RLE4 compression #6674 [npjg, radarhere]

Decode JPEG compressed BLP1 data in original mode #6678 [radarhere]

Added GPS TIFF tag info #6661 [radarhere]

Added conversion between RGB/RGBA/RGBX and LAB #6647 [radarhere]

Do not attempt normalization if mode is already normal #6644 [radarhere]

... (truncated)

Commits

d594f4c Update CHANGES.rst [ci skip]

909dc64 9.3.0 version bump

1a51ce7 Merge pull request #6699 from hugovk/security-libtiff_buffer

2444cdd Merge pull request #6700 from hugovk/security-samples_per_pixel-sec

744f455 Added release notes

0846bfa Add to release notes

799a6a0 Fix linting

00b25fd Hide UserWarning in logs

05b175e Tighter test case

13f2c5a Prevent DOS with large SAMPLESPERPIXEL in Tiff IFD

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 0
Data science and Business Process Management(BPM)

Hello Team,

I am an IT student specialising in data science career, recently being interested in BPM(Business Process Management). If I am here is to have a help from you on: what are the different ways by which we can apply data science with BPM(Business Process Management)?

In the expectation of a positive response please receive my deepest greetings.

Best regards.

opened by christbryan 1
Appendix : Covariance Type

The code is found in the a/m.

Encountering error: cannot unpack non-iterable numpy.float64 object Error in callback <function install_repl_displayhook..post_execute at 0x0000028D8639F9D0> (for post_execute):

It is related to draw_ellipse(model.means_[0], model.covariances_[0], ax[i], alpha=0.2)

When model.covariance passed data to width, height = 2 * np.sqrt(covariance)

I think that the reason is height is blank or no value causing the problem.

What is the fix?

Thank you.

opened by FluffyG88 0
Alternative Documentation with Mkdocs
Hi everyone. I have developed an alternative documentation with mkdocs, google colab and github actions. This repository lives at fralfaro/PythonDataScienceHandbook (documentation: link). In each chapter there is a jupyter notebook connected with google colab. This documentation solve pandas output:

Original Documentation:

Alternative Documentation:

Another interesting point is that all the documentation is generated in a single branch (master), and it is not necessary to use website branch. With this, the costs of maintaining both branches will be reduced.

I hope this is a good alternative. If you have any questions, happy to answer them!
opened by fralfaro 0

Owner

Jake Vanderplas

Python, Astronomy, Data Science

GitHub http://jakevdp.github.io/PythonDataScienceHandbook

Data intensive science for everyone.

The latest information about Galaxy can be found on the Galaxy Community Hub. Community support is available at Galaxy Help. Galaxy Quickstart Galaxy

1k Jan 8, 2023

CS 506 - Computational Tools for Data Science

CS 506 - Computational Tools for Data Science Code, slides, and notes for Boston University CS506 Fall 2021 The Final Project Repository can be found

14 Mar 23, 2022

A logical, reasonably standardized, but flexible project structure for doing and sharing data science work.

Cookiecutter Data Science A logical, reasonably standardized, but flexible project structure for doing and sharing data science work. Project homepage

0 Sep 5, 2021

A framework for feature exploration in Data Science

Beehive A framework for feature exploration in Data Science Background What do we do when we finish one episode of feature exploration in a jupyter no

1 Jan 3, 2022

ReproZip is a tool that simplifies the process of creating reproducible experiments from command-line executions, a frequently-used common denominator in computational science.

ReproZip ReproZip is a tool aimed at simplifying the process of creating reproducible experiments from command-line executions, a frequently-used comm

267 Jan 1, 2023

collection of interesting Computer Science resources

137 Dec 22, 2022

PsychoPy is an open-source package for creating experiments in behavioral science.

PsychoPy is an open-source package for creating experiments in behavioral science. It aims to provide a single package that is: precise enoug

1.3k Dec 31, 2022

Algorithms covered in the Bioinformatics Course part of the Cambridge Computer Science Tripos

Bioinformatics This is a repository of all the algorithms covered in the Bioinformatics Course part of the Cambridge Computer Science Tripos Algorithm

16 Jun 30, 2022

3D visualization of scientific data in Python

Mayavi: 3D visualization of scientific data in Python Mayavi docs: http://docs.enthought.com/mayavi/mayavi/ TVTK docs: http://docs.enthought.com/mayav

1.1k Jan 6, 2023

Efficient Python Tricks and Tools for Data Scientists

Why efficient Python? Because using Python more efficiently will make your code more readable and run more efficiently.

944 Dec 28, 2022

An interactive explorer for single-cell transcriptomics data

an interactive explorer for single-cell transcriptomics data cellxgene (pronounced "cell-by-gene") is an interactive data explorer for single-cell tra

424 Dec 15, 2022

🍊 :bar_chart: :bulb: Orange: Interactive data analysis

Orange Data Mining Orange is a data mining and visualization toolbox for novice and expert alike. To explore data with Orange, one requires no program

3.9k Jan 5, 2023

Datamol is a python library to work with molecules

Datamol is a python library to work with molecules. It's a layer built on top of RDKit and aims to be as light as possible.

276 Dec 19, 2022

Incubator for useful bioinformatics code, primarily in Python and R

Collection of useful code related to biological analysis. Much of this is discussed with examples at Blue collar bioinformatics. All code, images and

560 Dec 24, 2022

Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs (CIKM 2020)

Karate Club is an unsupervised machine learning extension library for NetworkX. Please look at the Documentation, relevant Paper, Promo Video, and Ext

1.8k Dec 31, 2022

Probabilistic Programming in Python: Bayesian Modeling and Probabilistic Machine Learning with Aesara

PyMC3 is a Python package for Bayesian statistical modeling and Probabilistic Machine Learning focusing on advanced Markov chain Monte Carlo (MCMC) an

7.2k Dec 30, 2022

Statsmodels: statistical modeling and econometrics in Python

About statsmodels statsmodels is a Python package that provides a complement to scipy for statistical computations including descriptive statistics an

8.1k Dec 30, 2022

A computer algebra system written in pure Python

SymPy See the AUTHORS file for the list of authors. And many more people helped on the SymPy mailing list, reported bugs, helped organize SymPy's part

9.9k Jan 8, 2023

PennyLane is a cross-platform Python library for differentiable programming of quantum computers.

PennyLane is a cross-platform Python library for differentiable programming of quantum computers. Train a quantum computer the same way as a neural network.

1.6k Jan 4, 2023

Python Data Science Handbook: full text in Jupyter Notebooks

Related tags

Overview

Python Data Science Handbook

How to Use this Book

About

Software

License

Code

Text

Comments

!curl -O http://openrecipes.s3.amazonaws.com/recipeitems-latest.json.gz

!gunzip recipeitems-latest.json.gz

9.3.0

Changes

9.3.0 (2022-10-29)

Owner

Jake Vanderplas

Data intensive science for everyone.

CS 506 - Computational Tools for Data Science

A logical, reasonably standardized, but flexible project structure for doing and sharing data science work.

A framework for feature exploration in Data Science

ReproZip is a tool that simplifies the process of creating reproducible experiments from command-line executions, a frequently-used common denominator in computational science.

collection of interesting Computer Science resources

PsychoPy is an open-source package for creating experiments in behavioral science.

Algorithms covered in the Bioinformatics Course part of the Cambridge Computer Science Tripos

3D visualization of scientific data in Python

Efficient Python Tricks and Tools for Data Scientists

An interactive explorer for single-cell transcriptomics data

🍊 :bar_chart: :bulb: Orange: Interactive data analysis

Datamol is a python library to work with molecules

Incubator for useful bioinformatics code, primarily in Python and R

Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs (CIKM 2020)

Probabilistic Programming in Python: Bayesian Modeling and Probabilistic Machine Learning with Aesara

Statsmodels: statistical modeling and econometrics in Python

A computer algebra system written in pure Python

PennyLane is a cross-platform Python library for differentiable programming of quantum computers.