Practical-statistics-for-data-scientists - Code repository for O'Reilly book

Overview

Code repository

Practical Statistics for Data Scientists:

50+ Essential Concepts Using R and Python

by Peter Bruce, Andrew Bruce, and Peter Gedeck

Online

View the notebooks online: nbviewer

Excecute the notebooks in Binder: Binder

This can take some time if the binder environment needs to be rebuilt.

Other language versions

English:
Practical Statistics for Data Scientists: 50+ Essential Concepts Using R and Python
2020: ISBN 149207294X
Google books, Amazon
Japanese:
データサイエンスのための統計学入門 第2版 ―予測、分類、統計モデリング、統計的機械学習とR/Pythonプログラミング
2020: ISBN 487311926X, Shinya Ohashi (supervised), Toshiaki Kurokawa (translated)
Google books, Amazon
German:
Praktische Statistik für Data Scientists: 50+ essenzielle Konzepte mit R und Python 
2021: ISBN 3960091532, Marcus Fraaß (Übersetzer)
Google books, Amazon
Korean:
Practical Statistics for Data Scientists: 데이터 과학을 위한 통계(2판) 2021: ISBN 9791162244180, Junyong Lee (translation)
Google books, Hanbit media
Polish:
Statystyka praktyczna w data science. 50 kluczowych zagadnien w jezykach R i Python 2021: ISBN 9788328374270
Google books, Amazon, Helion

See also

Setup R and Python environments

R

Run the following commands in R to install all required packages

if (!require(vioplot)) install.packages('vioplot')
if (!require(corrplot)) install.packages('corrplot')
if (!require(gmodels)) install.packages('gmodels')
if (!require(matrixStats)) install.packages('matrixStats')

if (!require(lmPerm)) install.packages('lmPerm')
if (!require(pwr)) install.packages('pwr')

if (!require(FNN)) install.packages('FNN')
if (!require(klaR)) install.packages('klaR')
if (!require(DMwR)) install.packages('DMwR')

if (!require(xgboost)) install.packages('xgboost')

if (!require(ellipse)) install.packages('ellipse')
if (!require(mclust)) install.packages('mclust')
if (!require(ca)) install.packages('ca')

Python

We recommend to use a conda environment to run the Python code.

conda create -n sfds python
conda activate sfds
conda env update -n sfds -f environment.yml
Comments
  • Chapter 7 Unsupervised Learning  Cell #18 Dendrogram is giving error

    Chapter 7 Unsupervised Learning Cell #18 Dendrogram is giving error

    Chapter 7 Unsupervised Learning Cell #18 Dendrogram is giving following error:

    Please fix the error and upload corrected code to Github web page. Thanks

    ValueError Traceback (most recent call last) in 1 fig, ax = plt.subplots(figsize=(5, 5)) 2 ----> 3 dendrogram(Z, labels=df.index, color_threshold=0) 4 plt.xticks(rotation=90) 5 ax.set_ylabel('distance')

    C:\ProgramData\Anaconda3\lib\site-packages\scipy\cluster\hierarchy.py in dendrogram(Z, p, truncate_mode, color_threshold, get_leaves, orientation, labels, count_sort, distance_sort, show_leaf_counts, no_plot, no_labels, leaf_font_size, leaf_rotation, leaf_label_func, show_contracted, link_color_func, ax, above_threshold_color) 3275 "'bottom', or 'right'") 3276 -> 3277 if labels and Z.shape[0] + 1 != len(labels): 3278 raise ValueError("Dimensions of Z and labels must be consistent.") 3279

    C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in nonzero(self) 2148 def nonzero(self): 2149 raise ValueError( -> 2150 f"The truth value of a {type(self).name} is ambiguous. " 2151 "Use a.empty, a.bool(), a.item(), a.any() or a.all()." 2152 )

    ValueError: The truth value of a Index is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

    opened by SSJUSA 14
  • Changing the method of getting path to make it Robust

    Changing the method of getting path to make it Robust

    This change will make the code more Robust. Using 'inspect' to find out filename and path will work seamlessly with execfile(...) function and PyCharm's Alt+Shift+e (to run the highlighted lines) shortcut. As mentioned here: https://stackoverflow.com/a/6209894/1484916

    opened by xordux 7
  • Python code for Chapter 3 - Web Stickness - TypeError in the original code

    Python code for Chapter 3 - Web Stickness - TypeError in the original code

    There is a TypeError running the Chapter 3 Web Stickness notebook:

    The line: print(np.mean(perm_diffs > mean_b - mean_a))

    results in the following TypeError: '>' not supported between instances of 'list' and 'float'

    which can be fixed using a mapObj such as:

    mapObj = map(lambda _: _>(mean_b-mean_a), perm_diffs)
    print (f'{sum(mapObj)*100/len(perm_diffs):4.2f}%')
    
    opened by Slepetys 5
  • Graphs in Chapter 5 Classification are not displaying in the Jupyter Notebook

    Graphs in Chapter 5 Classification are not displaying in the Jupyter Notebook

    Jupyter Notebook program of Chapter 5 Classification is giving following errors:

    Matplotlib is currently using agg, which is a non-GUI backend, so can't show the figure.

    Please fix these errors and update notebook's code files on this book's Github webpage.

    Thanks and best regards, SSJ

    opened by SSJUSA 4
  • Again in Ch 5, 6, 7

    Again in Ch 5, 6, 7

    Naive Bayes, The Naive Solution

    The predicted probabilities results are different. They should be 0.4798964(paid off) 0.5201036(default).

    I ran the code in colab. Would check this notebook? https://colab.research.google.com/drive/1ChitMlzaMHYDru6ngI1qBHhJGcIP-RhI#scrollTo=1EnynWD14l2R&line=7&uniqifier=1

    Variable importance

    Need line-break in line 318. https://github.com/gedeck/practical-statistics-for-data-scientists/blob/3e1bf1cb3ca3f345195f702addc3a3b01e67c58c/python/code/Chapter%206%20-%20Statistical%20Machine%20Learning.py#L318

    Hyperparameters and Cross-Validation

    Need line-break in 453. https://github.com/gedeck/practical-statistics-for-data-scientists/blob/3e1bf1cb3ca3f345195f702addc3a3b01e67c58c/python/code/Chapter%206%20-%20Statistical%20Machine%20Learning.py#L453

    And line 452 has type error. Would check this line? "TypeError: Object with dtype category cannot perform the numpy op subtract"

    Python XGBoost codes in Ch6

    It would be better to set eval_metric='error' in Python codes too.

    opened by deulkkae 3
  • Errors and Questions in Ch5, 6, 7

    Errors and Questions in Ch5, 6, 7

    1. In Chapter 5, some notebook code results are diffrent with printed book's.

    [Confusion Matrix]

    In [18]:
    # Confusion matrix
    pred <- predict(logistic_gam, newdata=loan_data)
    pred_y <- as.numeric(pred > 0)
    true_y <- as.numeric(loan_data$outcome=='default')
    true_pos <- (true_y==1) & (pred_y==1)
    true_neg <- (true_y==0) & (pred_y==0)
    false_pos <- (true_y==0) & (pred_y==1)
    false_neg <- (true_y==1) & (pred_y==0)
    conf_mat <- matrix(c(sum(true_pos), sum(false_pos),
                         sum(false_neg), sum(true_neg)), 2, 2)
    colnames(conf_mat) <- c('Yhat = 1', 'Yhat = 0')
    rownames(conf_mat) <- c('Y = 1', 'Y = 0')
    conf_mat
    

      | Yhat = 1 | Yhat = 0 -- | -- | -- Y |14293 | 8378 Y |8051 | 14620

    In the R notebook, the correctly predicted defaults are 14,293 and incorrectly predicted ones are 8,378. But, in the printed book they are 14,295 and 8,376.

    And in Python, I got the another diffrent numbers.

        Yhat = default  Yhat = paid off
    Y = default       14336        8335
    Y = paid off        8148      14523
    

    Which one is correctly right? If the notebook's results are right, the numbers in the first paragrahp of page 222 should be edited.

    2. It's also about the diffrent code results in printed book.

    [AUC]

    In [21]: 
    sum(roc_df$recall[-1] * diff(1-roc_df$specificity))
    head(roc_df)
    0.692623197044616
    

    The result in notebook is 0.692623197044616, but it is 0.6926172 in the book book. Please check the Python code and result too.

    3. XGBoost was updated 1.3.0, so it bring some errors in codes, in Chapter 6 and 7(page 272, 275, 276, 280).

    It's okay to excutue the codes till to page 276. But without explicitly setting eval_metric="error", you will finally get errors in page 280. I think it would be better to edit github's codes.

    4. In Chapter 7, K-Means Clustering - A Simple Example

    In [12]:
    set.seed(1010103)
    df <- sp500_px[row.names(sp500_px)>='2011-01-01', c('XOM', 'CVX')]
    km <- kmeans(df, centers=4, nstart=1)
    
    df$cluster <- factor(km$cluster)
    head(df)
    XOM	CVX	cluster
    2011-01-03	0.73680496	0.2406809	1
    2011-01-04	0.16866845	-0.5845157	4
    2011-01-05	0.02663055	0.4469854	1
    2011-01-06	0.24855834	-0.9197513	4
    2011-01-07	0.33732892	0.1805111	1
    2011-01-10	0.00000000	-0.4641675	4
    

    In the nodebook the first six records are assigned to either cluster 1 or clust 4. The meas of the clusters are the below.

    In [13]:
    centers <- data.frame(cluster=factor(1:4), km$centers)
    centers
    
    cluster	XOM	CVX
    1	 0.2315403	 0.3169645
    2	 0.9270317	 1.3464117
    3	-1.1439800	-1.7502975
    4	-0.3287416	-0.5734695
    

    But the excution results in the book are little bit different. They are assigned to cluster 1 or 2. However, as you see the [Figur 7-5], the cluster 3 and 4 are in the minus area(left below of the graph). and it looks like they represent "down" market. So, I think the code results and some sentences in page 296~297 should be changed.

    5. In Chapter 7, in page 323, the first line of the date table bring wrong column.

    > x <- loan_data[1:5, c('dti', 'payment_inc_ratio', 'home_', 'purpose_')]
    > x
    
        dti payment_inc_ratio   home             purpose  
      <dbl>             <dbl> <fctr>             <fctr>
    1  1.00           2.39320   RENT                car
    ...
    
    

    It should be changed like this.

    > x <- loan_data[1:5, c('dti', 'payment_inc_ratio', 'home_', 'purpose_')]
    > x
    
        dti payment_inc_ratio   home_             purpose_  
      <dbl>             <dbl> <fctr>             <fctr>
    1  1.00           2.39320   RENT                major_purchase
    ...
    
    

    Please check them all and let me know if I think(or did) something wrong. :) Thanks in advance.

    opened by deulkkae 3
  • sp500_data.csv.gz  & kc_tax.csv.gz

    sp500_data.csv.gz & kc_tax.csv.gz

    HI Peter i am new to this platform,python and your book. I was able to download all the data file to follow along except the two zip file above they an error 79- Inappropriate file type or format. I am on MAC (catalina) 10.15.6

    please upload a better copy. Thanks Screen Shot 2020-09-10 at 5 30 59 PM

    opened by mg-nyc 3
  • perm_fun use of set()

    perm_fun use of set()

    Using theperm_fun(x, nA, nB) for the permutation tests on pages 99-101 results in a deprecation warning now.

    "FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead."

    opened by akthe-at 2
  • Different histogram under the same number of bins

    Different histogram under the same number of bins

    In chapter 1, the section where we talk about "Frequency Tables and Histograms", I tried to replicate the code of the histogram with a different Python package lets-plot, which should be similar hist() plot in r. However, the y-axis (the frequency) is different than what the R and Python generated under the same number of bins.

    The histogram generated from the textbook code: image

    Code:

    ax = (state['Population'] / 1_000_000).plot.hist(bins=10)  
    ax.set_xlabel('Population (millions)')
    

    The histogram generated by lets-plot (aka ggplot in Python): image

    Code:

    temp_df = pd.DataFrame(state['Population'] / 1_000_000)  
    ggplot(temp_df, aes(x="Population")) + geom_histogram(bins=10)
    
    opened by hchen98 2
  • Python Jupyter Notebook program output is different from what is shown there

    Python Jupyter Notebook program output is different from what is shown there

    This is in reference to Python Jupyter Notebook for Chapter 5: Classification, section: Undersampling.

    The codes and outputs are, as mentioned in Notebook, shown below -

    original

    However, when I rerun that notebook, the output is as shown below

    actual

    Needless to say, the output is drastically different from what is in original notebook. I have rerun the same code in different notebook and yet the output is different from the original.

    Please look into this.

    opened by mayankkaizen 2
  • Ch 3. Line 77 in Python Code

    Ch 3. Line 77 in Python Code

    https://github.com/gedeck/practical-statistics-for-data-scientists/blob/0db4dbbcdb3ea61f2adb29e3912873168cf0bb92/python/code/Chapter%203%20-%20Statistial%20Experiments%20and%20Significance%20Testing.py#L77

    This line brings typeerror: TypeError: '>' not supported between instances of 'list' and 'float'

    It would be better to correct this line to print(np.mean(np.array(perm_diffs) > mean_b - mean_a))

    opened by deulkkae 2
  • Suggested - break down the Chapter..N...py files into smaller files ch_1_01...py etc.

    Suggested - break down the Chapter..N...py files into smaller files ch_1_01...py etc.

    ...or possibly move this into a contrib/ folder?

    The only changes are in practical-statistics-for-data-scientists/python/code, plus the addition of practical-statistics-for-data-scientists/python/data

    opened by pdxrod 0
  • Pull request

    Pull request

    I'm trying to do a pull request for some files which I've added to this project. They are the Python files Chapter..N...py broken down into smaller files to make them easier to read. I couldn't see how to do a pull request unless I had write access to this repo, so I cloned, and created my own, at https://github.com/pdxrod/practical-statistics-for-data-scientists. I'll delete this repo if requested to do so by Peter Gedeck.

    The main purpose of this branch (small-files) was to make it easier for me to read the book and understand it, being able to see the code in smaller sections, whereas the Chapter..N...py files are 395 lines on average.

    opened by pdxrod 16
Owner
null
Python library that makes it easy for data scientists to create charts.

Chartify Chartify is a Python library that makes it easy for data scientists to create charts. Why use Chartify? Consistent input data format: Spend l

Spotify 3.2k Jan 1, 2023
Generate visualizations of GitHub user and repository statistics using GitHub Actions.

GitHub Stats Visualization Generate visualizations of GitHub user and repository statistics using GitHub Actions. This project is currently a work-in-

JoelImgu 3 Dec 14, 2022
Generate visualizations of GitHub user and repository statistics using GitHub Actions.

GitHub Stats Visualization Generate visualizations of GitHub user and repository statistics using GitHub Actions. This project is currently a work-in-

Aditya Thakekar 1 Jan 11, 2022
Resources for teaching & learning practical data visualization with python.

Practical Data Visualization with Python Overview All views expressed on this site are my own and do not represent the opinions of any entity with whi

Paul Jeffries 98 Sep 24, 2022
Drug design and development team HackBio internship is a virtual bioinformatics program that introduces students and professional to advanced practical bioinformatics and its applications globally.

-Nyokong. Drug design and development team HackBio internship is a virtual bioinformatics program that introduces students and professional to advance

null 4 Aug 4, 2022
Statistics and Visualization of acceptance rate, main keyword of CVPR 2021 accepted papers for the main Computer Vision conference (CVPR)

Statistics and Visualization of acceptance rate, main keyword of CVPR 2021 accepted papers for the main Computer Vision conference (CVPR)

Hoseong Lee 78 Aug 23, 2022
ecoglib: visualization and statistics for high density microecog signals

ecoglib: visualization and statistics for high density microecog signals This library contains high-level analysis tools for "topos" and "chronos" asp

null 1 Nov 17, 2021
Generate SVG (dark/light) images visualizing (private/public) GitHub repo statistics for profile/website.

Generate daily updated visualizations of GitHub user and repository statistics from the GitHub API using GitHub Actions for any combination of private and public repositories, whether owned or contributed to - no server required.

Adam Ross 2 Dec 16, 2022
This repository contains a streaming Dataflow pipeline written in Python with Apache Beam, reading data from PubSub.

Sample streaming Dataflow pipeline written in Python This repository contains a streaming Dataflow pipeline written in Python with Apache Beam, readin

Israel Herraiz 9 Mar 18, 2022
This is a small repository for me to implement my simply Data Visualisation skills through Python.

Data Visualisations This is a small repository for me to implement my simply Data Visualisation skills through Python. Steam Population Chart from 10/

null 9 Dec 31, 2021
Apache Superset is a Data Visualization and Data Exploration Platform

Superset A modern, enterprise-ready business intelligence web application. Why Superset? | Supported Databases | Installation and Configuration | Rele

The Apache Software Foundation 50k Jan 6, 2023
Apache Superset is a Data Visualization and Data Exploration Platform

Apache Superset is a Data Visualization and Data Exploration Platform

The Apache Software Foundation 49.9k Jan 2, 2023
Script to create an animated data visualisation for categorical timeseries data - GIF choropleth map with annotations.

choropleth_ldn Simple script to create a chloropleth map of London with categorical timeseries data. The script in main.py creates a gif of the most f

null 1 Oct 7, 2021
Tidy data structures, summaries, and visualisations for missing data

naniar naniar provides principled, tidy ways to summarise, visualise, and manipulate missing data with minimal deviations from the workflows in ggplot

Nicholas Tierney 611 Dec 22, 2022
Automatic data visualization in atom with the nteract data-explorer

Data Explorer Interactively explore your data directly in atom with hydrogen! The nteract data-explorer provides automatic data visualization, so you

Ben Russert 65 Dec 1, 2022
Data-FX is an addon for Blender (2.9) that allows for the visualization of data with different charts

Data-FX Data-FX is an addon for Blender (2.9) that allows for the visualization of data with different charts Currently, there are only 2 chart option

Landon Ferguson 20 Nov 21, 2022
Collection of data visualizing projects through Tableau, Data Wrapper, and Power BI

Data-Visualization-Projects Collection of data visualizing projects through Tableau, Data Wrapper, and Power BI Indigenous-Brands-Social-Movements Pyt

Jinwoo(Roy) Yoon 1 Feb 5, 2022
Visualize your pandas data with one-line code

PandasEcharts 简介 基于pandas和pyecharts的可视化工具 安装 pip 安装 $ pip install pandasecharts 源码安装 $ git clone https://github.com/gamersover/pandasecharts $ cd pand

陈华杰 2 Apr 13, 2022
Example Code Notebooks for Data Visualization in Python

This repository contains sample code scripts for creating awesome data visualizations from scratch using different python libraries (such as matplotli

Javed Ali 27 Jan 4, 2023