Practical-statistics-for-data-scientists - Code repository for O'Reilly book


Code repository

Practical Statistics for Data Scientists:

50+ Essential Concepts Using R and Python

by Peter Bruce, Andrew Bruce, and Peter Gedeck


View the notebooks online: nbviewer

Excecute the notebooks in Binder: Binder

This can take some time if the binder environment needs to be rebuilt.

Other language versions

Practical Statistics for Data Scientists: 50+ Essential Concepts Using R and Python
2020: ISBN 149207294X
Google books, Amazon
データサイエンスのための統計学入門 第2版 ―予測、分類、統計モデリング、統計的機械学習とR/Pythonプログラミング
2020: ISBN 487311926X, Shinya Ohashi (supervised), Toshiaki Kurokawa (translated)
Google books, Amazon
Praktische Statistik für Data Scientists: 50+ essenzielle Konzepte mit R und Python 
2021: ISBN 3960091532, Marcus Fraaß (Übersetzer)
Google books, Amazon
Practical Statistics for Data Scientists: 데이터 과학을 위한 통계(2판) 2021: ISBN 9791162244180, Junyong Lee (translation)
Google books, Hanbit media
Statystyka praktyczna w data science. 50 kluczowych zagadnien w jezykach R i Python 2021: ISBN 9788328374270
Google books, Amazon, Helion

See also

Setup R and Python environments


Run the following commands in R to install all required packages

if (!require(vioplot)) install.packages('vioplot')
if (!require(corrplot)) install.packages('corrplot')
if (!require(gmodels)) install.packages('gmodels')
if (!require(matrixStats)) install.packages('matrixStats')

if (!require(lmPerm)) install.packages('lmPerm')
if (!require(pwr)) install.packages('pwr')

if (!require(FNN)) install.packages('FNN')
if (!require(klaR)) install.packages('klaR')
if (!require(DMwR)) install.packages('DMwR')

if (!require(xgboost)) install.packages('xgboost')

if (!require(ellipse)) install.packages('ellipse')
if (!require(mclust)) install.packages('mclust')
if (!require(ca)) install.packages('ca')


We recommend to use a conda environment to run the Python code.

conda create -n sfds python
conda activate sfds
conda env update -n sfds -f environment.yml
  • Chapter 7 Unsupervised Learning  Cell #18 Dendrogram is giving error

    Chapter 7 Unsupervised Learning Cell #18 Dendrogram is giving error

    Chapter 7 Unsupervised Learning Cell #18 Dendrogram is giving following error:

    Please fix the error and upload corrected code to Github web page. Thanks

    ValueError Traceback (most recent call last) in 1 fig, ax = plt.subplots(figsize=(5, 5)) 2 ----> 3 dendrogram(Z, labels=df.index, color_threshold=0) 4 plt.xticks(rotation=90) 5 ax.set_ylabel('distance')

    C:\ProgramData\Anaconda3\lib\site-packages\scipy\cluster\ in dendrogram(Z, p, truncate_mode, color_threshold, get_leaves, orientation, labels, count_sort, distance_sort, show_leaf_counts, no_plot, no_labels, leaf_font_size, leaf_rotation, leaf_label_func, show_contracted, link_color_func, ax, above_threshold_color) 3275 "'bottom', or 'right'") 3276 -> 3277 if labels and Z.shape[0] + 1 != len(labels): 3278 raise ValueError("Dimensions of Z and labels must be consistent.") 3279

    C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexes\ in nonzero(self) 2148 def nonzero(self): 2149 raise ValueError( -> 2150 f"The truth value of a {type(self).name} is ambiguous. " 2151 "Use a.empty, a.bool(), a.item(), a.any() or a.all()." 2152 )

    ValueError: The truth value of a Index is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

    opened by SSJUSA 14
  • Changing the method of getting path to make it Robust

    Changing the method of getting path to make it Robust

    This change will make the code more Robust. Using 'inspect' to find out filename and path will work seamlessly with execfile(...) function and PyCharm's Alt+Shift+e (to run the highlighted lines) shortcut. As mentioned here:

    opened by xordux 7
  • Python code for Chapter 3 - Web Stickness - TypeError in the original code

    Python code for Chapter 3 - Web Stickness - TypeError in the original code

    There is a TypeError running the Chapter 3 Web Stickness notebook:

    The line: print(np.mean(perm_diffs > mean_b - mean_a))

    results in the following TypeError: '>' not supported between instances of 'list' and 'float'

    which can be fixed using a mapObj such as:

    mapObj = map(lambda _: _>(mean_b-mean_a), perm_diffs)
    print (f'{sum(mapObj)*100/len(perm_diffs):4.2f}%')
    opened by Slepetys 5
  • Graphs in Chapter 5 Classification are not displaying in the Jupyter Notebook

    Graphs in Chapter 5 Classification are not displaying in the Jupyter Notebook

    Jupyter Notebook program of Chapter 5 Classification is giving following errors:

    Matplotlib is currently using agg, which is a non-GUI backend, so can't show the figure.

    Please fix these errors and update notebook's code files on this book's Github webpage.

    Thanks and best regards, SSJ

    opened by SSJUSA 4
  • Again in Ch 5, 6, 7

    Again in Ch 5, 6, 7

    Naive Bayes, The Naive Solution

    The predicted probabilities results are different. They should be 0.4798964(paid off) 0.5201036(default).

    I ran the code in colab. Would check this notebook?

    Variable importance

    Need line-break in line 318.

    Hyperparameters and Cross-Validation

    Need line-break in 453.

    And line 452 has type error. Would check this line? "TypeError: Object with dtype category cannot perform the numpy op subtract"

    Python XGBoost codes in Ch6

    It would be better to set eval_metric='error' in Python codes too.

    opened by deulkkae 3
  • Errors and Questions in Ch5, 6, 7

    Errors and Questions in Ch5, 6, 7

    1. In Chapter 5, some notebook code results are diffrent with printed book's.

    [Confusion Matrix]

    In [18]:
    # Confusion matrix
    pred <- predict(logistic_gam, newdata=loan_data)
    pred_y <- as.numeric(pred > 0)
    true_y <- as.numeric(loan_data$outcome=='default')
    true_pos <- (true_y==1) & (pred_y==1)
    true_neg <- (true_y==0) & (pred_y==0)
    false_pos <- (true_y==0) & (pred_y==1)
    false_neg <- (true_y==1) & (pred_y==0)
    conf_mat <- matrix(c(sum(true_pos), sum(false_pos),
                         sum(false_neg), sum(true_neg)), 2, 2)
    colnames(conf_mat) <- c('Yhat = 1', 'Yhat = 0')
    rownames(conf_mat) <- c('Y = 1', 'Y = 0')

      | Yhat = 1 | Yhat = 0 -- | -- | -- Y |14293 | 8378 Y |8051 | 14620

    In the R notebook, the correctly predicted defaults are 14,293 and incorrectly predicted ones are 8,378. But, in the printed book they are 14,295 and 8,376.

    And in Python, I got the another diffrent numbers.

        Yhat = default  Yhat = paid off
    Y = default       14336        8335
    Y = paid off        8148      14523

    Which one is correctly right? If the notebook's results are right, the numbers in the first paragrahp of page 222 should be edited.

    2. It's also about the diffrent code results in printed book.


    In [21]: 
    sum(roc_df$recall[-1] * diff(1-roc_df$specificity))

    The result in notebook is 0.692623197044616, but it is 0.6926172 in the book book. Please check the Python code and result too.

    3. XGBoost was updated 1.3.0, so it bring some errors in codes, in Chapter 6 and 7(page 272, 275, 276, 280).

    It's okay to excutue the codes till to page 276. But without explicitly setting eval_metric="error", you will finally get errors in page 280. I think it would be better to edit github's codes.

    4. In Chapter 7, K-Means Clustering - A Simple Example

    In [12]:
    df <- sp500_px[row.names(sp500_px)>='2011-01-01', c('XOM', 'CVX')]
    km <- kmeans(df, centers=4, nstart=1)
    df$cluster <- factor(km$cluster)
    XOM	CVX	cluster
    2011-01-03	0.73680496	0.2406809	1
    2011-01-04	0.16866845	-0.5845157	4
    2011-01-05	0.02663055	0.4469854	1
    2011-01-06	0.24855834	-0.9197513	4
    2011-01-07	0.33732892	0.1805111	1
    2011-01-10	0.00000000	-0.4641675	4

    In the nodebook the first six records are assigned to either cluster 1 or clust 4. The meas of the clusters are the below.

    In [13]:
    centers <- data.frame(cluster=factor(1:4), km$centers)
    cluster	XOM	CVX
    1	 0.2315403	 0.3169645
    2	 0.9270317	 1.3464117
    3	-1.1439800	-1.7502975
    4	-0.3287416	-0.5734695

    But the excution results in the book are little bit different. They are assigned to cluster 1 or 2. However, as you see the [Figur 7-5], the cluster 3 and 4 are in the minus area(left below of the graph). and it looks like they represent "down" market. So, I think the code results and some sentences in page 296~297 should be changed.

    5. In Chapter 7, in page 323, the first line of the date table bring wrong column.

    > x <- loan_data[1:5, c('dti', 'payment_inc_ratio', 'home_', 'purpose_')]
    > x
        dti payment_inc_ratio   home             purpose  
      <dbl>             <dbl> <fctr>             <fctr>
    1  1.00           2.39320   RENT                car

    It should be changed like this.

    > x <- loan_data[1:5, c('dti', 'payment_inc_ratio', 'home_', 'purpose_')]
    > x
        dti payment_inc_ratio   home_             purpose_  
      <dbl>             <dbl> <fctr>             <fctr>
    1  1.00           2.39320   RENT                major_purchase

    Please check them all and let me know if I think(or did) something wrong. :) Thanks in advance.

    opened by deulkkae 3
  • sp500_data.csv.gz  & kc_tax.csv.gz

    sp500_data.csv.gz & kc_tax.csv.gz

    HI Peter i am new to this platform,python and your book. I was able to download all the data file to follow along except the two zip file above they an error 79- Inappropriate file type or format. I am on MAC (catalina) 10.15.6

    please upload a better copy. Thanks Screen Shot 2020-09-10 at 5 30 59 PM

    opened by mg-nyc 3
  • perm_fun use of set()

    perm_fun use of set()

    Using theperm_fun(x, nA, nB) for the permutation tests on pages 99-101 results in a deprecation warning now.

    "FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead."

    opened by akthe-at 2
  • Different histogram under the same number of bins

    Different histogram under the same number of bins

    In chapter 1, the section where we talk about "Frequency Tables and Histograms", I tried to replicate the code of the histogram with a different Python package lets-plot, which should be similar hist() plot in r. However, the y-axis (the frequency) is different than what the R and Python generated under the same number of bins.

    The histogram generated from the textbook code: image


    ax = (state['Population'] / 1_000_000).plot.hist(bins=10)  
    ax.set_xlabel('Population (millions)')

    The histogram generated by lets-plot (aka ggplot in Python): image


    temp_df = pd.DataFrame(state['Population'] / 1_000_000)  
    ggplot(temp_df, aes(x="Population")) + geom_histogram(bins=10)
    opened by hchen98 2
  • Python Jupyter Notebook program output is different from what is shown there

    Python Jupyter Notebook program output is different from what is shown there

    This is in reference to Python Jupyter Notebook for Chapter 5: Classification, section: Undersampling.

    The codes and outputs are, as mentioned in Notebook, shown below -


    However, when I rerun that notebook, the output is as shown below


    Needless to say, the output is drastically different from what is in original notebook. I have rerun the same code in different notebook and yet the output is different from the original.

    Please look into this.

    opened by mayankkaizen 2
  • Ch 3. Line 77 in Python Code

    Ch 3. Line 77 in Python Code

    This line brings typeerror: TypeError: '>' not supported between instances of 'list' and 'float'

    It would be better to correct this line to print(np.mean(np.array(perm_diffs) > mean_b - mean_a))

    opened by deulkkae 2
  • Suggested - break down the files into smaller files etc.

    Suggested - break down the files into smaller files etc.

    ...or possibly move this into a contrib/ folder?

    The only changes are in practical-statistics-for-data-scientists/python/code, plus the addition of practical-statistics-for-data-scientists/python/data

    opened by pdxrod 0
  • Pull request

    Pull request

    I'm trying to do a pull request for some files which I've added to this project. They are the Python files broken down into smaller files to make them easier to read. I couldn't see how to do a pull request unless I had write access to this repo, so I cloned, and created my own, at I'll delete this repo if requested to do so by Peter Gedeck.

    The main purpose of this branch (small-files) was to make it easier for me to read the book and understand it, being able to see the code in smaller sections, whereas the files are 395 lines on average.

    opened by pdxrod 16
