100 data puzzles for pandas, ranging from short and simple to super tricky (60% complete)

Overview

100 pandas puzzles

Puzzles notebook

Solutions notebook

Inspired by 100 Numpy exerises, here are 100* short puzzles for testing your knowledge of pandas' power.

Since pandas is a large library with many different specialist features and functions, these excercises focus mainly on the fundamentals of manipulating data (indexing, grouping, aggregating, cleaning), making use of the core DataFrame and Series objects. Many of the excerises here are straightforward in that the solutions require no more than a few lines of code (in pandas or NumPy - don't go using pure Python!). Choosing the right methods and following best practices is the underlying goal.

The exercises are loosely divided in sections. Each section has a difficulty rating; these ratings are subjective, of course, but should be a seen as a rough guide as to how elaborate the required solution needs to be.

Good luck solving the puzzles!

* the list of puzzles is not yet complete! Pull requests or suggestions for additional exercises, corrections and improvements are welcomed.

Overview of puzzles

Section Name Description Difficulty
Importing pandas Getting started and checking your pandas setup Easy
DataFrame basics A few of the fundamental routines for selecting, sorting, adding and aggregating data in DataFrames Easy
DataFrames: beyond the basics Slightly trickier: you may need to combine two or more methods to get the right answer Medium
DataFrames: harder problems These might require a bit of thinking outside the box... Hard
Series and DatetimeIndex Exercises for creating and manipulating Series with datetime data Easy/Medium
Cleaning Data Making a DataFrame easier to work with Easy/Medium
Using MultiIndexes Go beyond flat DataFrames with additional index levels Medium
Minesweeper Generate the numbers for safe squares in a Minesweeper grid Hard
Plotting Explore pandas' part of plotting functionality to see trends in data Medium

Setting up

To tackle the puzzles on your own computer, you'll need a Python 3 environment with the dependencies (namely pandas) installed.

One way to do this is as follows. I'm using a bash shell, the procedure with Mac OS should be essentially the same. Windows, I'm not sure about.

  1. Check you have Python 3 installed by printing the version of Python:
python -V
  1. Clone the puzzle repository using Git:
git clone https://github.com/ajcr/100-pandas-puzzles.git
  1. Install the dependencies (caution: if you don't want to modify any Python modules in your active environment, consider using a virtual environment instead):
python -m pip install -r requirements.txt
  1. Launch a jupyter notebook server:
jupyter notebook --notebook-dir=100-pandas-puzzles

You should be able to see the notebooks and launch them in your web browser.

Contributors

This repository has benefitted from numerous contributors, with those who have sent puzzles and fixes listed in CONTRIBUTORS.

Thanks to everyone who has raised an issue too.

Other links

If you feel like reading up on pandas before starting, the official documentation useful and very extensive. Good places get a broader overview of pandas are:

There are may other excellent resources and books that are easily searchable and purchaseable.

Comments
  • Add another solution to distance-to-zero.

    Add another solution to distance-to-zero.

    Another solution to the distance to zero problem. This uses a group by. I think this one is more clear than the other two, but that could certainly just be to me.

    opened by madrury 4
  • Adding 5 plotting puzzles

    Adding 5 plotting puzzles

    Really enjoyed your puzzles, so I tried my hand at making some. Let me know if there's anything I can change, or if you believe there is some other pandas functionality that's deserving of puzzles as I'm happy to come up with more.

    opened by johink 2
  • Join forces?

    Join forces?

    Hi Alex,

    As a quest to better learn pandas I created a series of exercises in a different form than yours. I would like to know if you might to be interested to contribute in any way to my repo or if I can use the your exercises.

    Thanks

    opened by guipsamora 2
  • Fixed solution b ex 29 related to issue 17

    Fixed solution b ex 29 related to issue 17

    Implemented the fix suggested by @Arten013 in the issue #17

    This consists in a single, small change of the solutions notebook : in the second solution of exercice 29, replace x = (df1['X'] != 0).cumsum() y = x != x.shift() by y = df['X'] != 0

    opened by pleydier 1
  • The second solution to Q29 does not work propery.

    The second solution to Q29 does not work propery.

    The solution below (Q29-2) output wrong answer when I input dataframe whose value starts with zero.

    x = (df['X'] != 0).cumsum()
    y = x != x.shift()
    df['Y'] = y.groupby((y != y.shift()).cumsum()).cumsum()
    

    In this code, Series y has to have True where its value is not zero and False otherwise. However, the first value of y become True in any case.

    e.g.

    df1 = pd.DataFrame({'X': [0, 2, 0, 3]})
    df2 = pd.DataFrame({'X': [1, 2, 0, 3]})
    
    x = (df1['X'] != 0).cumsum()
    y = x != x.shift()
    print(y[0])
    
    x = (df2['X'] != 0).cumsum()
    y = x != x.shift()
    print(y[0])
    

    outputs

    True
    True
    

    This bug can be fixed by replacing first two lines into y = df['X'] != 0

    Here's the code to compare the results between the solution 1 , solution 2 and modified solution2.

    import pandas as pd
    import numpy as np
    df = pd.DataFrame({'X': [0, 2, 0, 3, 4, 2, 5, 0, 3, 4]})
    
    def solution1(df):
        izero = np.r_[-1, (df['X'] == 0).nonzero()[0]] # indices of zeros
        idx = np.arange(len(df))
        return pd.Series(idx - izero[np.searchsorted(izero - 1, idx) - 1])
    
    def solution2(df):
        x = (df['X'] != 0).cumsum()
        y = x != x.shift()
        return y.groupby((y != y.shift()).cumsum()).cumsum()
    
    def solution2_modified(df):
        y = df['X'] != 0
        return y.groupby((y != y.shift()).cumsum()).cumsum()
    
    check_df = pd.concat([df, solution1(df), solution2(df), solution2_modified(df)], axis=1)
    check_df.columns = ['input_df', 'solution1', 'solution2', 'solution2_modified']
    display(check_df)
    
    

    |input_df|solution1|solution2|solution2_modified| |-------:|--------:|--------:|-----------------:| | 0| 0| 1| 0| | 2| 1| 2| 1| | 0| 0| 0| 0| | 3| 1| 1| 1| | 4| 2| 2| 2| | 2| 3| 3| 3| | 5| 4| 4| 4| | 0| 0| 0| 0| | 3| 1| 1| 1| | 4| 2| 2| 2|

    I executed these code with Python 3.6.7 & pandas 0.24.0.

    opened by Arten013 1
  • Please add a license file

    Please add a license file

    It would be nice if these exercises would have a license, so one knows under which conditions one can make use of them.

    I don't have any particular license in mind myself, and of course that's not my call to make, tough in the name of reducing license proliferation I would suggest to use the same license as pandas itself uses: https://github.com/pandas-dev/pandas/blob/master/LICENSE .

    opened by jabl 1
  • Sorting not needed in solution to question 27.

    Sorting not needed in solution to question 27.

    1. A DataFrame has a column of groups 'grps' and and column of numbers 'vals'. For each group, find the sum of the three greatest values.

    The solution starts with sorting the 'vals' column - this is not needed. The nlargest method selects the three greatest values irrespective of the order of element.

    Suggestion: delete the sorting, the solution is provided by just by the second line of code.

    opened by ibah 1
  • NaN problem with question 21

    NaN problem with question 21

    Thanks for the project. When i working with question#21 using pandas1.2.4. It needs to fillna first.

    df['age'] = df['age'].fillna(0)
    df.pivot_table(index='animal', columns='visits', values='age', aggfunc='mean')
    
    opened by ProfFL028 0
  • Question 20.1 added

    Question 20.1 added

    Question 20.1 added to convert age and visits column to float and int respectively. Having object datatype for age and visits causing issue in question no 21. creating pivot table

    opened by rushikeshjaisur11 0
  • add alternate version for q20

    add alternate version for q20

    Added an alternate version of q20, it is a little longer than just .replace but it showcases how you could use lambda and apply in that situation. Let me know what you think.

    opened by cullinap 0
Owner
Alex Riley
Alex Riley
Using SQLite within Python to create database and analyze Starcraft 2 units data (Pandas also used)

SQLite python Starcraft 2 English This project shows the usage of SQLite with python. To create, modify and communicate with the SQLite database from

null 1 Dec 30, 2021
A collection of 100 Deep Learning images and visualizations

A collection of Deep Learning images and visualizations. The project has been developed by the AI Summer team and currently contains almost 100 images.

AI Summer 65 Sep 12, 2022
Movies-chart - A CLI app gets the top 250 movies of all time from imdb.com and the top 100 movies from rottentomatoes.com

movies-chart This CLI app gets the top 250 movies of all time from imdb.com and

null 3 Feb 17, 2022
Visualize your pandas data with one-line code

PandasEcharts 简介 基于pandas和pyecharts的可视化工具 安装 pip 安装 $ pip install pandasecharts 源码安装 $ git clone https://github.com/gamersover/pandasecharts $ cd pand

陈华杰 2 Apr 13, 2022
Calendar heatmaps from Pandas time series data

Note: See MarvinT/calmap for the maintained version of the project. That is also the version that gets published to PyPI and it has received several f

Martijn Vermaat 195 Dec 22, 2022
Pydrawer: The Python package for visualizing curves and linear transformations in a super simple way

pydrawer ?? The Python package for visualizing curves and linear transformations in a super simple way. ✏️ Installation Install pydrawer package with

Dylan Tintenfich 56 Dec 30, 2022
Curvipy - The Python package for visualizing curves and linear transformations in a super simple way

Curvipy - The Python package for visualizing curves and linear transformations in a super simple way

Dylan Tintenfich 55 Dec 28, 2022
This is a super simple visualization toolbox (script) for transformer attention visualization ✌

Trans_attention_vis This is a super simple visualization toolbox (script) for transformer attention visualization ✌ 1. How to prepare your attention m

Mingyu Wang 3 Jul 9, 2022
Data Visualizer for Super Mario Kart (SNES)

Data Visualizer for Super Mario Kart (SNES)

MrL314 21 Nov 20, 2022
A high-level plotting API for pandas, dask, xarray, and networkx built on HoloViews

hvPlot A high-level plotting API for the PyData ecosystem built on HoloViews. Build Status Coverage Latest dev release Latest release Docs What is it?

HoloViz 697 Jan 6, 2023
Bokeh Plotting Backend for Pandas and GeoPandas

Pandas-Bokeh provides a Bokeh plotting backend for Pandas, GeoPandas and Pyspark DataFrames, similar to the already existing Visualization feature of

Patrik Hlobil 822 Jan 7, 2023
A high-level plotting API for pandas, dask, xarray, and networkx built on HoloViews

hvPlot A high-level plotting API for the PyData ecosystem built on HoloViews. Build Status Coverage Latest dev release Latest release Docs What is it?

HoloViz 349 Feb 15, 2021
Bokeh Plotting Backend for Pandas and GeoPandas

Pandas-Bokeh provides a Bokeh plotting backend for Pandas, GeoPandas and Pyspark DataFrames, similar to the already existing Visualization feature of

Patrik Hlobil 614 Feb 17, 2021
📊📈 Serves up Pandas dataframes via the Django REST Framework for use in client-side (i.e. d3.js) visualizations and offline analysis (e.g. Excel)

???? Serves up Pandas dataframes via the Django REST Framework for use in client-side (i.e. d3.js) visualizations and offline analysis (e.g. Excel)

wq framework 1.2k Jan 1, 2023
In-memory Graph Database and Knowledge Graph with Natural Language Interface, compatible with Pandas

CogniPy for Pandas - In-memory Graph Database and Knowledge Graph with Natural Language Interface Whats in the box Reasoning, exploration of RDF/OWL,

Cognitum Octopus 34 Dec 13, 2022
Sparkling Pandas

SparklingPandas SparklingPandas aims to make it easy to use the distributed computing power of PySpark to scale your data analysis with Pandas. Sparkl

null 366 Oct 27, 2022
Create HTML profiling reports from pandas DataFrame objects

Pandas Profiling Documentation | Slack | Stack Overflow Generates profile reports from a pandas DataFrame. The pandas df.describe() function is great

null 10k Jan 1, 2023
Productivity Tools for Plotly + Pandas

Cufflinks This library binds the power of plotly with the flexibility of pandas for easy plotting. This library is available on https://github.com/san

Jorge Santos 2.7k Dec 30, 2022
A GUI for Pandas DataFrames

PandasGUI A GUI for analyzing Pandas DataFrames. Demo Installation Install latest release from PyPi: pip install pandasgui Install directly from Githu

Adam 2.8k Jan 3, 2023