100 data puzzles for pandas, ranging from short and simple to super tricky (60% complete)

Alex Riley

Last update: Jan 8, 2023

Related tags

Overview

100 pandas puzzles

Solutions notebook

Inspired by 100 Numpy exerises, here are 100* short puzzles for testing your knowledge of pandas' power.

Since pandas is a large library with many different specialist features and functions, these excercises focus mainly on the fundamentals of manipulating data (indexing, grouping, aggregating, cleaning), making use of the core DataFrame and Series objects. Many of the excerises here are straightforward in that the solutions require no more than a few lines of code (in pandas or NumPy - don't go using pure Python!). Choosing the right methods and following best practices is the underlying goal.

The exercises are loosely divided in sections. Each section has a difficulty rating; these ratings are subjective, of course, but should be a seen as a rough guide as to how elaborate the required solution needs to be.

Good luck solving the puzzles!

* the list of puzzles is not yet complete! Pull requests or suggestions for additional exercises, corrections and improvements are welcomed.

Overview of puzzles

Section Name	Description	Difficulty
Importing pandas	Getting started and checking your pandas setup	Easy
DataFrame basics	A few of the fundamental routines for selecting, sorting, adding and aggregating data in DataFrames	Easy
DataFrames: beyond the basics	Slightly trickier: you may need to combine two or more methods to get the right answer	Medium
DataFrames: harder problems	These might require a bit of thinking outside the box...	Hard
Series and DatetimeIndex	Exercises for creating and manipulating Series with datetime data	Easy/Medium
Cleaning Data	Making a DataFrame easier to work with	Easy/Medium
Using MultiIndexes	Go beyond flat DataFrames with additional index levels	Medium
Minesweeper	Generate the numbers for safe squares in a Minesweeper grid	Hard
Plotting	Explore pandas' part of plotting functionality to see trends in data	Medium

Setting up

To tackle the puzzles on your own computer, you'll need a Python 3 environment with the dependencies (namely pandas) installed.

One way to do this is as follows. I'm using a bash shell, the procedure with Mac OS should be essentially the same. Windows, I'm not sure about.

Check you have Python 3 installed by printing the version of Python:

python -V

Clone the puzzle repository using Git:

git clone https://github.com/ajcr/100-pandas-puzzles.git

Install the dependencies (caution: if you don't want to modify any Python modules in your active environment, consider using a virtual environment instead):

python -m pip install -r requirements.txt

Launch a jupyter notebook server:

jupyter notebook --notebook-dir=100-pandas-puzzles

You should be able to see the notebooks and launch them in your web browser.

Contributors

This repository has benefitted from numerous contributors, with those who have sent puzzles and fixes listed in CONTRIBUTORS.

Thanks to everyone who has raised an issue too.

Other links

If you feel like reading up on pandas before starting, the official documentation useful and very extensive. Good places get a broader overview of pandas are:

There are may other excellent resources and books that are easily searchable and purchaseable.

Comments

Add another solution to distance-to-zero.

Another solution to the distance to zero problem. This uses a group by. I think this one is more clear than the other two, but that could certainly just be to me.

opened by madrury 4
Adding 5 plotting puzzles

Really enjoyed your puzzles, so I tried my hand at making some. Let me know if there's anything I can change, or if you believe there is some other pandas functionality that's deserving of puzzles as I'm happy to come up with more.

opened by johink 2
Join forces?

Hi Alex,

As a quest to better learn pandas I created a series of exercises in a different form than yours. I would like to know if you might to be interested to contribute in any way to my repo or if I can use the your exercises.

Thanks

opened by guipsamora 2
Fixed solution b ex 29 related to issue 17

Implemented the fix suggested by @Arten013 in the issue #17

This consists in a single, small change of the solutions notebook : in the second solution of exercice 29, replace x = (df1['X'] != 0).cumsum() y = x != x.shift() by y = df['X'] != 0

opened by pleydier 1
The second solution to Q29 does not work propery.
The solution below (Q29-2) output wrong answer when I input dataframe whose value starts with zero.

x = (df['X'] != 0).cumsum() y = x != x.shift() df['Y'] = y.groupby((y != y.shift()).cumsum()).cumsum()

In this code, Series y has to have True where its value is not zero and False otherwise. However, the first value of y become True in any case.

e.g.

df1 = pd.DataFrame({'X': [0, 2, 0, 3]}) df2 = pd.DataFrame({'X': [1, 2, 0, 3]}) x = (df1['X'] != 0).cumsum() y = x != x.shift() print(y[0]) x = (df2['X'] != 0).cumsum() y = x != x.shift() print(y[0])

outputs

True True

This bug can be fixed by replacing first two lines into y = df['X'] != 0

Here's the code to compare the results between the solution 1 , solution 2 and modified solution2.

import pandas as pd import numpy as np df = pd.DataFrame({'X': [0, 2, 0, 3, 4, 2, 5, 0, 3, 4]}) def solution1(df): izero = np.r_[-1, (df['X'] == 0).nonzero()[0]] # indices of zeros idx = np.arange(len(df)) return pd.Series(idx - izero[np.searchsorted(izero - 1, idx) - 1]) def solution2(df): x = (df['X'] != 0).cumsum() y = x != x.shift() return y.groupby((y != y.shift()).cumsum()).cumsum() def solution2_modified(df): y = df['X'] != 0 return y.groupby((y != y.shift()).cumsum()).cumsum() check_df = pd.concat([df, solution1(df), solution2(df), solution2_modified(df)], axis=1) check_df.columns = ['input_df', 'solution1', 'solution2', 'solution2_modified'] display(check_df)

|input_df|solution1|solution2|solution2_modified| |-------:|--------:|--------:|-----------------:| | 0| 0| 1| 0| | 2| 1| 2| 1| | 0| 0| 0| 0| | 3| 1| 1| 1| | 4| 2| 2| 2| | 2| 3| 3| 3| | 5| 4| 4| 4| | 0| 0| 0| 0| | 3| 1| 1| 1| | 4| 2| 2| 2|

I executed these code with Python 3.6.7 & pandas 0.24.0.
opened by Arten013 1
Please add a license file

It would be nice if these exercises would have a license, so one knows under which conditions one can make use of them.

I don't have any particular license in mind myself, and of course that's not my call to make, tough in the name of reducing license proliferation I would suggest to use the same license as pandas itself uses: https://github.com/pandas-dev/pandas/blob/master/LICENSE .

opened by jabl 1
Sorting not needed in solution to question 27.
A DataFrame has a column of groups 'grps' and and column of numbers 'vals'. For each group, find the sum of the three greatest values.

The solution starts with sorting the 'vals' column - this is not needed. The nlargest method selects the three greatest values irrespective of the order of element.

Suggestion: delete the sorting, the solution is provided by just by the second line of code.
opened by ibah 1
NaN problem with question 21
Thanks for the project. When i working with question#21 using pandas1.2.4. It needs to fillna first.

df['age'] = df['age'].fillna(0) df.pivot_table(index='animal', columns='visits', values='age', aggfunc='mean')
opened by ProfFL028 0
Question 20.1 added

Question 20.1 added to convert age and visits column to float and int respectively. Having object datatype for age and visits causing issue in question no 21. creating pivot table

opened by rushikeshjaisur11 0
add alternate version for q20

Added an alternate version of q20, it is a little longer than just .replace but it showcases how you could use lambda and apply in that situation. Let me know what you think.

opened by cullinap 0

Owner

Alex Riley

GitHub

Using SQLite within Python to create database and analyze Starcraft 2 units data (Pandas also used)

SQLite python Starcraft 2 English This project shows the usage of SQLite with python. To create, modify and communicate with the SQLite database from

1 Dec 30, 2021

A collection of 100 Deep Learning images and visualizations

A collection of Deep Learning images and visualizations. The project has been developed by the AI Summer team and currently contains almost 100 images.

65 Sep 12, 2022

Movies-chart - A CLI app gets the top 250 movies of all time from imdb.com and the top 100 movies from rottentomatoes.com

movies-chart This CLI app gets the top 250 movies of all time from imdb.com and

3 Feb 17, 2022

Visualize your pandas data with one-line code

PandasEcharts 简介基于pandas和pyecharts的可视化工具安装 pip 安装 $ pip install pandasecharts 源码安装 $ git clone https://github.com/gamersover/pandasecharts $ cd pand

2 Apr 13, 2022

Calendar heatmaps from Pandas time series data

Note: See MarvinT/calmap for the maintained version of the project. That is also the version that gets published to PyPI and it has received several f

195 Dec 22, 2022

Pydrawer: The Python package for visualizing curves and linear transformations in a super simple way

pydrawer ?? The Python package for visualizing curves and linear transformations in a super simple way. ✏️ Installation Install pydrawer package with

56 Dec 30, 2022

Curvipy - The Python package for visualizing curves and linear transformations in a super simple way

55 Dec 28, 2022

This is a super simple visualization toolbox (script) for transformer attention visualization ✌

Trans_attention_vis This is a super simple visualization toolbox (script) for transformer attention visualization ✌ 1. How to prepare your attention m

3 Jul 9, 2022

Data Visualizer for Super Mario Kart (SNES)

21 Nov 20, 2022

A high-level plotting API for pandas, dask, xarray, and networkx built on HoloViews

hvPlot A high-level plotting API for the PyData ecosystem built on HoloViews. Build Status Coverage Latest dev release Latest release Docs What is it?

697 Jan 6, 2023

Bokeh Plotting Backend for Pandas and GeoPandas

Pandas-Bokeh provides a Bokeh plotting backend for Pandas, GeoPandas and Pyspark DataFrames, similar to the already existing Visualization feature of

822 Jan 7, 2023

A high-level plotting API for pandas, dask, xarray, and networkx built on HoloViews

hvPlot A high-level plotting API for the PyData ecosystem built on HoloViews. Build Status Coverage Latest dev release Latest release Docs What is it?

349 Feb 15, 2021

Bokeh Plotting Backend for Pandas and GeoPandas

Pandas-Bokeh provides a Bokeh plotting backend for Pandas, GeoPandas and Pyspark DataFrames, similar to the already existing Visualization feature of

614 Feb 17, 2021

📊📈 Serves up Pandas dataframes via the Django REST Framework for use in client-side (i.e. d3.js) visualizations and offline analysis (e.g. Excel)

???? Serves up Pandas dataframes via the Django REST Framework for use in client-side (i.e. d3.js) visualizations and offline analysis (e.g. Excel)

1.2k Jan 1, 2023

100 data puzzles for pandas, ranging from short and simple to super tricky (60% complete)

Related tags

Overview

100 pandas puzzles

Overview of puzzles

Setting up

Contributors

Other links

Comments

Owner

Alex Riley

Using SQLite within Python to create database and analyze Starcraft 2 units data (Pandas also used)

A collection of 100 Deep Learning images and visualizations

Movies-chart - A CLI app gets the top 250 movies of all time from imdb.com and the top 100 movies from rottentomatoes.com

Visualize your pandas data with one-line code

Calendar heatmaps from Pandas time series data

Pydrawer: The Python package for visualizing curves and linear transformations in a super simple way

Curvipy - The Python package for visualizing curves and linear transformations in a super simple way

This is a super simple visualization toolbox (script) for transformer attention visualization ✌

Data Visualizer for Super Mario Kart (SNES)

A high-level plotting API for pandas, dask, xarray, and networkx built on HoloViews

Bokeh Plotting Backend for Pandas and GeoPandas

A high-level plotting API for pandas, dask, xarray, and networkx built on HoloViews

Bokeh Plotting Backend for Pandas and GeoPandas

📊📈 Serves up Pandas dataframes via the Django REST Framework for use in client-side (i.e. d3.js) visualizations and offline analysis (e.g. Excel)

In-memory Graph Database and Knowledge Graph with Natural Language Interface, compatible with Pandas

Sparkling Pandas

Create HTML profiling reports from pandas DataFrame objects

Productivity Tools for Plotly + Pandas

A GUI for Pandas DataFrames