dplyr for python

Chris Riederer

Last update: Nov 21, 2022

Related tags

Pipelines dplython

Overview

Dplython: Dplyr for Python

Welcome to Dplython: Dplyr for Python.

Dplyr is a library for the language R designed to make data analysis fast and easy. The philosophy of Dplyr is to constrain data manipulation to a few simple functions that correspond to the most common tasks. This maps thinking closer to the process of writing code, helping you move closer to analyze data at the "speed of thought".

The goal of this project is to implement the functionality of the R package Dplyr on top of Python's pandas.

Dplyr: Click here
Pandas: Click here

This is version 0.0.7. It's experimental and subject to change.

Introductory Video

Here is a 20 minute video explaining dplython, given at PyGotham 2016.

Click the awkward picture above to see the talk! Note that sound doesn't start until about 1 minute in due to microphone issues.

Installation

To install, use pip:

pip install dplython

To get the latest development version, you can clone this repo or use the command:

pip install git+https://github.com/dodger487/dplython.git

Contributing

We welcome your feature requests, open issues, bug reports, and pull requests! Please use GitHub's interface. Also consider joining the dplython mailing list.

Example usage

import pandas
from dplython import (DplyFrame, X, diamonds, select, sift, sample_n,
    sample_frac, head, arrange, mutate, group_by, summarize, DelayFunction) 

# The example `diamonds` DataFrame is included in this package, but you can 
# cast a DataFrame to a DplyFrame in this simple way:
# diamonds = DplyFrame(pandas.read_csv('./diamonds.csv'))

# Select specific columns of the DataFrame using select, and 
#   get the first few using head
diamonds >> select(X.carat, X.cut, X.price) >> head(5)
"""
Out:
   carat        cut  price
0   0.23      Ideal    326
1   0.21    Premium    326
2   0.23       Good    327
3   0.29    Premium    334
4   0.31       Good    335
"""

# Filter out rows using sift
diamonds >> sift(X.carat > 4) >> select(X.carat, X.cut, X.depth, X.price)
"""
Out:
       carat      cut  depth  price
25998   4.01  Premium   61.0  15223
25999   4.01  Premium   62.5  15223
27130   4.13     Fair   64.8  17329
27415   5.01     Fair   65.5  18018
27630   4.50     Fair   65.8  18531
"""

# Sample with sample_n or sample_frac, sort with arrange
(diamonds >> 
  sample_n(10) >> 
  arrange(X.carat) >> 
  select(X.carat, X.cut, X.depth, X.price))
"""
Out:
       carat        cut  depth  price
37277   0.23  Very Good   61.5    484
17728   0.30  Very Good   58.8    614
33255   0.32      Ideal   61.1    825
38911   0.33      Ideal   61.6   1052
31491   0.34    Premium   60.3    765
37227   0.40    Premium   61.9    975
2578    0.81    Premium   60.8   3213
15888   1.01       Fair   64.6   6353
26594   1.74      Ideal   62.9  16316
25727   2.38    Premium   62.4  14648
"""

# You can: 
#   add columns with mutate (referencing other columns!)
#   group rows into dplyr-style groups with group_by
#   collapse rows into single rows using sumarize
(diamonds >> 
  mutate(carat_bin=X.carat.round()) >> 
  group_by(X.cut, X.carat_bin) >> 
  summarize(avg_price=X.price.mean()))
"""
Out:
       avg_price  carat_bin        cut
0     863.908535          0      Ideal
1    4213.864948          1      Ideal
2   12838.984078          2      Ideal
...
27  13466.823529          3       Fair
28  15842.666667          4       Fair
29  18018.000000          5       Fair
"""

# If you have column names that don't work as attributes, you can use an 
# alternate "get item" notation with X.
diamonds["column w/ spaces"] = range(len(diamonds))
diamonds >> select(X["column w/ spaces"]) >> head()
"""
Out:
   column w/ spaces
0                 0
1                 1
2                 2
3                 3
4                 4
5                 5
6                 6
7                 7
8                 8
9                 9
"""

# It's possible to pass the entire dataframe using X._ 
diamonds >> sample_n(6) >> select(X.carat, X.price) >> X._.T
"""
Out:
         18966    19729   9445   49951    3087    33128
carat     1.16     1.52     0.9    0.3     0.74    0.31
price  7803.00  8299.00  4593.0  540.0  3315.00  816.00
"""

# To pass the DataFrame or columns into functions, apply @DelayFunction
@DelayFunction
def PairwiseGreater(series1, series2):
  index = series1.index
  newSeries = pandas.Series([max(s1, s2) for s1, s2 in zip(series1, series2)])
  newSeries.index = index
  return newSeries

diamonds >> PairwiseGreater(X.x, X.y)


# Passing entire dataframe and plotting with ggplot
from ggplot import ggplot, aes, geom_point, facet_wrap
ggplot = DelayFunction(ggplot)  # Simple installation
(diamonds >> ggplot(aes(x="carat", y="price", color="cut"), data=X._) + 
  geom_point() + facet_wrap("color"))

(diamonds >>
  sift((X.clarity == "I1") | (X.clarity == "IF")) >> 
  ggplot(aes(x="carat", y="price", color="color"), X._) + 
    geom_point() + 
    facet_wrap("clarity"))

# Matplotlib works as well!
import pylab as pl
pl.scatter = DelayFunction(pl.scatter)
diamonds >> sample_frac(0.1) >> pl.scatter(X.carat, X.price)

This is very new and I'm matching changes. Let me know if you'd like to see a feature or think there's a better way I can do something.

Other approaches

pandas-ply

Development of dplython began before I knew pandas-ply existed. After I found it, I chose "X" as the manager to be consistent. Pandas-ply is a great approach and worth taking a look. The main contrasts between the two are that:

dplython uses dplyr-style groups, as opposed to the SQL-style groups of pandas and pandas-ply
dplython maps a little more directly onto dplyr, for example having mutate instead of an expanded select.
Use of operators to connect operations instead of method-chaining

Comments

Rename dfilter to filter

As I suggested in #22, I think we could safely "overload" the built-in filter function by testing whether it is called with a callable and an iterable in that order. Since neither DplyFrames nor DataFrames are callable, I think we'll be safe.

opened by danrobinson 8
Add filtering joins, tests

semi_join() and anti_join() are now functional, and there is a fair bit of testing added for them. One possibly minor note: they do require pandas v. 0.17.0 (current on v. 0.18.1). Not sure if this is too much to ask, but they do require it (specifically, it's an option that was added to pandas.DataFrame.merge).

opened by bleearmstrong 7

Or condition in sift()

I tried these two commands and they produce different output:

df >> sift(X.a == 1) 
df >> sift(X.a == 1 or X.b == 1) # this is equivalent to the line above

df >>sift(X.a == 1 | X.b == 1) # produces different results than the lines above and I don't know what the result represents

So is it possible to use the or condition inside sift at the moment?

opened by guocheng 5

add mutating joins

This should implement the four mutating joins (left, right, inner, outer/full). I did have to modify the @ApplyToDataframe function slightly. The functions were tested with the dplyr two-table verbs vignette (https://cran.r-project.org/web/packages/dplyr/vignettes/two-table.html) and achieved the same output. djoin was used as the generic name so as to not pollute the namespace.

opened by bleearmstrong 5

Slowness when grouping on a large number of keys

Hi, maybe you already know about this, but just something important to have on your radar. When grouping on large number of keys, things can get very slow. I had to switch back to regular pandas when an operation was taking > 10 minutes.

# Grouping variable with 5 values -> Get results immediately
diamonds >> group_by(X.cut) >> mutate(m = X.x.mean()

# Grouping variable with 273 values -> Get results after 10 seconds. 
# For larger data frames, can take more than 10 minutes
diamonds >> group_by(X.carat) >> mutate(m = X.x.mean())

# The same operation in standard pandas happens instantaneously
diamonds.groupby('carat').mean()

opened by csaid 5

Modifying dplython functions to take DataFrame arguments
I have idea on a change and want to get feedback before making it.

In dplyr, you can call each function on the dataframe itself. So:

# with piping df %>% mutate(foo=bar) # same as mutate(df, foo=bar)

Currently in dplython, the dplython functions all return other functions which are then applied to the DataFrame. If you want to replicate the above, you would have to do something that looks like this:

# with piping df >> mutate(foo=X.bar) # same as mutate(foo=X.bar)(df)

My proposal is to modify the dplython functions to check the type of the first argument. If the first argument is a DataFrame (or inherits from DataFrame), then instead of returning a function, the function is applied to the dataframe. I think this will be more readable. So:

# old mutate(foo=X.bar)(df) # new, also works mutate(df, foo=X.bar)

Note that this will not break the old way of doing things. I wanted to see what people think before making the change!
opened by dodger487 5
Python 3 support
Hi, the main improvement here is the addition of python 3 support. Apart from that:

Add travis-ci for continuous integration (the entire test suite passes on python 2.7 and 3.3/4/5).

Fix TestMutates.test_multi issue #5.
opened by Nagasaki45 5
No ordering of mutate kwargs arguments
As uncovered in https://github.com/dodger487/dplython/issues/5.

In Dplyr, the user is able to add a column in a mutate statement derived from a column that he or she just wrote. I want to make this feature available in dplython. So: diamonds_dp = self.diamonds >> mutate(foo=X.x, bar=X.foo.mean()) should be valid. foo should be the first column added, followed by bar.

This is difficult to accomplish though as python throws away the order information of *_kwargs. Currently, this example code would sometimes work and sometimes not work depending on the dictionary ordering of *_kwargs. There's a PEP out to accomplish this (https://www.python.org/dev/peps/pep-0468/) but it doesn't look like this feature is currently supported.

In a bad case, this will mean the user doesn't know what order to expect columns to be in. In a worse case, this will inconsistently cause errors when a use tries to create a column derived from another one.

Some potential solutions, none of which seem great:

Restrict mutate statements to one additional column per mutate (yuck)

Sort the dictionary in some way, like alphabetical

Examine each added Later to see if it derives from a column that isn't yet present in the DataFrame, add these last.

Something else?
opened by dodger487 5
dplython cannot be imported into Python v3.6.2?

I attempted to import dplython:

import pandas from dplython import (DplyFrame, X, diamonds, select, sift, sample_n, sample_frac, head, arrange, mutate, group_by, summarize, DelayFunction)

This error was displayed:

Traceback (most recent call last): File "C:\Users\Shane\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2862, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "", line 1, in from dplython import (DplyFrame, X, diamonds, select, sift, sample_n, File "C:\Program Files\JetBrains\PyCharm Community Edition 2017.2\helpers\pydev_pydev_bundle\pydev_import_hook.py", line 21, in do_import module = sel am Files\JetBrains\PyCharm Community Edition 2017.2\helpers\pydev_pydev_bundle\pydev_import_hook.py", line 21, in do_import module = self.sy ok.py", line 21, in do o_import module = self._system_import(name, *args, **kwargs) ModuleNotFoundError: No module named 'dplython'

The dplython module appears to be installed, as I can see it in the list of packages in PyCharm. However, I cannot see it in the list of packages in Anaconda, which seems suspicious.

To install, I checking out the git repository, then used:

python setup.py install

Result:

E:\git\dplython>python setup.py install running install running bdist_egg running egg_info writing dplython.egg-info\PKG-INFO writing dependency_links to dplython.egg-info\dependency_links.txt writing requirements to dplython.egg-info\requires.txt writing top-level names to dplython.egg-info\top_level.txt reading manifest file 'dplython.egg-info\SOURCES.txt' writing manifest file 'dplython.egg-info\SOURCES.txt' installing library code to build\bdist.win-amd64\egg running install_lib warning: install_lib: 'build\lib' does not exist -- no Python modules to install creating build\bdist.win-amd64\egg creating build\bdist.win-amd64\egg\EGG-INFO copying dplython.egg-info\PKG-INFO -> build\bdist.win-amd64\egg\EGG-INFO copying dplython.egg-info\SOURCES.txt -> build\bdist.win-amd64\egg\EGG-INFO copying dplython.egg-info\dependency_links.txt -> build\bdist.win-amd64\egg\EGG-INFO copying dplython.egg-info\requires.txt -> build\bdist.win-amd64\egg\EGG-INFO copying dplython.egg-info\top_level.txt -> build\bdist.win-amd64\egg\EGG-INFO zip_safe flag not set; analyzing archive contents... creating 'dist\dplython-0.0.7-py3.6.egg' and adding 'build\bdist.win-amd64\egg' to it removing 'build\bdist.win-amd64\egg' (and everything under it) Processing dplython-0.0.7-py3.6.egg Removing c:\users\shane\anaconda3\lib\site-packages\dplython-0.0.7-py3.6.egg Copying dplython-0.0.7-py3.6.egg to c:\users\shane\anaconda3\lib\site-packages dplython 0.0.7 is already the active version in easy-install.pth Installed c:\users\shane\anaconda3\lib\site-packages\dplython-0.0.7-py3.6.egg Processing dependencies for dplython==0.0.7 Searching for six==1.10.0 Best match: six 1.10.0 Adding six 1.10.0 to easy-install.pth file Using c:\users\shane\anaconda3\lib\site-packages Searching for pandas==0.20.3 Best match: pandas 0.20.3 Adding pandas 0.20.3 to easy-install.pth file Using c:\users\shane\anaconda3\lib\site-packages Searching for numpy==1.12.1 Best match: numpy 1.12.1 Adding numpy 1.12.1 to easy-install.pth file Using c:\users\shane\anaconda3\lib\site-packages Searching for pytz==2017.2 Best match: pytz 2017.2 Adding pytz 2017.2 to easy-install.pth file Using c:\users\shane\anaconda3\lib\site-packages Searching for python-dateutil==2.6.1 Best match: python-dateutil 2.6.1 Adding python-dateutil 2.6.1 to easy-install.pth file Using c:\users\shane\anaconda3\lib\site-packages Finished processing dependencies for dplython==0.0.7 E:\git\dplython>

opened by sharpe5 4
Join verb

add more testing for join verbs

This pr changes mutating joins to verbs. This fixes the bug of failing or giving incorrect results on grouped data. Preserves grouping of left dataframe (e.g. x >> group_by(a) >> inner_join(y) results in a dataframe grouped on a.

When I initially did the joins awhile back, I decided to use strings as the arguments, mainly because we only had one manager (X). However, after playing around with it more, I think it would be okay to use the one manager. e.g. diamonds >> left_join(other, by=[X.cut, (X.clarity, X.whatever)]) (initially, I would have thought it would have been better to have a second manager, say Y, so we could do diamonds >> left_join(other, by=[X.cut, (X.clarity, Y.whatever)]), but I think that would be significantly more work.

If you'd prefer to use laters, and the first form (using only X.) is acceptable, I wouldn't mind redoing this to work with laters. I've already thought about how to implement using functions in your join as well (x >> left_join(y, by=[(X.x, X.a.lower())]).

opened by bleearmstrong 4
Add filter joins

Add filtering joins. While I was working on implementing spread(), I realized the functions didn't work quite properly on grouped data. As of this pull request, grouping is removed when data is joined. In some cases, this makes sense; we can think of a mutating join as creating a new dataframe, so maybe grouping should be removed. For filtering joins, maybe not. For spread and gather, maybe not. How to deal with grouping should be discussed, not just for joins but for other functions. Should that be discussed here or is there somewhere else that is more appropriate?

opened by bleearmstrong 4
Has this package been deprecated?

Coming from an R background I was thrilled to see the existence of this package, but it looks like there has been no activity in many years. Has ownership transferred? Is this package still under development? I do not want to waste my time learning a package that is now obsolete, even though it is exactly what I was dreaming about!

opened by NateMietk 1
Can't import functions

import pandas from dplython import (DplyFrame, X, diamonds, select, sift, sample_n, sample_frac, head, arrange, mutate, group_by, summarize, DelayFunction)

Yields: ImportError: cannot import name 'DplyFrame'

If I remove DplyFrame, I get: ImportError: cannot import name 'X'

opened by jaybundy 1
Build broken on README.md update

I updated the README to include a video, and now the build is broken on a seemingly unrelated issue. My first guess is that something changed in pandas to make comparison between Series even more difficult, but I'm not sure yet. I will investigate but will hopefully resolve this soon. If anyone has issues with dplython breaking after the latest update please let me know.

(Quick note: After a bit of an absence due to talking on a very intense, temporary job for a few months, I'm back improving and updating dplython)

opened by dodger487 0

dplyr for python

Related tags

Overview

Dplython: Dplyr for Python

Introductory Video

Installation

Contributing

Example usage

Other approaches

Comments

Owner

Chris Riederer

A Python toolkit for processing tabular data

Nuitka is a Python compiler written in Python. It's fully compatible with Python 2.6, 2.7, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, and 3.9. You feed it your Python app, it does a lot of clever things, and spits out an executable or extension module.

Scapy: the Python-based interactive packet manipulation program & library. Supports Python 2 & Python 3.

Simple plotting for Python. Python wrapper for D3xter - render charts in the browser with simple Python syntax.

This repository is for active development of the Azure SDK for Python. For consumers of the SDK we recommend visiting our public developer docs at https://docs.microsoft.com/en-us/python/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-python.

This repository is for active development of the Azure SDK for Python. For consumers of the SDK we recommend visiting our public developer docs at https://docs.microsoft.com/en-us/python/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-python.

Implementation of hashids (http://hashids.org) in Python. Compatible with Python 2 and Python 3

Python library for serializing any arbitrary object graph into JSON. It can take almost any Python object and turn the object into JSON. Additionally, it can reconstitute the object back into Python.

tinykernel - A minimal Python kernel so you can run Python in your Python

Python-experiments - A Repository which contains python scripts to automate things and make your life easier with python

pycallgraph is a Python module that creates call graphs for Python programs.

Python Fire is a library for automatically generating command line interfaces (CLIs) from absolutely any Python object.

MySQL database connector for Python (with Python 3 support)

㊙️ Create standard barcodes with Python. No external dependencies. 100% Organic Python.

Crab is a ﬂexible, fast recommender engine for Python that integrates classic information ﬁltering recommendation algorithms in the world of scientiﬁc Python packages (numpy, scipy, matplotlib).

xlwings is a BSD-licensed Python library that makes it easy to call Python from Excel and vice versa. It works with Microsoft Excel on Windows and macOS. Sign up for the newsletter or follow us on twitter via

Official Python client for the MonkeyLearn API. Build and consume machine learning models for language processing from your Python apps.

PRAW, an acronym for "Python Reddit API Wrapper", is a python package that allows for simple access to Reddit's API.