dplyr for python

Related tags

Pipelines dplython
Overview

Dplython: Dplyr for Python

Build Status

Welcome to Dplython: Dplyr for Python.

Dplyr is a library for the language R designed to make data analysis fast and easy. The philosophy of Dplyr is to constrain data manipulation to a few simple functions that correspond to the most common tasks. This maps thinking closer to the process of writing code, helping you move closer to analyze data at the "speed of thought".

The goal of this project is to implement the functionality of the R package Dplyr on top of Python's pandas.

This is version 0.0.7. It's experimental and subject to change.

Introductory Video

Here is a 20 minute video explaining dplython, given at PyGotham 2016.

PyGotham Dplython video

Click the awkward picture above to see the talk! Note that sound doesn't start until about 1 minute in due to microphone issues.

Installation

To install, use pip:

pip install dplython

To get the latest development version, you can clone this repo or use the command:

pip install git+https://github.com/dodger487/dplython.git

Contributing

We welcome your feature requests, open issues, bug reports, and pull requests! Please use GitHub's interface. Also consider joining the dplython mailing list.

Example usage

import pandas
from dplython import (DplyFrame, X, diamonds, select, sift, sample_n,
    sample_frac, head, arrange, mutate, group_by, summarize, DelayFunction) 

# The example `diamonds` DataFrame is included in this package, but you can 
# cast a DataFrame to a DplyFrame in this simple way:
# diamonds = DplyFrame(pandas.read_csv('./diamonds.csv'))

# Select specific columns of the DataFrame using select, and 
#   get the first few using head
diamonds >> select(X.carat, X.cut, X.price) >> head(5)
"""
Out:
   carat        cut  price
0   0.23      Ideal    326
1   0.21    Premium    326
2   0.23       Good    327
3   0.29    Premium    334
4   0.31       Good    335
"""

# Filter out rows using sift
diamonds >> sift(X.carat > 4) >> select(X.carat, X.cut, X.depth, X.price)
"""
Out:
       carat      cut  depth  price
25998   4.01  Premium   61.0  15223
25999   4.01  Premium   62.5  15223
27130   4.13     Fair   64.8  17329
27415   5.01     Fair   65.5  18018
27630   4.50     Fair   65.8  18531
"""

# Sample with sample_n or sample_frac, sort with arrange
(diamonds >> 
  sample_n(10) >> 
  arrange(X.carat) >> 
  select(X.carat, X.cut, X.depth, X.price))
"""
Out:
       carat        cut  depth  price
37277   0.23  Very Good   61.5    484
17728   0.30  Very Good   58.8    614
33255   0.32      Ideal   61.1    825
38911   0.33      Ideal   61.6   1052
31491   0.34    Premium   60.3    765
37227   0.40    Premium   61.9    975
2578    0.81    Premium   60.8   3213
15888   1.01       Fair   64.6   6353
26594   1.74      Ideal   62.9  16316
25727   2.38    Premium   62.4  14648
"""

# You can: 
#   add columns with mutate (referencing other columns!)
#   group rows into dplyr-style groups with group_by
#   collapse rows into single rows using sumarize
(diamonds >> 
  mutate(carat_bin=X.carat.round()) >> 
  group_by(X.cut, X.carat_bin) >> 
  summarize(avg_price=X.price.mean()))
"""
Out:
       avg_price  carat_bin        cut
0     863.908535          0      Ideal
1    4213.864948          1      Ideal
2   12838.984078          2      Ideal
...
27  13466.823529          3       Fair
28  15842.666667          4       Fair
29  18018.000000          5       Fair
"""

# If you have column names that don't work as attributes, you can use an 
# alternate "get item" notation with X.
diamonds["column w/ spaces"] = range(len(diamonds))
diamonds >> select(X["column w/ spaces"]) >> head()
"""
Out:
   column w/ spaces
0                 0
1                 1
2                 2
3                 3
4                 4
5                 5
6                 6
7                 7
8                 8
9                 9
"""

# It's possible to pass the entire dataframe using X._ 
diamonds >> sample_n(6) >> select(X.carat, X.price) >> X._.T
"""
Out:
         18966    19729   9445   49951    3087    33128
carat     1.16     1.52     0.9    0.3     0.74    0.31
price  7803.00  8299.00  4593.0  540.0  3315.00  816.00
"""

# To pass the DataFrame or columns into functions, apply @DelayFunction
@DelayFunction
def PairwiseGreater(series1, series2):
  index = series1.index
  newSeries = pandas.Series([max(s1, s2) for s1, s2 in zip(series1, series2)])
  newSeries.index = index
  return newSeries

diamonds >> PairwiseGreater(X.x, X.y)


# Passing entire dataframe and plotting with ggplot
from ggplot import ggplot, aes, geom_point, facet_wrap
ggplot = DelayFunction(ggplot)  # Simple installation
(diamonds >> ggplot(aes(x="carat", y="price", color="cut"), data=X._) + 
  geom_point() + facet_wrap("color"))

Ggplot example 1

(diamonds >>
  sift((X.clarity == "I1") | (X.clarity == "IF")) >> 
  ggplot(aes(x="carat", y="price", color="color"), X._) + 
    geom_point() + 
    facet_wrap("clarity"))

Ggplot example 2

# Matplotlib works as well!
import pylab as pl
pl.scatter = DelayFunction(pl.scatter)
diamonds >> sample_frac(0.1) >> pl.scatter(X.carat, X.price)

MPL example 2

This is very new and I'm matching changes. Let me know if you'd like to see a feature or think there's a better way I can do something.

Other approaches

Development of dplython began before I knew pandas-ply existed. After I found it, I chose "X" as the manager to be consistent. Pandas-ply is a great approach and worth taking a look. The main contrasts between the two are that:

  • dplython uses dplyr-style groups, as opposed to the SQL-style groups of pandas and pandas-ply
  • dplython maps a little more directly onto dplyr, for example having mutate instead of an expanded select.
  • Use of operators to connect operations instead of method-chaining
Comments
  • Rename dfilter to filter

    Rename dfilter to filter

    As I suggested in #22, I think we could safely "overload" the built-in filter function by testing whether it is called with a callable and an iterable in that order. Since neither DplyFrames nor DataFrames are callable, I think we'll be safe.

    opened by danrobinson 8
  • Add filtering joins, tests

    Add filtering joins, tests

    semi_join() and anti_join() are now functional, and there is a fair bit of testing added for them. One possibly minor note: they do require pandas v. 0.17.0 (current on v. 0.18.1). Not sure if this is too much to ask, but they do require it (specifically, it's an option that was added to pandas.DataFrame.merge).

    opened by bleearmstrong 7
  • Or condition in sift()

    Or condition in sift()

    I tried these two commands and they produce different output:

    df >> sift(X.a == 1) 
    df >> sift(X.a == 1 or X.b == 1) # this is equivalent to the line above
    
    df >>sift(X.a == 1 | X.b == 1) # produces different results than the lines above and I don't know what the result represents
    

    So is it possible to use the or condition inside sift at the moment?

    opened by guocheng 5
  • add mutating joins

    add mutating joins

    This should implement the four mutating joins (left, right, inner, outer/full). I did have to modify the @ApplyToDataframe function slightly. The functions were tested with the dplyr two-table verbs vignette (https://cran.r-project.org/web/packages/dplyr/vignettes/two-table.html) and achieved the same output. djoin was used as the generic name so as to not pollute the namespace.

    opened by bleearmstrong 5
  • Slowness when grouping on a large number of keys

    Slowness when grouping on a large number of keys

    Hi, maybe you already know about this, but just something important to have on your radar. When grouping on large number of keys, things can get very slow. I had to switch back to regular pandas when an operation was taking > 10 minutes.

    # Grouping variable with 5 values -> Get results immediately
    diamonds >> group_by(X.cut) >> mutate(m = X.x.mean()
    
    # Grouping variable with 273 values -> Get results after 10 seconds. 
    # For larger data frames, can take more than 10 minutes
    diamonds >> group_by(X.carat) >> mutate(m = X.x.mean())
    
    # The same operation in standard pandas happens instantaneously
    diamonds.groupby('carat').mean()
    
    opened by csaid 5
  • Modifying dplython functions to take DataFrame arguments

    Modifying dplython functions to take DataFrame arguments

    I have idea on a change and want to get feedback before making it.

    In dplyr, you can call each function on the dataframe itself. So:

    # with piping
    df %>% mutate(foo=bar)
    
    # same as
    mutate(df, foo=bar)
    

    Currently in dplython, the dplython functions all return other functions which are then applied to the DataFrame. If you want to replicate the above, you would have to do something that looks like this:

    # with piping
    df >> mutate(foo=X.bar)
    
    # same as
    mutate(foo=X.bar)(df)
    

    My proposal is to modify the dplython functions to check the type of the first argument. If the first argument is a DataFrame (or inherits from DataFrame), then instead of returning a function, the function is applied to the dataframe. I think this will be more readable. So:

    # old
    mutate(foo=X.bar)(df)
    
    # new, also works
    mutate(df, foo=X.bar)
    

    Note that this will not break the old way of doing things. I wanted to see what people think before making the change!

    opened by dodger487 5
  • Python 3 support

    Python 3 support

    opened by Nagasaki45 5
  • No ordering of mutate kwargs arguments

    No ordering of mutate kwargs arguments

    As uncovered in https://github.com/dodger487/dplython/issues/5.

    In Dplyr, the user is able to add a column in a mutate statement derived from a column that he or she just wrote. I want to make this feature available in dplython. So: diamonds_dp = self.diamonds >> mutate(foo=X.x, bar=X.foo.mean()) should be valid. foo should be the first column added, followed by bar.

    This is difficult to accomplish though as python throws away the order information of *_kwargs. Currently, this example code would sometimes work and sometimes not work depending on the dictionary ordering of *_kwargs. There's a PEP out to accomplish this (https://www.python.org/dev/peps/pep-0468/) but it doesn't look like this feature is currently supported.

    In a bad case, this will mean the user doesn't know what order to expect columns to be in. In a worse case, this will inconsistently cause errors when a use tries to create a column derived from another one.

    Some potential solutions, none of which seem great:

    • Restrict mutate statements to one additional column per mutate (yuck)
    • Sort the dictionary in some way, like alphabetical
    • Examine each added Later to see if it derives from a column that isn't yet present in the DataFrame, add these last.
    • Something else?
    opened by dodger487 5
  • dplython cannot be imported into Python v3.6.2?

    dplython cannot be imported into Python v3.6.2?

    I attempted to import dplython:

    import pandas from dplython import (DplyFrame, X, diamonds, select, sift, sample_n, sample_frac, head, arrange, mutate, group_by, summarize, DelayFunction)

    This error was displayed:

    Traceback (most recent call last): File "C:\Users\Shane\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2862, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "", line 1, in from dplython import (DplyFrame, X, diamonds, select, sift, sample_n, File "C:\Program Files\JetBrains\PyCharm Community Edition 2017.2\helpers\pydev_pydev_bundle\pydev_import_hook.py", line 21, in do_import module = sel am Files\JetBrains\PyCharm Community Edition 2017.2\helpers\pydev_pydev_bundle\pydev_import_hook.py", line 21, in do_import module = self.sy ok.py", line 21, in do o_import module = self._system_import(name, *args, **kwargs) ModuleNotFoundError: No module named 'dplython'

    The dplython module appears to be installed, as I can see it in the list of packages in PyCharm. However, I cannot see it in the list of packages in Anaconda, which seems suspicious.

    To install, I checking out the git repository, then used:

    python setup.py install

    Result:

    E:\git\dplython>python setup.py install running install running bdist_egg running egg_info writing dplython.egg-info\PKG-INFO writing dependency_links to dplython.egg-info\dependency_links.txt writing requirements to dplython.egg-info\requires.txt writing top-level names to dplython.egg-info\top_level.txt reading manifest file 'dplython.egg-info\SOURCES.txt' writing manifest file 'dplython.egg-info\SOURCES.txt' installing library code to build\bdist.win-amd64\egg running install_lib warning: install_lib: 'build\lib' does not exist -- no Python modules to install creating build\bdist.win-amd64\egg creating build\bdist.win-amd64\egg\EGG-INFO copying dplython.egg-info\PKG-INFO -> build\bdist.win-amd64\egg\EGG-INFO copying dplython.egg-info\SOURCES.txt -> build\bdist.win-amd64\egg\EGG-INFO copying dplython.egg-info\dependency_links.txt -> build\bdist.win-amd64\egg\EGG-INFO copying dplython.egg-info\requires.txt -> build\bdist.win-amd64\egg\EGG-INFO copying dplython.egg-info\top_level.txt -> build\bdist.win-amd64\egg\EGG-INFO zip_safe flag not set; analyzing archive contents... creating 'dist\dplython-0.0.7-py3.6.egg' and adding 'build\bdist.win-amd64\egg' to it removing 'build\bdist.win-amd64\egg' (and everything under it) Processing dplython-0.0.7-py3.6.egg Removing c:\users\shane\anaconda3\lib\site-packages\dplython-0.0.7-py3.6.egg Copying dplython-0.0.7-py3.6.egg to c:\users\shane\anaconda3\lib\site-packages dplython 0.0.7 is already the active version in easy-install.pth Installed c:\users\shane\anaconda3\lib\site-packages\dplython-0.0.7-py3.6.egg Processing dependencies for dplython==0.0.7 Searching for six==1.10.0 Best match: six 1.10.0 Adding six 1.10.0 to easy-install.pth file Using c:\users\shane\anaconda3\lib\site-packages Searching for pandas==0.20.3 Best match: pandas 0.20.3 Adding pandas 0.20.3 to easy-install.pth file Using c:\users\shane\anaconda3\lib\site-packages Searching for numpy==1.12.1 Best match: numpy 1.12.1 Adding numpy 1.12.1 to easy-install.pth file Using c:\users\shane\anaconda3\lib\site-packages Searching for pytz==2017.2 Best match: pytz 2017.2 Adding pytz 2017.2 to easy-install.pth file Using c:\users\shane\anaconda3\lib\site-packages Searching for python-dateutil==2.6.1 Best match: python-dateutil 2.6.1 Adding python-dateutil 2.6.1 to easy-install.pth file Using c:\users\shane\anaconda3\lib\site-packages Finished processing dependencies for dplython==0.0.7 E:\git\dplython>

    opened by sharpe5 4
  • Join verb

    Join verb

    add more testing for join verbs

    This pr changes mutating joins to verbs. This fixes the bug of failing or giving incorrect results on grouped data. Preserves grouping of left dataframe (e.g. x >> group_by(a) >> inner_join(y) results in a dataframe grouped on a.

    When I initially did the joins awhile back, I decided to use strings as the arguments, mainly because we only had one manager (X). However, after playing around with it more, I think it would be okay to use the one manager. e.g. diamonds >> left_join(other, by=[X.cut, (X.clarity, X.whatever)]) (initially, I would have thought it would have been better to have a second manager, say Y, so we could do diamonds >> left_join(other, by=[X.cut, (X.clarity, Y.whatever)]), but I think that would be significantly more work.

    If you'd prefer to use laters, and the first form (using only X.) is acceptable, I wouldn't mind redoing this to work with laters. I've already thought about how to implement using functions in your join as well (x >> left_join(y, by=[(X.x, X.a.lower())]).

    opened by bleearmstrong 4
  • Add filter joins

    Add filter joins

    Add filtering joins. While I was working on implementing spread(), I realized the functions didn't work quite properly on grouped data. As of this pull request, grouping is removed when data is joined. In some cases, this makes sense; we can think of a mutating join as creating a new dataframe, so maybe grouping should be removed. For filtering joins, maybe not. For spread and gather, maybe not. How to deal with grouping should be discussed, not just for joins but for other functions. Should that be discussed here or is there somewhere else that is more appropriate?

    opened by bleearmstrong 4
  • Has this package been deprecated?

    Has this package been deprecated?

    Coming from an R background I was thrilled to see the existence of this package, but it looks like there has been no activity in many years. Has ownership transferred? Is this package still under development? I do not want to waste my time learning a package that is now obsolete, even though it is exactly what I was dreaming about!

    opened by NateMietk 1
  • Can't import functions

    Can't import functions

    import pandas from dplython import (DplyFrame, X, diamonds, select, sift, sample_n, sample_frac, head, arrange, mutate, group_by, summarize, DelayFunction)

    Yields: ImportError: cannot import name 'DplyFrame'

    If I remove DplyFrame, I get: ImportError: cannot import name 'X'

    opened by jaybundy 1
  • Build broken on README.md update

    Build broken on README.md update

    I updated the README to include a video, and now the build is broken on a seemingly unrelated issue. My first guess is that something changed in pandas to make comparison between Series even more difficult, but I'm not sure yet. I will investigate but will hopefully resolve this soon. If anyone has issues with dplython breaking after the latest update please let me know.

    (Quick note: After a bit of an absence due to talking on a very intense, temporary job for a few months, I'm back improving and updating dplython)

    opened by dodger487 0
Owner
Chris Riederer
Chris Riederer
A Python toolkit for processing tabular data

meza: A Python toolkit for processing tabular data Index Introduction | Requirements | Motivation | Hello World | Usage | Interoperability | Installat

Reuben Cummings 401 Dec 19, 2022
Nuitka Organization 8k Jan 7, 2023
Scapy: the Python-based interactive packet manipulation program & library. Supports Python 2 & Python 3.

Scapy Scapy is a powerful Python-based interactive packet manipulation program and library. It is able to forge or decode packets of a wide number of

SecDev 8.3k Jan 8, 2023
Simple plotting for Python. Python wrapper for D3xter - render charts in the browser with simple Python syntax.

PyDexter Simple plotting for Python. Python wrapper for D3xter - render charts in the browser with simple Python syntax. Setup $ pip install PyDexter

D3xter 31 Mar 6, 2021
Implementation of hashids (http://hashids.org) in Python. Compatible with Python 2 and Python 3

hashids for Python 2.7 & 3 A python port of the JavaScript hashids implementation. It generates YouTube-like hashes from one or many numbers. Use hash

David Aurelio 1.4k Jan 2, 2023
Python library for serializing any arbitrary object graph into JSON. It can take almost any Python object and turn the object into JSON. Additionally, it can reconstitute the object back into Python.

jsonpickle jsonpickle is a library for the two-way conversion of complex Python objects and JSON. jsonpickle builds upon the existing JSON encoders, s

null 1.1k Jan 2, 2023
tinykernel - A minimal Python kernel so you can run Python in your Python

tinykernel - A minimal Python kernel so you can run Python in your Python

fast.ai 37 Dec 2, 2022
Python-experiments - A Repository which contains python scripts to automate things and make your life easier with python

Python Experiments A Repository which contains python scripts to automate things

Vivek Kumar Singh 11 Sep 25, 2022
pycallgraph is a Python module that creates call graphs for Python programs.

Project Abandoned Many apologies. I've stopped maintaining this project due to personal time constraints. Blog post with more information. I'm happy t

gak 1.7k Jan 1, 2023
Python Fire is a library for automatically generating command line interfaces (CLIs) from absolutely any Python object.

Python Fire Python Fire is a library for automatically generating command line interfaces (CLIs) from absolutely any Python object. Python Fire is a s

Google 23.6k Dec 31, 2022
MySQL database connector for Python (with Python 3 support)

mysqlclient This project is a fork of MySQLdb1. This project adds Python 3 support and fixed many bugs. PyPI: https://pypi.org/project/mysqlclient/ Gi

PyMySQL 2.2k Dec 25, 2022
㊙️ Create standard barcodes with Python. No external dependencies. 100% Organic Python.

python-barcode python-barcode provides a simple way to create barcodes in Python. There are no external dependencies when generating SVG files. Pillow

Hugo Barrera 419 Dec 26, 2022
Crab is a flexible, fast recommender engine for Python that integrates classic information filtering recommendation algorithms in the world of scientific Python packages (numpy, scipy, matplotlib).

Crab - A Recommendation Engine library for Python Crab is a flexible, fast recommender engine for Python that integrates classic information filtering r

python-recsys 1.2k Dec 21, 2022
xlwings is a BSD-licensed Python library that makes it easy to call Python from Excel and vice versa. It works with Microsoft Excel on Windows and macOS. Sign up for the newsletter or follow us on twitter via

xlwings - Make Excel fly with Python! xlwings CE xlwings CE is a BSD-licensed Python library that makes it easy to call Python from Excel and vice ver

xlwings 2.5k Jan 6, 2023
Official Python client for the MonkeyLearn API. Build and consume machine learning models for language processing from your Python apps.

MonkeyLearn API for Python Official Python client for the MonkeyLearn API. Build and run machine learning models for language processing from your Pyt

MonkeyLearn 157 Nov 22, 2022
PRAW, an acronym for "Python Reddit API Wrapper", is a python package that allows for simple access to Reddit's API.

PRAW: The Python Reddit API Wrapper PRAW, an acronym for "Python Reddit API Wrapper", is a Python package that allows for simple access to Reddit's AP

Python Reddit API Wrapper Development 3k Dec 29, 2022