Tidy interface to polars

Overview

tidypolars

PyPI Latest Release

tidypolars is a data frame library built on top of the blazingly fast polars library that gives access to methods and functions familiar to R tidyverse users.

Installation

$ pip3 install tidypolars

General syntax

tidypolars methods are designed to work like tidyverse functions:

import tidypolars as tp
from tidypolars import col, desc

df = tp.Tibble(x = range(3), y = range(3, 6), z = ['a', 'a', 'b'])

(
    df
    .select('x', 'y', 'z')
    .filter(col('x') < 4, col('y') > 1)
    .arrange(desc('z'), 'x')
    .mutate(double_x = col('x') * 2,
            x_plus_y = col('x') + col('y'))
)
┌─────┬─────┬─────┬──────────┬──────────┐
│ xyzdouble_xx_plus_y │
│ ---------------      │
│ i64i64stri64i64      │
╞═════╪═════╪═════╪══════════╪══════════╡
│ 25b47        │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 03a03        │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 14a25        │
└─────┴─────┴─────┴──────────┴──────────┘

The key difference from R is that column names must be wrapped in col() in the following methods:

  • .filter()
  • .mutate()
  • .summarize()

The general idea - when doing calculations on a column you need to wrap it in col(). When doing simple column selections (like in .select()) you can pass the column names as strings.

Group by syntax

Methods operate by group by calling the by arg.

  • A single column can be passed with by = 'z'
  • Multiple columns can be passed with by = ['y', 'z']
(
    df
    .summarize(avg_x = tp.mean(col('x')),
               by = 'z')
)
┌─────┬───────┐
│ zavg_x │
│ ------   │
│ strf64   │
╞═════╪═══════╡
│ a0.5   │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b2     │
└─────┴───────┘

Selecting/dropping columns

tidyselect functions can be mixed with normal selection when selecting columns:

df = tp.Tibble(x1 = range(3), x2 = range(3), y = range(3), z = range(3))

df.select(tp.starts_with('x'), 'z')
┌─────┬─────┬─────┐
│ x1x2z   │
│ --------- │
│ i64i64i64 │
╞═════╪═════╪═════╡
│ 000   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 111   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 222   │
└─────┴─────┴─────┘

To drop columns use the .drop() method:

df.drop(tp.starts_with('x'), 'z')
┌─────┐
│ y   │
│ --- │
│ i64 │
╞═════╡
│ 0   │
├╌╌╌╌╌┤
│ 1   │
├╌╌╌╌╌┤
│ 2   │
└─────┘

Converting to/from pandas data frames

If you need to use a package that requires pandas data frames, you can convert from a tidypolars Tibble to a pandas DataFrame.

To do this you'll first need to install pyarrow:

pip3 install pyarrow

To convert to a pandas DataFrame:

df = df.to_pandas()

To convert from a pandas DataFrame to a tidypolars Tibble:

df = tp.from_pandas(df)

Speed Comparisons

A few notes:

  • Comparing times from separate functions typically isn't very useful. For example - the .summarize() tests were performed on a different dataset from .pivot_wider().
  • All tests are run 5 times. The times shown are the median of those 5 runs.
  • All timings are in milliseconds.
  • All tests can be found in the source code here.
  • FAQ - Why are some tidypolars functions faster than their polars counterpart?
    • Short answer - they're not! After all they're just using polars in the background.
    • Long answer - All python functions have some slight natural variation in their execution time. By chance the tidypolars runs were slightly shorter on those specific functions on this iteration of the tests. However one goal of these tests is to show that the "time cost" of translating syntax to polars is very negligible to the user (especially on medium-to-large datasets).
  • Lastly I'd like to mention that these tests were not rigorously created to cover all angles equally. They are just meant to be used as general insight into the performance of these packages.
┌─────────────┬────────────┬─────────┬──────────┐
│ func_testedtidypolarspolarspandas   │
│ ------------      │
│ strf64f64f64      │
╞═════════════╪════════════╪═════════╪══════════╡
│ arrange190.345169.478500.112  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ case_when87.34879.427152.623  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ distinct16.88816.28228.725   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ filter29.78929.91231.397  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ full_join236.784231.2831042.689 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ inner_join49.7147.563630.98   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ left_join113.7921151100.607 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ mutate7.9797.408117.283  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ pivot_wider42.76439.93949.048   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ summarize59.43458.011453.707  │
└─────────────┴────────────┴─────────┴──────────┘

Contributing

Interested in contributing? Check out the contributing guidelines. Please note that this project is released with a Code of Conduct. By contributing to this project, you agree to abide by its terms.

Comments
  • `drop` with error `RuntimeError: Any(NotFound(

    `drop` with error `RuntimeError: Any(NotFound("^x.*$"))`

    import sys
    import tidypolars as tp
    sys.version
    # '3.9.7 (default, Sep 16 2021, 13:09:58) \n[GCC 7.5.0]'
    tp.__version__
    # '0.2.1'
    ## error
    df = tp.Tibble(x1 = range(3), x2 = range(3), y=range(3), z = range(3))
    df.drop([tp.starts_with('x'), 'z'])
    df.drop()
    `
    ---------------------------------------------------------------------------
    RuntimeError                              Traceback (most recent call last)
    /tmp/ipykernel_12815/866601321.py in <module>
    ----> 1 df.drop(tp.starts_with('x'))
    
    ~/miniconda3/envs/py39/lib/python3.9/site-packages/polars/eager/frame.py in drop(self, name)
       2253             return df
       2254 
    -> 2255         return wrap_df(self._df.drop(name))
       2256 
       2257     def drop_in_place(self, name: str) -> "pl.Series":
    
    RuntimeError: Any(NotFound("^x.*$"))
    `
    
    
    
    opened by ztsweet 9
  • `AttributeError: arrange not found`

    `AttributeError: arrange not found`

    import tidypolars as tp
    from tidypolars import col, desc
    import sys
    sys.version
    # '3.10.0 | packaged by conda-forge | (default, Oct 12 2021, 21:24:52) [GCC 9.4.0]'
    tp.__version__
    # '0.2.1'
    df = tp.Tibble({'x': ['a', 'a', 'b'], 'y': range(3)})
    df.arrange('x', 'y')
    `
    ---------------------------------------------------------------------------
    RuntimeError                              Traceback (most recent call last)
    ~/miniconda3/envs/py310/lib/python3.10/site-packages/polars/eager/frame.py in __getattr__(self, item)
        882         try:
    --> 883             return pl.eager.series.wrap_s(self._df.column(item))
        884         except RuntimeError:
    
    RuntimeError: Any(NotFound("arrange"))
    
    During handling of the above exception, another exception occurred:
    
    AttributeError                            Traceback (most recent call last)
    /tmp/ipykernel_21110/1194586334.py in <module>
    ----> 1 df.arrange('x', 'y')
    
    ~/miniconda3/envs/py310/lib/python3.10/site-packages/polars/eager/frame.py in __getattr__(self, item)
        883             return pl.eager.series.wrap_s(self._df.column(item))
        884         except RuntimeError:
    --> 885             raise AttributeError(f"{item} not found")
        886 
        887     def __iter__(self) -> Iterator[Any]:
    
    AttributeError: arrange not found
    `
    
    bug 
    opened by ztsweet 6
  • Missing attributes when chaining

    Missing attributes when chaining

    Hi Mark, thanks for putting this package together. It looks very cool.

    I'm having a tough time getting the motivating examples to work, though. For example, the following triggers an error:

    import tidypolars as tp
    from tidypolars import col, desc
    
    df = tp.Tibble(x = range(3), y = range(3, 6), z = ['a', 'a', 'b'])
    
    df.filter(col('x') < 2).arrange(desc('z'), 'x')
    
    ---------------------------------------------------------------------------
    AttributeError                            Traceback (most recent call last)
    Untitled-1 in <cell line: 1>()
    ----> <a href='untitled:Untitled-1?line=7'>8</a> df.filter(col('x') < 2).arrange(desc('z'), 'x')
    
    AttributeError: 'DataFrame' object has no attribute 'arrange'
    

    What's genuinely odd about the above is that arrange works on its own and when it comes before filter.

    # All of these work as expected
    df.filter(col('x') < 2)
    df.arrange(desc('z'), 'x')
    df.arrange(desc('z'), 'x').filter(col('x') < 2)
    

    A seemingly related issue is that I can't pass two arguments to filter when it follows arrange (or other verbs likeselect for that matter).

    df.filter(col('x') < 2, col('y')>3) ## works
    df.arrange(desc('z'), 'x').filter(col('x') < 2, col('y')>3) ## errors with "filter() takes 2 positional arguments but 3 were given"
    

    Any ideas?

    I'm on Python 3.9.2 installed via Homebrew on a 2019 Macbook (so regular Intel chip) and running the latest version of tidypolars (0.2.15).

    bug 
    opened by grantmcdermott 5
  • Is it possible to have dplyr's `group_by` + `mutate` behavior?

    Is it possible to have dplyr's `group_by` + `mutate` behavior?

    First of all, I really like this package and I've started to use it a lot in my work. As a Pythonista whose first language is R, I really enjoy tidypolars.

    In R, we can do something like the following

    library(dplyr)
    data(iris)
    
    iris %>%
      group_by(Species) %>%
      mutate(
        result = Petal.Width - mean(Petal.Width)
      )
    

    Since we have a group_by(Species) call, dplyr will subtract the mean that corresponds to each group in the mutate() operation (not the mean across all observations from all species).

    As far as I understand, this is still not possible with tidypolars since we don't have a group_by function that behaves in a similar way to the one in dplyr. So my questions are

    • Is it possible to have this behavior in tidypolars now?
      • If yes, how?
      • If not, is it going to be possible? I could volunteer to try to implement it. I'm not familiar with the existing codebase, but I suspect that Python eager evaluation of function arguments is what makes it harder to have such a feature?

    Again, thanks for the fantastic library!

    opened by tomicapretto 5
  • idiomatic way to add list as column

    idiomatic way to add list as column

    Forgive what's probably a dumb question, but is there a way to get .mutate to return the same object as the .bind_cols line?

    import tidypolars as tp
    
    tb = tp.Tibble({'a': [1, 2, 3]})
    x = [4, 5, 6]
    # gives desired output 
    tb.bind_cols(tp.Tibble({'b': x}))
    # gives error: ValueError: could not convert value '[4, 5, 6]' as a Literal
    tb.mutate(b = x)
    
    feature 
    opened by eutwt 4
  • purrr functions!?

    purrr functions!?

    I noticed that in tidytable, you have purrr functions like map.(), but not in tidypolars.

    Using for loops + lambda functions are just not desirable for collaborative coding / code readability/comprehension. In Python, even if there is a bit of sacrifice in performance, if it allows better code readability, it would be really nice to have.

    Would something like map.() be in the scope of this repo?

    feature 
    opened by exsell-jc 3
  • ```ValueError``` with ```filter```

    ```ValueError``` with ```filter```

    When I chain filter expressions with | (error message said to use | and not or), I receive a ValueError message:

    tp.Tibble(chr_col = tp.Series(['this is a test 1', 'this is a test 2', 'this is a test 3']))\
        .filter(col('chr_col') == 'this is a test 1' |
                col('chr_col') == 'this is a test 2')
    
    ValueError: Since Expr are lazy, the truthiness of an Expr is ambiguous. 
    Hint: use '&' or '|' to chain Expr together, not and/or.
    

    It works fine if I do one or the other:

    tp.Tibble(chr_col = tp.Series(['this is a test 1', 'this is a test 2', 'this is a test 3']))\
        .filter(# col('chr_col') == 'this is a test 1' |
                col('chr_col') == 'this is a test 2')
    
    # chr_col
    #   --
    #   str
    # "this is a test 2"
    
    tp.Tibble(chr_col = tp.Series(['this is a test 1', 'this is a test 2', 'this is a test 3']))\
        .filter(col('chr_col') == 'this is a test 1' # |
                # col('chr_col') == 'this is a test 2'
                )
    
    # chr_col
    #   --
    #   str
    # "this is a test 1"
    
    opened by alexandro-ag 3
  • ```as_date``` with ```RuntimeError: please define a fmt```

    ```as_date``` with ```RuntimeError: please define a fmt```

    Good afternoon,

    I think I found an issue with the as_date method. In the example per the documentation, the following succeeds:

    import tidypolars as tp
    from tidypolars import col
    
    date_df = tp.Tibble(date = ['2021-12-31']) # Year-Month-Day (%Y-%m-%d)
    date_df.mutate(date_parsed = tp.as_date(col('date'))) # Success
    

    However when parsing different formats (using the fmt argument), the date fails to parse:

    import tidypolars as tp
    from tidypolars import col
    
    date_df = tp.Tibble(date = ['12/31/2021']) # Month/Day/Year (%m/%d/%Y)
    date_df.mutate(date_parsed = tp.as_date(col('date'), fmt='%m/%d/%Y')) # RuntimeError
    

    I also extend my appreciation for all the work on this package. I've been searching for a tidyverse implementation in python and this one knocks my expectations out of the park. Thank you.

    opened by alexandro-ag 3
  • Revisit `.rename()` syntax

    Revisit `.rename()` syntax

    Should the syntax be the same as pl.DataFrame.rename? Currently polars mimics pandas syntax. Or should it be something that attempts to mimic tidyverse syntax?

    Note: polars also has a .rename_col() with syntax df.rename_col('old', 'new').

    opened by markfairbanks 3
  • Compatibility with polars v0.14.0

    Compatibility with polars v0.14.0

    PR that caused the break: https://github.com/pola-rs/polars/pull/4309

    Old behavior that tidypolars relied on: https://github.com/pola-rs/polars/pull/2862

    feature 
    opened by markfairbanks 2
  • Basics: tp.read_csv(), df.drop(x1, x2, x3, ...), and df.colnames?

    Basics: tp.read_csv(), df.drop(x1, x2, x3, ...), and df.colnames?

    Really new to the library, but looking at the documentation did not really help with understanding.

    Problem 1

    import polars as pl
    import tidypolars as tp
    import csv
    import requests
    
    url = f'https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-05-24/Scrumqueens-data-2022-05-23.csv'
    
    df = tp.read_csv(file = url) # does not work
    df = pl.read_csv(file = url) # works??
    

    Problem 2

    df = df.drop('...1', 'Notes') # does not work
    df = df.drop('...1') # works separately
    df = df.drop('Notes') # works separately
    

    Problem 3

    df.colnames
    df.names
    df.colnames()
    df.names()
    # None of these work
    

    What am I missing, exactly?

    opened by exsell-jc 2
  • plans for adding type hints

    plans for adding type hints

    Hi, it seems that the codebase is not annotated making the discoverability of methods difficult and static code analysis not working. Any plans on adding type hints?

    feature 
    opened by mr-majkel 1
  • `write_csv()` returns `'super' object has no attribute 'to_csv'`

    `write_csv()` returns `'super' object has no attribute 'to_csv'`

    Hi, There seems to be a problem with write_csv(). I can import tidypolars and the data just fine:

    import tidypolars as tp
    
    rents = tp.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-07-05/rent.csv")
    

    But when I try to export the data frame as a csv file:

    rents.write_csv("rents.csv")
    

    I get an error stating 'super' object has no attribute 'to_csv'.

    The data come from the Tidytuesday repo. Python version is 3.10.8 and tidypolars is 0.2.19. I'm on macOS 13.

    bug 
    opened by alesvomacka 1
  • Calculating time

    Calculating time

    In R with lubridate, it would look like this:

    one_year_before = some_date - years(1)
    one_year_before = some_date - months(12)
    

    But in tidypolars functions list, there doesn't seem to be a years or months function: https://tidypolars.readthedocs.io/en/latest/reference.html

    feature 
    opened by exsell-jc 4
Releases(v0.2.19)
ROS-UGV-Control-Interface - Control interface which can be used in any UGV

ROS-UGV-Control-Interface Cam Closed: Cam Opened:

Ahmet Fatih Akcan 1 Nov 4, 2022
Unified Interface for Constructing and Managing Workflows on different workflow engines, such as Argo Workflows, Tekton Pipelines, and Apache Airflow.

Couler What is Couler? Couler aims to provide a unified interface for constructing and managing workflows on different workflow engines, such as Argo

Couler Project 781 Jan 3, 2023
Web-interface + rest API for classification and regression (https://jeff1evesque.github.io/machine-learning.docs)

Machine Learning This project provides a web-interface, as well as a programmatic-api for various machine learning algorithms. Supported algorithms: S

Jeff Levesque 252 Dec 11, 2022
High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

What is xLearn? xLearn is a high performance, easy-to-use, and scalable machine learning package that contains linear model (LR), factorization machin

Chao Ma 3k Jan 3, 2023
A Neural Net Training Interface on TensorFlow, with focus on speed + flexibility

Tensorpack is a neural network training interface based on TensorFlow. Features: It's Yet Another TF high-level API, with speed, and flexibility built

Tensorpack 6.2k Jan 1, 2023
High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

What is xLearn? xLearn is a high performance, easy-to-use, and scalable machine learning package that contains linear model (LR), factorization machin

Chao Ma 2.8k Feb 12, 2021
Elegy is a framework-agnostic Trainer interface for the Jax ecosystem.

Elegy Elegy is a framework-agnostic Trainer interface for the Jax ecosystem. Main Features Easy-to-use: Elegy provides a Keras-like high-level API tha

null 435 Dec 30, 2022
A Neural Net Training Interface on TensorFlow, with focus on speed + flexibility

Tensorpack is a neural network training interface based on TensorFlow. Features: It's Yet Another TF high-level API, with speed, and flexibility built

Tensorpack 6.2k Jan 9, 2023
PyStan, a Python interface to Stan, a platform for statistical modeling. Documentation: https://pystan.readthedocs.io

PyStan NOTE: This documentation describes a BETA release of PyStan 3. PyStan is a Python interface to Stan, a package for Bayesian inference. Stan® is

Stan 229 Dec 29, 2022
Sequential model-based optimization with a `scipy.optimize` interface

Scikit-Optimize Scikit-Optimize, or skopt, is a simple and efficient library to minimize (very) expensive and noisy black-box functions. It implements

Scikit-Optimize 2.5k Jan 4, 2023
Python interface for SmartRF Sniffer 2 Firmware

#TI SmartRF Packet Sniffer 2 Python Interface TI Makes available a nice packet sniffer firmware, which interfaces to Wireshark. You can see this proje

Colin O'Flynn 3 May 18, 2021
Python interface for the DIGIT tactile sensor

DIGIT-INTERFACE Python interface for the DIGIT tactile sensor. For updates and discussions please join the #DIGIT channel at the www.touch-sensing.org

Facebook Research 35 Dec 22, 2022
Unofficial TensorFlow implementation of Protein Interface Prediction using Graph Convolutional Networks.

[TensorFlow] Protein Interface Prediction using Graph Convolutional Networks Unofficial TensorFlow implementation of Protein Interface Prediction usin

YeongHyeon Park 9 Oct 25, 2022
Selene is a Python library and command line interface for training deep neural networks from biological sequence data such as genomes.

Selene is a Python library and command line interface for training deep neural networks from biological sequence data such as genomes.

Troyanskaya Laboratory 323 Jan 1, 2023
Lux AI environment interface for RLlib multi-agents

Lux AI interface to RLlib MultiAgentsEnv For Lux AI Season 1 Kaggle competition. LuxAI repo RLlib-multiagents docs Kaggle environments repo Please let

Jaime 12 Nov 7, 2022
A geometric deep learning pipeline for predicting protein interface contacts.

A geometric deep learning pipeline for predicting protein interface contacts.

null 44 Dec 30, 2022
This package proposes simplified exporting pytorch models to ONNX and TensorRT, and also gives some base interface for model inference.

PyTorch Infer Utils This package proposes simplified exporting pytorch models to ONNX and TensorRT, and also gives some base interface for model infer

Alex Gorodnitskiy 11 Mar 20, 2022
YOLOv5 detection interface - PyQt5 implementation

所有代码已上传,直接clone后,运行yolo_win.py即可开启界面。 2021/9/29:加入置信度选择 界面是在ultralytics的yolov5基础上建立的,界面使用pyqt5实现,内容较简单,娱乐而已。 功能: 模型选择 本地文件选择(视频图片均可) 开关摄像头

null 487 Dec 27, 2022
A naive ROS interface for visualDet3D.

YOLO3D ROS Node This repo contains a Monocular 3D detection Ros node. Base on https://github.com/Owen-Liuyuxuan/visualDet3D All parameters are exposed

Yuxuan Liu 19 Oct 8, 2022