Data pipelines built with polars

Related tags

Data Analysis valves
Overview

valves

Warning: the project is very much work in progress.

Valves is a collection of functions for your data .pipe()-lines.

This project aimes to host a few performant implementations of functions that are common in industry. This gives us an opportunity to share sensible implementations but it also allows us to compare the performance across libraries. For now the project mainly targets polars, pandas and dask.

Comments
  • Switch subtitle of project

    Switch subtitle of project

    Currently, the subtitle of the project lists "data pipelines built with polars".

    Considering that we don't just support polars here, might it be better to call it "general functions for your data .pipe()-lines."?

    opened by koaning 1
  • Group-Based Sampling

    Group-Based Sampling

    If you're going to make a random sample on your data you don't always want to uniformly sample the rows. Instead, you may want to uniformly sample users and all their rows. That way, you'll have all interactions/sessions of the user in your subset.

    opened by koaning 0
  • Exponentaion weighed functions

    Exponentaion weighed functions

    Someone asked if we could support this: https://pandas.pydata.org/pandas-docs/stable/user_guide/window.html#window-exponentially-weighted

    I haven't looked at it much, but it seems like this should be possible with some cumulative expression kung fu.

    opened by ritchie46 1
  • user_item and item_item recommender tables

    user_item and item_item recommender tables

    Given a log of weighted user-item interactions, can we generate a item-item recommendation table and a user-item recommendation table?

    Kind of! We can calculate p(item_a | item_b) and p(item_a) which is can be reweighed into a table with recommendations. We can also do something similar for users. After all, a user that interactive with items a, b and c will have a score for item x defined via;

    p(item_x | user) = p(item_x | item_a, item_b, item_c)
                     \propto p(item_x | item_a) p(item_x| item_b) p(item_x|item_c)
    
    opened by koaning 4
  • Benchmarks

    Benchmarks

    As we compare different tools here. It would be cool to run benchmarks from this repo.

    Maybe in CI, and later maybe even a dedicated runner.

    These can could then be shown on the website. I am already assuming here that polars does great. :smile:

    opened by ritchie46 7
  • Docs, any preference?

    Docs, any preference?

    @ritchie46 Do we have any preference for documentation? I could go for something like mkdocs but I figured checking in first because it wouldn't fit the current documentation style.

    opened by koaning 1
Owner
null
Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code. Tuplex has similar Python APIs to Apache Spark or Dask, but rather than invoking the Python interpreter, Tuplex generates optimized LLVM bytecode for the given pipeline and input data set.

Tuplex 791 Jan 4, 2023
Streamz helps you build pipelines to manage continuous streams of data

Streamz helps you build pipelines to manage continuous streams of data. It is simple to use in simple cases, but also supports complex pipelines that involve branching, joining, flow control, feedback, back pressure, and so on.

Python Streamz 1.1k Dec 28, 2022
This tool parses log data and allows to define analysis pipelines for anomaly detection.

logdata-anomaly-miner This tool parses log data and allows to define analysis pipelines for anomaly detection. It was designed to run the analysis wit

AECID 32 Nov 27, 2022
Building house price data pipelines with Apache Beam and Spark on GCP

This project contains the process from building a web crawler to extract the raw data of house price to create ETL pipelines using Google Could Platform services.

null 1 Nov 22, 2021
Python library for creating data pipelines with chain functional programming

PyFunctional Features PyFunctional makes creating data pipelines easy by using chained functional operators. Here are a few examples of what it can do

Pedro Rodriguez 2.1k Jan 5, 2023
simple way to build the declarative and destributed data pipelines with python

unipipeline simple way to build the declarative and distributed data pipelines. Why you should use it Declarative strict config Scaffolding Fully type

aliaksandr-master 0 Jan 26, 2022
PipeChain is a utility library for creating functional pipelines.

PipeChain Motivation PipeChain is a utility library for creating functional pipelines. Let's start with a motivating example. We have a list of Austra

Michael Milton 2 Aug 7, 2022
Tablexplore is an application for data analysis and plotting built in Python using the PySide2/Qt toolkit.

Tablexplore is an application for data analysis and plotting built in Python using the PySide2/Qt toolkit.

Damien Farrell 81 Dec 26, 2022
Helper tools to construct probability distributions built from expert elicited data for use in monte carlo simulations.

Elicited Helper tools to construct probability distributions built from expert elicited data for use in monte carlo simulations. Credit to Brett Hoove

Ryan McGeehan 3 Nov 4, 2022
Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Amundsen 3.7k Jan 3, 2023
Elementary is an open-source data reliability framework for modern data teams. The first module of the framework is data lineage.

Data lineage made simple, reliable, and automated. Effortlessly track the flow of data, understand dependencies and analyze impact. Features Visualiza

null 898 Jan 9, 2023
TheMachineScraper 🐱‍👤 is an Information Grabber built for Machine Analysis

TheMachineScraper ??‍?? is a tool made purely for analysing machine data for any reason.

doop 5 Dec 1, 2022
🧪 Panel-Chemistry - exploratory data analysis and build powerful data and viz tools within the domain of Chemistry using Python and HoloViz Panel.

???? ??. The purpose of the panel-chemistry project is to make it really easy for you to do DATA ANALYSIS and build powerful DATA AND VIZ APPLICATIONS within the domain of Chemistry using using Python and HoloViz Panel.

Marc Skov Madsen 97 Dec 8, 2022
fds is a tool for Data Scientists made by DAGsHub to version control data and code at once.

Fast Data Science, AKA fds, is a CLI for Data Scientists to version control data and code at once, by conveniently wrapping git and dvc

DAGsHub 359 Dec 22, 2022
A data parser for the internal syncing data format used by Fog of World.

A data parser for the internal syncing data format used by Fog of World. The parser is not designed to be a well-coded library with good performance, it is more like a demo for showing the data structure.

Zed(Zijun) Chen 40 Dec 12, 2022
Fancy data functions that will make your life as a data scientist easier.

WhiteBox Utilities Toolkit: Tools to make your life easier Fancy data functions that will make your life as a data scientist easier. Installing To ins

WhiteBox 3 Oct 3, 2022
A Big Data ETL project in PySpark on the historical NYC Taxi Rides data

Processing NYC Taxi Data using PySpark ETL pipeline Description This is an project to extract, transform, and load large amount of data from NYC Taxi

Unnikrishnan 2 Dec 12, 2021
Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

null 2 Nov 20, 2021