Pypeln is a simple yet powerful Python library for creating concurrent data pipelines.

Cristian Garcia

Last update: Dec 31, 2022

Related tags

Data Analysis pypeln

Overview

Pypeln

Pypeln (pronounced as "pypeline") is a simple yet powerful Python library for creating concurrent data pipelines.

Main Features

Simple: Pypeln was designed to solve medium data tasks that require parallelism and concurrency where using frameworks like Spark or Dask feels exaggerated or unnatural.
Easy-to-use: Pypeln exposes a familiar functional API compatible with regular Python code.
Flexible: Pypeln enables you to build pipelines using Processes, Threads and asyncio.Tasks via the exact same API.
Fine-grained Control: Pypeln allows you to have control over the memory and cpu resources used at each stage of your pipelines.

For more information take a look at the Documentation.

Installation

Install Pypeln using pip:

pip install pypeln

Basic Usage

With Pypeln you can easily create multi-stage data pipelines using 3 type of workers:

Processes

You can create a pipeline based on multiprocessing.Process workers by using the process module:

import pypeln as pl
import time
from random import random

def slow_add1(x):
    time.sleep(random()) # <= some slow computation
    return x + 1

def slow_gt3(x):
    time.sleep(random()) # <= some slow computation
    return x > 3

data = range(10) # [0, 1, 2, ..., 9] 

stage = pl.process.map(slow_add1, data, workers=3, maxsize=4)
stage = pl.process.filter(slow_gt3, stage, workers=2)

data = list(stage) # e.g. [5, 6, 9, 4, 8, 10, 7]

At each stage the you can specify the numbers of workers. The maxsize parameter limits the maximum amount of elements that the stage can hold simultaneously.

Threads

You can create a pipeline based on threading.Thread workers by using the thread module:

import pypeln as pl
import time
from random import random

def slow_add1(x):
    time.sleep(random()) # <= some slow computation
    return x + 1

def slow_gt3(x):
    time.sleep(random()) # <= some slow computation
    return x > 3

data = range(10) # [0, 1, 2, ..., 9] 

stage = pl.thread.map(slow_add1, data, workers=3, maxsize=4)
stage = pl.thread.filter(slow_gt3, stage, workers=2)

data = list(stage) # e.g. [5, 6, 9, 4, 8, 10, 7]

Here we have the exact same situation as in the previous case except that the worker are Threads.

Tasks

You can create a pipeline based on asyncio.Task workers by using the task module:

import pypeln as pl
import asyncio
from random import random

async def slow_add1(x):
    await asyncio.sleep(random()) # <= some slow computation
    return x + 1

async def slow_gt3(x):
    await asyncio.sleep(random()) # <= some slow computation
    return x > 3

data = range(10) # [0, 1, 2, ..., 9] 

stage = pl.task.map(slow_add1, data, workers=3, maxsize=4)
stage = pl.task.filter(slow_gt3, stage, workers=2)

data = list(stage) # e.g. [5, 6, 9, 4, 8, 10, 7]

Conceptually similar but everything is running in a single thread and Task workers are created dynamically. If the code is running inside an async task can use await on the stage instead to avoid blocking:

import pypeln as pl
import asyncio
from random import random

async def slow_add1(x):
    await asyncio.sleep(random()) # <= some slow computation
    return x + 1

async def slow_gt3(x):
    await asyncio.sleep(random()) # <= some slow computation
    return x > 3


def main():
    data = range(10) # [0, 1, 2, ..., 9] 

    stage = pl.task.map(slow_add1, data, workers=3, maxsize=4)
    stage = pl.task.filter(slow_gt3, stage, workers=2)

    data = await stage # e.g. [5, 6, 9, 4, 8, 10, 7]

asyncio.run(main())

Sync

The sync module implements all operations using synchronous generators. This module is useful for debugging or when you don't need to perform heavy CPU or IO tasks but still want to retain element order information that certain functions like pl.*.ordered rely on.

import pypeln as pl
import time
from random import random

def slow_add1(x):
    return x + 1

def slow_gt3(x):
    return x > 3

data = range(10) # [0, 1, 2, ..., 9] 

stage = pl.sync.map(slow_add1, data, workers=3, maxsize=4)
stage = pl.sync.filter(slow_gt3, stage, workers=2)

data = list(stage) # [4, 5, 6, 7, 8, 9, 10]

Common arguments such as workers and maxsize are accepted by this module's functions for API compatibility purposes but are ignored.

Mixed Pipelines

You can create pipelines using different worker types such that each type is the best for its given task so you can get the maximum performance out of your code:

data = get_iterable()
data = pl.task.map(f1, data, workers=100)
data = pl.thread.flat_map(f2, data, workers=10)
data = filter(f3, data)
data = pl.process.map(f4, data, workers=5, maxsize=200)

Notice that here we even used a regular python filter, since stages are iterables Pypeln integrates smoothly with any python code, just be aware of how each stage behaves.

Pipe Operator

In the spirit of being a true pipeline library, Pypeln also lets you create your pipelines using the pipe | operator:

data = (
    range(10)
    | pl.process.map(slow_add1, workers=3, maxsize=4)
    | pl.process.filter(slow_gt3, workers=2)
    | list
)

Run Tests

A sample script is provided to run the tests in a container (either Docker or Podman is supported), to run tests:

$ bash scripts/run-tests.sh

This script can also receive a python version to check test against, i.e

$ bash scripts/run-tests.sh 3.7

Related Stuff

Contributors

License

MIT

Comments

BrokenPipeError [Errno 32] when using process

First of all, love pypeln and thank you for your work.

Submitting this issue because even on the most basic scripts using process, like your Process example, raise a BrokenPipeError. I've tried pypeln versions 0.3.3 down to 0.2.0 in a clean venv with only pypeln & its requirements installed.

[Errno 32] Broken pipe Process Process-3: Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap self.run() File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/process.py", line 99, in run self._target(*self._args, **self._kwargs) File "/Users/MYUSERNAME/.virtualenvs/pypeln-testl/lib/python3.7/site-packages/pypeln/process/stage.py", line 109, in run worker_namespace.done = True File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/managers.py", line 1127, in __setattr__ return callmethod('__setattr__', (key, value)) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/managers.py", line 818, in _callmethod conn.send((self._id, methodname, args, kwds)) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/connection.py", line 206, in send self._send_bytes(_ForkingPickler.dumps(obj)) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/connection.py", line 404, in _send_bytes self._send(header + buf) File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/connection.py", line 368, in _send n = write(self._handle, buf) BrokenPipeError: [Errno 32] Broken pipe

Please let me know if I can provide any further details. Unfortunately, I am not skilled enough to assist in the fix, hence why I lean on pypeln for multiprocessing and queuing :)

opened by ghost 6
asyncio_task example fails on Jupyter Notebook

Maybe pypeln interferes with Jupyters own event loop, maybe I did something wrong. Do you have any idea?

RuntimeError: Task <Task pending coro=<_run_task() running at /opt/conda/lib/python3.7/site-packages/pypeln/asyncio_task.py:203> cb=[gather.<locals>._done_callback() at /opt/conda/lib/python3.7/asyncio/tasks.py:691]> got Future <Future pending> attached to a different loop

opened by kalkschneider 5
tdqm

Hello! First of all, amazing library, I am a huge fan. I was wondering how can I add tdqm (https://github.com/tqdm/tqdm) to pypeln to see the progress.

opened by FrancescoSaverioZuppichini 5
Task timeout
Hi there,

Great project, thanks for your work!

Do you have any way to force the timeout on long running tasks?

pr.map(fn, stage, timeout=3) # fn would time out after 3 seconds and skip the computation
opened by muchas 4
Fix maxsize in process, task and thread

This should solve this https://github.com/cgarciae/pypeln/issues/64 and also this https://github.com/cgarciae/pypeln/issues/55 as this bug still there also for process.

I haven't fix sync because the structure is different, but there are also hardcoded maxsize=0 like here https://github.com/cgarciae/pypeln/blob/master/pypeln/sync/stage.py#L93 How should this be fixed?

Would be nice to have tests for this

opened by charlielito 3

Not working with python 3.9

I tried the Tasks example code from the pypeln README but it fails:

Traceback (most recent call last):
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 790, in exec_module
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "/Users/sebastian/test/venv/lib/python3.9/site-packages/pypeln/__init__.py", line 4, in <module>
    from . import thread
  File "/Users/sebastian/test/venv/lib/python3.9/site-packages/pypeln/thread/__init__.py", line 34, in <module>
    from .api.concat import concat
  File "/Users/sebastian/test/venv/lib/python3.9/site-packages/pypeln/thread/api/concat.py", line 8, in <module>
    from .to_stage import to_stage
  File "/Users/sebastian/test/venv/lib/python3.9/site-packages/pypeln/thread/api/to_stage.py", line 5, in <module>
    from ..stage import Stage
  File "/Users/sebastian/test/venv/lib/python3.9/site-packages/pypeln/thread/stage.py", line 8, in <module>
    from .queue import IterableQueue, OutputQueues
  File "/Users/sebastian/test/venv/lib/python3.9/site-packages/pypeln/thread/queue.py", line 17, in <module>
    class PipelineException(tp.NamedTuple, BaseException):
  File "/usr/local/Cellar/[email protected]/3.9.0_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/typing.py", line 1820, in _namedtuple_mro_entries
    raise TypeError("Multiple inheritance with NamedTuple is not supported")
TypeError: Multiple inheritance with NamedTuple is not supported
python-BaseException

If I'm correct this has to do with python/cpython#19363

opened by sebastianw 3

ordered in pypeln.task is not always ordered
Hi, First of all, I would like to thank you for writing such a versatile, powerful and yet easy to use library for working with concurrent data pipelines. One of my office projects had an use case where I needed to make multiple independent post requests to a REST API with certain payloads. We chose pypeln module for making multiple concurrent requests. As we required API responses in the same order of the post requests, we tried using pypeln.task.ordered, but the received responses were not always in the same order as expected.

Therefore I experimented with the following piece of code:

import pypeln as pl import asyncio from random import random async def slow_add1(x): await asyncio.sleep(random()) return x+1 async def main(): data = range(20) stage = pl.task.map(slow_add1, data, workers=1, maxsize=4) stage = pl.task.ordered(stage) out = await stage print("Output: ", out) for i in range(15): print("At Iteration:",i) asyncio.run(main())

I obsereved the results over multiple runs & found that the responses are not always in proper order. One such sample output is:

Please notice that output for iteration 3 as well as 11 is out of order (others are OK). Since I am a new user, I might be misunderstanding something here. My doubt is that, doesn't pypeln.task.ordered ensures that the response received would be in same order as in request, irrespective of uneven/unequal processing time of requests? Am I missing something here ?
opened by nav181 3

maxsize not being respected for process.map

Hello.
First of all. Let me just say that you changed my world yesterday when I found pypeln. I've wanted exactly this for a very long time. Thank you for writing it!!

Since I'm a brand new user, I might be misunderstanding, but I think I may have found a bug. I am running the following

conda python 3.6.8
pypeln==0.4.4
Running in Jupyter Lab with the following installed to view progress bars

pip install ipywidgets
jupyter labextension install @jupyter-widgets/jupyterlab-manager

Here is the code I am running

from tqdm.auto import tqdm
import pypeln as pyp
import time

in_list = list(range(300))
bar1 = tqdm(total=len(in_list), desc='stage1')
bar2 = tqdm(total=len(in_list), desc='stage2')
bar3 = tqdm(total=len(in_list), desc='stage3')

def func1(x):
    time.sleep(.01)
    bar1.update()
    return x

def func2(x):
    time.sleep(.2)
    return x
    
def func2_monitor(x):
    bar2.update()
    return x
    
def func3(x):
    time.sleep(.6)
    bar3.update()
    return x

(
    in_list
    | pyp.thread.map(func1, maxsize=1, workers=1)
    | pyp.process.map(func2, maxsize=1, workers=2)
    | pyp.thread.map(func2_monitor, maxsize=1, workers=1)
    | pyp.thread.map(func3, maxsize=1, workers=1)
    | list
    
);

This code runs stages while showing progress bars of when each node has processed data. Here is what I am seeing.

It appears that the first stage is consuming the entire source without respecting the maxsize argument. If this is expected behavior, I would like to understand more.

Thank you.

opened by robdmc 3

on_done is not called with on_start args
Hello Cristian,

In your last release you changed the way the callback functions work. The return values of on_start are not passed to on_done as input arguments anymore. I hope you didn't do it on purpose, that makes it hard to close open connections if a worker has finished.

Your old code:

args = params.on_start(worker_info) params.on_done(stage_status, *args)

Your new code:

f_kwargs = self.on_start(**on_start_kwargs) on_done_kwargs = {} done_resp = self.on_done(**on_done_kwargs)
opened by kalkschneider 3
Create a buffering stage
Love the package! Thanks for writing it.

I have a question that I've spent about a day poking at without any good ideas. I'd like to make a stage that buffers and batches records from previous batches. For example, let's say I have an iterable that emits records and a map stage that does some transformation to each record. What I'm looking for is a stage that would combine records into groups of, say, 100 for batch processing. In other words:

>>> ( range(100) | aio.map(lambda x: x) | aio.buffer(10) # <--- This is the functionality I'm looking for | aio.map(lambda x: sum(x)) | list ) [45, 145, 245, ...]

Is this at all possible?

Thanks!
opened by stevenmanton 3

how to use on_start functions with arguments

Hi @cgarciae

I'm trying to use a on_start function that uses an extra argument. From the code I see in Stage.run, it seems that you've planned to allow for additional arguments apart from the worker_info, but I don't see a way to pass these arguments in the end:

 def run(self) -> tp.Iterable:

    worker_info = WorkerInfo(index=0)

    on_start_args: tp.List[str] = (
        pypeln_utils.function_args(self.on_start) if self.on_start else []
    )
    on_done_args: tp.List[str] = (
        pypeln_utils.function_args(self.on_done) if self.on_done else []
    )

    if self.on_start is not None:
        on_start_kwargs = dict(worker_info=worker_info)
        kwargs = self.on_start(
            **{
                key: value
                for key, value in on_start_kwargs.items()
                if key in on_start_args
            }
        )

it seems you check for additional arguments, but the on_start_kwargs is hard-coded to the worker_info only. Any suggestion how to solve this?

Thanks Adrian

opened by alpae 2

How to use process pooling to create task?[Feature Request]

Is your feature request related to a problem? Please describe. How to use process pooling to create task? Not repeat create preocess or threads.

Describe the solution you'd like pools.map(fn, data)

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered, any example in any other framework

Additional context Add any other context or screenshots about the feature request here.
enhancement

opened by liuzhuang1024 0
[Bug] any particular reason to set `pypeln.utils.TIMEOUT` to 0.0001?

Describe the bug ~10 thread based workers saturates the cpu (Python8 / Ubuntu / pypeln 0.4.9) by polling for new items in the input queue in the loop.

Was is the reason to set the timeout to such low value? When I change that to 0.1 (my tasks are IO bound and take around a second to complete) the pipeline still works fine. Is it safe to lower it? Will other pipeline types (ie. task) be affected?

Also polling with 0.0001 timeout is probably below fidelity of OS system timer so that makes the call non blocking or to block for much longer (ie. on windows the effective minimum sleep is 16ms but maybe my knowledge is outdated)
bug

opened by rudolfix 0
[Bug]
Describe the bug A clear and concise description of what the bug is.

###The ERROR## Stage(process_fn=Map(f=<function allpkh at 0x000001BE3DA41318>), workers=4, maxsize=8, total_sources=1, timeout=0, dependencies=[Stage(process_fn=FromIterable(iterable=['1

Minimal code to reproduce Small snippet that contains a minimal amount of code.

stage = pl.task.map(allpkh, Company1, workers=4, maxsize=8) print(stage) #have tried with the process also **Expected behavior** The function should print the results **Library Info** Please provide os info and elegy version. ```python import pypeln print(pypeln.__version__)

Screenshots If applicable, add screenshots to help explain your problem.

Additional context Add any other context about the problem here.
bug
opened by chinmoybasak 0
allow multiprocess dep instead of multiprocessing
multiprocess external lib has other benefits like using dill instead of pickle, allowing us more leeway on certain edge cases that are not compatible with native multiprocessing.

https://github.com/uqfoundation/multiprocess

from their readme:

multiprocess enables:

objects to be transferred between processes using pipes or multi-producer/multi-consumer queues objects to be shared between processes using a server process or (for simple data) shared memory

multiprocess provides:

equivalents of all the synchronization primitives in threading a Pool class to facilitate submitting tasks to worker processes enhanced serialization, using dill

Let me know your thoughts on this type of change. Happy to iterate on it.

Thanks

Related: https://github.com/cgarciae/pypeln/issues/53
opened by lalo 0
Allow using a custom Process class

Thank you for creating this great package.

I would like to create a pipeline where some of the stages use PyTorch (with GPU usage). PyTorch cannot access the GPU from inside a multiprocessing.Process subprocess. For that reason PyTorch includes a torch.multiprocessing.Process class which has the same API as multiprocessing.Process.

I would like the ability to use a custom Process class instead of the default multiprocessing.Process, so I can use PyTorch in the pipeline. Without it I'm afraid pypeln is unusable to me.

For instance, add an optional process_class arguement to map (and other functions) with a default value multiprocessing.Process.

Alternatively, maybe there's a walkaround for what I need that I'm unaware of. In that case, please let me know.
enhancement

opened by ShakedDovrat 4

Releases(0.4.9)

0.4.9(Jan 6, 2022)
Changes

@metataro: Fixes AttributeError when using process workers with mp start method 'spawn' #74

@SimonBiggs: Fixes for Python 3.9 #78

@cgarciae: Update dependencies + minimal python version support to 3.6.2 #89

Source code(tar.gz)
Source code(zip)
0.4.7(Jan 5, 2021)
Fixed bugs:

[Bug] maxsize not being respected for thread.map #64

Closed issues:

maxsize not being respected for process.map #55

Merged pull requests:

lock-namespace #69 (cgarciae)

Fix maxsize in process, task and thread #66 (charlielito)

Update bug_report.md #65 (charlielito)

fix/ci #62 (cgarciae)

Update advanced.md #57 (isaacjoy)

Re-export to_iterator #35 (PromyLOPh)

Source code(tar.gz)
Source code(zip)
0.4.6(Oct 11, 2020)
Introduces the maxsize as an argument to to_stage and to_iterable.

ordered now takes an optinal maxsize parameter.

Source code(tar.gz)
Source code(zip)
0.4.5(Oct 4, 2020)
Fixed pl.task.from_iterable to solve #56

pl.*.ordered implementations now based on bisect.insort.

Source code(tar.gz)
Source code(zip)
0.4.4(Jul 9, 2020)
Lazily creates MANAGER object in pl.process to potentially avoid errors on Windows and OSX.

Source code(tar.gz)
Source code(zip)
0.4.3(Jun 27, 2020)
flat_map now also allows the return argument to be an Awaitable[Iterable] consisten with pypeln < 0.4 versions.

Source code(tar.gz)
Source code(zip)
0.4.2(Jun 23, 2020)
Includes some conditional depedencies & imports to support Python >= 3.6

Source code(tar.gz)
Source code(zip)
0.4.1(Jun 21, 2020)
Lowered Python version requirement to 3.5, however to use the task module will only be available for versions >= 3.7.

Source code(tar.gz)
Source code(zip)
0.4.0(Jun 21, 2020)
Big internal refactor:

Reduces the risk of potential zombie workers

New internal Worker and Supervisor classes which make code more readable / maintainable.

Code is now split into individual files for each API function to make contribution easier and improve maintainability.

API Reference docs are now shown per function and a new Overview page was created per module.

Breaking Changes

maxsize arguement is removed from all from_iterable functions as it was not used.

worker_constructor parameter was removed from all from_iterable functions in favor of the simpler use_thread argument.

Source code(tar.gz)
Source code(zip)
0.3.3(May 31, 2020)
Adds to_iterable function back to the API #35 (@PromyLOPh)

Source code(tar.gz)
Source code(zip)
0.3.0(Apr 6, 2020)
Adds

ordered function in all modules, this orders output elements based on the order of creation on the source iterable.

Additional options and rules for the depending injection mechanism. See Advanced Usage.

All pl.*.Stage classes now inherit from pl.BaseStage.

Source code(tar.gz)
Source code(zip)
0.2.5(Mar 4, 2020)

Source code(tar.gz)
Source code(zip)
0.2.2(Feb 22, 2020)

Source code(tar.gz)
Source code(zip)
0.2.0(Feb 18, 2020)

Refactor + Update documentation
Source code(tar.gz)
Source code(zip)

Owner

Cristian Garcia

ML Engineer at Quansight, working on Treex and Elegy.

GitHub https://cgarciae.github.io/pypeln

simple way to build the declarative and destributed data pipelines with python

unipipeline simple way to build the declarative and distributed data pipelines. Why you should use it Declarative strict config Scaffolding Fully type

0 Jan 26, 2022

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code. Tuplex has similar Python APIs to Apache Spark or Dask, but rather than invoking the Python interpreter, Tuplex generates optimized LLVM bytecode for the given pipeline and input data set.

791 Jan 4, 2023

Streamz helps you build pipelines to manage continuous streams of data

Streamz helps you build pipelines to manage continuous streams of data. It is simple to use in simple cases, but also supports complex pipelines that involve branching, joining, flow control, feedback, back pressure, and so on.

1.1k Dec 28, 2022

This tool parses log data and allows to define analysis pipelines for anomaly detection.

logdata-anomaly-miner This tool parses log data and allows to define analysis pipelines for anomaly detection. It was designed to run the analysis wit

32 Nov 27, 2022

Building house price data pipelines with Apache Beam and Spark on GCP

This project contains the process from building a web crawler to extract the raw data of house price to create ETL pipelines using Google Could Platform services.

1 Nov 22, 2021

Data pipelines built with polars

valves Warning: the project is very much work in progress. Valves is a collection of functions for your data .pipe()-lines. This project aimes to host

14 Jan 3, 2023

🧪 Panel-Chemistry - exploratory data analysis and build powerful data and viz tools within the domain of Chemistry using Python and HoloViz Panel.

???? ??. The purpose of the panel-chemistry project is to make it really easy for you to do DATA ANALYSIS and build powerful DATA AND VIZ APPLICATIONS within the domain of Chemistry using using Python and HoloViz Panel.

97 Dec 8, 2022

A powerful data analysis package based on mathematical step functions. Strongly aligned with pandas.

The leading use-case for the staircase package is for the creation and analysis of step functions. Pretty exciting huh. But don't hit the close button

48 Dec 21, 2022

Powerful, efficient particle trajectory analysis in scientific Python.

freud Overview The freud Python library provides a simple, flexible, powerful set of tools for analyzing trajectories obtained from molecular dynamics

195 Dec 20, 2022

Creating a statistical model to predict 10 year treasury yields

Predicting 10-Year Treasury Yields Intitially, I wanted to see if the volatility in the stock market, represented by the VIX index (data source), had

10 Oct 27, 2021

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

3.7k Jan 3, 2023

Elementary is an open-source data reliability framework for modern data teams. The first module of the framework is data lineage.

Data lineage made simple, reliable, and automated. Effortlessly track the flow of data, understand dependencies and analyze impact. Features Visualiza

898 Jan 9, 2023

pyETT: Python library for Eleven VR Table Tennis data

pyETT: Python library for Eleven VR Table Tennis data Documentation Documentation for pyETT is located at https://pyett.readthedocs.io/. Installation

5 Nov 19, 2022

DaDRA (day-druh) is a Python library for Data-Driven Reachability Analysis.

DaDRA (day-druh) is a Python library for Data-Driven Reachability Analysis. The main goal of the package is to accelerate the process of computing estimates of forward reachable sets for nonlinear dynamical systems.

2 Nov 8, 2021

Lale is a Python library for semi-automated data science.

Lale is a Python library for semi-automated data science. Lale makes it easy to automatically select algorithms and tune hyperparameters of pipelines that are compatible with scikit-learn, in a type-safe fashion.

293 Dec 29, 2022

EOD Historical Data Python Library (Unofficial)

EOD Historical Data Python Library (Unofficial) https://eodhistoricaldata.com Installation python3 -m pip install eodhistoricaldata Note Demo API key

20 Dec 22, 2022

A Python 3 library making time series data mining tasks, utilizing matrix profile algorithms

MatrixProfile MatrixProfile is a Python 3 library, brought to you by the Matrix Profile Foundation, for mining time series data. The Matrix Profile is

302 Dec 29, 2022

Larch: Applications and Python Library for Data Analysis of X-ray Absorption Spectroscopy (XAS, XANES, XAFS, EXAFS), X-ray Fluorescence (XRF) Spectroscopy and Imaging

Larch: Data Analysis Tools for X-ray Spectroscopy and More Documentation: http://xraypy.github.io/xraylarch Code: http://github.com/xraypy/xraylarch L

95 Dec 13, 2022

yt is an open-source, permissively-licensed Python library for analyzing and visualizing volumetric data.

The yt Project yt is an open-source, permissively-licensed Python library for analyzing and visualizing volumetric data. yt supports structured, varia

367 Dec 25, 2022

Pypeln is a simple yet powerful Python library for creating concurrent data pipelines.

Related tags

Overview

Pypeln

Main Features

Installation

Basic Usage

Processes

Threads

Tasks

Sync

Mixed Pipelines

Pipe Operator

Run Tests

Related Stuff

Contributors

License

Comments

Releases(0.4.9)

0.4.9(Jan 6, 2022)

Changes

0.4.7(Jan 5, 2021)

0.4.6(Oct 11, 2020)

0.4.5(Oct 4, 2020)

0.4.4(Jul 9, 2020)

0.4.3(Jun 27, 2020)

0.4.2(Jun 23, 2020)

0.4.1(Jun 21, 2020)

0.4.0(Jun 21, 2020)

Breaking Changes

0.3.3(May 31, 2020)

0.3.0(Apr 6, 2020)

Adds

0.2.5(Mar 4, 2020)

0.2.2(Feb 22, 2020)

0.2.0(Feb 18, 2020)

Owner

Cristian Garcia

simple way to build the declarative and destributed data pipelines with python

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

Streamz helps you build pipelines to manage continuous streams of data

This tool parses log data and allows to define analysis pipelines for anomaly detection.

Building house price data pipelines with Apache Beam and Spark on GCP

Data pipelines built with polars

🧪 Panel-Chemistry - exploratory data analysis and build powerful data and viz tools within the domain of Chemistry using Python and HoloViz Panel.

A powerful data analysis package based on mathematical step functions. Strongly aligned with pandas.

Powerful, efficient particle trajectory analysis in scientific Python.

Creating a statistical model to predict 10 year treasury yields

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Elementary is an open-source data reliability framework for modern data teams. The first module of the framework is data lineage.

pyETT: Python library for Eleven VR Table Tennis data

DaDRA (day-druh) is a Python library for Data-Driven Reachability Analysis.

Lale is a Python library for semi-automated data science.

EOD Historical Data Python Library (Unofficial)

A Python 3 library making time series data mining tasks, utilizing matrix profile algorithms

Larch: Applications and Python Library for Data Analysis of X-ray Absorption Spectroscopy (XAS, XANES, XAFS, EXAFS), X-ray Fluorescence (XRF) Spectroscopy and Imaging

yt is an open-source, permissively-licensed Python library for analyzing and visualizing volumetric data.