Pipeline and Dataset helpers for complex algorithm evaluation.

Machine Learning and Data Analytics Lab FAU

Last update: Dec 7, 2022

Related tags

Data Analysis tpcp

Overview

tpcp - Tiny Pipelines for Complex Problems

A generic way to build object-oriented datasets and algorithm pipelines and tools to evaluate them

pip install tpcp

Why?

Evaluating Algorithms - in particular when they contain machine learning - is hard. Besides understanding required steps (Cross-validation, Bias, Overfitting, ...), you need to implement the required concepts and make them work together with your algorithms and data. If you are doing something "regular" like training an SVM on tabulary data, amazing libraries like sklearn, tslearn, pytorch, and many others, have your back. By using their built-in tools (e.g. sklearn.evaluation.GridSearchCV) you prevent implementation errors, and you are provided with a sensible structure to organise your code that is well understood in the community.

However, often the problems we are trying to solve are not regular. They are complex. As an example, here is the summary of the method from one of our recent papers:

We have continues multi-dimensional sensor recordings from multiple participants from a hospital visit and multiple days at home
For each participant we have global metadata (age, diagnosis) and daily annotations
We want to train Hidden-Markov-Model that can find events in the data streams
We need to tune hyper-parameters of the algorithm using a participant-wise cross-validation
We want to evaluate the final performance of the algorithm for the settings trained on the hospital data -> tested on home data and trained on home data -> tested on home data
Using the same structure we want to evaluate a state-of-the-art algorithm to compare the results

None of the standard frameworks can easily abstract this problem, because we had none-tabular data, multiple data sources per participant, a non-traditional ML algorithm, and a complex train-test split logic.

With tpcp we want to provide a flexible framework to approach such complex problems with structure and confidence.

How?

To make tpcp easy to use, we try to focus on a couple of key ideas:

Datasets are Python classes (think pytorch.datasets, but more flexible) that can be split, iterated over, and queried
Algorithms and Pipelines are Python classes with a simple run and optimize interface, that can be implemented to fit any problem
Everything is a parameter and everything is optimization: In regular ML we differentiate training and hyper-parameter optimization. In tpcp we consider everything that modifies parameters or weights as an optimization. This allows to use the same concepts and code interfaces from simple algorithms that just require a gridsearch to optimize a parameter to neuronal network pipelines with hyperparameter Tuning
Provide what is difficult, allow to change everything else: tpcp implements complicated constructs like cross validation and gridsearch and whenever possible tries to catch obvious errors in your approach. However, for the actual algorithm and dataset you are free to do, whatever is required to solve your current research question.

Comments

TODO notes left in docs
re: openjournals/joss-reviews#4953

There are a handful of leftover TODO notes that appear in the rendered documentation. Some of these might be reasonable to leave in for now if they refer to not-yet-implemented features or examples (e.g., the 4th & 5th ones below?), but I think the first 3 are probably worth resolving in order to make the documentation complete.

[ ] "TODO: add cites" under the "Group 1" subheading in the "Categories of Algorithms" section of the Algorithm Evaluation and Parameter Optimization page

[ ] "TODO: Link" in the Warning in the docstring for tpcp.clone

[ ] "learn more TODO" in the docstring for tpcp.PureParameter/tpcp.PurePara

(also, "parameter" and "pure_paramter" appear here in snake case—should these be camel case (i.e., "Parameter" and "PureParamter" to refer to the annotation classes?)

[ ] "Todo: Full dedicated example for PureParameter" under the "The Pipeline" heading in the GridSearchCV example

[ ] "TODO: Make GridSearchCV aware of it!" near the end of the "Optimization Info" example
opened by paxtonfitzpatrick 3
CI tests for Python 3.11
re: openjournals/joss-reviews#4953

Based on the Python version requirement in pyproject.toml https://github.com/mad-lab-fau/tpcp/blob/d88c875bd33f8c0b0ea9eb6bc8b55cd8802311ed/pyproject.toml#L16 and the PyPI classifiers that Poetry automatically generates for the packages listing based on this, tpcp appears to support Python 3.8, 3.9, 3.10, and 3.11. However, the CI tests are currently run only on Python 3.8, 3.9, and 3.10: https://github.com/mad-lab-fau/tpcp/blob/d88c875bd33f8c0b0ea9eb6bc8b55cd8802311ed/.github/workflows/test-and-lint.yml#L14-L15

I'm guessing may be because Pytorch doesn't yet support Python 3.11 (pytorch/pytorch#86566) and is used in some of the tests. However, since the core of tpcp itself seems to support Python 3.11 just fine (I've now run most of the examples with Python 3.11.0 without issue, though of course pip install 'tpcp[torch]' fails), Pytorch is an optional dependency, and the other optional dependencies (optuna & attrs) do support Python 3.11, it would be great to have CI tests run against that version if it's going to be supported.

You could do this by:

adding "3.11" to your python-version array in line 15 of your workflow file

editing your "Install dependencies" step from: https://github.com/mad-lab-fau/tpcp/blob/d88c875bd33f8c0b0ea9eb6bc8b55cd8802311ed/.github/workflows/test-and-lint.yml#L23-L28 to:
- name: Install dependencies env: PYTHON_VERSION: ${{ matrix.python-version }} run: | python -m pip install --upgrade pip pip install poetry poetry config virtualenvs.in-project true poetry install -E optuna [[ "$PYTHON_VERSION" == "3.11" ]] || poetry install -E torch

editing line 3 of test_hash.py from: https://github.com/mad-lab-fau/tpcp/blob/84a3048ff54b5d6912068009d15c65fdc5eb4592/tests/test_hash.py#L3 to:
torch = pytest.importorskip("torch")

Once Pytorch adds support for Python 3.11, you could simply revert steps 2 and 3.

I also think it would be worth adding a quick note to the Dev Setup section of the README saying that development requires python>=3.8,<3.11 since the instructions there include installing Pytorch.

If the CI tests end up revealing other, more extensive issues with Python 3.11 compatibility that I didn't run into, you could of course instead just reduce the maximum supported version in your pyproject.toml for the JOSS submission.
opened by paxtonfitzpatrick 1
broken links in docs
re: openjournals/joss-reviews#4953

Hey @AKuederle, thanks for creating such a useful package and documenting everything so extensively! I found the "User Guides" and "Examples" sections particularly helpful. I came across just a few minor issues with the docs. The following hyperlinked text leads to readthedocs 404 pages:

On the General Concepts page:

"mutable handling" (second paragraph)

"optimization guide" (end of the "Parameters" section)

"our example" (end of the "Composite Parameters" section)

On the Optimization and Training page:

"evaluation guide" (#3 under "Parameter Annotations")

"example" (before the Note in the "Parameter Annotations" section)
opened by paxtonfitzpatrick 1
Proper support for compound objects

This allows to use set_params and get_params calls for objects that contain list of other objects as parameters.

Note, that we only support one "type" of list namly one that consists of a sequence of (name, nested_obj) tupels. This is similar to how sklearn Pipelines work, with the difference that we allow multiple different parameters to be independent composite keys.

opened by AKuederle 1
Allow to collect information for an optimize call

In contrast to sklearn, tpcp is focusod on running not training algorithms. This means, we allow to return multiple results from a run call (by placing them as attributes on the instance). However, it is not easily possible to do the same with a call to self_optimize. While we only expect the optimized version of a algorithm to be the outcome of a self_optimize call, it might be helpfull to have access to further information about the training.

At the moment this is conflicting with our fundamental API design. Further, pipelines during optimization are often cloned multiple times. This means information stored outside the parameters usually does not find its way back to user space. So whatever solution we come up with should allow to conserve this information when optimizations are run within GridSearchCV (or similar).

Other libraries often solve this using callbacks that can be called within a the method to generate a log. I don't like that, as it is unclear, how we would pass such callbacks to the optimize method and further how we would get the values from the Callback Recorder (in particular if everything is called in a nested loop like a GridSearch).

Another option would be to specify one attribute with a fixed name that is allowed to be modified by the optimize method. As it is a fixed named attribute, higher level Wrapper could be made aware of it and collect it before the pipeline is cloned after an optimization.

A first version of that is implemented in #45

opened by AKuederle 1
Rudimentary first implementation of an Info object concept

One remaining issue in the fundamental design of tpcp is that it was basically impossible to extract informations from the trainings process. This PR tests an approach, where it is possible to fill an "info" dictionary during a self_optimize (or run call). The Optimize method has been made aware of this object and stores it as a result. This way it can be accessed when running GridSearchCV or other methods.

opened by AKuederle 1
Pipelines can not be dataclasses

At the moment pipelines can not be dataclasses due to the way we perform the initial checks (i.e. we run the checks in __init_subclass, which is called before the dataclass wrapper is even called.

Maybe we can find a solution for that, which is not horribly hacky.

opened by AKuederle 1
Implement specific Optuna optimizer

#27 implements a generic optimizer base class for optuna based optimizers. However, it alwayes requires the user to create subclasses to create custom optimizers.

In the future it would be nice to provide a specific GrideSearch and GridSearchCV equivalent implementation. These would either just call the scorer or call cross_validate in their objective function.

opened by AKuederle 1
Examples v1

A first set of examples that are the foundations for all further examples.

@richrobe I decided to go with ECG instead of sleep for now (I could just copy and paste stuff from the biosig exercise :) ) Might still make sense to add sleep stuff later when talking about ML/other more complicated algorithms. Other advantage of ECG was, that the datasets are small enough to directly commit to the repo.

closes #2

opened by AKuederle 1
Joss Paper
@richrobe Could you have a first look regrading general structure and scope? Also can you have a look at the sleep analysis section and add citations for that?

Todos:

[x] Add image for dataset/algo/pipeline (same as in general docs)

[x] Fix missing dois
opened by AKuederle 1
Access to data in custom aggregator

It might be helpful to have access to the actual datapoints in the custom aggregator.

This could be used to weight results (e.g. by the number of events in each dp, or by the type of "task" represented by each datapoint.

opened by AKuederle 0
Parameters with leading "_" can cause issues

Parameters with leading "_" can lead to problems because of the way how Python splits strings based on a token in reverse mode:

However, I would suggest to just not allow leading "_" for parameters. That leads to strange names (with dripple-underscores) anyway

opened by AKuederle 0

Better Result Objects

At the moment all attributes with a trailing underscore are considered results. This works nice in a general case.

It becomes a little annoying when using dataclasses/attrs, as we need to exclude these fields explicitly, when we provided type information for them.

Further, to don't make the API even more confusing, we only allow the run method to write these attributes. The optimize methods have no way to write results (see #46, #45).

Overcomming both issues, we could allow to define nested typed classes to store results in. With a little bit of magic (see below), we can enforce that each instance automatically gets a fresh instance of the class to write results in.

We could either enforce naiming of the these objects (e.g. r_, R_) for the actual results or we could use the decorator to mark and find them.

With that we could also allow to define specific additional optimization results. Either using a specific name or a seperate decorator.

from typing import Any, Literal, Optional, Type, Generic, TypeVar, Union, overload

T = TypeVar("T")


class _Generic:
    def __setattr__(self, __name: str, __value: Any) -> None:
        super().__setattr__(__name, __value)

    def __getattr__(self, __name: str) -> None:
        return self.__dict__[__name]


class result(Generic[T]):
    name: str
    klass: Type[T]

    def __init__(self, klass: Type[T] = _Generic):
        self.klass = klass

    def __set_name__(self, owner, name):
        self.name = name

    @overload
    def __get__(self, obj: Literal[None], objtype: Any) -> Type[T]:
        pass

    @overload
    def __get__(self, obj: object, objtype: Any) -> T:
        pass

    def __get__(self, obj: Optional[object], objtype: Any = None) -> Union[T, Type[T]]:
        if obj:
            new_instance = self.klass()
            name: str = getattr(self, "name", self.klass.__name__)
            setattr(obj, name, new_instance)
            return new_instance
        return self.klass


class ParentClassWithTypedResults:
    @result
    class R_:
        test: str


class ParentClassUntypedResults:
    r_ = result()


a = ParentClassUntypedResults()
a.something = "bla"

a = ParentClassWithTypedResults()
a.test = "blas"

Unfortunately, Pycharms autocomplete is not happy with the decorator and we don't get typechecking or autocomplete for the TypedResults case. VsCode works

opened by AKuederle 0

Representations for objects
#13 implements a basic version of a representation.

However, it has some remaining issues:

[ ] Long representations are alls hown in the same line (maybe use this: https://github.com/scikit-learn/scikit-learn/blob/056f993b411c1fa5cf6a2ced8e51de03617b25b4/sklearn/base.py#L104)

[ ] objects with unuasual representaitons (e.g. pd.DataFrames) can mess up the formatting, when included as a paramter)

Some further notes:

sklearn does something very elegant and does not include parameters in the repr that are still at their default value.

Dataset already has a repr that does not include the parameters. It might be nice to include the custom user parameters in that representation as well.

good first issue help wanted
opened by AKuederle 0

Releases(v0.12.2)

v0.12.2(Dec 14, 2022)
Fixed

The previous for fixing hashing of objects defined in the __main__ module was not working This should now be fixed.

Source code(tar.gz)
Source code(zip)
v0.12.1(Dec 14, 2022)
Changed

The safe_run method did unintentionally double-wrap the run method, if it already had a make_action_safe decorator. This is now fixed.

Fixed

Under certain conditions hashing of an object defined in the __main__ module failed. This release implements a workaround for this issue, that should hopefully resolve most cases.

Source code(tar.gz)
Source code(zip)
v0.12.0(Nov 15, 2022)
Added

Added the concept of the self_optimize_with_info method that can be implemented instead or in addition to the self_optimize method. This method should be used when an optimize method requires to return/output additional information besides the main result and is supported by the Optimize wrapper. (https://github.com/mad-lab-fau/tpcp/pull/49)

Added a new method called __clone_param__ that gives a class control over how params are cloned. This can be helpful, if for some reason objects don't behave well with deepcopy.

Added a new method called __repr_parameters__ that gives a class control over how params are represented. This can be used to customize the representation of individual parameters in the __repr__ method.

Add proper repr for CloneFactory

Source code(tar.gz)
Source code(zip)
v0.11.0(Oct 17, 2022)
[0.11.0] - 2022-10-17

Added

Support for Optuna >3.0

Example on how to use attrs and dataclass with tpcp

Added versions for Dataset and CustomOptunaOptimize that work with dataclasses and attrs.

Added first class support for composite objects (e.g. objects that need a list of other objects as parameters). This is basically sklearn pipelines with fewer restrictions (https://github.com/mad-lab-fau/tpcp/pull/48).

Changed

CustomOptunaOptimize now expects a callable to define the study, instead of taking a study object itself. This ensures that the study objects can be independent when the class is called as part of cross_validate.

Parameters are only validated when get_params is called. This reduces the reliance on __init_subclass__ and that we correctly wrap the init. This makes it possible to easier support attrs and dataclass

Source code(tar.gz)
Source code(zip)
v0.10.0(Sep 9, 2022)
[0.10.0] - 2022-09-09

Changed

Reworked once again when and how annotations for tpcp classes are processed. Processing is now delayed until you are actually using the annotations (i.e. as part of the "safe wrappers"). The only user facing change is that the chance of running into edge cases is lower and that __field_annotations__ is now only available on class instances and not the class itself anymore.

Source code(tar.gz)
Source code(zip)
v0.9.1(Sep 9, 2022)
[0.9.1] - 2022-09-08

Fixed

Classes without init can now pass the tpcp checks

Added

You can nest parameter annotations into ClassVar and they will still be processed. This is helpful when using dataclasses and annotating nested parameter values.

Source code(tar.gz)
Source code(zip)
v0.9.0(Aug 11, 2022)
This release drops Python 3.7 support!

Added

Bunch new high-level documentation

Added submission version of JOSS paper

Changed

The aggregate methods of custom aggregators now gets the list of datapoints in additions to the scores. Both parameters are now passed as keyword only arguments.

Source code(tar.gz)
Source code(zip)
v0.8.0(Aug 9, 2022)
[0.8.0] - 2022-08-09

Added

An example on how to use the dataclass decorator with tpcp classes. (https://github.com/mad-lab-fau/tpcp/pull/41)

In case you need complex aggregations of scores across data points, you can now wrap the return values of score functions in custom Aggregators. The best plac eto learn about this feature is the new "Custom Scorer" example. (https://github.com/mad-lab-fau/tpcp/pull/42)

All cross_validation based methods now have a new parameter called mock_labels. This can be used to provide a "y" value to the split method of a sklearn-cv splitter. This is required e.g. for Stratified KFold splitters. (https://github.com/mad-lab-fau/tpcp/pull/43)

Changed

Most of the class proccesing and sanity checks now happens in the init (or rather a post init hook) instead of during class initialisation. This increases the chance for some edge cases, but allows to post-process classes, before tpcp checks are run. Most importantly, it allows the use of the dataclass decorator in combination with tpcp classes. For the "enduser", this change will have minimal impact. Only, if you relied on accessing special tpcp class parameters before the class (e.g. __field_annotations__) was initialised, you will get an error now. Other than that, you will only notice a very slight overhead on class initialisation, as we know need to run some basic checks when you call the init or get_params. (https://github.com/mad-lab-fau/tpcp/pull/41)

The API of the Scorer class was modified. In case you used custom Scorer before, they will likely not work anymore. Further, we removed the error_score parameter from the Scorer and all related methods, that forwarded this parameter (e.g. GridSearch). Error that occur in the score function will now always be raised! If you need special handling of error cases, handle them in your error function yourself (i.e. using try-except). This gives more granular control and makes the implementation of the expected score function returns much easier on the tpcp side. (https://github.com/mad-lab-fau/tpcp/pull/42)

Source code(tar.gz)
Source code(zip)
v0.7.0(Jun 23, 2022)
[0.7.0] - 2022-06-23

Added

The Dataset class now has a new parameter group, which will return the group/row information, if there is only a single group/row left in the dataset. This parameter returns either a string or a namedtuple to make it easy to access the group/row information.

The Dataset.groups parameter now returns a list of namedtuples when it previously returned a list of normal tuples.

New is_single_group and assert_is_single_group methods for the Dataset class are added. They are shortcuts for calling self.is_single(groupby_cols=self.groupby_cols) and self.assert_is_single(groupby_cols=self.groupby_cols).

Removed

We removed the OptimizableAlgorithm base class, as it is not really useful. We recommend implementing your own base class or mixin if you are implementing a set of algorithms that need a normal and an optimizable version.

Source code(tar.gz)
Source code(zip)
v0.6.3(May 31, 2022)
It is now possible to use namedtuples as parameters and they are correctly cloned (https://github.com/mad-lab-fau/tpcp/issues/39)

Source code(tar.gz)
Source code(zip)
v0.6.2(Apr 21, 2022)

Source code(tar.gz)
Source code(zip)
v0.6.1(Apr 5, 2022)
Changed

Fixed bug with tensor hashing (https://github.com/mad-lab-fau/tpcp/pull/37)

Fixed an issue with memoization during hashing (https://github.com/mad-lab-fau/tpcp/pull/37)

Fixed an issue that the safe_optimize_wrapper could not correctly detect changes to mutable objects. This is now fixed by pre-calculating all the hashes. (https://github.com/mad-lab-fau/tpcp/pull/38)

Source code(tar.gz)
Source code(zip)
v0.6.0(Apr 4, 2022)
Added

A new class to wrap the optimization framework Optuna. CustomOptunaOptimize can be used to create custom wrapper classes for various Optuna optimizations, that play nicely with tpcp and can be nested within tpcp operations. (https://github.com/mad-lab-fau/tpcp/pull/27)

A new example for the CustomOptunaOptimize wrapper that explains how to create complex custom optimizers using Optuna and the new Scorer callbacks (see below) (https://github.com/mad-lab-fau/tpcp/pull/27)

Scorer now supports an optional callback function, which will be called after each datapoint is scored. (https://github.com/mad-lab-fau/tpcp/pull/29)

Pipelines, Optimize objects, and Scorer are now Generic. This improves typing (in particular with VsCode), but means a little bit more typing (pun intended), when creating new Pipelines and Optimizers (https://github.com/mad-lab-fau/tpcp/pull/29)

Added option for scoring function to return arbitrary additional information using the NoAgg wrapper (https://github.com/mad-lab-fau/tpcp/pull/31)

(experimental) Torch compatibility for hash based comparisons (e.g. in the safe_run wrapper). Before the wrapper would fail, with torch module subclasses, as their pickle based hashes where not consistent. We implemented a custom hash function that should solve this. For now, we will consider this feature experimental, as we are not sure if it breaks in certain use-cases. (https://github.com/mad-lab-fau/tpcp/pull/33)

tpcp.types now exposes a bunch of internal types that might be helpful to type custom Pipelines and Optimizers. (https://github.com/mad-lab-fau/tpcp/pull/34)

Changed

The return type for the individual values in the Scorer class is not List[float] instead of np.ndarray. This also effects the output of cross_validate, GridSearch.gs_results_ and GridSearchCV.cv_results_ (https://github.com/mad-lab-fau/tpcp/pull/29)

cf now has "faked" return type, so that type checkers in the user code, do not complain anymore. (https://github.com/mad-lab-fau/tpcp/pull/29)

All TypeVar Variables are now called SomethingT instead of Something_ (https://github.com/mad-lab-fau/tpcp/pull/34)

Source code(tar.gz)
Source code(zip)
v0.5.0(Mar 15, 2022)
[0.5.0] - 2022-03-15

Added

The make_optimize_safe decorator (and hence, the Optimize method) make use of the parameter annotations to check that only parameters marked as OptimizableParameter are changed by the self_optimize method. This check also supports nested parameters, in case the optimization involves optimizing nested objects. (https://github.com/mad-lab-fau/tpcp/pull/9)

All tpcp objects now have a basic representation that is automatically generated based on their parameters (https://github.com/mad-lab-fau/tpcp/pull/13)

Added algo optimization and evaluation guide and improved docs overall (https://github.com/mad-lab-fau/tpcp/pull/26)

Added examples for all fundamental concepts (https://github.com/mad-lab-fau/tpcp/pull/23)

Source code(tar.gz)
Source code(zip)
v0.4.0(Dec 13, 2021)

Source code(tar.gz)
Source code(zip)
v0.3.1(Nov 19, 2021)

Source code(tar.gz)
Source code(zip)
v0.3.0(Nov 19, 2021)

Source code(tar.gz)
Source code(zip)
v0.2.0-alpha.3(Nov 18, 2021)

Source code(tar.gz)
Source code(zip)
v0.2.0-alpha.1(Nov 18, 2021)

Testing release
Source code(tar.gz)
Source code(zip)
v0.2.0(Nov 18, 2021)

Source code(tar.gz)
Source code(zip)

Owner

Machine Learning and Data Analytics Lab FAU

Public projects of the Machine Learning and Data Analytics Lab at the Friedrich-Alexander-University Erlangen-Nürnberg

GitHub

SNV calling pipeline developed explicitly to process individual or trio vcf files obtained from Illumina based pipeline (grch37/grch38).

SNV Pipeline SNV calling pipeline developed explicitly to process individual or trio vcf files obtained from Illumina based pipeline (grch37/grch38).

1 Nov 2, 2021

Two phase pipeline + StreamlitTwo phase pipeline + Streamlit

Two phase pipeline + Streamlit This is an example project that demonstrates how to create a pipeline that consists of two phases of execution. In betw

1 Nov 17, 2021

Udacity-api-reporting-pipeline - Udacity api reporting pipeline

udacity-api-reporting-pipeline In this exercise, you'll use portions of each of

1 Feb 15, 2022

Analyzing Earth Observation (EO) data is complex and solutions often require custom tailored algorithms.

eo-grow Earth observation framework for scaled-up processing in Python. Analyzing Earth Observation (EO) data is complex and solutions often require c

18 Dec 23, 2022

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

2 Nov 20, 2021

A real-time financial data streaming pipeline and visualization platform using Apache Kafka, Cassandra, and Bokeh.

Realtime Financial Market Data Visualization and Analysis Introduction This repo shows my project about real-time stock data pipeline. All the code is

6 Sep 7, 2022

PrimaryBid - Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift

Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift This project is composed of two parts: Part1 and Part2

1 Jan 19, 2022

Evaluation of a Monocular Eye Tracking Set-Up

Evaluation of a Monocular Eye Tracking Set-Up As part of my master thesis, I implemented a new state-of-the-art model that is based on the work of Che

19 Dec 17, 2022

ETL pipeline on movie data using Python and postgreSQL

Movies-ETL ETL pipeline on movie data using Python and postgreSQL Overview This project consisted on a automated Extraction, Transformation and Load p

0 Jul 7, 2021

X-news - Pipeline data use scrapy, kafka, spark streaming, spark ML and elasticsearch, Kibana

5 Sep 28, 2022

Educational project on how to build an ETL (Extract, Transform, Load) data pipeline, orchestrated with Airflow.

ETL Pipeline with Airflow, Spark, s3, MongoDB and Amazon Redshift

214 Jan 2, 2023

pipeline for migrating lichess data into postgresql

How Long Does It Take Ordinary People To "Get Good" At Chess? TL;DR: According to 5.5 years of data from 2.3 million players and 450 million games, mo

182 Nov 11, 2022

In this project, ETL pipeline is build on data warehouse hosted on AWS Redshift.

ETL Pipeline for AWS Project Description In this project, ETL pipeline is build on data warehouse hosted on AWS Redshift. The data is loaded from S3 t

1 Nov 1, 2021

A pipeline that creates consensus sequences from a Nanopore reads. I

A pipeline that creates consensus sequences from a Nanopore reads. It clusters reads that are similar to each other and creates a consensus that is then identified using BLAST.

2 May 15, 2022

Full automated data pipeline using docker images

Create postgres tables from CSV files This first section is only relate to creating tables from CSV files using postgres container alone. Just one of

1 Nov 21, 2021

Demonstrate a Dataflow pipeline that saves data from an API into BigQuery table

Overview dataflow-mvp provides a basic example pipeline that pulls data from an API and writes it to a BigQuery table using GCP's Dataflow (i.e., Apac

1 Dec 3, 2021

An ETL Pipeline of a large data set from a fictitious music streaming service named Sparkify.

An ETL Pipeline of a large data set from a fictitious music streaming service named Sparkify. The ETL process flows from AWS's S3 into staging tables in AWS Redshift.

1 Feb 11, 2022

A python package which can be pip installed to perform statistics and visualize binomial and gaussian distributions of the dataset

GBiStat package A python package to assist programmers with data analysis. This package could be used to plot : Binomial Distribution of the dataset p

4 Oct 17, 2022

A variant of LinUCB bandit algorithm with local differential privacy guarantee

Contents LDP LinUCB Description Model Architecture Dataset Environment Requirements Script Description Script and Sample Code Script Parameters Launch

4 Oct 25, 2022

Pipeline and Dataset helpers for complex algorithm evaluation.

Related tags

Overview

tpcp - Tiny Pipelines for Complex Problems

Why?

How?

Comments

Releases(v0.12.2)

v0.12.2(Dec 14, 2022)

Fixed

v0.12.1(Dec 14, 2022)

Changed

Fixed

v0.12.0(Nov 15, 2022)

Added

v0.11.0(Oct 17, 2022)

[0.11.0] - 2022-10-17

Added

Changed

v0.10.0(Sep 9, 2022)

[0.10.0] - 2022-09-09

Changed

v0.9.1(Sep 9, 2022)

[0.9.1] - 2022-09-08

Fixed

Added

v0.9.0(Aug 11, 2022)

Added

Changed

v0.8.0(Aug 9, 2022)

[0.8.0] - 2022-08-09

Added

Changed

v0.7.0(Jun 23, 2022)

[0.7.0] - 2022-06-23

Added

Removed

v0.6.3(May 31, 2022)

v0.6.2(Apr 21, 2022)

v0.6.1(Apr 5, 2022)

Changed

v0.6.0(Apr 4, 2022)

Added

Changed

v0.5.0(Mar 15, 2022)

[0.5.0] - 2022-03-15

Added

v0.4.0(Dec 13, 2021)

v0.3.1(Nov 19, 2021)

v0.3.0(Nov 19, 2021)

v0.2.0-alpha.3(Nov 18, 2021)

v0.2.0-alpha.1(Nov 18, 2021)

v0.2.0(Nov 18, 2021)

Owner

Machine Learning and Data Analytics Lab FAU

SNV calling pipeline developed explicitly to process individual or trio vcf files obtained from Illumina based pipeline (grch37/grch38).

Two phase pipeline + StreamlitTwo phase pipeline + Streamlit

Udacity-api-reporting-pipeline - Udacity api reporting pipeline

Analyzing Earth Observation (EO) data is complex and solutions often require custom tailored algorithms.

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

A real-time financial data streaming pipeline and visualization platform using Apache Kafka, Cassandra, and Bokeh.

PrimaryBid - Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift

Evaluation of a Monocular Eye Tracking Set-Up

ETL pipeline on movie data using Python and postgreSQL

X-news - Pipeline data use scrapy, kafka, spark streaming, spark ML and elasticsearch, Kibana

Educational project on how to build an ETL (Extract, Transform, Load) data pipeline, orchestrated with Airflow.

pipeline for migrating lichess data into postgresql

In this project, ETL pipeline is build on data warehouse hosted on AWS Redshift.

A pipeline that creates consensus sequences from a Nanopore reads. I

Full automated data pipeline using docker images

Demonstrate a Dataflow pipeline that saves data from an API into BigQuery table

An ETL Pipeline of a large data set from a fictitious music streaming service named Sparkify.

A python package which can be pip installed to perform statistics and visualize binomial and gaussian distributions of the dataset

A variant of LinUCB bandit algorithm with local differential privacy guarantee