Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

Tuplex

Last update: Jan 4, 2023

Related tags

Data Analysis tuplex

Overview

Tuplex: Blazing Fast Python Data Science

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code. Tuplex has similar Python APIs to Apache Spark or Dask, but rather than invoking the Python interpreter, Tuplex generates optimized LLVM bytecode for the given pipeline and input data set. Under the hood, Tuplex is based on data-driven compilation and dual-mode processing, two key techniques that make it possible for Tuplex to provide speed comparable to a pipeline written in hand-optimized C++.

You can join the discussion on Tuplex on our Gitter community or read up more on the background of Tuplex in our SIGMOD'21 paper.

Contributions welcome!

Installation
- Docker image
- Pypi
Building
Example
License

Installation

To install Tuplex, you can use a PyPi package for Linux, or a Docker container for MacOS which will launch a jupyter notebook with Tuplex preinstalled.

Docker

docker run -p 8888:8888 tuplex/tuplex

PyPI

pip install tuplex

Building

Tuplex is available for MacOS and Linux. The current version has been tested under MacOS 10.13-10.15 and Ubuntu 18.04 and 20.04 LTS. To install Tuplex, simply install the dependencies first and then build the package.

MacOS build from source

To build Tuplex, you need several other packages first which can be easily installed via brew.

brew install llvm@9 boost boost-python3 aws-sdk-cpp pcre2 antlr4-cpp-runtime googletest gflags yaml-cpp celero
python3 -m pip cloudpickle numpy
python3 setup.py install

Ubuntu build from source

To faciliate installing the dependencies for Ubuntu, we do provide two scripts (scripts/ubuntu1804/install_reqs.sh for Ubuntu 18.04, or scripts/ubuntu2004/install_reqs.sh for Ubuntu 20.04). To create an up to date version of Tuplex, simply run

./scripts/ubuntu1804/install_reqs.sh
python3 -m pip cloudpickle numpy
python3 setup.py install

Customizing the build

Besides building a pip package, cmake can be also directly invoked. To compile the package via cmake

mkdir build
cd build
cmake ..
make -j$(nproc)

The python package corresponding to Tuplex can be then found in build/dist/python with C++ test executables based on googletest in build/dist/bin.

To customize the cmake build, the following options are available to be passed via -D:

option	values	description
`CMAKE_BUILD_TYPE`	`Release` (default), `Debug`, `RelWithDebInfo`, `tsan`, `asan`, `ubsan`	select compile mode. Tsan/Asan/Ubsan correspond to Google Sanitizers.
`BUILD_WITH_AWS`	`ON` (default), `OFF`	build with AWS SDK or not. On Ubuntu this will build the Lambda executor.
`GENERATE_PDFS`	`ON`, `OFF` (default)	output in Debug mode PDF files if graphviz is installed (e.g., `brew install graphviz`) for ASTs of UDFs, query plans, ...
`PYTHON3_VERSION`	`3.6`, ...	when trying to select a python3 version to build against, use this by specifying `major.minor`. To specify the python executable, use the options provided by cmake.
`LLVM_ROOT_DIR`	e.g. `/usr/lib/llvm-9`	specify which LLVM version to use
`BOOST_DIR`	e.g. `/opt/boost`	specify which Boost version to use. Note that the python component of boost has to be built against the python version used to build Tuplex

For example, to create a debug build which outputs PDFs use the following snippet:

cmake -DCMAKE_BUILD_TYPE=Debug -DGENERATE_PDFS=ON ..

Example

Tuplex can be used in python interactive mode, a jupyter notebook or by copying the below code to a file. To try it out, run the following example:

from tuplex import *
c = Context()
res = c.parallelize([1, 2, None, 4]).map(lambda x: (x, x * x)).collect()
# this prints [(1, 1), (2, 4), (4, 16)]
print(res)

More examples can be found here.

License

Tuplex is available under Apache 2.0 License, to cite the paper use:

@inproceedings{10.1145/3448016.3457244,
author = {Spiegelberg, Leonhard and Yesantharao, Rahul and Schwarzkopf, Malte and Kraska, Tim},
title = {Tuplex: Data Science in Python at Native Code Speed},
year = {2021},
isbn = {9781450383431},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3448016.3457244},
doi = {10.1145/3448016.3457244},
booktitle = {Proceedings of the 2021 International Conference on Management of Data},
pages = {1718–1731},
numpages = {14},
location = {Virtual Event, China},
series = {SIGMOD/PODS '21}
}

Comments

installation errors

Hi, when trying "pip install tuplex", i am getting the following errors

ERROR: Could not find a version that satisfies the requirement tuplex (from versions: none) ERROR: No matching distribution found for tuplex

Thanks,

opened by rubenSaro 5
[BUG] Tuplex fails to decode the type when calling .unique() for 311 benchmark.
I downloaded a 311 requests dataset which contains 6 columns (one of them is "Incident Zip") in order to run the 311 benchmark (tuplex/benchmarks/311/runtuplex.py)

When the following code is executed:

tstart = time.time() df = ctx.csv( ",".join(perf_paths), null_values=["Unspecified", "NO CLUE", "NA", "N/A", "0", ""], type_hints={0: typing.Optional[str], 1: typing.Optional[str], 2: typing.Optional[str], 3: typing.Optional[str], 4: typing.Optional[str], 5: typing.Optional[str], }, ) # Do the pipeline df = df.mapColumn("IncidentZip", fix_zip_codes).unique() # Output to csv df.tocsv(output_path) job_time = time.time() - tstart print(json.dumps({"startupTime": startup_time, "jobTime": job_time}))

I receive the following error: [2022-04-26 16:57:10.185] [global] [error] decoding of other types not yet supported...

Initially I suspected that the issue might be that Incident Zip column is inferred as f64 so I explicitly set the type of all columns to be str, as seen in the code snippet above.

However, the error still persists. Any ideas?

PS: I believe that error is on unique() method because when I remove the call it works fine.

Thanks in advance.
bug
opened by kchasialis 4
Fix aggregateByKey python binding

The python binding for aggregateByKey (in dataset.py) currently crashes due to the deprecated function get_lambda_source - update to the new function get_udf_source.

opened by rahulyesantharao 4
[BUG] Tuplex crashes in exception resolver when take/collect() is called twice
The tuplex crashes from a corrupted exception partition when the pipeline is executed twice. This could be an issue from invalidating partition.

Example pipeline when it crashes:

ds = c.parallelize([(1, "A"),(2, "a"),(3, 2)]).filter(lambda a, b: a > 1) ds.collect() ds.collect()
bug
opened by KorlaMarch 3
Weld benchmarks - Python 3.

Hello!

I tried building Weld for Python 3 but had no success. I managed, however, to get pyweld and pygrizzly installed using python2 pip. Python 2 is deprecated and I would like to build pygrizzly and pyweld for Python 3.

Did you manage to build pyweld and pygrizzly for Python 3?

Thanks in advance!

opened by kchasialis 2
[BUG] Is keyword crash
Tuplex crashes when the the is keyword is used with a non-boolean or non-none type argument.

Example:

c = Context() c.parallelize([1, 2, 3]).filter(lambda x: x is 2).collect()
opened by bgivertz 2

Ubuntu 20.04 installation issues

Hello :)

Instead of opening a new issue I thought I would piggyback onto this one since it's very related. (Do let me know if you'd rather have me open a new one and I will do so.)

I'm following the Ubuntu 20.04 build options specified in the readme and I can't get past the python3 setup.py install step (I'm using Python 3.8). The error I get is the following:

    ERROR: Command errored out with exit status 1:
     command: /home/gorka/dev/tuplex/venv38/bin/python -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/home/gorka/dev/tuplex/setup.py'"'"'; __file__='"'"'/home/gorka/dev/tuplex/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' develop --no-deps
         cwd: /home/gorka/dev/tuplex/
    Complete output (62 lines):
    running develop
    running egg_info
    writing tuplex/python/tuplex.egg-info/PKG-INFO
    writing dependency_links to tuplex/python/tuplex.egg-info/dependency_links.txt
    writing requirements to tuplex/python/tuplex.egg-info/requires.txt
    writing top-level names to tuplex/python/tuplex.egg-info/top_level.txt
    reading manifest file 'tuplex/python/tuplex.egg-info/SOURCES.txt'
    reading manifest template 'MANIFEST.in'
    warning: no files found matching '*' under directory 'tuplex/libexec'
    warning: no files found matching '*' under directory 'tuplex/tuplex/libexec'
    warning: no files found matching '*.so'
    warning: no files found matching '*.dll'
    adding license file 'LICENSE'
    writing manifest file 'tuplex/python/tuplex.egg-info/SOURCES.txt'
    running build_ext
    -- Building dev version
    CMake Error at CMakeLists.txt:9 (project):
      Running
    
       '/tmp/pip-build-env-8hwvbpaz/overlay/bin/ninja' '--version'
    
      failed with:
    
       No such file or directory
    
    
    -- Configuring incomplete, errors occurred!
    See also "/home/gorka/dev/tuplex/build/temp.linux-x86_64-3.8/CMakeFiles/CMakeOutput.log".
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/home/gorka/dev/tuplex/setup.py", line 308, in <module>
        setup(name="tuplex",
      File "/tmp/pip-build-env-wpou7eah/overlay/lib/python3.8/site-packages/setuptools/__init__.py", line 153, in setup
        return distutils.core.setup(**attrs)
      File "/usr/lib/python3.8/distutils/core.py", line 148, in setup
        dist.run_commands()
      File "/usr/lib/python3.8/distutils/dist.py", line 966, in run_commands
        self.run_command(cmd)
      File "/usr/lib/python3.8/distutils/dist.py", line 985, in run_command
        cmd_obj.run()
      File "/tmp/pip-build-env-wpou7eah/overlay/lib/python3.8/site-packages/setuptools/command/develop.py", line 34, in run
        self.install_for_development()
      File "/tmp/pip-build-env-wpou7eah/overlay/lib/python3.8/site-packages/setuptools/command/develop.py", line 114, in install_for_development
        self.run_command('build_ext')
      File "/usr/lib/python3.8/distutils/cmd.py", line 313, in run_command
        self.distribution.run_command(command)
      File "/usr/lib/python3.8/distutils/dist.py", line 985, in run_command
        cmd_obj.run()
      File "/tmp/pip-build-env-wpou7eah/overlay/lib/python3.8/site-packages/setuptools/command/build_ext.py", line 79, in run
        _build_ext.run(self)
      File "/usr/lib/python3.8/distutils/command/build_ext.py", line 340, in run
        self.build_extensions()
      File "/usr/lib/python3.8/distutils/command/build_ext.py", line 449, in build_extensions
        self._build_extensions_serial()
      File "/usr/lib/python3.8/distutils/command/build_ext.py", line 474, in _build_extensions_serial
        self.build_extension(ext)
      File "/home/gorka/dev/tuplex/setup.py", line 205, in build_extension
        subprocess.check_call(
      File "/usr/lib/python3.8/subprocess.py", line 364, in check_call
        raise CalledProcessError(retcode, cmd)
    subprocess.CalledProcessError: Command '['cmake', '/home/gorka/dev/tuplex/tuplex', '-DPYTHON_EXECUTABLE=/home/gorka/dev/tuplex/venv38/bin/python', '-DCMAKE_BUILD_TYPE=Release', '-DPYTHON3_VERSION=3.8', '-GNinja']' returned non-zero exit status 1.
    configuring cmake with: cmake /home/gorka/dev/tuplex/tuplex -DPYTHON_EXECUTABLE=/home/gorka/dev/tuplex/venv38/bin/python -DCMAKE_BUILD_TYPE=Release -DPYTHON3_VERSION=3.8 -GNinja
    ----------------------------------------
ERROR: Command errored out with exit status 1: /home/gorka/dev/tuplex/venv38/bin/python -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/home/gorka/dev/tuplex/setup.py'"'"'; __file__='"'"'/home/gorka/dev/tuplex/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' develop --no-deps Check the logs for full command output.

CMakeFiles/CMakeOutput.log contains the following:

The system is: Linux - 5.4.0-89-generic - x86_64

Thank you in advance.

Cheers, Gorka.

Originally posted by @gorkaerana in https://github.com/tuplex/tuplex/issues/8#issuecomment-948719143

opened by LeonhardFS 2

Iterator built-in functions
Add the following built-in functions in UDF: iterable can be a list, homogenous tuple, string, range or iterator object. (i) iter(iterable) : Returns an iterator. (ii) reversed(seq) : seq can be a list, homogenous tuple, string or range object. (iii) enumerate(iterable[, start]) : Start=0 by default if not provided. Returns an iterator with yieldType=(I64, iter(iterable).yieldType). (iv) zip(*iterables): Currently at least one iterable must be provided. Returns an iterator with yieldType=(iter(iterable_1).yieldType, iter(iterable_2).yieldType, ..., iter(iterable_N).yieldType). (v) next(iterator[, default]): Returns the next item from iterator. If iterator is exhausted, returns default if default is provided, otherwise raise StopIteration.

Reference: Built-in Functions -- Python 3.9.7 documentation

Using unsupported types in the functions above (dictionary as iterable or seq, non-homogenous tuple as iterable or seq, default type in next call different from iterator.yieldType) or returning an iterator from UDF is resolved through fallback mode. Error handling codes are adapted from ASTHelper and now refactoring into IFailable.

Add support for using iterators (generated from iter, enumerate or zip) as testlist in for loops (i.e. for i in iteratorType: ...).
opened by yunzhi-jake 2
Historyserver

This is added support for the WebUI. It supports previous functionality of a much older version of Tuplex as well as new support for splitting the plan into stages, tracking stage dependencies, join operators, aggregate operators, and showing UDF's for filter operators.

It would be great if you could do a code review and tell me what I should change.

opened by colby-anderson 2

SegFault after `c = Context()`

After installing in a venv with:

python -m pip install --upgrade --upgrade-strategy eager tuplex

... and attempting the first example:

python
from tuplex import *
Welcome to

  _____            _
 |_   _|   _ _ __ | | _____  __
   | || | | | '_ \| |/ _ \ \/ /
   | || |_| | |_) | |  __/>  <
   |_| \__,_| .__/|_|\___/_/\_\ 0.3.0
            |_|
    
using Python 3.8.10 (default, May  4 2021, 00:00:00) 
[GCC 10.2.1 20201125 (Red Hat 10.2.1-9)] on linux
 Interactive Shell mode
>>> c = Context()
Segmentation fault (core dumped)

opened by rickhg12hs 2

[BUG] ResolveOperator schema mismatch

Currently, if a resolver has a schema mismatch to its corresponding operator the job fails. However, this should be allowed - despite it basically means triggering the fallback path!

Bug discovered due to the random shuffling failure on CI, where basically tracing within the resolver triggered a schema mismatch.
bug

opened by LeonhardFS 1
CVE-2007-4559 Patch

Patching CVE-2007-4559

Hi, we are security researchers from the Advanced Research Center at Trellix. We have began a campaign to patch a widespread bug named CVE-2007-4559. CVE-2007-4559 is a 15 year old bug in the Python tarfile package. By using extract() or extractall() on a tarfile object without sanitizing input, a maliciously crafted .tar file could perform a directory path traversal attack. We found at least one unsantized extractall() in your codebase and are providing a patch for you via pull request. The patch essentially checks to see if all tarfile members will be extracted safely and throws an exception otherwise. We encourage you to use this patch or your own solution to secure against CVE-2007-4559. Further technical information about the vulnerability can be found in this blog.

If you have further questions you may contact us through this projects lead researcher Kasimir Schulz.

opened by TrellixVulnTeam 0
[FEATURE] Missing projection pushdown for aggregates

Aggregates should be improved (perhaps after struct type is enabled) and push down selects from both combine/aggregate UDFs. Also error handling has to be added and maybe higher level functions like count, mean, std, min, max, ... providing standard aggregates.
enhancement

opened by LeonhardFS 0

Owner

Tuplex

Python Data Science at Native Code Speed

GitHub https://tuplex.cs.brown.edu

Utilize data analytics skills to solve real-world business problems using Humana’s big data

Humana-Mays-2021-HealthCare-Analytics-Case-Competition- The goal of the project is to utilize data analytics skills to solve real-world business probl

1 Dec 27, 2021

Python library for creating data pipelines with chain functional programming

PyFunctional Features PyFunctional makes creating data pipelines easy by using chained functional operators. Here are a few examples of what it can do

2.1k Jan 5, 2023

simple way to build the declarative and destributed data pipelines with python

unipipeline simple way to build the declarative and distributed data pipelines. Why you should use it Declarative strict config Scaffolding Fully type

0 Jan 26, 2022

NumPy and Pandas interface to Big Data

Blaze translates a subset of modified NumPy and Pandas-like syntax to databases and other computing systems. Blaze allows Python users a familiar inte

3.1k Jan 5, 2023

The official repository for ROOT: analyzing, storing and visualizing big data, scientifically

About The ROOT system provides a set of OO frameworks with all the functionality needed to handle and analyze large amounts of data in a very efficien

2k Dec 29, 2022

BigDL - Evaluate the performance of BigDL (Distributed Deep Learning on Apache Spark) in big data analysis problems

Evaluate the performance of BigDL (Distributed Deep Learning on Apache Spark) in big data analysis problems.

1 Jan 6, 2022

Integrate bus data from a variety of sources (batch processing and real time processing).

Purpose: This is integrate bus data from a variety of sources such as: csv, json api, sensor data ... into Relational Database (batch processing and r

1 Nov 25, 2021

Streamz helps you build pipelines to manage continuous streams of data

Streamz helps you build pipelines to manage continuous streams of data. It is simple to use in simple cases, but also supports complex pipelines that involve branching, joining, flow control, feedback, back pressure, and so on.

1.1k Dec 28, 2022

This tool parses log data and allows to define analysis pipelines for anomaly detection.

logdata-anomaly-miner This tool parses log data and allows to define analysis pipelines for anomaly detection. It was designed to run the analysis wit

32 Nov 27, 2022

Building house price data pipelines with Apache Beam and Spark on GCP

This project contains the process from building a web crawler to extract the raw data of house price to create ETL pipelines using Google Could Platform services.

1 Nov 22, 2021

Data pipelines built with polars

valves Warning: the project is very much work in progress. Valves is a collection of functions for your data .pipe()-lines. This project aimes to host

14 Jan 3, 2023

PipeChain is a utility library for creating functional pipelines.

PipeChain Motivation PipeChain is a utility library for creating functional pipelines. Let's start with a motivating example. We have a list of Austra

2 Aug 7, 2022

Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

Data Scientist Learning Plan Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

27 Nov 1, 2022

Lale is a Python library for semi-automated data science.

Lale is a Python library for semi-automated data science. Lale makes it easy to automatically select algorithms and tune hyperparameters of pipelines that are compatible with scikit-learn, in a type-safe fashion.

293 Dec 29, 2022

Driver Analysis with Factors and Forests: An Automated Data Science Tool using Python

Driver Analysis with Factors and Forests: An Automated Data Science Tool using Python ??

2 May 26, 2022

A Pythonic introduction to methods for scaling your data science and machine learning work to larger datasets and larger models, using the tools and APIs you know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

This tutorial's purpose is to introduce Pythonistas to methods for scaling their data science and machine learning work to larger datasets and larger models, using the tools and APIs they know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

102 Nov 10, 2022

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

Related tags

Overview

Tuplex: Blazing Fast Python Data Science

Contents

Installation

Docker

PyPI

Building

MacOS build from source

Ubuntu build from source

Customizing the build

Example

License

Comments

Patching CVE-2007-4559

Owner

Tuplex

Utilize data analytics skills to solve real-world business problems using Humana’s big data

Python library for creating data pipelines with chain functional programming

simple way to build the declarative and destributed data pipelines with python

NumPy and Pandas interface to Big Data

The official repository for ROOT: analyzing, storing and visualizing big data, scientifically

BigDL - Evaluate the performance of BigDL (Distributed Deep Learning on Apache Spark) in big data analysis problems

Integrate bus data from a variety of sources (batch processing and real time processing).

Streamz helps you build pipelines to manage continuous streams of data

This tool parses log data and allows to define analysis pipelines for anomaly detection.

Building house price data pipelines with Apache Beam and Spark on GCP

Data pipelines built with polars

PipeChain is a utility library for creating functional pipelines.

Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

Lale is a Python library for semi-automated data science.

Driver Analysis with Factors and Forests: An Automated Data Science Tool using Python

A Pythonic introduction to methods for scaling your data science and machine learning work to larger datasets and larger models, using the tools and APIs you know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

Orchest is a browser based IDE for Data Science.

A lightweight, hub-and-spoke dashboard for multi-account Data Science projects

Data Science Environment Setup in single line