Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

Related tags

Data Analysis tuplex
Overview

Tuplex: Blazing Fast Python Data Science

Build Status License Supported python versions Gitter PyPi Downloads

Website Documentation

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code. Tuplex has similar Python APIs to Apache Spark or Dask, but rather than invoking the Python interpreter, Tuplex generates optimized LLVM bytecode for the given pipeline and input data set. Under the hood, Tuplex is based on data-driven compilation and dual-mode processing, two key techniques that make it possible for Tuplex to provide speed comparable to a pipeline written in hand-optimized C++.

You can join the discussion on Tuplex on our Gitter community or read up more on the background of Tuplex in our SIGMOD'21 paper.

Contributions welcome!

Contents

Installation

To install Tuplex, you can use a PyPi package for Linux, or a Docker container for MacOS which will launch a jupyter notebook with Tuplex preinstalled.

Docker

docker run -p 8888:8888 tuplex/tuplex

PyPI

pip install tuplex

Building

Tuplex is available for MacOS and Linux. The current version has been tested under MacOS 10.13-10.15 and Ubuntu 18.04 and 20.04 LTS. To install Tuplex, simply install the dependencies first and then build the package.

MacOS build from source

To build Tuplex, you need several other packages first which can be easily installed via brew.

brew install llvm@9 boost boost-python3 aws-sdk-cpp pcre2 antlr4-cpp-runtime googletest gflags yaml-cpp celero
python3 -m pip cloudpickle numpy
python3 setup.py install

Ubuntu build from source

To faciliate installing the dependencies for Ubuntu, we do provide two scripts (scripts/ubuntu1804/install_reqs.sh for Ubuntu 18.04, or scripts/ubuntu2004/install_reqs.sh for Ubuntu 20.04). To create an up to date version of Tuplex, simply run

./scripts/ubuntu1804/install_reqs.sh
python3 -m pip cloudpickle numpy
python3 setup.py install

Customizing the build

Besides building a pip package, cmake can be also directly invoked. To compile the package via cmake

mkdir build
cd build
cmake ..
make -j$(nproc)

The python package corresponding to Tuplex can be then found in build/dist/python with C++ test executables based on googletest in build/dist/bin.

To customize the cmake build, the following options are available to be passed via -D:

option values description
CMAKE_BUILD_TYPE Release (default), Debug, RelWithDebInfo, tsan, asan, ubsan select compile mode. Tsan/Asan/Ubsan correspond to Google Sanitizers.
BUILD_WITH_AWS ON (default), OFF build with AWS SDK or not. On Ubuntu this will build the Lambda executor.
GENERATE_PDFS ON, OFF (default) output in Debug mode PDF files if graphviz is installed (e.g., brew install graphviz) for ASTs of UDFs, query plans, ...
PYTHON3_VERSION 3.6, ... when trying to select a python3 version to build against, use this by specifying major.minor. To specify the python executable, use the options provided by cmake.
LLVM_ROOT_DIR e.g. /usr/lib/llvm-9 specify which LLVM version to use
BOOST_DIR e.g. /opt/boost specify which Boost version to use. Note that the python component of boost has to be built against the python version used to build Tuplex

For example, to create a debug build which outputs PDFs use the following snippet:

cmake -DCMAKE_BUILD_TYPE=Debug -DGENERATE_PDFS=ON ..

Example

Tuplex can be used in python interactive mode, a jupyter notebook or by copying the below code to a file. To try it out, run the following example:

from tuplex import *
c = Context()
res = c.parallelize([1, 2, None, 4]).map(lambda x: (x, x * x)).collect()
# this prints [(1, 1), (2, 4), (4, 16)]
print(res)

More examples can be found here.

License

Tuplex is available under Apache 2.0 License, to cite the paper use:

@inproceedings{10.1145/3448016.3457244,
author = {Spiegelberg, Leonhard and Yesantharao, Rahul and Schwarzkopf, Malte and Kraska, Tim},
title = {Tuplex: Data Science in Python at Native Code Speed},
year = {2021},
isbn = {9781450383431},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3448016.3457244},
doi = {10.1145/3448016.3457244},
booktitle = {Proceedings of the 2021 International Conference on Management of Data},
pages = {1718–1731},
numpages = {14},
location = {Virtual Event, China},
series = {SIGMOD/PODS '21}
}

(c) 2017-2021 Tuplex contributors

Comments
  • installation errors

    installation errors

    Hi, when trying "pip install tuplex", i am getting the following errors

    ERROR: Could not find a version that satisfies the requirement tuplex (from versions: none) ERROR: No matching distribution found for tuplex

    Thanks,

    opened by rubenSaro 5
  • [BUG] Tuplex fails to decode the type when calling .unique() for 311 benchmark.

    [BUG] Tuplex fails to decode the type when calling .unique() for 311 benchmark.

    I downloaded a 311 requests dataset which contains 6 columns (one of them is "Incident Zip") in order to run the 311 benchmark (tuplex/benchmarks/311/runtuplex.py)

    When the following code is executed:

    tstart = time.time()
    df = ctx.csv(
        ",".join(perf_paths),
        null_values=["Unspecified", "NO CLUE", "NA", "N/A", "0", ""],
        type_hints={0: typing.Optional[str],
                    1: typing.Optional[str],
                    2: typing.Optional[str],
                    3: typing.Optional[str],
                    4: typing.Optional[str],
                    5: typing.Optional[str],
        },
    )
    # Do the pipeline
    df = df.mapColumn("IncidentZip", fix_zip_codes).unique()
    # Output to csv
    df.tocsv(output_path)
    job_time = time.time() - tstart
    print(json.dumps({"startupTime": startup_time, "jobTime": job_time}))
    

    I receive the following error: [2022-04-26 16:57:10.185] [global] [error] decoding of other types not yet supported...

    Initially I suspected that the issue might be that Incident Zip column is inferred as f64 so I explicitly set the type of all columns to be str, as seen in the code snippet above.

    However, the error still persists. Any ideas?

    PS: I believe that error is on unique() method because when I remove the call it works fine.

    Thanks in advance.

    bug 
    opened by kchasialis 4
  • Fix aggregateByKey python binding

    Fix aggregateByKey python binding

    The python binding for aggregateByKey (in dataset.py) currently crashes due to the deprecated function get_lambda_source - update to the new function get_udf_source.

    opened by rahulyesantharao 4
  • [BUG] Tuplex crashes in exception resolver when take/collect() is called twice

    [BUG] Tuplex crashes in exception resolver when take/collect() is called twice

    The tuplex crashes from a corrupted exception partition when the pipeline is executed twice. This could be an issue from invalidating partition.

    Example pipeline when it crashes:

    ds = c.parallelize([(1, "A"),(2, "a"),(3, 2)]).filter(lambda a, b: a > 1)
    ds.collect()
    ds.collect()
    
    bug 
    opened by KorlaMarch 3
  • Weld benchmarks - Python 3.

    Weld benchmarks - Python 3.

    Hello!

    I tried building Weld for Python 3 but had no success. I managed, however, to get pyweld and pygrizzly installed using python2 pip. Python 2 is deprecated and I would like to build pygrizzly and pyweld for Python 3.

    Did you manage to build pyweld and pygrizzly for Python 3?

    Thanks in advance!

    opened by kchasialis 2
  • [BUG] Is keyword crash

    [BUG] Is keyword crash

    Tuplex crashes when the the is keyword is used with a non-boolean or non-none type argument.

    Example:

    c = Context()
    c.parallelize([1, 2, 3]).filter(lambda x: x is 2).collect()
    
    opened by bgivertz 2
  • Ubuntu 20.04 installation issues

    Ubuntu 20.04 installation issues

    Hello :)

    Instead of opening a new issue I thought I would piggyback onto this one since it's very related. (Do let me know if you'd rather have me open a new one and I will do so.)

    I'm following the Ubuntu 20.04 build options specified in the readme and I can't get past the python3 setup.py install step (I'm using Python 3.8). The error I get is the following:

        ERROR: Command errored out with exit status 1:
         command: /home/gorka/dev/tuplex/venv38/bin/python -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/home/gorka/dev/tuplex/setup.py'"'"'; __file__='"'"'/home/gorka/dev/tuplex/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' develop --no-deps
             cwd: /home/gorka/dev/tuplex/
        Complete output (62 lines):
        running develop
        running egg_info
        writing tuplex/python/tuplex.egg-info/PKG-INFO
        writing dependency_links to tuplex/python/tuplex.egg-info/dependency_links.txt
        writing requirements to tuplex/python/tuplex.egg-info/requires.txt
        writing top-level names to tuplex/python/tuplex.egg-info/top_level.txt
        reading manifest file 'tuplex/python/tuplex.egg-info/SOURCES.txt'
        reading manifest template 'MANIFEST.in'
        warning: no files found matching '*' under directory 'tuplex/libexec'
        warning: no files found matching '*' under directory 'tuplex/tuplex/libexec'
        warning: no files found matching '*.so'
        warning: no files found matching '*.dll'
        adding license file 'LICENSE'
        writing manifest file 'tuplex/python/tuplex.egg-info/SOURCES.txt'
        running build_ext
        -- Building dev version
        CMake Error at CMakeLists.txt:9 (project):
          Running
        
           '/tmp/pip-build-env-8hwvbpaz/overlay/bin/ninja' '--version'
        
          failed with:
        
           No such file or directory
        
        
        -- Configuring incomplete, errors occurred!
        See also "/home/gorka/dev/tuplex/build/temp.linux-x86_64-3.8/CMakeFiles/CMakeOutput.log".
        Traceback (most recent call last):
          File "<string>", line 1, in <module>
          File "/home/gorka/dev/tuplex/setup.py", line 308, in <module>
            setup(name="tuplex",
          File "/tmp/pip-build-env-wpou7eah/overlay/lib/python3.8/site-packages/setuptools/__init__.py", line 153, in setup
            return distutils.core.setup(**attrs)
          File "/usr/lib/python3.8/distutils/core.py", line 148, in setup
            dist.run_commands()
          File "/usr/lib/python3.8/distutils/dist.py", line 966, in run_commands
            self.run_command(cmd)
          File "/usr/lib/python3.8/distutils/dist.py", line 985, in run_command
            cmd_obj.run()
          File "/tmp/pip-build-env-wpou7eah/overlay/lib/python3.8/site-packages/setuptools/command/develop.py", line 34, in run
            self.install_for_development()
          File "/tmp/pip-build-env-wpou7eah/overlay/lib/python3.8/site-packages/setuptools/command/develop.py", line 114, in install_for_development
            self.run_command('build_ext')
          File "/usr/lib/python3.8/distutils/cmd.py", line 313, in run_command
            self.distribution.run_command(command)
          File "/usr/lib/python3.8/distutils/dist.py", line 985, in run_command
            cmd_obj.run()
          File "/tmp/pip-build-env-wpou7eah/overlay/lib/python3.8/site-packages/setuptools/command/build_ext.py", line 79, in run
            _build_ext.run(self)
          File "/usr/lib/python3.8/distutils/command/build_ext.py", line 340, in run
            self.build_extensions()
          File "/usr/lib/python3.8/distutils/command/build_ext.py", line 449, in build_extensions
            self._build_extensions_serial()
          File "/usr/lib/python3.8/distutils/command/build_ext.py", line 474, in _build_extensions_serial
            self.build_extension(ext)
          File "/home/gorka/dev/tuplex/setup.py", line 205, in build_extension
            subprocess.check_call(
          File "/usr/lib/python3.8/subprocess.py", line 364, in check_call
            raise CalledProcessError(retcode, cmd)
        subprocess.CalledProcessError: Command '['cmake', '/home/gorka/dev/tuplex/tuplex', '-DPYTHON_EXECUTABLE=/home/gorka/dev/tuplex/venv38/bin/python', '-DCMAKE_BUILD_TYPE=Release', '-DPYTHON3_VERSION=3.8', '-GNinja']' returned non-zero exit status 1.
        configuring cmake with: cmake /home/gorka/dev/tuplex/tuplex -DPYTHON_EXECUTABLE=/home/gorka/dev/tuplex/venv38/bin/python -DCMAKE_BUILD_TYPE=Release -DPYTHON3_VERSION=3.8 -GNinja
        ----------------------------------------
    ERROR: Command errored out with exit status 1: /home/gorka/dev/tuplex/venv38/bin/python -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/home/gorka/dev/tuplex/setup.py'"'"'; __file__='"'"'/home/gorka/dev/tuplex/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' develop --no-deps Check the logs for full command output.
    

    CMakeFiles/CMakeOutput.log contains the following:

    The system is: Linux - 5.4.0-89-generic - x86_64
    

    Thank you in advance.

    Cheers, Gorka.

    Originally posted by @gorkaerana in https://github.com/tuplex/tuplex/issues/8#issuecomment-948719143

    opened by LeonhardFS 2
  • Iterator built-in functions

    Iterator built-in functions

    1. Add the following built-in functions in UDF: iterable can be a list, homogenous tuple, string, range or iterator object. (i) iter(iterable) : Returns an iterator. (ii) reversed(seq) : seq can be a list, homogenous tuple, string or range object. (iii) enumerate(iterable[, start]) : Start=0 by default if not provided. Returns an iterator with yieldType=(I64, iter(iterable).yieldType). (iv) zip(*iterables): Currently at least one iterable must be provided. Returns an iterator with yieldType=(iter(iterable_1).yieldType, iter(iterable_2).yieldType, ..., iter(iterable_N).yieldType). (v) next(iterator[, default]): Returns the next item from iterator. If iterator is exhausted, returns default if default is provided, otherwise raise StopIteration.

      Reference: Built-in Functions -- Python 3.9.7 documentation

    2. Using unsupported types in the functions above (dictionary as iterable or seq, non-homogenous tuple as iterable or seq, default type in next call different from iterator.yieldType) or returning an iterator from UDF is resolved through fallback mode. Error handling codes are adapted from ASTHelper and now refactoring into IFailable.

    3. Add support for using iterators (generated from iter, enumerate or zip) as testlist in for loops (i.e. for i in iteratorType: ...).

    opened by yunzhi-jake 2
  • Historyserver

    Historyserver

    This is added support for the WebUI. It supports previous functionality of a much older version of Tuplex as well as new support for splitting the plan into stages, tracking stage dependencies, join operators, aggregate operators, and showing UDF's for filter operators.

    It would be great if you could do a code review and tell me what I should change.

    opened by colby-anderson 2
  • SegFault after `c = Context()`

    SegFault after `c = Context()`

    After installing in a venv with:

    python -m pip install --upgrade --upgrade-strategy eager tuplex
    

    ... and attempting the first example:

    python
    from tuplex import *
    Welcome to
    
      _____            _
     |_   _|   _ _ __ | | _____  __
       | || | | | '_ \| |/ _ \ \/ /
       | || |_| | |_) | |  __/>  <
       |_| \__,_| .__/|_|\___/_/\_\ 0.3.0
                |_|
        
    using Python 3.8.10 (default, May  4 2021, 00:00:00) 
    [GCC 10.2.1 20201125 (Red Hat 10.2.1-9)] on linux
     Interactive Shell mode
    >>> c = Context()
    Segmentation fault (core dumped)
    
    opened by rickhg12hs 2
  • [BUG] ResolveOperator schema mismatch

    [BUG] ResolveOperator schema mismatch

    Currently, if a resolver has a schema mismatch to its corresponding operator the job fails. However, this should be allowed - despite it basically means triggering the fallback path!

    Bug discovered due to the random shuffling failure on CI, where basically tracing within the resolver triggered a schema mismatch.

    bug 
    opened by LeonhardFS 1
  • CVE-2007-4559 Patch

    CVE-2007-4559 Patch

    Patching CVE-2007-4559

    Hi, we are security researchers from the Advanced Research Center at Trellix. We have began a campaign to patch a widespread bug named CVE-2007-4559. CVE-2007-4559 is a 15 year old bug in the Python tarfile package. By using extract() or extractall() on a tarfile object without sanitizing input, a maliciously crafted .tar file could perform a directory path traversal attack. We found at least one unsantized extractall() in your codebase and are providing a patch for you via pull request. The patch essentially checks to see if all tarfile members will be extracted safely and throws an exception otherwise. We encourage you to use this patch or your own solution to secure against CVE-2007-4559. Further technical information about the vulnerability can be found in this blog.

    If you have further questions you may contact us through this projects lead researcher Kasimir Schulz.

    opened by TrellixVulnTeam 0
  • [FEATURE] Missing projection pushdown for aggregates

    [FEATURE] Missing projection pushdown for aggregates

    Aggregates should be improved (perhaps after struct type is enabled) and push down selects from both combine/aggregate UDFs. Also error handling has to be added and maybe higher level functions like count, mean, std, min, max, ... providing standard aggregates.

    enhancement 
    opened by LeonhardFS 0
Owner
Tuplex
Python Data Science at Native Code Speed
Tuplex
Utilize data analytics skills to solve real-world business problems using Humana’s big data

Humana-Mays-2021-HealthCare-Analytics-Case-Competition- The goal of the project is to utilize data analytics skills to solve real-world business probl

Yongxian (Caroline) Lun 1 Dec 27, 2021
Python library for creating data pipelines with chain functional programming

PyFunctional Features PyFunctional makes creating data pipelines easy by using chained functional operators. Here are a few examples of what it can do

Pedro Rodriguez 2.1k Jan 5, 2023
simple way to build the declarative and destributed data pipelines with python

unipipeline simple way to build the declarative and distributed data pipelines. Why you should use it Declarative strict config Scaffolding Fully type

aliaksandr-master 0 Jan 26, 2022
NumPy and Pandas interface to Big Data

Blaze translates a subset of modified NumPy and Pandas-like syntax to databases and other computing systems. Blaze allows Python users a familiar inte

Blaze 3.1k Jan 5, 2023
The official repository for ROOT: analyzing, storing and visualizing big data, scientifically

About The ROOT system provides a set of OO frameworks with all the functionality needed to handle and analyze large amounts of data in a very efficien

ROOT 2k Dec 29, 2022
BigDL - Evaluate the performance of BigDL (Distributed Deep Learning on Apache Spark) in big data analysis problems

Evaluate the performance of BigDL (Distributed Deep Learning on Apache Spark) in big data analysis problems.

Vo Cong Thanh 1 Jan 6, 2022
Integrate bus data from a variety of sources (batch processing and real time processing).

Purpose: This is integrate bus data from a variety of sources such as: csv, json api, sensor data ... into Relational Database (batch processing and r

null 1 Nov 25, 2021
Streamz helps you build pipelines to manage continuous streams of data

Streamz helps you build pipelines to manage continuous streams of data. It is simple to use in simple cases, but also supports complex pipelines that involve branching, joining, flow control, feedback, back pressure, and so on.

Python Streamz 1.1k Dec 28, 2022
This tool parses log data and allows to define analysis pipelines for anomaly detection.

logdata-anomaly-miner This tool parses log data and allows to define analysis pipelines for anomaly detection. It was designed to run the analysis wit

AECID 32 Nov 27, 2022
Building house price data pipelines with Apache Beam and Spark on GCP

This project contains the process from building a web crawler to extract the raw data of house price to create ETL pipelines using Google Could Platform services.

null 1 Nov 22, 2021
Data pipelines built with polars

valves Warning: the project is very much work in progress. Valves is a collection of functions for your data .pipe()-lines. This project aimes to host

null 14 Jan 3, 2023
PipeChain is a utility library for creating functional pipelines.

PipeChain Motivation PipeChain is a utility library for creating functional pipelines. Let's start with a motivating example. We have a list of Austra

Michael Milton 2 Aug 7, 2022
Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

Data Scientist Learning Plan Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

Trung-Duy Nguyen 27 Nov 1, 2022
Lale is a Python library for semi-automated data science.

Lale is a Python library for semi-automated data science. Lale makes it easy to automatically select algorithms and tune hyperparameters of pipelines that are compatible with scikit-learn, in a type-safe fashion.

International Business Machines 293 Dec 29, 2022
Driver Analysis with Factors and Forests: An Automated Data Science Tool using Python

Driver Analysis with Factors and Forests: An Automated Data Science Tool using Python ??

Thomas 2 May 26, 2022
A Pythonic introduction to methods for scaling your data science and machine learning work to larger datasets and larger models, using the tools and APIs you know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

This tutorial's purpose is to introduce Pythonistas to methods for scaling their data science and machine learning work to larger datasets and larger models, using the tools and APIs they know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

Coiled 102 Nov 10, 2022
Orchest is a browser based IDE for Data Science.

Orchest is a browser based IDE for Data Science. It integrates your favorite Data Science tools out of the box, so you don’t have to. The application is easy to use and can run on your laptop as well as on a large scale cloud cluster.

Orchest 3.6k Jan 9, 2023
A lightweight, hub-and-spoke dashboard for multi-account Data Science projects

A lightweight, hub-and-spoke dashboard for cross-account Data Science Projects Introduction Modern Data Science environments often involve many indepe

AWS Samples 3 Oct 30, 2021
Data Science Environment Setup in single line

datascienv is package that helps your to setup your environment in single line of code with all dependency and it is also include pyforest that provide single line of import all required ml libraries

Ashish Patel 55 Dec 16, 2022