BlazingSQL is a lightweight, GPU accelerated, SQL engine for Python. Built on RAPIDS cuDF.

BlazingSQL

Last update: Jan 2, 2023

Related tags

GPU Utilities python data-science machine-learning sql gpu arrow artificial-intelligence gpu-acceleration sql-engine conda-environment machine-learning-workflow rapids cudf rapidsai blazingsql gpu-dataframes

Overview

A lightweight, GPU accelerated, SQL engine built on the RAPIDS.ai ecosystem.

Get Started on app.blazingsql.com

BlazingSQL is a GPU accelerated SQL engine built on top of the RAPIDS ecosystem. RAPIDS is based on the Apache Arrow columnar memory format, and cuDF is a GPU DataFrame library for loading, joining, aggregating, filtering, and otherwise manipulating data.

BlazingSQL is a SQL interface for cuDF, with various features to support large scale data science workflows and enterprise datasets.

Query Data Stored Externally - a single line of code can register remote storage solutions, such as Amazon S3.
Simple SQL - incredibly easy to use, run a SQL query and the results are GPU DataFrames (GDFs).
Interoperable - GDFs are immediately accessible to any RAPIDS library for data science workloads.

Try our 5-min Welcome Notebook to start using BlazingSQL and RAPIDS AI.

Getting Started

Here's two copy + paste reproducable BlazingSQL snippets, keep scrolling to find example Notebooks below.

Create and query a table from a cudf.DataFrame with progress bar:

import cudf

df = cudf.DataFrame()

df['key'] = ['a', 'b', 'c', 'd', 'e']
df['val'] = [7.6, 2.9, 7.1, 1.6, 2.2]

from blazingsql import BlazingContext
bc = BlazingContext(enable_progress_bar=True)

bc.create_table('game_1', df)

bc.sql('SELECT * FROM game_1 WHERE val > 4') # the query progress will be shown

	Key	Value
0	a	7.6
1	b	7.1

Create and query a table from a AWS S3 bucket:

from blazingsql import BlazingContext
bc = BlazingContext()

bc.s3('blazingsql-colab', bucket_name='blazingsql-colab')

bc.create_table('taxi', 's3://blazingsql-colab/yellow_taxi/taxi_data.parquet')

bc.sql('SELECT passenger_count, trip_distance FROM taxi LIMIT 2')

	passenger_count	fare_amount
0	1.0	1.1
1	1.0	0.7

Examples

Notebook Title	Description	Try Now
Welcome Notebook	An introduction to BlazingSQL Notebooks and the GPU Data Science Ecosystem.
The DataFrame	Learn how to use BlazingSQL and cuDF to create GPU DataFrames with SQL and Pandas-like APIs.
Data Visualization	Plug in your favorite Python visualization packages, or use GPU accelerated visualization tools to render millions of rows in a flash.
Machine Learning	Learn about cuML, mirrored after the Scikit-Learn API, it offers GPU accelerated machine learning on GPU DataFrames.

Documentation

You can find our full documentation at docs.blazingdb.com.

Prerequisites

Anaconda or Miniconda installed
OS Support
- Ubuntu 16.04/18.04 LTS
- CentOS 7
GPU Support
- Pascal or Better
- Compute Capability >= 6.0
CUDA Support
- 10.1.2
- 10.2
Python Support
- 3.7
- 3.8

Install Using Conda

BlazingSQL can be installed with conda (miniconda, or the full Anaconda distribution) from the blazingsql channel:

Stable Version

conda install -c blazingsql -c rapidsai -c nvidia -c conda-forge -c defaults blazingsql python=$PYTHON_VERSION cudatoolkit=$CUDA_VERSION

Where $CUDA_VERSION is 10.1, 10.2 or 11.0 and $PYTHON_VERSION is 3.7 or 3.8 For example for CUDA 10.1 and Python 3.7:

conda install -c blazingsql -c rapidsai -c nvidia -c conda-forge -c defaults blazingsql python=3.7 cudatoolkit=10.1

Nightly Version

conda install -c blazingsql-nightly -c rapidsai-nightly -c nvidia -c conda-forge -c defaults blazingsql python=$PYTHON_VERSION  cudatoolkit=$CUDA_VERSION

Where $CUDA_VERSION is 10.1, 10.2 or 11.0 and $PYTHON_VERSION is 3.7 or 3.8 For example for CUDA 10.1 and Python 3.7:

conda install -c blazingsql-nightly -c rapidsai-nightly -c nvidia -c conda-forge -c defaults blazingsql python=3.7  cudatoolkit=10.1

Build/Install from Source (Conda Environment)

This is the recommended way of building all of the BlazingSQL components and dependencies from source. It ensures that all the dependencies are available to the build process.

Stable Version

Install build dependencies

conda create -n bsql python=$PYTHON_VERSION
conda activate bsql
conda install --yes -c conda-forge spdlog=1.7.0 google-cloud-cpp=1.16 ninja
conda install --yes -c rapidsai -c nvidia -c conda-forge -c defaults dask-cuda=0.18 dask-cudf=0.18 cudf=0.18 ucx-py=0.18 ucx-proc=*=gpu python=3.7 cudatoolkit=$CUDA_VERSION
conda install --yes -c conda-forge cmake=3.18 gtest gmock cppzmq cython=0.29 openjdk=8.0 maven jpype1 netifaces pyhive tqdm ipywidgets

Where $CUDA_VERSION is is 10.1, 10.2 or 11.0 and $PYTHON_VERSION is 3.7 or 3.8 For example for CUDA 10.1 and Python 3.7:

conda create -n bsql python=3.7
conda activate bsql
conda install --yes -c conda-forge spdlog=1.7.0 google-cloud-cpp=1.16 ninja
conda install --yes -c rapidsai -c nvidia -c conda-forge -c defaults dask-cuda=0.18 dask-cudf=0.18 cudf=0.18 ucx-py=0.18 ucx-proc=*=gpu python=3.7 cudatoolkit=10.1
conda install --yes -c conda-forge cmake=3.18 gtest gmock cppzmq cython=0.29 openjdk=8.0 maven jpype1 netifaces pyhive tqdm ipywidgets

Build

The build process will checkout the BlazingSQL repository and will build and install into the conda environment.

cd $CONDA_PREFIX
git clone https://github.com/BlazingDB/blazingsql.git
cd blazingsql
git checkout main
export CUDACXX=/usr/local/cuda/bin/nvcc
./build.sh

NOTE: You can do ./build.sh -h to see more build options.

$CONDA_PREFIX now has a folder for the blazingsql repository.

Nightly Version

Install build dependencies

conda create -n bsql python=$PYTHON_VERSION
conda activate bsql
conda install --yes -c conda-forge spdlog=1.7.0 google-cloud-cpp=1.16 ninja
conda install --yes -c rapidsai-nightly -c nvidia -c conda-forge -c defaults dask-cuda=0.19 dask-cudf=0.19 cudf=0.19 ucx-py=0.19 ucx-proc=*=gpu python=3.7 cudatoolkit=$CUDA_VERSION
conda install --yes -c conda-forge cmake=3.18 gtest==1.10.0=h0efe328_4 gmock cppzmq cython=0.29 openjdk=8.0 maven jpype1 netifaces pyhive tqdm ipywidgets

Where $CUDA_VERSION is is 10.1, 10.2 or 11.0 and $PYTHON_VERSION is 3.7 or 3.8 For example for CUDA 10.1 and Python 3.7:

conda create -n bsql python=3.7
conda activate bsql
conda install --yes -c conda-forge spdlog=1.7.0 google-cloud-cpp=1.16 ninja
conda install --yes -c rapidsai-nightly -c nvidia -c conda-forge -c defaults dask-cuda=0.19 dask-cudf=0.19 cudf=0.19 ucx-py=0.19 ucx-proc=*=gpu python=3.7 cudatoolkit=10.1
conda install --yes -c conda-forge cmake=3.18 gtest==1.10.0=h0efe328_4 gmock cppzmq cython=0.29 openjdk=8.0 maven jpype1 netifaces pyhive tqdm ipywidgets

Build

The build process will checkout the BlazingSQL repository and will build and install into the conda environment.

cd $CONDA_PREFIX
git clone https://github.com/BlazingDB/blazingsql.git
cd blazingsql
export CUDACXX=/usr/local/cuda/bin/nvcc
./build.sh

NOTE: You can do ./build.sh -h to see more build options.

NOTE: You can perform static analysis with cppcheck with the command cppcheck --project=compile_commands.json in any of the cpp project build directories.

$CONDA_PREFIX now has a folder for the blazingsql repository.

Storage plugins

To build without the storage plugins (AWS S3, Google Cloud Storage) use the next arguments:

# Disable all storage plugins
./build.sh disable-aws-s3 disable-google-gs

# Disable AWS S3 storage plugin
./build.sh disable-aws-s3

# Disable Google Cloud Storage plugin
./build.sh disable-google-gs

NOTE: By disabling the storage plugins you don't need to install previously AWS SDK C++ or Google Cloud Storage (neither any of its dependencies).

Documentation

User guides and public APIs documentation can be found at here

Our internal code architecture can be built using Spinx.

pip install recommonmark exhale
conda install -c conda-forge doxygen
cd $CONDA_PREFIX
cd blazingsql/docs
make html

The generated documentation can be viewed in a browser at blazingsql/docs/_build/html/index.html

Community

Contributing

Have questions or feedback? Post a new github issue.

Please see our guide for contributing to BlazingSQL.

Contact

Feel free to join our channel (#blazingsql) in the RAPIDS-GoAi Slack: .

You can also email us at [email protected] or find out more details on BlazingSQL.com.

License

Apache License 2.0

RAPIDS AI - Open GPU Data Science

The RAPIDS suite of open source software libraries aim to enable execution of end-to-end data science and analytics pipelines entirely on GPUs. It relies on NVIDIA® CUDA® primitives for low-level compute optimization, but exposing that GPU parallelism and high-bandwidth memory speed through user-friendly Python interfaces.

Apache Arrow on GPU

The GPU version of Apache Arrow is a common API that enables efficient interchange of tabular data between processes running on the GPU. End-to-end computation on the GPU avoids unnecessary copying and converting of data off the GPU, reducing compute time and cost for high-performance analytics common in artificial intelligence workloads. As the name implies, cuDF uses the Apache Arrow columnar data format on the GPU. Currently, a subset of the features in Apache Arrow are supported.

Comments

Communication c++ layer
We had a pretty interesting experience trying to get performance and correctness by sending all of our messages between nodes using ucx-py and dask to send messages. The single threaded nature of python, the fact that dask is using torando.ioloop and we were seeing things like coroutines run at the same time if we were awaiting a ucx.send. It has been really hard to troubleshoot and the performance isn't there for us.

We need to send and receive messages in the c++ layer to remove the issues we had. Seeing as how we have often been hasty in trying to implement ucx as fast as possible we are going to try and be smart and slow the heck down. Hell if it takes 3 times as long to develop and 1/2 as long to debug we will come out ahead :).

I kind of envision a few classes like this

template < typename SerializerFunction, typename BufferCommunicator> class SenderClass{ SenderClass(serializer, bufferCommunicator) //stuff like broadcast, send_to_node, } template < typename DeserializerFunction, typename BufferAssembler > class ReceiverClass{ ReceiverClass(deserializer,bufferAssembler) } SerializerFunc ==> f(vector<column_views>) returns a list of views to rmm buffers, metadata for reassembling these buffers into views DeserializerFunc ==> takes a list of buffers and metadata and gives us a unique_ptr to a BlazingTable BufferAssembler ==> collects all the buffers of a message and associate them with their metadata BufferCommunicator ==> can send a single buffer from one node to another and have it arrive at the other nodes buffer assembler

We can use a combination of these things to send a message and receive it on the other end with a listener that basses the buffer to the bufferAssembler, when the buffer assembler is done the deserializer converts it to a cudf::table and a Metadata that we can use to add the message to the appropriate class
Design
opened by felipeblazing 28
Blazingsql cannot process large files

blazingsql cannot process csv files larger than 3 GB, and the message "out of memory" is displayed. The number of GPUs is 4 and the total memory is 6 GB. The failure occurs whether a single GPU or multiple GPUs are used.

If multiple GPUs are used, the following information is displayed: distributed.nanny - WARNING - Restarting worker
question

opened by Wxinxiny 21
[REVIEW] Enabling E2E tests with null data
Enabling this new env flag BLAZINGSQL_E2E_TEST_WITH_NULLS, I got:

TOTAL SUMMARY for test suite: PASSED: 1839/1945 FAILED: 106/1945 CRASH: 0/1945 TOTAL: 1945 saveLog = false MAX DELTA: 192.0
opened by rommelDB 19
Allow for concurrent queries form a single BlazingContext
Right now when you do: bc.sql() execution on the python script halts until that function call returns when the function returns with the result of the query. You used to be able to use the option return_futures but that feature is now obsolete due to https://github.com/BlazingDB/blazingsql/pull/1289

On the other hand https://github.com/BlazingDB/blazingsql/pull/1289 makes it easy to implement multiple concurrent queries.

This feature request is to propose an API and user experience for multiple concurrent queries support from a single BlazingContext.

The proposed API would be something as follows: Proposed API A

query0 = 'SELECT * FROM my_table where columnA > 0' query1 = 'SELECT * FROM my_table where columnB < 0' token0 = bc.sql(query0, return_token=True) token1 = bc.sql(query1, return_token=True) result0 = bc.fetch(token0) result1 = bc.fetch(token1)

In this case token0 and token1 would be int32s which are actually just the queryId. In this case bc.fetch would halt execution until the results are available. We would also implement a function (which would be optional) that would look like this: done = bc.is_query_done(token0) which would return a boolean, simply indicating if the query is done.

Other ways we could do this are: Proposed API B:

token0 = bc.async_sql(query) is_done = bc.async_sql(token0, get_status=True) #this is the optional is_query_done API result = bc.async_sql(token0) # here its the same API, but since we are passing in an int instead of a string we would know that we are getting the result

Proposed API C:

token0 = bc.sql(query, return_token=True) is_done = bc.sql(token0, get_status=True) #this is the optional is_query_done API result = bc.sql(token0) # here its the same API, but since we are passing in an int instead of a string we would know that we are getting the result

Feel free to propose other APIs.

This internally, this would just use the APIs that are now part of https://github.com/BlazingDB/blazingsql/pull/1289, which allow us to start a query, check its status and get the results. Internally what happens is that when multiple queries are running at the same time is that each query has its own graph, and each graph is generating compute tasks. The compute tasks are then processed by the executor as resources allow. Right now the tasks would be processed FIFO (with a certain amount of parallelism depending on resources and configuration). Eventually we can set prioritization policies for which tasks get done first. For example tasks from the first query to start are given priority, or tasks which are most likely to reduce memory pressure are prioritized, etc...
opened by wmalpica 19
[REVIEW] Fix `CC`/`CXX` variables in CI

This PR adds CC, CXX, and CUDAHOSTCXX entries to the build.script_env section of the conda recipe, so that those environment variables get passed through the build.sh script and ultimately to CMake. This enables CMake to use the correct versions of gcc and g++ when compiling.

Additionally, it includes some fixes for the upstream cudf changes in https://github.com/rapidsai/cudf/pull/8142

opened by ajschmidt8 16

Implement a prototype to a create table from other RDBMS

Proposal 1 Direct approach using only create table semantic:

bc.create_table("dept_emp", "mysql://lucho:admin@localhost:3306/sampledb/dept_emp")
bc.create_table("titles", "postgres://luis:12345@localhost:3306/testdb/subchema/titles")

Proposal 2 2 step approach, similar to what we have with storage registration:

bc.mysql("mysqldb1", "mysql://lucho:admin@localhost:3306/sampledb")
bc.create_table("dept_emp", "mysqldb1/dept_emp")

bc.postgres("pgdb1", "postgres://luis:12345@localhost:3306/testdb")
bc.create_table("titles", "pgdb1/subchema/titles")

cc @williamBlazing @felipeblazing @rommelDB

feature request

opened by aucahuasi 15

[BUG] parallel_for failed: cudaErrorIllegalAddress: an illegal memory access was encountered
Describe the bug Crash when using example from: https://blog.blazingdb.com/data-visualization-with-blazingsql-12095862eb73

Steps/Code to reproduce bug run sample code [(attached)]([url](url s3-test.py.txt ))

Expected behavior No illegal memory access exception.

Environment overview (please complete the following information)

Environment location: Bare metal conda Python 3.7.7 (default, Mar 23 2020, 22:36:06) [GCC 7.3.0] :: Anaconda, Inc. on linux Debian 10 CUDA 10.2

Method of cuDF install:

conda install -c blazingsql/label/cuda10.2 -c blazingsql -c rapidsai -c nvidia -c conda-forge -c defaults blazingsql python=3.7

Environment details PATH=/opt/miniconda3/bin:/opt/miniconda3/condabin:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games:/usr/local/go/bin

Additional context Code attached. Other tests using blazingsql worked fine on this box.

Output:

listening: tcp://*:22758 2020-06-14T15:56:41Z|-78920688|TRACE|deregisterFileSystem: filesystem authority not found CacheDataLocalFile: /tmp/.blazing-temp-D63WqK6ZgzRBOMd0kxS4CzTDNC69hqAn1vlzzPGIjU8ijs78nLFqpShVKo8Qkdmm.orc terminate called after throwing an instance of 'thrust::system::system_error' what(): parallel_for failed: cudaErrorIllegalAddress: an illegal memory access was encountered distributed.nanny - WARNING - Restarting worker BlazingContext ready distributed.nanny - WARNING - Worker process still alive after 3 seconds, killing

After crash, nvidia-smi shows below, main python process is hung:

+-----------------------------------------------------------------------------+ | NVIDIA-SMI 440.82 Driver Version: 440.82 CUDA Version: 10.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 TITAN V On | 00000000:01:00.0 Off | N/A | | 29% 43C P8 26W / 250W | 640MiB / 12066MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 31262 C python 627MiB | +-----------------------------------------------------------------------------+

First time I ran it, it created a number of .orc files in /tmp before crashing with above error. Another time it gave:

listening: tcp://*:22170 BlazingContext ready 2020-06-14T16:59:14Z|-682139984|TRACE|deregisterFileSystem: filesystem authority not found distributed.nanny - WARNING - Restarting worker distributed.nanny - WARNING - Worker process still alive after 3 seconds, killing Unable to start CUDA Context Traceback (most recent call last): File "/opt/miniconda3/envs/py37/lib/python3.7/site-packages/dask_cuda/initialize.py", line 108, in dask_setup numba.cuda.current_context() File "/opt/miniconda3/envs/py37/lib/python3.7/site-packages/numba/cuda/cudadrv/devices.py", line 212, in get_context return _runtime.get_or_create_context(devnum) File "/opt/miniconda3/envs/py37/lib/python3.7/site-packages/numba/cuda/cudadrv/devices.py", line 138, in get_or_create_context return self._get_or_create_context_uncached(devnum) File "/opt/miniconda3/envs/py37/lib/python3.7/site-packages/numba/cuda/cudadrv/devices.py", line 153, in _get_or_create_context_uncached return self._activate_context_for(0) File "/opt/miniconda3/envs/py37/lib/python3.7/site-packages/numba/cuda/cudadrv/devices.py", line 169, in _activate_context_for newctx = gpu.get_primary_context() File "/opt/miniconda3/envs/py37/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 529, in get_primary_context driver.cuDevicePrimaryCtxRetain(byref(hctx), self.id) File "/opt/miniconda3/envs/py37/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 295, in safe_cuda_api_call self._check_error(fname, retcode) File "/opt/miniconda3/envs/py37/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py", line 330, in _check_error raise CudaAPIError(retcode, msg) numba.cuda.cudadrv.driver.CudaAPIError: [304] Call to cuDevicePrimaryCtxRetain results in CUDA_ERROR_OPERATING_SYSTEM pure virtual method called terminate called without an active exception

Edit: Tried accessing the files locally instead of from s3 and reproduced same error. As soon as memory fills up or after several orc files created (varies) gets illegal memory access or worker dies. If the worker restarts successfully it does not appear to process anything. Using non-dask/cluster version of BlazingContext it says 'Killed' as soon as it runs out of memory on the GPU. Processing only one input file works fine as it does not run out of memory.
bug
opened by threedliteguy 15
[BUG] b'In function ddlCreateTableService: cannot create the table: Could not create table'
(rapids_blazing) sh-4.2$ conda list | grep blazing blazingdb-toolchain 0.4.0 py37hf484d3e_0 blazingsql blazingsql-calcite 0.4.0 py37_0 blazingsql blazingsql-communication 0.4.0 py37_80 blazingsql blazingsql-io 0.4.0 py37_31 blazingsql blazingsql-orchestrator 0.4.0 py37_19 blazingsql blazingsql-protocol 0.4.0 py37_25 blazingsql blazingsql-python 0.4.0 cuda10.0_py37_14 blazingsql/label/cuda10.0 blazingsql-ral 0.4.0 cuda10.0_py37_5 blazingsql/label/cuda10.0

Yesterday I was able to create a table based on a cudf but today I'm having some errors. I have already created again the whole instance and did again all the steps but I'm facing some errors similar to the following: b'In function ddlCreateTableService: cannot create the table: Could not create table'

There is no more information. Any ideas how I can trace the error? In addition, it seems in version 0.4.2 there will be some changes in how BlazingContext launches processes, could it be related to this? When this release will be conda installable?

If I execute again BlazingContext() and try to create the table I can get two kinds of different errors: 1) Already connected to the Orchestrator b'In function ddlCreateTableService: cannot create the table: Connection to server failed.'

WARNING: blazingsql-orchestrator was not automativally started, its probably already running WARNING: blazingsql-engine was not automativally started, its probably already running WARNING: blazingsql-algebra was not automativally started, its probably already running Already connected to the Orchestrator Unexpected error on create_table, can only concatenate str (not "tuple") to str
bug
opened by ivenzor 15

[REVIEW] Implement string REPLACE

This PR:

Implements the REPLACE operator for string columns using cudf::strings::replace
Adds a new end-to-end test in stringsTests.py, and updates the runTest.py
Refactors the removal of the string encaspulation characters (i.e., the single quotes in LIKE '%the%') in several parts of LogicalProjection.cpp to use a string utility function

If the implementation looks fine, I'll push the new parquet file to https://github.com/BlazingDB/blazingsql-testing-files and update the CHANGELOG to unblock gpuCI.

This closes https://github.com/BlazingDB/blazingsql/issues/1175

[x] passes e2e tests locally
[x] PR to update testing files https://github.com/BlazingDB/blazingsql-testing-files/pull/1

from pyspark.sql import SparkSession
from blazingsql import BlazingContext
import pandas as pd


# spark = SparkSession.builder \
#     .master("local") \
#     .getOrCreate()

# bc = BlazingContext()


df = pd.DataFrame({
    "a": ["Felipe", "William", "Rodrigo"],
    "b": [2, 4, 6],
    "c": ["2020-11-20", "2020-11-19", "2020-11-18"]
})

bc.create_table("df", df)
sdf = spark.createDataFrame(df)
sdf.createOrReplaceTempView("df")

query = """
SELECT
    a,
    REPLACE(a, 'i', '##') as a_new,
    c,
    REPLACE(c, '2020', '1999') as c_new
FROM df
"""

spark.sql(query).show()

print(bc.explain(query))
print(bc.sql(query))
+-------+---------+----------+----------+
|      a|    a_new|         c|     c_new|
+-------+---------+----------+----------+
| Felipe|  Fel##pe|2020-11-20|1999-11-20|
|William|W##ll##am|2020-11-19|1999-11-19|
|Rodrigo| Rodr##go|2020-11-18|1999-11-18|
+-------+---------+----------+----------+

LogicalProject(a=[$0], a_new=[REPLACE($0, 'i', '##')], c=[$1], c_new=[REPLACE($1, '2020', '1999')])
  BindableTableScan(table=[[main, df]], projects=[[0, 2]], aliases=[[a, a_new, c, c_new]])

         a      a_new           c       c_new
0   Felipe    Fel##pe  2020-11-20  1999-11-20
1  William  W##ll##am  2020-11-19  1999-11-19
2  Rodrigo   Rodr##go  2020-11-18  1999-11-18

opened by beckernick 13

Barriers Required for Distributed execution.

Right now Kernels handle distribution and ensuring completeness so that they continue when they have to communicate. Here is an example of what that looks like in aggregation.

Below we are iterating through batches that this kernel gets from its input cache, partitioning them and sending each node its corresponding partition. We store a count of how many partitions we sent to each node and how many we kept for ourselves.

while (input.wait_for_next()) {
    auto batch = input.next();
    CudfTableView batch_view = batch->view();
    std::vector<CudfTableView> partitioned;
    std::unique_ptr<CudfTable> hashed_data; // Keep table alive in this scope
    if (batch_view.num_rows() > 0) {
        std::vector<cudf::size_type> hased_data_offsets;
        std::tie(hashed_data, hased_data_offsets) = cudf::hash_partition(batch->view(), columns_to_hash, num_partitions);
        // the offsets returned by hash_partition will always start at 0, which is a value we want to ignore for cudf::split
        std::vector<cudf::size_type> split_indexes(hased_data_offsets.begin() + 1, hased_data_offsets.end());
        partitioned = cudf::split(hashed_data->view(), split_indexes);
    } else {
        //  copy empty view
        for (auto i = 0; i < num_partitions; i++) {
            partitioned.push_back(batch_view);
        }
    }

    ral::cache::MetadataDictionary metadata;
    for(int i = 0; i < this->context->getTotalNodes(); i++ ){
        auto partition = std::make_unique<ral::frame::BlazingTable>(partitioned[i], batch->names());
        if (this->context->getNode(i) == self_node){
            this->output_.get_cache()->addToCache(std::move(partition),"",true);
            node_count[self_node.id()]++;
        } else {
            node_count[this->context->getNode(i).id()]++;
            output_cache->addCacheData(std::make_unique<ral::cache::GPUCacheDataMetaData>(std::move(partition), metadata),"",true);
        }
    }
    }
    batch_count++;
}

After this code executes we send each node a count of how many partitions we sent them.


auto self_node = ral::communication::CommunicationData::getInstance().getSelfNode();
auto nodes = context->getAllNodes();
std::string worker_ids = "";


for(std::size_t i = 0; i < nodes.size(); ++i) {
    if(!(nodes[i] == self_node)) {
        ral::cache::MetadataDictionary metadata;
        messages_to_wait_for.push_back(metadata.get_values()[ral::cache::QUERY_ID_METADATA_LABEL] + "_" +
                                metadata.get_values()[ral::cache::KERNEL_ID_METADATA_LABEL] +	"_" +
                                metadata.get_values()[ral::cache::WORKER_IDS_METADATA_LABEL]);
        this->query_graph->get_output_cache()->addCacheData(
            std::unique_ptr<ral::cache::GPUCacheData>(new 
               ral::cache::GPUCacheDataMetaData(ral::utilities::create_empty_table({}, {}), metadata)),"",true);
    }
}

Then we collect all of the partition counts from each worker node. After this we sum them up and wait for our output cache to have that many partitions before we can say this kernel is finished.


auto self_node = ral::communication::CommunicationData::getInstance().getSelfNode();
int total_count = node_count[self_node.id()];
for (auto message : messages_to_wait_for){
    auto meta_message = this->query_graph->get_input_cache()->pullCacheData(message);
    total_count += std::stoi(static_cast<ral::cache::GPUCacheDataMetaData *>(meta_message.get())->getMetadata().get_values()[ral::cache::PARTITION_COUNT]);
}
this->output_cache()->wait_for_count(total_count);

We want to abstract away a few of the things that are happening here. We are often following this pattern of spreading data out and then theres a barrier to be able to continue. We want to remove this code from the kernel run function itself and have a more generic way of saying things like

As we discuss and implement the movement towards scheduling tasks to be run we need to have primitives that can do things like :

create a broadcast to all and expect broadcast from all primitive
create a method for preventing tasks from either being scheduled or run by the scheduler until some kind of condition is met (e.g. wait_for_count but disassociated from the actual run function so it is something that can be "injected" preferably through something like composition).

Design

opened by felipeblazing 13

[REVIEW] fix latest cudf dependencies

This PR contains fixes for issues of building, cc @romulo-auccapuclla This PR also contains fixes for new arrow API 4.0.1 (due to cudf-nightly) This PR contains a fix due to https://github.com/rapidsai/cudf/pull/8692 (related to the conda env name). Note: Something that worries me is that the HiveFileTest lately crashes randomly with a std::bad_alloc(query 01 CSV, that was commented out).

opened by Christian8491 11
Bump calcite-core from 1.23.0 to 1.32.0 in /algebra/blazingdb-calcite-core
Bumps calcite-core from 1.23.0 to 1.32.0.

Commits

413eded [CALCITE-5275] Release Calcite 1.32.0

57aafa3 Cosmetic changes to release notes

2624925 [CALCITE-5262] Add many spatial functions, including support for WKB (well-kn...

479afa6 [CALCITE-5278] Upgrade Janino from 3.1.6 to 3.1.8

1167b12 [CALCITE-5270] JDBC adapter should not generate 'FILTER (WHERE)' in Firebolt ...

89c940c [CALCITE-5241] Implement CHAR function for MySQL and Spark, also JDBC '{fn CH...

d20fd09 [CALCITE-5274] Improve DocumentBuilderFactory in DiffRepository test class by...

6302e6f [CALCITE-5277] Make EnumerableRelImplementor stashedParameters order determin...

baeecc8 [CALCITE-5251] Support SQL hint for Snapshot

ba80b91 [CALCITE-5263] Improve XmlFunctions by using an XML DocumentBuilder

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies java
opened by dependabot[bot] 0

[BUG] Cannot import BlazingContext when processor type unknown

Describe the bug Cannot import BlazingContext when processor type unknown.

Steps/Code to reproduce bug Code and output from ipython (personal info hidden).

In [1]: from blazingsql import BlazingContext
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-7-0b19b5b41f48> in <module>
----> 1 from blazingsql import BlazingContext

~/miniconda3/envs/blazingsql/lib/python3.7/site-packages/blazingsql/__init__.py in <module>
      1 from pyblazing.apiv2 import S3EncryptionType
      2 from pyblazing.apiv2 import DataType
----> 3 from pyblazing.apiv2.context import BlazingContext
      4
      5 from cio import getProductDetailsCaller

~/miniconda3/envs/blazingsql/lib/python3.7/site-packages/pyblazing/apiv2/context.py in <module>
    105         )
    106
--> 107 jpype.startJVM("-ea", convertStrings=False, jvmpath=jvm_path)
    108 # jpype.startJVM()
    109

~/miniconda3/envs/blazingsql/lib/python3.7/site-packages/jpype/_core.py in startJVM(*args, **kwargs)
    225     try:
    226         _jpype.startup(jvmpath, tuple(args),
--> 227                        ignoreUnrecognized, convertStrings, interrupt)
    228         initializeResources()
    229     except RuntimeError as ex:

FileNotFoundError: [Errno 2] JVM DLL not found: /home/{my_username}/miniconda3/envs/blazingsql/lib/server/libjvm.so


In [2]: !uname -p
unknown

Expected behavior Should be imported without any errors.

Environment overview (please complete the following information)

Environment location: Bare-metal
Method of BlazingSQL install: conda
BlazingSQL Version

BlazingSQL version (git hash): 13618d177a37bd34bb20ac832fb8a14f8243ff5c
BlazingSQL branch name: HEAD
BlazingSQL branch tag: v21.08.02
BlazingSQL build id: 0
BlazingSQL compiler version: GNU /usr/local/gcc9/bin/g++ 9.4.0
BlazingSQL cuda flags: -Xcompiler -Wno-parentheses -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_75,code=compute_75 --expt-extended-lambda --expt-relaxed-constexpr -Werror=cross-execution-space-call -Xcompiler -Wall,-Wno-error=deprecated-declarations --default-stream=per-thread -DHT_DEFAULT_ALLOCATOR
BlazingSQL Operating system kernel: Linux-5.4.0-1054-aws
BlazingSQL Operating system architecture: x86_64
BlazingSQL Linux Operating system release: NAME=CentOS Linux|VERSION=7 (Core)|ID=centos|ID_LIKE=rhel fedora|VERSION_ID=7|PRETTY_NAME=CentOS Linux 7 (Core)|ANSI_COLOR=031|CPE_NAME=cpe:/o:centos:centos:7|HOME_URL=[https://www.centos.org/|BUG_REPORT_URL=https://bugs.centos.org/||CENTOS_MANTISBT_PROJECT=CentOS-7|CENTOS_MANTISBT_PROJECT_VERSION=7|REDHAT_SUPPORT_PRODUCT=centos|REDHAT_SUPPORT_PRODUCT_VERSION=7|](https://www.centos.org/%7CBUG_REPORT_URL=https://bugs.centos.org/%7C%7CCENTOS_MANTISBT_PROJECT=CentOS-7%7CCENTOS_MANTISBT_PROJECT_VERSION=7%7CREDHAT_SUPPORT_PRODUCT=centos%7CREDHAT_SUPPORT_PRODUCT_VERSION=7%7C)

----For BlazingSQL Developers---- Suspected source of the issue https://github.com/BlazingDB/blazingsql/blob/branch-21.08/pyblazing/pyblazing/apiv2/context.py#L70

machine_processor = platform.processor()

if machine_processor in ("x86_64", "x64"):
    machine_processor = "amd64"

when the uname -p is unknown, platform.processor() equals to '', thus machine_processor is empty, which leads to wrong jvm lib path.

bug ? - Needs Triage

opened by callofdutyops 2

[BUG] app.blazing.com website not reachable

Describe the bug Not able to open any app on the app deployment

https://app.blazingsql.com/jupyter/user-redirect/lab/workspaces/auto-b/tree/Welcome_to_BlazingSQL_Notebooks/welcome.ipynb

Steps/Code to reproduce bug Go to browser and try to open https://app.blazingsql.com Error: Site could not be reached

Expected behavior Should be able to open the website

Other design considerations It is very difficult to set up blazing SQL on google collab. It is giving me too many version compatibility issues. If someone can help me with it that would work as well
bug ? - Needs Triage

opened by shailee-m 1
Bump liquibase-core from 3.6.2 to 4.8.0 in /algebra/blazingdb-calcite-application
Bumps liquibase-core from 3.6.2 to 4.8.0.

Release notes

Sourced from liquibase-core's releases.

v4.8.0

Liquibase 4.8.0 release

Please report any issues to https://github.com/liquibase/liquibase/issues.

Notable Changes

Liquibase 4.8.0 introduces the following functionality:

The init hub subcommand that connects your local Liquibase activity to Liquibase Hub and sets up the Liquibase environment to use Liquibase Hub. [DAT-8769]

Note: For more information, see init hub and Getting Started with Liquibase Hub.

[PRO] The sqlcmd utility support to process complex SQL for MSSQL Server. Liquibase provides the liquibase.sqlcmd.conf file to pass arguments to your executor when running Liquibase Pro. [DAT-7447]

Note: For more information, see Using the SQLCMD integration and runWith attribute with Liquibase Pro and MSSQL Server.

Changes to the behavior of the XML parser, which no longer allows referencing external DTD files for security reasons. If you use externally defined entities or any other potentially insecure XML feature in your changelogs, set liquibase.secureParsing=false. [PR#2384] [LB-2218]

Note: For more information about the ways to set the parameter, see Command Parameters.

The upgrade of the postgresql (from 42.2.12 to 42.3.2) and h2 (from 2.0.206 to 2.1.210) drivers that Liquibase includes in the installation package. If you use those drivers and upgrade an existing Liquibase installation, remove the earlier versions of drivers from the LIQUIBASE_HOME/lib directory.

Enhancements

Implemented the SimpleObjectConstructor interface for DB2 on z/OS [DAT-8580]

Included the CLI instructions on how to use the properties file with a nonstandard name when running the init project subcommand [DAT-9041]

Improved the output message for init start-h2 when the H2 database driver is specified, but there is no connection detected [DAT-8992]

Added validation errors for the enableCheckConstraint, disableCheckConstraint, dropPackage, dropPackageBody Change Types [DAT-9017]

[PR#2367] [Mike Olivas] Added example rollback scripts to the example-changelog.sql file [LB-2220]

[PR#1648] [Daniel Gray] Improved the exception error message for the customChange node with no class attribute [LB-1144]

[PR#2222] [msimko81] Added the offline mode support for the rollback-sql <tag> operation [LB-2198]

[PR#2273] [Tsvi Zandany] Added the autocomplete quality checks commands for macOS

[PR#2308] [Valentin Blistin] Added the close method for the ClassLoaderResourceAccessor class [LB-2205]

Fixes

Fixed the renameSequence Change Type generating an uppercase SQL instead of lowercase and causing failures with case-sensitive databases [LB-1763]

Fixed the generate-changelog and diff commands to show and generate a proper SQL with autoincrement for PostgreSQL [DAT-8779]

[PR#1320] [PR#2329] [Anatoly S] [Nathan Voxland] [Matteo Turra] Improved the UTF-8 character support in the <sql> tags [LB-562]

[PR#2139] [wziebicki] Improved the UniqueConstraintSnapshotGenerator class to add a table name to the Oracle database query so that the cache key can contain it, and the unique constraint can be read from cache [LB-2206]

[PR#2335] [erzsebet] Removed duplicated DATABASECHANGELOGLOCK SQL statements when running the update-sql command against a new database [LB-2208]

[PR#2505] [Nathan Voxland] [Dries Samyn] [erzsebet] Fixed the OSGI support with the MANIFEST.MF file in the liquibase-core-4.7.0.jar file. The PR fixes PR#2361.

[PR#2366] [Steven Massaro] Made the sp_rename function to be lowercase when using MSSQL Server

[PR#2429] [Nathan Voxland] [Mirko Dostmann] [Alex Litovsky] Fixed the Ant and Apache Derby support. The PR fixes PR#2388 and PR#2391. [LB-2222]

[PR#2397] [Nathan Voxland] [maxencelaurent] Implemented the logging of exceptions in the CDI integration. The PR fixes PR#2354. [LB-2223]

[PR#2398] [Nathan Voxland] [urvanov-ru] Fixed the handling of schema snapshots with an underscore. The PR fixes PR#1604. [LB-2219]

[PR#2340] [Nathan Voxland] [nickshoe] Enabled adding the computed=true columns with no type set in the createTable Change Type. The PR fixes PR#2283. [LB-2215]

[PR#2372] [Nathan Voxland] [Henrik Sachse] [rburgst] Fixed the autoincrement syntax for H2 2.0 and later versions. The PR fixes PR#2362. [LB-2216]

[PR#2549] [Nathan Voxland] [jenhae] Added validation errors for Liquibase and MSSQL Server if the ordered clause is specified in the createSequence or alterSequence Change Type. The PR fixes PR#2528.

[PR#2351] [Wesley Willard] [Joseph Cen] Fixed the handling of properties defined with different dbms filters. The PR fixes PR#2231. [LB-2217]

... (truncated)

Changelog

Sourced from liquibase-core's changelog.

Liquibase Core Changelog

Changes in version 4.8.0 (2022.02.23)

Notable Changes

Liquibase 4.8.0 introduces a built-in SQLCMD integration that allows you to specify the runwith paramter sqlcmd custom executor to process complex SQL for MSSQL Server. Liquibase provides the liquibase.sqlcmd.conf file to pass arguments to your executor when running Liquibase Pro.

For new and existing Liquibase Hub users, Liquibase 4.8.0 introduces the init hub command, used in Hub’s Getting Started on-boarding. Users can get defaults and changelog files setup, working, and registered to Hub with just this one command.

Enhancements

Implemented the SimpleObjectConstructor interface for DB2 on z/OS [DAT-8580]

Implemented the init hub command to complete Liquibase Hub onboarding

Included the CLI instructions on how to use the properties file with a nonstandard name when running the init project subcommand [DAT-9041]

Added to init start-h2 a clearer message when the H2 database driver is specified, but there is no connection detected. [DAT-8992]

Added validation errors for the enableCheckConstraint, disableCheckConstraint, dropPackage, dropPackageBody Change Types [DAT-9017]

[PR#2367] [Mike Olivas] Added example rollback scripts to the example-changelog.sql file [LB-2220]

[PR#1648] [Daniel Gray] Improved the exception error message for the customChange node with no class attribute [LB-1144]

[PR#2222] [msimko81] Added the offline mode support for the rollback-sql operation [LB-2198]

Fixes

Fixed the renameSequence Change Type generating an uppercase SQL instead of lowercase and causing failures with case-sensitive databases [LB-1763

Fixed the generate-changelog and diff commands to show and generate a proper SQL with autoincrement for PostgreSQL [DAT-8779]

[PR#1320] [Anatoly S] Fixed the issue with the generated SQL incorrectly displaying the numero sign (No) [LB-562]

[PR#2139] [wziebicki] Improved the UniqueConstraintSnapshotGenerator class to add a table name to the Oracle database query so that the cache key can contain it, and the unique constraint can be read from cache [LB-2206]

[PR#2335] [erzsebet] Removed duplicated DATABASECHANGELOGLOCK SQL statements when running the update-sql command against a new database [LB-2208]

[PR#1894] [KushnirykOleh] Made the time data type precise for PostgreSQL [LB-1798]•[PR#2190] [Richard Bradley] [Hannu Hartikainen] [VlasyukA] [Nathan Voxland] Fixed the issue with locking database if no changelogs are needed to run and no updates are pending. The services can proceed in parallel. [LB-2203]

New Test System management by @nvoxland in liquibase/liquibase#2312

Remove duplicate databasechangeloglock SQL when running update-sql against a new database by @nvoxland in liquibase/liquibase#2335

UniqueConstraintSnapshotGenerator - Add table name to OracleDB query by @wziebicki in liquibase/liquibase#2139

CORE-3326 Numero sign is a symbol in Russian by @tolix in liquibase/liquibase#1320

Allow a custom executor to be specified through a property on existing Executor implementations DAT-7531 by @wwillard7800 in liquibase/liquibase#2374

Make ClassLoaderResourceAccessor implement Closable by @Delir4um in liquibase/liquibase#2308

Liquibase Responsible Disclosure Policy by @kristyldatical in liquibase/liquibase#2435

use lowercase sp_rename function in MSSQL (LB-1763) by @StevenMassaro in liquibase/liquibase#2366

Added example rollback scripts by @molivasdat in liquibase/liquibase#2367

Bump junit from 4.12 to 4.13.1 in /liquibase-extension-testing by @dependabot in liquibase/liquibase#1827

Bump maven-surefire-plugin from 2.22.1 to 2.22.2 by @dependabot in liquibase/liquibase#2427

Bump testcontainers-bom from 1.16.2 to 1.16.3 by @dependabot in liquibase/liquibase#2467

Bump mockito-core from 3.3.3 to 3.12.4 by @dependabot in liquibase/liquibase#2470

Bump maven-jar-plugin from 3.1.1 to 3.2.2 by @dependabot in liquibase/liquibase#2458

Bump mockito-inline from 3.8.0 to 3.12.4 by @dependabot in liquibase/liquibase#2465

Bump surefire-junit4 from 2.22.1 to 2.22.2 by @dependabot in liquibase/liquibase#2466

Bump objenesis from 2.1 to 2.6 by @dependabot in liquibase/liquibase#2468

Bump slf4j-jdk14 from 1.7.33 to 1.7.35 by @dependabot in liquibase/liquibase#2475

Bump assertj-core from 3.13.2 to 3.22.0 by @dependabot in liquibase/liquibase#2474

... (truncated)

Commits

887e441 Fixing re-version check

74bfc03 Fixing re-version check

efab2bd Fixing re-version check

0892c27 Merge pull request #2559 from liquibase/update-changelog-4.8.0

bb76633 Updated changelog

4549314 updated changelog txt

3a76197 Merge pull request #2560 from liquibase/updated-xsd-4.8.0

abb4d77 create liquibase changelog xsd 4.8.0

221f681 update changelog 4.8.0

0bb2eae DAT-8615: init hub command (#2326)

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies java
opened by dependabot[bot] 0
FileNotFoundError: [Errno 2] No such file or directory: 'blazingsql-orchestrator': 'blazingsql-orchestrator'

When I am trying to run blazingsql in google colab I am finding the following error Link of colab Notebook: https://blog.blazingdb.com/blazingsql-rapids-ai-now-free-on-google-colab-b8646f1ea948 <module 'subprocess' from '/usr/lib/python3.7/subprocess.py'>

FileNotFoundError Traceback (most recent call last) in () 8 import subprocess 9 print(subprocess) ---> 10 subprocess.Popen(['blazingsql-orchestrator', '9100', '8889', '127.0.0.1', '8890'],stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.STDOUT) 11 subprocess.Popen(['java', '-jar', '/usr/local/lib/blazingsql-algebra.jar', '-p', '8890']) 12 import pyblazing.apiv2.context as cont

1 frames /usr/lib/python3.7/subprocess.py in _execute_child(self, args, executable, preexec_fn, close_fds, pass_fds, cwd, env, startupinfo, creationflags, shell, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite, restore_signals, start_new_session) 1549 if errno_num == errno.ENOENT: 1550 err_msg += ': ' + repr(err_filename) -> 1551 raise child_exception_type(errno_num, err_msg, err_filename) 1552 raise child_exception_type(err_msg) 1553

FileNotFoundError: [Errno 2] No such file or directory: 'blazingsql-orchestrator': 'blazingsql-orchestrator'
bug ? - Needs Triage

opened by SoumyaB57 0

Releases(v21.08.00)

v21.08.00(Aug 16, 2021)
Improvements

Update ucx-py versions to 0.21

return ok for filesystems

Setting up default value for max_bytes_chunk_read to 256 MB

Bug Fixes

Fix build due to changes in rmm device buffer

Fix reading decimal columns from orc file

Fix CC/CXX variables in CI

Fix latest cudf dependencies

Fix concat suite E2E test for nested calls

Fix for GCS credentials from filepath

Fix decimal support using float64

Fix build issue with thrust package

Source code(tar.gz)
Source code(zip)
v21.06.00(Aug 16, 2021)
Note new versioning system from Major.Minor to Year.Month. Previous version was 0.19.

New Features

Limited support of unbounded partitioned windows

Support for CURRENT_DATE, CURRENT_TIME and CURRENT_TIMESTAMP

Support for right outer join

Support for DURATION type

Support for IS NOT FALSE condition

Support ORDERing by null values

Support for multiple columns inside COUNT() statement

Improvements

Support for concurrency in E2E tests

Better Support for unsigned types in C++ side

Folder refactoring related to caches, kernels, execution_graph, BlazingTable

Improve data loading when the algebra contains only BindableScan/Scan and Limit

Enable support for spdlog 1.8.5

Update RAPIDS version references

Bug Fixes

Fix IS NOT DISTINCT FROM with joins

Fix wrong results from timestampdiff/add

Fixed build issues due to cudf aggregation API change

Comparing param set to true for e2e

Fixed provider unit_tests

Fix orc statistic building

Fix Decimal/Fixed Point issue

Fix for max_bytes_chunk_read param to csv files

Fix ucx-py versioning specs

Reading chunks of max bytes for csv files

Source code(tar.gz)
Source code(zip)
v0.19.0(Apr 21, 2021)
New Features

New API that supports concurrent queries, by starting a query and obtaining a token, and then retrieving the result with that token.

Support for string CONCAT using the CONCAT keyword, instead of '||'.

New API to get the physical execution plan: bc.explain(query, detail = True)

Support for querying PostgreSQL tables

New documentation page

Improvements

Improvements and expansion to the end-to-end testing framework, including adding testing for data with nulls

Improved performance of joins by adding a timeout to the concatenating CacheMachine

Improved kernel row output estimation

Bug Fixes

Fixed bugs in uninitialized variables in orc metadata and improvements to handling the parseMetadata exceptions

Fixed bugs in handling nulls in case conditions with strings

Fixed issue with deleting allocated host memory

Fixed issues in capturing error messages from exceptions

Fixed bug when there are no projects in a BindableTableScan

Fixed issues from cuda when freeing pinned memory

Fixed bug in DistributeAggregationKernel where the wrong columns were being hashed

FIxed bug with empty row group ids for parquet

Fixed issues with int64 literal values

Fixed issue when CAST was applied to a literal

Fixed bug when getting ORC metadata for decimal type

Fixed bug with substrings with nulls

Fixed support for minus unary operator

Fixed bug with calculating number of batches in BindableTableScan

Fixed bug with full outer join when both tables contained nulls

Fixed bug with COUNT DISTINCT

Fixed issue with columns aliases when there was a Join operation

Fixed issue with python side exceptions

Fixed various issues due to changes in cudf or other dependencies

Window Functions (Experimental)

This release now provides limited Window Functions support. Window Functions that have the partition by clause support the following aggregations:

MIN

MAX

COUNT

SUM

AVG

ROW_NUMBER

LEAD

LAG Window Functions that have the do not have a partition by clause and have a bounded window frame using the ROWS BETWEEN (the window frame does not use the keyword UNBOUNDED) support the following aggregations:

MIN

MAX

COUNT

SUM

AVG At this moment, window frames using the keywords UNBOUNDED and CURRENT ROW don't fully work.

Deprecated Features

Disabled support for outer joins with inequalities

Source code(tar.gz)
Source code(zip)
v0.18.0(Feb 26, 2021)
New SQL Functions

The following SQL commands are now supported:

REGEXP_REPLACE

INITCAP

New Features

New centralized task executor for all query execution

New pinned memory buffer pool for improved performance in communication

New host memory buffer pool for improved performance in caching data to system memory

Support for UCX communications which enables usage of high performance communication hardware such as using InfiniBand

Creating table from ORC files now collects metadata from ORC files and can perform predicate pushdown on metadata

Progress bar when executing queries

Added ability to try to retry tasks when getting out of memory errors

Added ability to get maximum gpu memory used

Improvements

Improved support for concurrent queries

Improvements to query execution logs

Added/improved communication logs

Added ability to disable logs

Improved storage plugin output messages

Improved support for creating tables from JSON files

Bug Fixes

Fixed distribution so that its evenly distributes data loading based off of rowgroups

Fixed cython exception handling

Support FileSystems (GS, S3) when extension of the files are not provided

Fixed issue when creating tables from a local dir relative path

Misc bug fixes

Codebase improvements

Code base clean up, improved code organization and refactoring

No longer depending on gtest for runtime

Reduced number of compilation warnings

Source code(tar.gz)
Source code(zip)
v0.17.0(Dec 14, 2020)
New SQL Functions

The following SQL commands are now supported:

TO_DATE / TO_TIMESTAMP

DAYOFWEEK

TRIM / LTRIM / RTRIM

LEFT / RIGHT

UPPER / LOWER

REPLACE

REVERSE

New Features

New communications architecture with support for both TCP and UCX (UCX support is in beta)

Allow to create tables from compressed text delimited files

Allow to create tables off of Hive partitioned folder structure, where BlazingSQL will infer columns and types.

Added powerPC building script and instructions

Added local logging directory option to BlazingContext to help resolve logging file permission issues

Added option to read csv files in chunks

Logs are now configurable to have max size and be rotated

Improvements

Added Apache Calcite rule for window functions. (Window functions not supported yet)

Add validation for the kwargs when BlazingContext.create_table API is called

Added validation for s3 buckets

Added scheduler file support for e2e testing framework

Improved how sampling is done for ORDER BY

Several changes to keep up with cuDF API changes

Remove temp files when an error occurs

Added new end-to-end tests

Added new unit tests

Improved contribution documentation

Code refactoring and removing dead or duplicate code

Improvements in error logging

Improvement to error messaging when validating any GCP bucket

Added error logging in DataSourceSequence

Showing an appropriate error to indicate that we don't support opening directories with wildcards

Showing an appropriate error for invalid or unsupported expressions on the logical plan

Changes or improvements in technology stack or CI

Added output compile json option for cppcheck

Bump junit from 4.12 to 4.13.1 in /algebra

Improved gpuCI scripts

Removed need to specify cuda version via a label for conda packages

Fixed cmake version to be 3.18.4

Fix SSL errors for conda

Bug Fixes

Fixed issue when loading parquet files with local_files=True

Fixed logging directory setup

Fixed issues with config_options

Fixed issue in float columns when parsing parquet metadata

Fixed bug in MergeAggregations when single node has multiple batches

Fix graph thread pool hang when exception is thrown

Fix ignore headers when multiple CSV files was provided

Fix column_names (table) always as list of string

Fixed literal type inference for integers

Deprecated features

Deprecated bc.partition

Source code(tar.gz)
Source code(zip)
v0.16.0(Oct 23, 2020)
Improvements

Activate End-to-end test result validation for GPU_CI.

Add capacity to set the transport memory

Update conda recipe, remove cxx11 abi from cmake

Just one initialize() function at beginning and add logs related to allocation stuff

Make possible to read the system environment variables to setup config_option for BlazingContext

Update TPCH queries for end to end tests: converting implicit joins into explicit joins

Removing cudf source code dependency as some cudf utilities headers were exposed

Can now set manually BLAZING_CACHE_DIRECTORY

Bug Fixes

Fixed issue due to cudf orc api change

Fixed issue parsing fixed width string literals

Fixed issue with hive string columns

Fixed issue due to an rmm include

Fixed build issues with latest rmm 0.16 and columnBasisTest due to deprecated drop_column() function

Fix metadata mistmatch due to parsedMetadata, caused by parquet files that had only nulls in certain columns for only some files

Removed workaround for parquet read schema

Fixed issue caused by creating tables with multiple csv files and having BSQL infer the datatypes and having a dtypes mismatch

Avoid read _metadata files

Fixed issues with parsers, in particular ORC parser was misbehaving

Fixed issue with logging directories in distributed environments

Pinned google cloud version to 1.16

Partial revert of some changes on parquet rowgroups flow with local_files=True

Fixed issue when loading paths with wildcards

Fixed issue with concat_all in concatenating cache

Fix arrow and spdlog compilation issues

Fixed intra-query memory leak in joins

Fixed crash when loading an empty folder

Fixed parseSchemaPython can throw exceptions

Source code(tar.gz)
Source code(zip)
v0.15.0(Aug 31, 2020)
New Features:

Added a memory monitor for better memory management for out of core processing

Added list_tables() and describe_table() functions

Added support for constant expressions evaluation by Calcite

Added support for cross join

Added rand() and support for running unary operations on literals

Added get_free_memory() function

Improvements

Performance improvements:

Implemented Unordered pull from cache to help performance

Concatenating cache improvement and replacing PartwiseJoin::load_set with a concatenating cache

Adding max kernel num threads pool

Added new separate thresh for concat cache

Stability improvements:

Added checks for concatenation to prevent String overflow

Added nogil statements for pure C functions in Cython

Round robing dask workers on single gpu queries

Reraising query errors in context.py

Implemented using threadpool for outgoing messages

Documentation improvements:

Added exhale to generate doxygen for sphinx docs

Added Sphinx based code architecture documentation

Added doxygen comments to CacheMachine.h

Added more documentation about memory management

Updated readme

Added doxygen comments to some kernels and the batch processing

Building improvements:

Updated Calcite to the most recent version 1.23

Added check for CUDF_HOME to allow build to use an existing prebuilt cudf source tree

Python/Cython check code style

Make AWS and GCS optional

Logging improvements:

Logging level (flush_on) can be configurable

Set log_level when using LOGGING_LEVEL param

Testing improvements:

Added unit tests on Calcite to check how logical plans are affected when rulesets are updated

Updated set of TPCH queries on the E2E tests

Added initial set of unit tests for WaitingQueue and nullptr checks around spdlog calls

Add unit test for Project kernel

Other improvements:

Removed a lot of dead code from the codebase

Replace random_generator with cudf::sample

Adding extern C for include files

Use default client and network interface from Dask. BlazingSQL should now be able to infer the network interface.

Updated the GPUManager functions

Handle exceptions from pool_threads

Bug Fixes

Various fixing of issues due to updates to cudf

Fixed issue with Hive partitions when doing SELECT *

Normalize columns before distribution in JoinPartitionKernel

Fixed issue with hive partitions base folder

Fix interops operators output types

Fix when the algebra plan was provided using one-line as logical plan

Fix issue related to Hive metadata

Remove temp files from data cached to disk

Fix when checking only Limit and Scan Kernels

Loading one file at a time (LimitKernel and ScanKernel)

Fixed small issue with hive types conversion

Fix for literal cast

Fixed issue with start and length of substring being different types

Fixed issue on logical plans when there is an EXISTS clause

Fixed issue with casting string to string

Fixed issue with getting table scan info

Fixed row_groups issue in ParquetParser.cpp

Fixed issue with some constant expressions not evaluated by calcite

Fixed issue with log directory creation in a distributed environment

Fixed issue where we were including testing hpp in our code

Fixed optimization regression on the select count(*) case

Fixed issue caused by using new arrow_io_source

Fixed e2e string comparison

Fixed random segfault issue in parser

Fixed issue with column names on sample function

Introduced config param for max orderby samples and fixed issue with oversampling in ORDER BY

Source code(tar.gz)
Source code(zip)
v0.14.0(Jun 24, 2020)
New Features:

New execution architecture, supporting executing queries on data that does not fit in the GPU. The new architecture features the following:

The execution model is an acyclic graph of execution nodes with a cache in between execution nodes.

Each execution node operates independently on batches of data, allowing it to process steps in parallel as much as possible instead of sequentially.

Each cache between every execution step can hold the data in GPU, in system memory or on disk.

Has support for multi-partition dask.cudf.DataFrame result set outputs.

Added ability to set configuration options

Added support for using NULL as a literal value

Implemented CHAR_LENGTH function

Added ability to specify region for S3 buckets

Added type normalization for UNION ALL

Added support for MinIO Storage

Improvements:

Improved support for CAST function to include TINYINT and SMALLINT

Handle behavior when the optimized plan contains a LogicalValues

Improvements to exception handling

Support modern compilers (>= g++-7.x)

Improved logging now uses spdlog

Adding event logging

BlazingSQL engine no longer needs to concatenate dask.cudf.DataFrame partitions prior to running a query on a dask.cudf.DataFrame table

Improved expression parser, including support for expression trees of unlimited size.

Optimized data loading for queries of the type: SELECT * FROM table LIMIT N

Added built in end to end testing framework

Added logging to condition variables that are waiting too long

Bug Fixes:

Fixed bug in size estimation for tables before joins

Fixed issue with excessive thread creation in communication

Fixed bug in expression parsing for joins

Fixed bug caused by sharing data loaders when a query has one table more than once

FIxed Hive file format inference

Source code(tar.gz)
Source code(zip)
v0.13.0(Apr 7, 2020)
New Features:

Support for AVG in distributed mode

Added ability to use existing memory allocator

Implemented unify_partitions function for preparing dask_cudf DataFrames prior to creating BlazingSQL tables

Implemented ROUND function

Implemented support for CASE with strings

Improvements:

Local files can be referenced with relative file paths when creating tables.

Automatic casting for joins on similar data types (i.e. joining an int32 with an int64 will cast the int32 to an int64)

Updated AWS SDK version

More changes to related to changes migration of libcudf to libcudf++

Added docstrings to main python APIs

Bug Fixes:

Fixed bug when for joining against empty DataFrame

Fixed bug with GROUP BY ignoring nulls

Fixed various issues related to creating tables from dask_cudf DataFrames

Fixed various bugs with creating tables from Hive Cursor

Fixed bugs related to new libcudf++ functionality

Fixed bug in LIMIT statement

Fixed bug in timestamp processing

Fixed bug in SUM0 aggregation (which enables COUNT DISTINCT)

Fixed bug when querying single file with multiple workers

Fixed bug with distributed COUNT aggregation without GROUP BY

Fixed bug when creating and querying a table with several Apache Parquet files and one is empty

Fixed bug with joins with nulls in the join key columns

Other:

Temporarily deprecated JSON reader. In the meantime we recommend using: cudf.read_json

Source code(tar.gz)
Source code(zip)
v0.12.0(Feb 6, 2020)
New Features:

Ability to skip reading and processing row groups when querying Apache Parquet files by applying predicates on metadata

Ability to do SELECT COUNT (DISTINCT column)

Ability to use and set Pool memory allocator for increased performance and/or managed (UVM) allocator which provides robustness against running out of GPU memory

Improvements:

New building scripts thanks to @dillon-cullinan

Bug Fixes:

Fixed various bugs in the Apache Arrow provider

Fixed bug with incorrect data type in CASE statements

Fixed bug and memory leak in distributed joins

Fixed bug in usage of Google Cloud Storage plugin

Source code(tar.gz)
Source code(zip)
v0.11(Dec 17, 2019)
New Features:

Merged all the code repos for the whole stack into one repo

Pythonization of the whole BlazingSQL stack. See our blog post for more information

New API for being able to query performance and execution logs

Ability to create BlazingSQL tables from Hive tables

Partial support for Non-equality joins. For example SELECT * FROM tableA as A INNER JOIN tableB as B ON A.key = B.key AND A.this_date > B.that_date

Added arrow-provider

Improvements:

Optimized simple queries that only have COUNT(*)

Removed limitation on number of operands for outer joins

Improved error messaging

Improvements to relational algebra optimization

Bug Fixes:

Fixed bug where a python script running BlazingSQL would hang at the end of a script

Fixed bug when using wildcards for file paths and using dask distribution

Fixed bug with HDFS

Fixed bug with projects with large amounts of transformations on large GPUs

Fixed bug with multiple projections on the same column

Fixed COUNT(*) to properly ignore nulls

Fixed stability issues with certains queries running on 3 or more nodes

Fixed bug with querying a GDF and no transformations are applied

Fixed bug with empty result sets

Fixed bug with empty column names

Source code(tar.gz)
Source code(zip)
v0.4.6(Nov 12, 2019)
New Features

Implemented string concat operator

Implemented substring operator

Improvements:

Improved management of services

Changed Apache Calcite schema database to an in-memory database

Improved performance of communication between nodes by enabling parallel messaging

Improved performance of data loading by enabling parallel file reading

Added new distributed join method for joining small tables

Bug Fixes:

Fixed various issues with Timestamp data types

Fixed issue when column names were too long

Fixed bug in relational algebra generation

Fixed various bugs in communication layer

Fixed bug with order by with strings

Fixed issue with parsing Apache Parquet file schemas

Fixed memory leak in joins

Fixed memory leak in communication layer

Fixed bug in table concatenation in disitrubiton algorithms

Fixed bug when trying to join on columns of integers of different byte widths, or floats of different byte widths

Fixed bug when trying to do a union on columns of integers of different byte widths, or floats of different byte widths

Fixed bug in passing error message to user

Source code(tar.gz)
Source code(zip)
v0.4.5(Oct 22, 2019)
New Features

Completely revamped data transport layer is much faster and robust

Added support for LIKE operator

Added ability to create tables from Dask dataframes.

Improved how services are launched from BlazingContext. Including new ready() function which checks to see if all services are online and shutdown() function to shutdown all services.

Improvements

Improved performance logging

Now using in-memory H2 database for Apache Calcite table catalog

Updated to cudf v0.10

Bug Fixes

Fixed bug in expression parsing

Fixed various bugs with date literals, date functions and GDF_TIMESTAMP data type

Fixed bug with aliases

Fixed bug in order by for distributed queries when there are empty partitions

Fixed bug in creating tables from S3 directories

Fixed bug where predicate pushdown was not happening in certain types of queries

Source code(tar.gz)
Source code(zip)
v0.4.4(Oct 22, 2019)
New Features

Added support for CAST

Added file_format parameter to create_table. This parameter is used for when the file format is not determinable from the file extension.

Bug Fixes

Fixed bug where aliases would sometimes not be set correctly

Source code(tar.gz)
Source code(zip)
v0.4.3(Sep 26, 2019)
New Features

Added file_format parameter to create_table to help create tables from files that don't have extensions

Bug Fixes

Fixed how releases are versioned for Conda

Fixed bug with joining against an empty table

Source code(tar.gz)
Source code(zip)
v0.4.2(Sep 20, 2019)
New Features

Added support for CASE

Improved support for Boolean columns

Creating tables using wildcards in file paths

Added support for Google Cloud Storage

Bug Fixes

Fixed bug in groups by's with strings in distributed cluster

Fixed issues in how BlazingContext launches processes

Fixed issue where releases were being done in Debug mode

Fixed bug related to creating multiple tables with the same name

Source code(tar.gz)
Source code(zip)
v0.4.1(Sep 20, 2019)
New Features:

Ability to compile and install using Conda

Creating BlazingContext can now automatically launches processes

Support for creating tables from JSON and ORC files

Added more CSV parsing parameters for creating tables from CSV files

Updated to use cudf v0.9 release

Added support for LIMIT

Bug fixes

Fixed bug with processing queries using date literals

Fixed distribution issues with data with nulls

Source code(tar.gz)
Source code(zip)
v0.4.0(Aug 16, 2019)
A great deal has happened since we last released.

We now support distributed query execution!

Distributed results output to dask-cudf

Updated to cuDF 0.9

Millions, literally millions, of bug fixes.

No longer use main. before any table names. That was awful. bc.sql('select * from main.table_name') --> bc.sql('select * from table_name')

Source code(tar.gz)
Source code(zip)
simple-distribution-tcp-cudf0.7(Jun 14, 2019)

before cudf 0.8 and before table scan
Source code(tar.gz)
Source code(zip)

BlazingSQL is a lightweight, GPU accelerated, SQL engine for Python. Built on RAPIDS cuDF.

Related tags

Overview

Getting Started

Examples

Documentation

Prerequisites

Install Using Conda

Stable Version

Nightly Version

Build/Install from Source (Conda Environment)

Stable Version

Install build dependencies

Build

Nightly Version

Install build dependencies

Build

Storage plugins

Documentation

Community

Contributing

Contact

License

RAPIDS AI - Open GPU Data Science

Apache Arrow on GPU

Comments

v4.8.0

Liquibase 4.8.0 release

Notable Changes

Enhancements

Fixes

Liquibase Core Changelog

When I am trying to run blazingsql in google colab I am finding the following error Link of colab Notebook: https://blog.blazingdb.com/blazingsql-rapids-ai-now-free-on-google-colab-b8646f1ea948 <module 'subprocess' from '/usr/lib/python3.7/subprocess.py'>

Releases(v21.08.00)

v21.08.00(Aug 16, 2021)

Improvements

Bug Fixes

v21.06.00(Aug 16, 2021)

New Features

Improvements

Bug Fixes

v0.19.0(Apr 21, 2021)

New Features

Improvements

Bug Fixes

Window Functions (Experimental)

Deprecated Features

v0.18.0(Feb 26, 2021)

New SQL Functions

New Features

Improvements

Bug Fixes

Codebase improvements

v0.17.0(Dec 14, 2020)

New SQL Functions

New Features

Improvements

Improvements in error logging

Changes or improvements in technology stack or CI

Bug Fixes

Deprecated features

v0.16.0(Oct 23, 2020)

Improvements

Bug Fixes

v0.15.0(Aug 31, 2020)

New Features:

Improvements

Performance improvements:

Stability improvements:

Documentation improvements:

Building improvements:

Logging improvements:

Testing improvements:

Other improvements:

Bug Fixes

v0.14.0(Jun 24, 2020)

New Features:

Improvements:

Bug Fixes:

v0.13.0(Apr 7, 2020)