This repository contains a set of benchmarks of different implementations of Parquet (storage format) <-> Arrow (in-memory format).

Overview

Parquet benchmarks

This repository contains a set of benchmarks of different implementations of Parquet (storage format) <-> Arrow (in-memory format).

The results on Azure's Standard D4s v3 (4 vcpus, 16 GiB memory) are available here.

Read uncompressed

read uncompressed i64

read uncompressed bool

read uncompressed utf8

(Note: neither pyarrow nor arrow validate utf8, which can result in undefined behavior.)

read uncompressed dict utf8

(Note: neither pyarrow nor arrow validate utf8, which can result in undefined behavior.)

Read compressed (snappy)

read compressed i64

read compressed bool

read compressed utf8

(Note: neither pyarrow nor arrow validate utf8, which can result in undefined behavior.)

Write uncompressed

write uncompressed i64

write uncompressed bool

write uncompressed utf8

Write compressed (snappy)

write compressed i64

write compressed bool

write compressed utf8

(Note: neither pyarrow nor arrow validate utf8, which can result in undefined behavior.)

Run benchmarks

To reproduce, use

python3 -m venv venv
venv/bin/pip install -U pip
venv/bin/pip install pyarrow

# create files
venv/bin/python write_parquet.py

# run benchmarks
venv/bin/python run.py

# print results to stdout as csv
venv/bin/python summarize.py

Details

The benchmark reads a single column from a file pre-loaded into memory, decompresses and deserializes the column to an arrow array.

The benchmark includes different configurations:

  • dictionary-encoded vs plain encoding
  • single page vs multiple pages
  • compressed vs uncompressed
  • different types:
    • i64
    • bool
    • utf8
You might also like...
This repository contnains sample problems with test cases using Cormen-Lib

Cormen Lib Sample Problems Description This repository contnains sample problems with test cases using Cormen-Lib. These problems were made for the pu

Repository for JIDA SNP Browser Web Application: Local Deployment

JIDA JIDA is a web application that retrieves SNP information for a genomic region of interest in Homo sapiens and calculates specific summary statist

A torch.Tensor-like DataFrame library supporting multiple execution runtimes and Arrow as a common memory format

TorchArrow (Warning: Unstable Prototype) This is a prototype library currently under heavy development. It does not currently have stable releases, an

🖍️This is a feature-complete clone of the awesome Chalk (JavaScript) library.
🖍️This is a feature-complete clone of the awesome Chalk (JavaScript) library.

Terminal string styling done right This is a feature-complete clone of the awesome Chalk (JavaScript) library. All credits go to Sindre Sorhus. Highli

Python-geoarrow - Storing geometry data in Apache Arrow format

geoarrow Storing geometry data in Apache Arrow format Installation $ pip install

IMGUR5K handwriting set. It is a handwritten in-the-wild dataset, which contains challenging real world handwritten samples from different writers.The dataset is shared as a set of image urls with annotations. This code downloads the images and verifies the hash to the image to avoid data contamination.
This repository contains the implementations related to the experiments of a set of publicly available datasets that are used in the time series forecasting research space.

TSForecasting This repository contains the implementations related to the experiments of a set of publicly available datasets that are used in the tim

The Dual Memory is build from a simple CNN for the deep memory and Linear Regression fro the fast Memory
The Dual Memory is build from a simple CNN for the deep memory and Linear Regression fro the fast Memory

Simple-DMA a simple Dual Memory Architecture for classifications. based on the paper Dual-Memory Deep Learning Architectures for Lifelong Learning of

Ralph is a command-line tool to fetch, extract, convert and push your tracking logs from various storage backends to your LRS or any other compatible storage or database backend.

Ralph is a command-line tool to fetch, extract, convert and push your tracking logs (aka learning events) from various storage backends to your

DNA Storage Simulator that analyzes and simulates DNA storage

DNA Storage Simulator This monorepository contains code for a research project by Mayank Keoliya and supervised by Djordje Jevdjic, that analyzes and

Qtas(Quite a Storage)is an experimental distributed storage system developed by Q-team in BJFU Advanced Computer Network sources.

Qtas(Quite a Storage)is a experimental distributed storage system developed by Q-team in BJFU Advanced Computer Network sources.

Qtas(Quite a Storage)is an experimental distributed storage system developed by Q-team in BJFU Advanced Computer Network sources.

Qtas(Quite a Storage)is a experimental distributed storage system developed by Q-team in BJFU Advanced Computer Network sources.

Storage-optimizer - Identify potintial optimizations on the cloud storage accounts

Storage Optimizer Identify potintial optimizations on the cloud storage accounts

Standard implementations of FedLab and its provided benchmarks.

FedLab-benchmarks This repo contains standard implementations of FedLab and its provided benchmarks. Currently, following algorithms or benchrmarks ar

Gateware for the Terasic/Arrow DECA board, to become a USB2 high speed audio interface

DECA USB Audio Interface DECA based USB 2.0 High Speed audio interface Status / current limitations enumerates as class compliant audio device on Linu

Small Arrow Vortex clipboard processing library

Description Small Arrow Vortex clipboard processing library. Install You can install this library from PyPI with pip install av-clipboard-lib or compi

Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

AWS Data Wrangler Pandas on AWS Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretMana

Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

AWS Data Wrangler Pandas on AWS Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretMana

Owner
null
This repository contains a testing script for nmigen-boards that tries to build blinky for all the platforms provided by nmigen-boards.

Introduction This repository contains a testing script for nmigen-boards that tries to build blinky for all the platforms provided by nmigen-boards.

S.J.R. van Schaik 4 Jul 23, 2022
PacketPy is an open-source solution for stress testing network devices using different testing methods

PacketPy About PacketPy is an open-source solution for stress testing network devices using different testing methods. Currently, there are only two c

null 4 Sep 22, 2022
A set of pytest fixtures to test Flask applications

pytest-flask An extension of pytest test runner which provides a set of useful tools to simplify testing and development of the Flask extensions and a

pytest-dev 433 Dec 23, 2022
A set of pytest fixtures to test Flask applications

pytest-flask An extension of pytest test runner which provides a set of useful tools to simplify testing and development of the Flask extensions and a

pytest-dev 354 Feb 17, 2021
A configurable set of panels that display various debug information about the current request/response.

Django Debug Toolbar The Django Debug Toolbar is a configurable set of panels that display various debug information about the current request/respons

Jazzband 7.3k Jan 2, 2023
pywinauto is a set of python modules to automate the Microsoft Windows GUI

pywinauto is a set of python modules to automate the Microsoft Windows GUI. At its simplest it allows you to send mouse and keyboard actions to windows dialogs and controls, but it has support for more complex actions like getting text data.

null 3.8k Jan 6, 2023
Avocado is a set of tools and libraries to help with automated testing.

Welcome to Avocado Avocado is a set of tools and libraries to help with automated testing. One can call it a test framework with benefits. Native test

Ana Guerrero Lopez 1 Nov 19, 2021
Set your Dynaconf environment to testing when running pytest

pytest-dynaconf Set your Dynaconf environment to testing when running pytest. Installation You can install "pytest-dynaconf" via pip from PyPI: $ pip

David Baumgold 3 Mar 11, 2022
A simple serverless create api test repository. Please Ignore.

serverless-create-api-test A simple serverless create api test repository. Please Ignore. Things to remember: Setup workflow Change Name in workflow e

Sarvesh Bhatnagar 1 Jan 18, 2022
This repository has automation content to test Arista devices.

Network tests automation Network tests automation About this repository Requirements Requirements on your laptop Requirements on the switches Quick te

Netdevops Community 17 Nov 4, 2022