This repository contains a set of benchmarks of different implementations of Parquet (storage format) <-> Arrow (in-memory format).

Last update: Dec 21, 2022

Related tags

Testing parquet-benchmark

Overview

Parquet benchmarks

This repository contains a set of benchmarks of different implementations of Parquet (storage format) <-> Arrow (in-memory format).

The results on Azure's Standard D4s v3 (4 vcpus, 16 GiB memory) are available here.

Read uncompressed

(Note: neither pyarrow nor arrow validate utf8, which can result in undefined behavior.)

Read compressed (snappy)

(Note: neither pyarrow nor arrow validate utf8, which can result in undefined behavior.)

Write uncompressed

Write compressed (snappy)

(Note: neither pyarrow nor arrow validate utf8, which can result in undefined behavior.)

Run benchmarks

To reproduce, use

python3 -m venv venv
venv/bin/pip install -U pip
venv/bin/pip install pyarrow

# create files
venv/bin/python write_parquet.py

# run benchmarks
venv/bin/python run.py

# print results to stdout as csv
venv/bin/python summarize.py

Details

The benchmark reads a single column from a file pre-loaded into memory, decompresses and deserializes the column to an arrow array.

The benchmark includes different configurations:

dictionary-encoded vs plain encoding
single page vs multiple pages
compressed vs uncompressed
different types:
- i64
- bool
- utf8

This repository contnains sample problems with test cases using Cormen-Lib

Cormen Lib Sample Problems Description This repository contnains sample problems with test cases using Cormen-Lib. These problems were made for the pu

3 Jun 30, 2022

Repository for JIDA SNP Browser Web Application: Local Deployment

JIDA JIDA is a web application that retrieves SNP information for a genomic region of interest in Homo sapiens and calculates specific summary statist

3 Mar 3, 2022

A torch.Tensor-like DataFrame library supporting multiple execution runtimes and Arrow as a common memory format

TorchArrow (Warning: Unstable Prototype) This is a prototype library currently under heavy development. It does not currently have stable releases, an

536 Jan 6, 2023

🖍️This is a feature-complete clone of the awesome Chalk (JavaScript) library.

Terminal string styling done right This is a feature-complete clone of the awesome Chalk (JavaScript) library. All credits go to Sindre Sorhus. Highli

132 Dec 27, 2022

Python-geoarrow - Storing geometry data in Apache Arrow format

geoarrow Storing geometry data in Apache Arrow format Installation $ pip install

11 Mar 3, 2022

IMGUR5K handwriting set. It is a handwritten in-the-wild dataset, which contains challenging real world handwritten samples from different writers.The dataset is shared as a set of image urls with annotations. This code downloads the images and verifies the hash to the image to avoid data contamination.

IMGUR5K Handwriting Dataset To run the code for downloading the urls and generate corresponding annotations : Usage: python download_imgur5k.py --data

213 Dec 26, 2022

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.

Petastorm Contents Petastorm Installation Generating a dataset Plain Python API Tensorflow API Pytorch API Spark Dataset Converter API Analyzing petas

1.6k Dec 31, 2022

This repository contains the implementations related to the experiments of a set of publicly available datasets that are used in the time series forecasting research space.

TSForecasting This repository contains the implementations related to the experiments of a set of publicly available datasets that are used in the tim

80 Dec 30, 2022

The Dual Memory is build from a simple CNN for the deep memory and Linear Regression fro the fast Memory

Simple-DMA a simple Dual Memory Architecture for classifications. based on the paper Dual-Memory Deep Learning Architectures for Lifelong Learning of

1 Jan 27, 2022

Ralph is a command-line tool to fetch, extract, convert and push your tracking logs from various storage backends to your LRS or any other compatible storage or database backend.

Ralph is a command-line tool to fetch, extract, convert and push your tracking logs (aka learning events) from various storage backends to your

18 Jan 5, 2023

DNA Storage Simulator that analyzes and simulates DNA storage

DNA Storage Simulator This monorepository contains code for a research project by Mayank Keoliya and supervised by Djordje Jevdjic, that analyzes and

3 Sep 25, 2022

Qtas（Quite a Storage）is an experimental distributed storage system developed by Q-team in BJFU Advanced Computer Network sources.

Qtas（Quite a Storage）is a experimental distributed storage system developed by Q-team in BJFU Advanced Computer Network sources.

3 Jan 12, 2022

Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

AWS Data Wrangler Pandas on AWS Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretMana

3.3k Jan 4, 2023

Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

3.3k Dec 31, 2022

This repository contains a set of benchmarks of different implementations of Parquet (storage format) <-> Arrow (in-memory format).

Related tags

Overview

Parquet benchmarks

Read uncompressed

Read compressed (snappy)

Write uncompressed

Write compressed (snappy)

Run benchmarks

Details

You might also like...

This repository contnains sample problems with test cases using Cormen-Lib

Repository for JIDA SNP Browser Web Application: Local Deployment

A torch.Tensor-like DataFrame library supporting multiple execution runtimes and Arrow as a common memory format

🖍️This is a feature-complete clone of the awesome Chalk (JavaScript) library.

Python-geoarrow - Storing geometry data in Apache Arrow format

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.

This repository contains the implementations related to the experiments of a set of publicly available datasets that are used in the time series forecasting research space.

The Dual Memory is build from a simple CNN for the deep memory and Linear Regression fro the fast Memory

Ralph is a command-line tool to fetch, extract, convert and push your tracking logs from various storage backends to your LRS or any other compatible storage or database backend.

DNA Storage Simulator that analyzes and simulates DNA storage

Qtas（Quite a Storage）is an experimental distributed storage system developed by Q-team in BJFU Advanced Computer Network sources.

Qtas（Quite a Storage）is an experimental distributed storage system developed by Q-team in BJFU Advanced Computer Network sources.

Storage-optimizer - Identify potintial optimizations on the cloud storage accounts

Standard implementations of FedLab and its provided benchmarks.

Gateware for the Terasic/Arrow DECA board, to become a USB2 high speed audio interface

Small Arrow Vortex clipboard processing library

Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

Owner

This repository contains a testing script for nmigen-boards that tries to build blinky for all the platforms provided by nmigen-boards.

PacketPy is an open-source solution for stress testing network devices using different testing methods

A set of pytest fixtures to test Flask applications

A set of pytest fixtures to test Flask applications

A configurable set of panels that display various debug information about the current request/response.

pywinauto is a set of python modules to automate the Microsoft Windows GUI

Avocado is a set of tools and libraries to help with automated testing.

Set your Dynaconf environment to testing when running pytest

A simple serverless create api test repository. Please Ignore.

This repository has automation content to test Arista devices.