Large dataset storage format for Pytorch

theblackcat102

Last update: Oct 22, 2022

Related tags

Overview

H5Record

Large dataset ( > 100G, <= 1T) storage format for Pytorch (wip)

Support python 3

pip install h5record

Why?

Writing large dataset is still a wild west in pytorch. Approaches seen in the wild include:
- large directory with lots of small files : slow IO when complex file is fetched, deserialized frequently
- database approach : depend on what kind of database engine used, usually multi-process read is not supported
- the above method scale non linear in terms of data - storage size
TFRecord solved the above problems well ( multiprocess fetch, (de)compression ), fast serialization ( protobuf )
However TFRecord port does not support data size evaluation (used frequently by Dataloader ), no index level access available ( important for data evaluation or verification )

H5Record aim to tackle TFRecord problems by compressing the dataset into HDF5 file with an easy to use interface through predefined interfaces ( String, Image, Sequences, Integer).

Some advantage of using H5Record

Support multi-process read
Relatively simple to use and low technical debt
Support compression/de-compression on the fly
Quick load to memory if required

Simple usage

pip install h5record

Sentence Similarity

from h5record import H5Dataset, Float, String

schema = (
    String(name='sentence1'),
    String(name='sentence2'),
    Float(name='label')
)
data = [
    ['Sent 1.', 'Sent 2', 0.1],
    ['Sent 3', 'Sent 4', 0.2],
]

def pair_iter():
    for row in data:
        yield {
            'sentence1': row[0],
            'sentence2': row[1],
            'label': row[2]
        }

dataset = H5Dataset(schema, './question_pair.h5', pair_iter())
for idx in range(len(dataset)):
    print(dataset[idx])

Note

Due to in progress development, this package should be use in care in storage with FAT, FAT-32 format

Comparison between different compression algorithm

No chunking is used

Compression Type	File size	Read speed row/second
no compression	2.0G	2084.55 it/s
lzf	1.7G	1496.14 it/s
gzip	1.1G	843.78 it/s

benchmarked in i7-9700, 1TB NVMe SSD

If you are interested to learn more feel free to checkout the note as well!

You might also like...

A large-scale video dataset for the training and evaluation of 3D human pose estimation models

ASPset-510 ASPset-510 (Australian Sports Pose Dataset) is a large-scale video dataset for the training and evaluation of 3D human pose estimation mode

36 Oct 30, 2022

A large-scale video dataset for the training and evaluation of 3D human pose estimation models

ASPset-510 (Australian Sports Pose Dataset) is a large-scale video dataset for the training and evaluation of 3D human pose estimation models. It contains 17 different amateur subjects performing 30 sports-related actions each, for a total of 510 action clips.

25 Jun 20, 2021

A Large-Scale Dataset for Spinal Vertebrae Segmentation in Computed Tomography

PyTorch-LIT is the Lite Inference Toolkit (LIT) for PyTorch which focuses on easy and fast inference of large models on end-devices.

PyTorch-LIT PyTorch-LIT is the Lite Inference Toolkit (LIT) for PyTorch which focuses on easy and fast inference of large models on end-devices. With

157 Dec 11, 2022

This is the dataset and code release of the OpenRooms Dataset.

95 Jan 8, 2023

Comments

Example about Image dataset

Thanks for your work. Do you have an end to end example about image dataset which includes creating h5records file similar to tfrecord files and then using it in dataloader mechanism just like tf dataset api loader mechanism?
documentation question

opened by meet-minimalist 1

Releases(1.0.4)

1.0.4(Jun 8, 2021)

Minor bug fix
Source code(tar.gz)
Source code(zip)
1.0.3(Jun 6, 2021)
Support for image sequence, float16 sequence, float sequence and float16 datatype

Fix bugs

Source code(tar.gz)
Source code(zip)
1.0.1(Jun 5, 2021)

Source code(tar.gz)
Source code(zip)

Large dataset storage format for Pytorch

Related tags

Overview

H5Record

Why?

Simple usage

Note

Comparison between different compression algorithm

You might also like...

A large-scale video dataset for the training and evaluation of 3D human pose estimation models

A large-scale video dataset for the training and evaluation of 3D human pose estimation models

A Large-Scale Dataset for Spinal Vertebrae Segmentation in Computed Tomography

Large Scale Multi-Illuminant (LSMI) Dataset for Developing White Balance Algorithm under Mixed Illumination

LIVECell - A large-scale dataset for label-free live cell segmentation

A large-scale face dataset for face parsing, recognition, generation and editing.

N-Omniglot is a large neuromorphic few-shot learning dataset

PyTorch-LIT is the Lite Inference Toolkit (LIT) for PyTorch which focuses on easy and fast inference of large models on end-devices.

This is the dataset and code release of the OpenRooms Dataset.

Comments

Example about Image dataset

Releases(1.0.4)

1.0.4(Jun 8, 2021)

1.0.3(Jun 6, 2021)

1.0.1(Jun 5, 2021)

Owner

theblackcat102

Official Implementation and Dataset of "PPR10K: A Large-Scale Portrait Photo Retouching Dataset with Human-Region Mask and Group-Level Consistency", CVPR 2021

A large dataset of 100k Google Satellite and matching Map images, resembling pix2pix's Google Maps dataset.

Json2Xml tool will help you convert from json COCO format to VOC xml format in Object Detection Problem.

Txt2Xml tool will help you convert from txt COCO format to VOC xml format in Object Detection Problem.

A pytorch implementation of the CVPR2021 paper "VSPW: A Large-scale Dataset for Video Scene Parsing in the Wild"

A set of tools for converting a darknet dataset to COCO format working with YOLOX

A transformer which can randomly augment VOC format dataset (both image and bbox) online.

Project Aquarium is a SUSE-sponsored open source project aiming at becoming an easy to use, rock solid storage appliance based on Ceph.

Exporter for Storage Area Network (SAN)

This is the official source code for SLATE. We provide the code for the model, the training code, and a dataset loader for the 3D Shapes dataset. This code is implemented in Pytorch.