S3-plugin is a high performance PyTorch dataset library to efficiently access datasets stored in S3 buckets.

Amazon Web Services

Last update: Jan 3, 2023

Related tags

Pytorch Utilities amazon-s3-plugin-for-pytorch

Overview

S3 Plugin

S3-plugin is a high performance PyTorch dataset library to efficiently access datasets stored in S3 buckets. It provides streaming data access to datasets of any size and thus eliminates the need to provision local storage capacity. The library is designed to leverage the high throughput that S3 offers to access objects with minimal latency.

The users have the flexibility to use either map-style or iterable-style dataset interfaces based on their needs. The library itself is file-format agnostic and presents objects in S3 as a binary buffer(blob). Users are free to apply any additional transformation on the data received from S3.

Installation

You can install this package by following the below instructions.

Prerequisite

Python 3.6 (or Python 3.7) is required for this installation.
AWS CLI for configuring S3 access.
Pytorch >= 1.5 (If not available, S3-plugin installs latest Torch)
Note: To run on Mac, AWS_SDK_CPP must be installed.

Installing S3-Plugin via Wheel

# TODO Add final public wheels
aws s3 cp <S3 URI> .
pip install <whl name awsio-0.0.1-cp...whl>

Configuration

Before reading data from S3 bucket, you need to provide bucket region parameter:

AWS_REGION: By default, regional endpoint is used for S3, with region controlled by AWS_REGION. If AWS_REGION is not specified, then us-west-2 is used by default.

To read objects in a bucket that is not publicly accessible, AWS credentials must be provided through one of the following methods:

Install and configure awscli by aws configure.
Set credentials in the AWS credentials profile file on the local system, located at: ~/.aws/credentials on Linux, macOS, or Unix
Set the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables.
If you are using this library on an EC2 instance, specify an IAM role and then give the EC2 instance access to that role.

Smoke Test

To test your setup, run:

bash tests/smoke_tests/import_awsio.sh

The test will first make sure that the package imports correctly by printing the commit hash related to the build. Then, it will prompt the user for a S3 url to a file and return whether or not the file exists.

For example:

$ bash tests/smoke_tests/import_awsio.sh 
Testing: import awsio
0.0.1+b119a6d
import awsio succeeded
S3 URL : 's3://path/to/bucket/test_0.JPEG'
Testing: checking setup by quering whether or not 's3://path/to/bucket/test_0.JPEG' is an existing file
file_exists: True
Smoke test was successful.

Usage

Once the above setup is complete, you can interact with S3 bucket in following ways:

Accepted input S3 url formats:

Single url
url = 's3://path/to/bucket/abc.tfrecord'
List of urls as follows:

urls = ['s3://path/to/bucket/abc.tfrecord','s3://path/to/bucket/def.tfrecord']

Prefix to S3 bucket to include all files under 's3_prefix' folder starting with '0'

urls = 's3://path/to/s3_prefix/0'

Using list_files() function, which can be used to manipulate input list of urls to fetch as follows:

from awsio.python.lib.io.s3.s3dataset import list_files
urls = list_files('s3://path/to/s3_prefix/0')

Map-Style Dataset

If each object in S3 contains a single training sample, then map-style dataset i.e. S3Dataset can be used. To partition data across nodes and to shuffle data, this dataset can be used with PyTorch distributed sampler. Additionally, pre-processing can be applied to the data in S3 by extending the S3Dataset class. Following example illustrates use of map-style S3Dataset for image datasets:

from awsio.python.lib.io.s3.s3dataset import S3Dataset
from torch.utils.data import DataLoader
from torchvision import transforms
from PIL import Image
import io

class S3ImageSet(S3Dataset):
    def __init__(self, urls, transform=None):
        super().__init__(urls)
        self.transform = transform

    def __getitem__(self, idx):
        img_name, img = super(S3ImageSet, self).__getitem__(idx)
        # Convert bytes object to image
        img = Image.open(io.BytesIO(img)).convert('RGB')
        
        # Apply preprocessing functions on data
        if self.transform is not None:
            img = self.transform(img)
        return img

batch_size = 32

preproc = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)),
    transforms.Resize((100, 100))
])

# urls can be S3 prefix containing images or list of all individual S3 images
urls = 's3://path/to/s3_prefix/'

dataset = S3ImageSet(urls, transform=preproc)
dataloader = DataLoader(dataset,
        batch_size=batch_size,
        num_workers=64)

Iterable-style dataset

If each object in S3 contains multiple training samples e.g. archive files containing multiple small images or TF record files/shards containing multiple records, then it is advisable to use the Iterable-style dataset implementation i.e. S3IterableDataset. For the specific case of zip/tar archival files, each file contained in the archival is returned during each iteration in a streaming fashion. For all other file formats, binary blob for the whole shard is returned and users need to implement the appropriate parsing logic. Besides, S3IterableDataset takes care of partitioning the data across nodes and workers in a distributed setting.

Note: For datasets consisting of a large number of smaller objects, accessing each object individually can be inefficient. For such datasets, it is recommended to create shards of the training data and use S3IterableDataset for better performance.

# tar file containing label and image files as below
 tar --list --file=file1.tar |  sed 4q

1234.cls
1234.jpg
5678.cls
5678.jpg

Consider tar file for image classification. It can be easily loaded by writing a custom python generator function using the iterator returned by S3IterableDataset. (Note: To create shards from a file dataset refer this link.)

from torch.utils.data import IterableDataset
from awsio.python.lib.io.s3.s3dataset import S3IterableDataset
from PIL import Image
import io
import numpy as np
from torchvision import transforms

class ImageS3(IterableDataset):
    def __init__(self, urls, shuffle_urls=False, transform=None):
        self.s3_iter_dataset = S3IterableDataset(urls,
                                                 shuffle_urls)
        self.transform = transform

    def data_generator(self):
        try:
            while True:
                # Based on alphabetical order of files, sequence of label and image may change.
                label_fname, label_fobj = next(self.s3_iter_dataset_iterator)
                image_fname, image_fobj = next(self.s3_iter_dataset_iterator)
                
                label = int(label_fobj)
                image_np = Image.open(io.BytesIO(image_fobj)).convert('RGB')
                
                # Apply torch vision transforms if provided
                if self.transform is not None:
                    image_np = self.transform(image_np)
                yield image_np, label

        except StopIteration:
            return
            
    def __iter__(self):
        self.s3_iter_dataset_iterator = iter(self.s3_iter_dataset)
        return self.data_generator()
        
    def set_epoch(self, epoch):
        self.s3_iter_dataset.set_epoch(epoch)

# urls can be a S3 prefix containing all the shards or a list of S3 paths for all the shards 
 urls = ["s3://path/to/file1.tar", "s3://path/to/file2.tar"]

# Example Torchvision transforms to apply on data    
preproc = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)),
    transforms.Resize((100, 100))
])

dataset = ImageS3(urls, transform=preproc)

This dataset can be easily used with dataloader for parallel data loading and preprocessing:

dataloader = torch.utils.data.DataLoader(dataset, num_workers=4, batch_size=32)

We can shuffle the sequence of fetching shards by setting shuffle_urls=True and calling set_epoch method at the beginning of every epochs as:

dataset = ImageS3(urls, transform=preproc, shuffle_urls=True)
for epoch in range(epochs):
    dataset.set_epoch(epoch)
    # training code ...

Note that the above code will only shuffle sequence of shards, the individual training samples within shards will be fetched in the same order. To shuffle the order of training samples across shards, use ShuffleDataset. ShuffleDataset maintains a buffer of data samples read from multiple shards and returns a random sample from it. The count of samples to be buffered is specified by buffer_size. To use ShuffleDataset, update the above example as follows:

dataset = ShuffleDataset(ImageS3(urls), buffer_size=4000)

Iterable-style dataset (NLP)

The data set can be similarly used for NLP tasks. Following example demonstrates use for S3IterableDataset for BERT data loading.

# Consider S3 prefix containing hdf5 files.
# Each hdf5 file contains numpy arrays for different variables required for BERT 
# training such as next sentence labels, masks etc.
aws s3 ls --human-readable s3://path/to/s3_prefix |  sed 3q


file_1.hdf5
file_2.hdf5
file_3.hdf5

import torch
from torch.utils.data import IterableDataset, DataLoader
from itertools import islice
import h5py
import numpy as np
import io
from awsio.python.lib.io.s3.s3dataset import S3IterableDataset

def create_data_samples_from_file(fileobj):
    # Converts bytes data to numpy arrays
    keys = ['input_ids', 'input_mask', 'segment_ids', \
        'masked_lm_positions', 'masked_lm_ids', 'next_sentence_labels']
    dataset = io.BytesIO(fileobj)
    with h5py.File(dataset, "r") as f:
        data_file = [np.asarray(f[key][:]) for key in keys]
    return data_file

class s3_dataset(IterableDataset):

    def __init__(self, urls):
        self.urls = urls
        self.dataset = S3IterableDataset(self.urls, shuffle_urls=True)

    def data_generator(self):
        try:
            while True:
                filename, fileobj = next(self.dataset_iter)
                # data_samples: list of six numpy arrays 
                data_samples = create_data_samples_from_file(fileobj)
                
                for sample in list(zip(*data_samples)):
                    # Preprocess sample if required and then yield
                    yield sample

        except StopIteration as e:
            return

    def __iter__(self):
        self.dataset_iter = iter(self.dataset)
        return self.data_generator()

urls = "s3://path/to/s3_prefix"
train_dataset = s3_dataset(urls)

Test Coverage

To check python test coverage, install coverage.py as follows:

pip install coverage

To make sure that all tests are run, please also install pytest, boto3, and pandas as follows:

pip install pytest boto3 pandas

To run tests and calculate coverage:

coverage erase
coverage run -p --source=awsio -m pytest -v tests/py-tests/test_regions.py \
tests/py-tests/test_utils.py \
tests/py-tests/test_s3dataset.py \
tests/py-tests/test_s3iterabledataset.py \
tests/py-tests/test_read_datasets.py \
tests/py-tests/test_integration.py
coverage combine
coverage report -m

Comments

installation via pip or conda

Is there a way to use the package through pip/conda without docker? I'd also like to know what additional steps are needed to set this up in jupyter lab.

opened by vaibhavnayel 8
Installation not working

I am trying to install this package as: pip install --no-cache-dir -U https://aws-s3-plugin.s3.us-west-2.amazonaws.com/binaries/0.0.1/1c3e69e/awsio-0.0.1-cp38-cp38-manylinux1_x86_64.whl

(Grabbed the wheel location from the docker file)

And when I run the test: bash tests/smoke_tests/import_awsio.sh

I see the following error: from awsio._version import version ModuleNotFoundError: No module named 'awsio._version'

opened by sravya8 7
ValueError: curlCode: 77, Problem with the SSL CA cert (path? access rights?)
Hi, I'm attempting to test S3 datasets in a Jupyter notebook on an EC2 instance. I've configured the cli with aws configure and the following command lists files successfully:

url_path = 's3://data-sm/webdataset/dataloading_benchmarks/dogs/shards/' !aws s3 ls {url_path}

However, the following results in an error:

from awsio.python.lib.io.s3.s3dataset import list_files urls = list_files(url_path)

ValueError Traceback (most recent call last) Input In [216], in 7 # Errors 8 from awsio.python.lib.io.s3.s3dataset import list_files ----> 9 urls = list_files(url_path)

File ~/anaconda3/envs/py38/lib/python3.8/site-packages/awsio/python/lib/io/s3/s3dataset.py:97, in list_files(url) 94 """Returns a list of entries under the same prefix. 95 """ 96 handler = _pywrap_s3_io.S3Init() ---> 97 return handler.list_files(url)

ValueError: curlCode: 77, Problem with the SSL CA cert (path? access rights?) with address : 3.5.84.1 with address : 3.5.84.1

Any idea what's up with this error?
opened by austinmw 5
segmentation fault on Plugin tests

Hi, I am trying to build and install the plugin from git source on EC2 (with RHEL 8). when I do the smoke, or any test as soon as I hit the _pywrap_s3_io.s3Init() a segmentation fault is thrown. do you have any suggestion ? does plugin works on RHEL?

opened by mahb324 4
Installation

Please provide clear documentation to install on different cluster than AWS specific. I guess this plugin would be helpful for every individual who is consuming data from s3.

Need a documentation w,r,t dockerHub docker or a python package to install this.

opened by anmoldhingra1 2
Thread safety: Dataloaders (with multiple workers) only supports multiprocessing_context "spawn"
This may be a documentation issue:

Pytorch dataloaders only appear to work when multiprocessing_context is set to "spawn". (or when workers=0)

At least with pytorch 1.10

>>> torch.__version__ '1.10.0+cu113'

I obtain the error

ERROR: Unexpected segmentation fault encountered in worker.

unless multiprocessing_context = "spawn" is explicitly set.
opened by rehno-lindeque 2
Possible to use with Sagemaker?

Hello,

Is it possible to use this plugin with Sagemaker? In the documentation, it looks like each Pytorch Estimator needs a .tar.gz file on S3, which seems incompatible with using this Dataset and keeping the data on S3 (rather than copying and unzipping). https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/sagemaker.pytorch.html

opened by jbohnslav 2
Support S3_ENDPOINT_URL for non-AWS storage.

Add flag to support endpointOverride configuration, enables connecting to and using object stores that require this.

As an example of other tools using this pattern see s5cmd.

Manually tested and validated to work.

Issue #, if available:

Description of changes: Add command line flag that plumbs into AWS client configuration.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

opened by joshuarobinson 2
Where the wheel file

I am trying to install this plugin by wheel file on Window, but I cant find the wheel file, can you provide an instruction about installing plugin on Window? Think you

opened by Di0826 2
How does it compare to the recently released torchdata?

Torchdata includes some DataPipes, the fsspec and iopath that allow to connect to cloud providers (such as AWS S3).

How does the amazon-s3-plugin-for-pytorch compare to torchdata?

opened by gcheron 1
Data files being downloaded in working directory

I've been using the S3Dataset classes to read data from S3. It works, however all the image files are downloaded in my working directory when reading the url via
fileobj = self.handler.s3_read(filename) where self.handler is _pywrap_s3_io.S3Init() https://github.com/aws/amazon-s3-plugin-for-pytorch/blob/919729ca97f154f297e6038177b16cbcc04293ef/awsio/python/lib/io/s3/s3dataset.py#L146

Is there a way to specify a different directory to where the images should be downloaded, or re-direct the download to a separate temp directory?

opened by SamTNeoX 1
cert Error
url_objects = handler.list_files(url) ValueError: curlCode: 77, Problem with the SSL CA cert (path? access rights?) with address : 52.**.**.** with address : 52.**.**.**

the following function throws and error when trying to read bucket. I'm able to access the bucket via CLI with aws s3 ls <s3-bucket-name> so my .aws file is well configured

would like to some help
opened by dorbittonn 0
Allow increasing executorPoolSize

A p4d.24xl offers 4x100Gbps of throughput. 25 threads will most likely not max out available bandwidth. Allowing configuration of executorPoolSize would allow for more threads, and faster s3 throughput.

I'd need to run a test with this library, but recently I saw 100Gibps of throughput to a m5n.24xl using ~90 threads downloading from s3, where-as with 25 threads downloading from s3 I got just 44.286Gibps of throughput.

Currently this library hard-codes 25 threads for s3 downloads: https://github.com/aws/amazon-s3-plugin-for-pytorch/blob/38284c8a5e92be3bbf47b08e8c90d94be0cb79e7/awsio/csrc/io/s3/s3_io.cpp#L46

opened by cobookman 1
Reading object metadata

I was wondering if is way a way to also fetch an object's metadata when reading the object itself. I am trying to use ImageNet to train an image classification model, similar to what is done in s3_imagenet_example.py, but I am trying to add image class as metadata for the object itself.

opened by rkoo19 4

Owner

Amazon Web Services

GitHub

PyTorch Extension Library of Optimized Scatter Operations

PyTorch Scatter Documentation This package consists of a small extension library of highly optimized sparse update (scatter and segment) operations fo

1.2k Jan 7, 2023

PyTorch Extension Library of Optimized Autograd Sparse Matrix Operations

PyTorch Sparse This package consists of a small extension library of optimized sparse matrix operations with autograd support. This package currently

757 Jan 4, 2023

higher is a pytorch library allowing users to obtain higher order gradients over losses spanning training loops rather than individual training steps.

higher is a library providing support for higher-order optimization, e.g. through unrolled first-order optimization loops, of "meta" aspects of these

1.5k Jan 3, 2023

The goal of this library is to generate more helpful exception messages for numpy/pytorch matrix algebra expressions.

Tensor Sensor See article Clarifying exceptions and visualizing tensor operations in deep learning code. One of the biggest challenges when writing co

704 Dec 14, 2022

A tiny scalar-valued autograd engine and a neural net library on top of it with PyTorch-like API

micrograd A tiny Autograd engine (with a bite! :)). Implements backpropagation (reverse-mode autodiff) over a dynamically built DAG and a small neural

3.5k Jan 8, 2023

ocaml-torch provides some ocaml bindings for the PyTorch tensor library.

ocaml-torch provides some ocaml bindings for the PyTorch tensor library. This brings to OCaml NumPy-like tensor computations with GPU acceleration and tape-based automatic differentiation.

369 Jan 3, 2023

PyGCL: Graph Contrastive Learning Library for PyTorch

PyGCL is an open-source library for graph contrastive learning (GCL), which features modularized GCL components from published papers, standardized evaluation, and experiment management.

592 Jan 7, 2023

PyNIF3D is an open-source PyTorch-based library for research on neural implicit functions (NIF)-based 3D geometry representation.

PyNIF3D is an open-source PyTorch-based library for research on neural implicit functions (NIF)-based 3D geometry representation. It aims to accelerate research by providing a modular design that allows for easy extension and combination of NIF-related components, as well as readily available paper implementations and dataset loaders.

96 Nov 28, 2022

Tez is a super-simple and lightweight Trainer for PyTorch. It also comes with many utils that you can use to tackle over 90% of deep learning projects in PyTorch.

Tez: a simple pytorch trainer NOTE: Currently, we are not accepting any pull requests! All PRs will be closed. If you want a feature or something does

1.1k Jan 4, 2023

ONNX Runtime for PyTorch accelerates PyTorch model training using ONNX Runtime.

Accelerate PyTorch models with ONNX Runtime

270 Dec 24, 2022

A lightweight wrapper for PyTorch that provides a simple declarative API for context switching between devices, distributed modes, mixed-precision, and PyTorch extensions.

56 Sep 13, 2022

A PyTorch repo for data loading and utilities to be shared by the PyTorch domain libraries.

878 Dec 30, 2022

Unofficial PyTorch implementation of DeepMind's Perceiver IO with PyTorch Lightning scripts for distributed training

251 Dec 25, 2022

PyTorch framework A simple and complete framework for PyTorch, providing a variety of data loading and simple task solutions that are easy to extend and migrate

12 Dec 19, 2021

S3-plugin is a high performance PyTorch dataset library to efficiently access datasets stored in S3 buckets.

Related tags

Overview

S3 Plugin

Installation

Prerequisite

Installing S3-Plugin via Wheel

Configuration

Smoke Test

Usage

Map-Style Dataset

Iterable-style dataset

Iterable-style dataset (NLP)

Test Coverage

Comments

Owner

Amazon Web Services

PyTorch Extension Library of Optimized Scatter Operations

PyTorch Extension Library of Optimized Autograd Sparse Matrix Operations

higher is a pytorch library allowing users to obtain higher order gradients over losses spanning training loops rather than individual training steps.

The goal of this library is to generate more helpful exception messages for numpy/pytorch matrix algebra expressions.

A tiny scalar-valued autograd engine and a neural net library on top of it with PyTorch-like API

ocaml-torch provides some ocaml bindings for the PyTorch tensor library.

PyGCL: Graph Contrastive Learning Library for PyTorch

PyNIF3D is an open-source PyTorch-based library for research on neural implicit functions (NIF)-based 3D geometry representation.

Tez is a super-simple and lightweight Trainer for PyTorch. It also comes with many utils that you can use to tackle over 90% of deep learning projects in PyTorch.

ONNX Runtime for PyTorch accelerates PyTorch model training using ONNX Runtime.

A lightweight wrapper for PyTorch that provides a simple declarative API for context switching between devices, distributed modes, mixed-precision, and PyTorch extensions.

A PyTorch repo for data loading and utilities to be shared by the PyTorch domain libraries.

Unofficial PyTorch implementation of DeepMind's Perceiver IO with PyTorch Lightning scripts for distributed training

PyTorch framework A simple and complete framework for PyTorch, providing a variety of data loading and simple task solutions that are easy to extend and migrate

Pretrained ConvNets for pytorch: NASNet, ResNeXt, ResNet, InceptionV4, InceptionResnetV2, Xception, DPN, etc.

Model summary in PyTorch similar to `model.summary()` in Keras

torch-optimizer -- collection of optimizers for Pytorch

A PyTorch implementation of EfficientNet

The easiest way to use deep metric learning in your application. Modular, flexible, and extensible. Written in PyTorch.