A PyTorch repo for data loading and utilities to be shared by the PyTorch domain libraries.

Overview

TorchData ( 🚨 Warning: Unstable Prototype 🚨 )

Why torchdata? | Install guide | What are DataPipes? | Prototype Usage and Feedback | Contributing | Future Plans

This is a prototype library currently under heavy development. It does not currently have stable releases, and as such will likely be modified significantly in BC-breaking ways until beta release (targeting early 2022), and can only be used with the PyTorch nighly binaries. If you have suggestions on the API or use cases you'd like to be covered, please open a github issue. We'd love to hear thoughts and feedback.

torchdata is a prototype library of common modular data loading primitives for easily constructing flexible and performant data pipelines.

It aims to provide composable iter-style and map-style building blocks called DataPipes that work well out of the box with the PyTorch DataLoader. Right now it only contains basic functionality to reproduce several datasets in TorchVision and TorchText, namely including loading, parsing, caching, and several other utilities (e.g. hash checking). We plan to expand and harden this set considerably over the coming months.

To understand the basic structure of DataPipes, please see What are DataPipes? below, and to see how DataPipes can be practically composed into datasets, please see our examples/ directory.

Note that because many features of the original DataLoader have been modularized into DataPipes, some now live as standard DataPipes in pytorch/pytorch rather than torchdata to preserve BC functional parity within torch.

Why composable data loading?

Over many years of feedback and organic community usage of the PyTorch DataLoader and DataSets, we've found that:

  1. The original DataLoader bundled too many features together, making them difficult to extend, manipulate, or replace. This has created a proliferation of use-case specific DataLoader variants in the community rather than an ecosystem of interoperable elements.
  2. Many libraries, including each of the PyTorch domain libraries, have rewritten the same data loading utilities over and over again. We can save OSS maintainers time and effort rewriting, debugging, and maintaining these table-stakes elements.

Installation

Colab

Follow the instructions in this Colab notebook

Local pip or conda

First, set up an environment. We will be installing a nightly PyTorch binary as well as torchdata. If you're using conda, create a conda environment:

conda create --name torchdata
conda activate torchdata

If you wish to use venv instead:

python -m venv torchdata-env
source torchdata-env/bin/activate

Next, install one of the following following PyTorch nightly binaries.

# For CUDA 10.2
pip install --pre torch -f https://download.pytorch.org/whl/nightly/cu102/torch_nightly.html
# For CUDA 11.1
pip install --pre torch -f https://download.pytorch.org/whl/nightly/cu111/torch_nightly.html
# For CPU-only build
pip install --pre torch -f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html

If you already have a nightly of PyTorch installed and wanted to upgrade it (recommended!), append --upgrade to one of those commands.

Install torchdata:

pip install --user "git+https://github.com/pytorch/data.git"

Run a quick sanity check in python:

from torchdata.datapipes.iter import HttpReader
URL = "https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/train.csv"
ag_news_train = HttpReader([URL]).parse_csv().map(lambda t: (int(t[0]), " ".join(t[1:])))
agn_batches = ag_news_train.batch(2).map(lambda batch: {'labels': [sample[0] for sample in batch],\
                                      'text': [sample[1].split() for sample in batch]})
batch = next(iter(agn_batches))
assert batch['text'][0][0:8] == ['Wall', 'St.', 'Bears', 'Claw', 'Back', 'Into', 'the', 'Black']

From source

$ pip install -e git+https://github.com/pytorch/data#egg=torchdata

What are DataPipes?

Early on, we observed widespread confusion between the PyTorch DataSets which represented reusable loading tooling (e.g. TorchVision's ImageFolder), and those that represented pre-built iterators/accessors over actual data corpora (e.g. TorchVision's ImageNet). This led to an unfortunate pattern of siloed inheritence of data tooling rather than composition.

DataPipe is simply a renaming and repurposing of the PyTorch DataSet for composed usage. A DataPipe takes in some access function over Python data structures, __iter__ for IterDataPipes and __getitem__ for MapDataPipes, and returns a new access function with a slight transformation applied. For example, take a look at this JsonParser, which accepts an IterDataPipe over file names and raw streams, and produces a new iterator over the filenames and deserialized data:

import json

class JsonParserIterDataPipe(IterDataPipe):
    def __init__(self, source_datapipe, **kwargs):
        self.source_datapipe = source_datapipe
        self.kwargs = kwargs

    def __iter__(self):
        for file_name, stream in self.source_datapipe:
            data = stream.read()
            yield file_name, json.loads(data)

    def __len__(self):
        return len(self.source_datapipe)

You can see in this example how DataPipes can be easily chained together to compose graphs of transformations that reproduce sohpisticated data pipelines, with streamed operation as a first-class citizen.

Under this naming convention, DataSet simply refers to a graph of DataPipes, and a dataset module like ImageNet can be rebuilt as a factory function returning the requisite composed DataPipes. Note that the vast majority of initial support is focused on IterDataPipes, while more MapDataPipes support will come later.

Implementing DataPipes

As a guiding example, let's implement an IterDataPipe that applies a callable to the input iterator. For MapDataPipes, take a look at the map folder for examples, and follow the steps below for the __getitem__ method instead of __iter__.

Naming

The naming convention for DataPipes is "Operation"-er, followed by IterDataPipe or MapDataPipe, as each DataPipe is essentially a container to apply an operation to data yielded from a source DataPipe. For succintness, we alias to just "Operation-er" in init files. For our IterDataPipe example, we'll name the module MapperIterDataPipe and alias it as iter.Mapper under datapipes.

Constructor

DataSets are now generally constructed as stacks of DataPipes, so each DataPipe typically takes a source DataPipe as its first argument.

class MapperIterDataPipe(IterDataPipe):
    def __init__(self, dp, fn):
        super().__init__()
        self.dp = dp
        self.fn = fn

Note:

  • Avoid loading data from the source DataPipe in __init__ function, in order to support lazy data loading and save memory.
  • If IterDataPipe instance holds data in memory, please be ware of the in-place modification of data. When second iterator is created from the instance, the data may have already changed. Please take IterableWrapper class as reference to deepcopy data for each iterator.

Iterator

For IterDataPipes, an __iter__ function is needed to consume data from the source IterDataPipe then apply the operation over the data before yield.

class MapperIterDataPipe(IterDataPipe):
    ...

    def __iter__(self):
        for d in self.dp:
            yield self.fn(d)

Length

In many cases, as in our MapperIterDataPipe example, the __len__ method of a DataPipe returns the length of the source DataPipe.

class MapperIterDataPipe(IterDataPipe):
    ...

    def __len__(self):
        return len(self.dp)

However, note that __len__ is optional for IterDataPipe and often inadvisable. For CSVParserIterDataPipe in the using DataPipes section below, __len__ is not implemented because the number of rows in each file is unknown before loading it. In some special cases, __len__ can be made to either return an integer or raise an Error depending on the input. In those cases, the Error must be a TypeError to support Python's build-in functions like list(dp).

Registering DataPipes with the functional API

Each DataPipe can be registered to support functional invocation using the decorator functional_datapipe.

@functional_datapipe("map")
class MapperIterDataPipe(IterDataPipe):
    ...

The stack of DataPipes can then be constructed in functional form:

>>> import torch.utils.data.datapipes as dp
>>> datapipes1 = dp.iter.FileLoader(['a.file', 'b.file']).map(fn=decoder).shuffle().batch(2)

>>> datapipes2 = dp.iter.FileLoader(['a.file', 'b.file'])
>>> datapipes2 = dp.iter.Mapper(datapipes2)
>>> datapipes2 = dp.iter.Shuffler(datapipes2)
>>> datapipes2 = dp.iter.Batcher(datapipes2, 2)

In the above example, datapipes1 and datapipes2 represent the exact same stack of IterDataPipes.

Using DataPipes

For a complete example, suppose we want to load data from CSV files with the following steps:

  • List all csv files in a directory
  • Load csv files
  • Parse csv file and yield rows

To support the above pipeline, CSVParser is registered as parse_csv_files to consume file streams and expand them as rows.

@functional_datapipe("parse_csv_files")
class CSVParserIterDataPipe(IterDataPipe):
    def __init__(self, dp, **fmtparams):
        self.dp = dp
        self.fmtparams = fmtparams

    def __iter__(self):
        for filename, stream in self.dp:
            reader = csv.reader(stream, **self.fmtparams)
            for row in reader:
                yield filename, row

Then, the pipeline can be assembled as follows:

>>> import torch.utils.data.datapipes as dp

>>> FOLDER = 'path/2/csv/folder'
>>> datapipe = dp.iter.FileLister([FOLDER]).filter(fn=lambda filename: filename.endswith('.csv'))
>>> datapipe = dp.iter.FileLoader(datapipe, mode='rt')
>>> datapipe = datapipe.parse_csv_files(delimiter=' ')

>>> for d in datapipe: # Start loading data
...     pass

Contributing

We welcome PRs! See the CONTRIBUTING file.

Prototype Usage and Feedback

We'd love to hear from and work with early adopters to shape our designs. Please reach out by raising an issue if you're interested in using this tooling for your project.

Future Plans

We hope to sufficiently expand the library, harden APIs, and gather feedback to enable a beta release at the time of the PyTorch 1.11 release (early 2022).

License

TorchData is BSD licensed, as found in the LICENSE file.

Comments
  • S3 datapipes

    S3 datapipes

    Changes

    • Added S3FileLister and S3FileLoader IterDataPipes.
    • Added pybind11 build for s3 io cpp files and python scripts.

    TODO

    • [x] clean up setup files and link pybind11 in CMAKE_PREFIX automatically.
    • [x] remove aws-cpp-sdk dependency at build with BUILD_S3 env var & pop exceptions when missing dependencies at usage.
    • [x] new api changes for list_files.
    • [x] clean up cpp files (naming, new structure, new logic etc.)
    • [x] expose timeouts, regions.
    • [x] thorough tests
      • [x] different correct usage: bucket (with or without / at last), folder (with or without / at last), prefix, item.
      • [x] different incorrect usage: non-existing files, wrong s3 urls, etc.
      • [x] region changes
      • [x] choice of public datasets
    • [x] benchmarks
      • [x] performance test
    • [x] README.md
      • [x] user guide & recommendations
      • [x] dependencies
    CLA Signed 
    opened by ydaiming 39
  • Add list_file() functional API to FSSpecFileLister and IoPathFileLister

    Add list_file() functional API to FSSpecFileLister and IoPathFileLister

    Fixes #387

    Changes

    • Adds list_file() method on IoPathFileListerIterDataPipe
    • Adds list_file() method on FSSpecFileListerIterDataPipe
    • Add tests for those methods

    Additional comments

    I feel as if the implementation is quite naive. Would appreciate any feedback on it.

    CLA Signed 
    opened by xiurobert 25
  • Graph traversal is broken for custom iter datapipes

    Graph traversal is broken for custom iter datapipes

    from torch.utils.data.graph import traverse
    from torchdata.datapipes.iter import IterDataPipe, IterableWrapper
    
    
    class CustomIterDataPipe(IterDataPipe):
        def noop(self, x):
            return x
    
        def __init__(self):
            self._dp = IterableWrapper([]).map(self.noop)
    
        def __iter__(self):
            yield from self._dp
    
    
    traverse(CustomIterDataPipe())
    
    RecursionError: maximum recursion depth exceeded
    

    Without the .map() call it works fine. I don't think this is specific to .map() though. From trying a few datapipes, this always happens if self._dp is composed in some way.

    bug high priority 
    opened by pmeier 24
  • Refactoring and renaming to KeyZipper to IterKeyZipper and MapZipper to MapKeyZipper

    Refactoring and renaming to KeyZipper to IterKeyZipper and MapZipper to MapKeyZipper

    Stack from ghstack:

    • -> #50

    Since MapZipper has been added, the name KeyZipper is confusing and should be changed to IterKeyZipper instead. We are also changing MapZipper to MapKeyZipper to ensure the names stay matching.

    Note that this renaming is BC breaking for users.

    Differential Revision: D31487393

    CLA Signed Merged 
    opened by NivekT 22
  • [RFC] Disable the multiple Iterators per IterDataPipe (Make Iterator singleton)

    [RFC] Disable the multiple Iterators per IterDataPipe (Make Iterator singleton)

    This is the initial draft. I will complete it shortly.

    State of Iterator is attached to each IterDataPipe instance. This is super useful for:

    • Determinism
    • Snapshotting
    • Benchmarking -> It becomes easier to register each DataPipe since they have different ID in the graph.

    Implementation Options:

    • Each DataPipe has an attribute of _iterator as the place holder for __iter__ calls.
    • Implement __next__. (My Preference)
      • It would make the instance pickable. Previously generator function (__iter__) is not picklable -> Help multiprocessing and snapshotting)
      • __iter__ return self (Forker(self) may be another option, not 100% sure)
      • IMO, this is super useful as we can track the number of __next__ call to do a fast forward. The state of iteration is attached to DataPipe instance, rather than a temporary instance created from __iter__, which we couldn't track the internal state. (We can easily track states like RNG, iteration number, buffer, etc. as they are going to be attached to self instance)
      • As source DataPipe is attached to each DataPipe, but the actual iteration happens on Iterator level. The graph constructed by DataLoaderV2 doesn't match the actual execution graph.

    DataLoader trigger Error if there are two DataPipe instance with same id in the graph. (Another option is DataLoader do an automatically fork) Users should use Forker for each DataPipe want to have single DataPipe twice in the graph.

    cc: @VitalyFedyunin @NivekT

    opened by ejguan 22
  • Issue during import of portalocker on windows

    Issue during import of portalocker on windows

    🐛 Describe the bug

    Currently TorchText CI is broken on windows due to following error:

    ImportError: DLL load failed while importing win32file: The specified module could not be found
    

    The error occurred during import of portalocker.

    cc: @vitaly-fedyunin

    Versions

    Latest from main

    opened by parmeet 19
  • Refactor OnDiskCache

    Refactor OnDiskCache

    Fixes https://github.com/facebookexternal/torchdata/issues/114 and https://github.com/facebookexternal/torchdata/issues/140

    Stack from ghstack:

    • #61 Refactor OnDiskCache

    ~This PR relies on a patch in PyTorch Core https://github.com/pytorch/pytorch/pull/67783~ (Landed)

    Refactor OnDiskCacheHolder to track a sequence of DataPipe operations.

    • Yield filepath rather than file handle
    • filepath_fn also supports multiple outputs like list or tuple of file paths or generator function to yield multiple file paths.
    • hash_dict and hash_type is used to support hash check. If specified, the pipeline will check the data before saving to local file system against the hash. Will raise Error when data doesn't meet the hash.
    • Optional extra_check_fn can be used to do extra check for each file (This function should take filepath as input)
    • To track the sequence of DataPipe operations, users could use functional API or DataPipe constructor
    • The returned data at the end of operations should be (metadata, bytes/string) or (metadata, filehandle)

    For end_caching:

    • Refactor it to a separate DataPipe class
    • mode is used to determine how to save the data or how to read from file handles
    • filepath_fn is an optional function to be applied to the metadata of result DataPipe
    • same_filepath_fn is used to indicate that the same filepath_fn from OnDiskCacheHolder will be used.
    • skip_read is a flag to skip reading from file handles before saved to local file system.

    Features

    • Supports both functional API and DataPipe constructor
    • Supports multiple on_disk_cache in the pipeline

    Use case

    • Single file with hash check
    temp_dir = tempfile.TemporaryDirectory()
    
    tar_file_dp = IterableWrapper([tar_file_url])
    
    def _filepath_fn(url):
        filename = os.path.basename(url)
        return os.path.join(temp_dir.name, filename)
    
    tar_hash_dict = {"xxxx": "yyyy"}
    
    tar_cache_dp = tar_file_dp.on_disk_cache(filepath_fn=_filepath_fn, hash_dict=tar_hash_dict, hash_type="md5")
    
    # Option 1
    # Add map function to transform url to file path
    # tar_cache_dp = HttpReader(tar_cache_dp).map(fn=_filepath_fn, input_col=0)
    # tar_cache_dp = tar_cache_dp.end_caching(mode="wb")
    
    # Option2 use `same_filepath_fn`
    tar_cache_dp = HttpReader(tar_cache_dp).end_caching(mode="wb", same_filepath_fn=True)
    
    • Multiple files
    # - csv.tar
    # | - 0.csv
    # | - 1.csv
    # | - 2.csv
    
    archive_dp = IterableWrapper([archive_file_path])
    
    def _gen_filepath_fn(archive_path): # Generator function
        for i in range(3):
            yield os.path.join(os.path.dirname(archive_path), "csv", "{}.csv".format(i))
    
    file_cache_dp = OnDiskCacheHolder(archive_dp, filepath_fn=_gen_filepath_fn)
    file_cache_dp = FileLoader(file_cache_dp, mode="rb")
    file_cache_dp = TarArchiveReader(file_cache_dp)
    file_cache_dp = file_cache_dp.map(fn=lambda x: x.read().decode(), input_col=1)
    
    def _csv_filepath_fn(csv_path):
        return os.path.join(os.path.dirname(os.path.dirname(csv_path)), "csv", os.path.basename(csv_path))
    
    # Text mode and skip_read as the data is read and decoded
    file_cache_dp = EndOnDiskCacheHolder(file_cache_dp, mode="w", filepath_fn=_csv_filepath_fn, skip_read=True)
    

    cc: @pmeier

    Differential Revision: D31734382

    CLA Signed Merged ciflow/slow 
    opened by ejguan 18
  • [DataPipe] Adding kwargs for `fs.open()` in fsspec DataPipes

    [DataPipe] Adding kwargs for `fs.open()` in fsspec DataPipes

    Stack from ghstack:

    • -> #804

    Fixes #803

    I left FSSpecFileLister untouched since I don't think it will be useful for fs.ls() to accept kwargs.

    Differential Revision: D40038331

    CLA Signed 
    opened by NivekT 17
  • Exception: Could not get the file at https://s3.amazonaws.com/... [RequestException] None.

    Exception: Could not get the file at https://s3.amazonaws.com/... [RequestException] None.

    🐛 Describe the bug

    Code that throw the excpetion:

    from torchtext.datasets import WikiText2 train_iter, val_iter, test_iter = WikiText2()

    code refuse to download via torchtext.datasets, but I can download the data right off the browser just fine.

    Exception raised: Traceback (most recent call last): File "C:\python\Python310\lib\site-packages\urllib3\connectionpool.py", line 700, in urlopen self._prepare_proxy(conn) File "C:\python\Python310\lib\site-packages\urllib3\connectionpool.py", line 994, in _prepare_proxy
    conn.connect() File "C:\python\Python310\lib\site-packages\urllib3\connection.py", line 364, in connect self.sock = conn = self._connect_tls_proxy(hostname, conn) File "C:\python\Python310\lib\site-packages\urllib3\connection.py", line 499, in connect_tls_proxy
    socket = ssl_wrap_socket( File "C:\python\Python310\lib\site-packages\urllib3\util\ssl
    .py", line 453, in ssl_wrap_socket
    ssl_sock = ssl_wrap_socket_impl(sock, context, tls_in_tls) File "C:\python\Python310\lib\site-packages\urllib3\util\ssl.py", line 495, in _ssl_wrap_socket_impl return ssl_context.wrap_socket(sock) File "C:\python\Python310\lib\ssl.py", line 512, in wrap_socket return self.sslsocket_class._create( File "C:\python\Python310\lib\ssl.py", line 1070, in _create self.do_handshake() File "C:\python\Python310\lib\ssl.py", line 1341, in do_handshake self._sslobj.do_handshake() ssl.SSLEOFError: EOF occurred in violation of protocol (_ssl.c:997)

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last): File "C:\python\Python310\lib\site-packages\requests\adapters.py", line 440, in send resp = conn.urlopen( File "C:\python\Python310\lib\site-packages\urllib3\connectionpool.py", line 785, in urlopen retries = retries.increment( File "C:\python\Python310\lib\site-packages\urllib3\util\retry.py", line 592, in increment raise MaxRetryError(_pool, url, error or ResponseError(cause)) urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='s3.amazonaws.com', port=443): Max retries exceeded with url: /research.metamind.io/wikitext/wikitext-2-v1.zip (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:997)')))

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last): File "C:\python\Python310\lib\site-packages\torchdata\datapipes\iter\load\online.py", line 17, in _get_response_from_http r = session.get(url, stream=True) File "C:\python\Python310\lib\site-packages\requests\sessions.py", line 542, in get return self.request('GET', url, **kwargs) File "C:\python\Python310\lib\site-packages\requests\sessions.py", line 529, in request resp = self.send(prep, **send_kwargs) File "C:\python\Python310\lib\site-packages\requests\sessions.py", line 645, in send r = adapter.send(request, **kwargs) File "C:\python\Python310\lib\site-packages\requests\adapters.py", line 517, in send raise SSLError(e, request=request) requests.exceptions.SSLError: HTTPSConnectionPool(host='s3.amazonaws.com', port=443): Max retries exceeded with url: /research.metamind.io/wikitext/wikitext-2-v1.zip (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:997)')))

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last): File "c:\python\my\1. pytorch\language_modeling_transf.py", line 83, in vocab = build_vocab_from_iterator(map(tokenizer, train_iter), specials=['']) File "C:\python\Python310\lib\site-packages\torchtext\vocab\vocab_factory.py", line 92, in build_vocab_from_iterator for tokens in iterator: File "C:\python\Python310\lib\site-packages\torch\utils\data_typing.py", line 366, in wrap_generator response = gen.send(None) File "C:\python\Python310\lib\site-packages\torchdata\datapipes\iter\util\plain_text_reader.py", line 116, in iter for path, file in self.source_datapipe: File "C:\python\Python310\lib\site-packages\torch\utils\data_typing.py", line 366, in wrap_generator response = gen.send(None) File "C:\python\Python310\lib\site-packages\torch\utils\data\datapipes\iter\fileopener.py", line 60, in iter yield from get_file_binaries_from_pathnames(self.datapipe, self.mode, self.encoding) File "C:\python\Python310\lib\site-packages\torch\utils\data\datapipes\utils\common.py", line 85, in get_file_binaries_from_pathnames for pathname in pathnames: File "C:\python\Python310\lib\site-packages\torch\utils\data_typing.py", line 366, in wrap_generator response = gen.send(None) File "C:\python\Python310\lib\site-packages\torch\utils\data\datapipes\iter\combining.py", line 46, in iter for data in dp: File "C:\python\Python310\lib\site-packages\torch\utils\data_typing.py", line 366, in wrap_generator response = gen.send(None) File "C:\python\Python310\lib\site-packages\torch\utils\data\datapipes\iter\filelister.py", line 51, in iter for path in self.datapipe: File "C:\python\Python310\lib\site-packages\torch\utils\data_typing.py", line 366, in wrap_generator response = gen.send(None) File "C:\python\Python310\lib\site-packages\torch\utils\data\datapipes\iter\grouping.py", line 140, in iter for element in self.datapipe: File "C:\python\Python310\lib\site-packages\torch\utils\data_typing.py", line 366, in wrap_generator response = gen.send(None) File "C:\python\Python310\lib\site-packages\torch\utils\data\datapipes\iter\callable.py", line 112, in iter for data in self.datapipe: File "C:\python\Python310\lib\site-packages\torch\utils\data_typing.py", line 356, in next return next(self.iterator) File "C:\python\Python310\lib\site-packages\torch\utils\data\datapipes\iter\combining.py", line 190, in get_generator_by_instance yield from self.main_datapipe.get_next_element_by_instance(self.instance_id) File "C:\python\Python310\lib\site-packages\torch\utils\data\datapipes\iter\combining.py", line 301, in get_next_element_by_instance yield self._find_next(instance_id) File "C:\python\Python310\lib\site-packages\torch\utils\data\datapipes\iter\combining.py", line 275, in _find_next value = next(self._datapipe_iterator) File "C:\python\Python310\lib\site-packages\torch\utils\data_typing.py", line 366, in wrap_generator response = gen.send(None) File "C:\python\Python310\lib\site-packages\torch\utils\data\datapipes\iter\combining.py", line 46, in iter for data in dp: File "C:\python\Python310\lib\site-packages\torch\utils\data_typing.py", line 366, in wrap_generator response = gen.send(None) File "C:\python\Python310\lib\site-packages\torchdata\datapipes\iter\util\saver.py", line 48, in iter for filepath, data in self.source_datapipe: File "C:\python\Python310\lib\site-packages\torch\utils\data_typing.py", line 366, in wrap_generator response = gen.send(None) File "C:\python\Python310\lib\site-packages\torchdata\datapipes\iter\util\hashchecker.py", line 62, in iter for file_name, data in self.source_datapipe: File "C:\python\Python310\lib\site-packages\torch\utils\data_typing.py", line 366, in wrap_generator response = gen.send(None) File "C:\python\Python310\lib\site-packages\torch\utils\data\datapipes\iter\callable.py", line 112, in iter for data in self.datapipe: File "C:\python\Python310\lib\site-packages\torch\utils\data_typing.py", line 366, in wrap_generator response = gen.send(None) File "C:\python\Python310\lib\site-packages\torch\utils\data\datapipes\iter\callable.py", line 112, in iter for data in self.datapipe: File "C:\python\Python310\lib\site-packages\torch\utils\data_typing.py", line 366, in wrap_generator response = gen.send(None) File "C:\python\Python310\lib\site-packages\torchdata\datapipes\iter\load\online.py", line 56, in iter yield _get_response_from_http(url, timeout=self.timeout) File "C:\python\Python310\lib\site-packages\torchdata\datapipes\iter\load\online.py", line 24, in _get_response_from_http raise Exception(f"Could not get the file at {url}. [RequestException] {e.response}.") Exception: Could not get the file at https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip. [RequestException] None.

    Versions

    PyTorch version: 1.11.0+cu113 Is debug build: False CUDA used to build PyTorch: 11.3 ROCM used to build PyTorch: N/A

    OS: Microsoft Windows 10 GCC version: Could not collect Clang version: Could not collect CMake version: Could not collect Libc version: N/A

    Python version: 3.10.4 (tags/v3.10.4:9d38120, Mar 23 2022, 23:13:41) [MSC v.1929 64 bit (AMD64)] (64-bit runtime) Python platform: Windows-10-10.0.19044 Is CUDA available: True CUDA runtime version: 11.3.58 GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3060 Ti Nvidia driver version: 512.15 cuDNN version: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.3\bin\cudnn_ops_train64_8.dll HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

    Versions of relevant libraries: [pip3] numpy==1.22.3 [pip3] torch==1.11.0+cu113 [pip3] torchaudio==0.11.0+cu113 [pip3] torchdata==0.3.0 [pip3] torchtext==0.12.0 [pip3] torchvision==0.12.0+cu113 [conda] Could not collect

    bug 
    opened by Creative-Ataraxia 17
  • Implement DistribtuedReadingService

    Implement DistribtuedReadingService

    Add DistributedReadingService

    • Single process
    • Share shuffle seeds across distributed process
    • Automatically distributed sharding

    Add tests for both DataLoader2 and DataLoader.

    • Spawn processes
    • Elastic training
    CLA Signed 
    opened by ejguan 16
  • update s3 test cases

    update s3 test cases

    Please read through our contribution guide prior to creating your pull request.

    • Note that there is a section on requirements related to adding a new DataPipe.

    Fixes #460

    Changes

    • update the s3 test cases due to an update in the public dataset
    CLA Signed 
    opened by ydaiming 16
  • Weird behaviour of `InMemoryCacheHolder` not really speeding things up

    Weird behaviour of `InMemoryCacheHolder` not really speeding things up

    🐛 Describe the bug

    Weird behaviour of InMemoryCacheHolder not really speeding things up

    First iteration took 9s, all the others 4s. Why? Shouldn't it be cached?

    # download camvid and place it here
    import torchdata.datapipes.iter as pipes
    from pathlib import Path
    from torchvision.io import read_image
    from torch.utils.data import DataLoader
    from time import perf_counter
    from PIL import Image
    
    dataset_dir = Path('./camvid')
    
    pipe = pipes.Zipper(
        pipes.FileLister([dataset_dir / "images"], masks='*png'),
    ).map(lambda x: (read_image(x[0])))
    
    pipe = pipes.InMemoryCacheHolder(pipe, size=32000).sharding_filter() # 8GB
    dl = DataLoader(pipe, batch_size=32, num_workers=8, persistent_workers=True, prefetch_factor=2)
    
    for i in range(10):
        start = perf_counter()
        for data in dl:
            # print(image.shape)
            continue
    
        print(f"[{i}]Elapsed {perf_counter() - start: .2f}")
    

    Output

    [0]Elapsed  18.8
    [1]Elapsed  4.41
    [2]Elapsed  4.47
    [3]Elapsed  4.75
    [4]Elapsed  4.53
    [5]Elapsed  4.41
    [6]Elapsed  4.38
    [7]Elapsed  4.41
    [8]Elapsed  4.41
    [9]Elapsed  4.41
    

    If I set num_workers=1, the first iteration is faster, and then all the others are the same

    If I use .batch(32), useless in RL since to my understand I need more workers to prepare the next batches, I see a speed up

    ...
    pipe = pipes.Zipper(
        pipes.FileLister([dataset_dir / "images"], masks='*png'),
    ).map(lambda x: (read_image(x[0])))
    
    pipe = pipes.InMemoryCacheHolder(pipe, size=32000).batch(32) # 8GB
    
    for i in range(10):
        start = perf_counter()
        for data in pipe:
            # print(image.shape)
            continue
    
        print(f"[{i}]Elapsed {perf_counter() - start: .2f}")
    
    [0]Elapsed  15.99
    [1]Elapsed  0.03
    [2]Elapsed  0.03
    [3]Elapsed  0.03
    [4]Elapsed  0.03
    [5]Elapsed  0.03
    [6]Elapsed  0.03
    [7]Elapsed  0.03
    [8]Elapsed  0.03
    [9]Elapsed  0.03
    

    Thanks!

    Versions

    Collecting environment information...
    PyTorch version: 1.13.1+cu117
    Is debug build: False
    CUDA used to build PyTorch: 11.7
    ROCM used to build PyTorch: N/A
    
    OS: Ubuntu 20.04.5 LTS (x86_64)
    GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
    Clang version: Could not collect
    CMake version: Could not collect
    Libc version: glibc-2.31
    
    Python version: 3.8.10 (default, Nov 14 2022, 12:59:47)  [GCC 9.4.0] (64-bit runtime)
    Python platform: Linux-5.15.0-1026-aws-x86_64-with-glibc2.29
    Is CUDA available: True
    CUDA runtime version: 10.1.243
    CUDA_MODULE_LOADING set to: LAZY
    GPU models and configuration: GPU 0: Tesla V100-SXM2-16GB
    Nvidia driver version: 470.161.03
    cuDNN version: Could not collect
    HIP runtime version: N/A
    MIOpen runtime version: N/A
    Is XNNPACK available: True
    
    Versions of relevant libraries:
    [pip3] numpy==1.24.1
    [pip3] torch==1.13.1
    [pip3] torchdata==0.5.1
    [pip3] torchvision==0.14.1
    [conda] Could not collect
    
    opened by FrancescoSaverioZuppichini 0
  • `_DataPipeSerializationWrapper` doesn't work with multiprocessing Queue

    `_DataPipeSerializationWrapper` doesn't work with multiprocessing Queue

    🐛 Describe the bug

    After https://github.com/pytorch/data/pull/919 is landed, a hanging problem happens on MacOS or Windows, where spawn is used to create subprocesses by default. See: https://github.com/pytorch/data/actions/runs/3794926183 I was able to mitigate the issue by removing the SeializationWrapper from https://github.com/pytorch/data/blob/e15e1453967ce2f25f6fcd2838caadfd0e2fa811/torchdata/dataloader2/dataloader2.py#L112

    And, the reason that the SerializationWrapper doesn't work is multiprocessing.Queue is attached to a DataPipe and sent to subprocesses. Even though I am able to solve my hanging problem in a different way, it's better to solve this problem directly via SerializationWrapper.

    The following should be a minimum repro example

    ctx = mp.get_context("spawn")
    q = ctx.Queue()
    dp = IterableWrapper(list(range(10)))
    # Attach a Queue
    dp.q = q
    dl = DataLoader2(dp, reading_service=PrototypeMultiProcessingReadingService(2, "spawn"))
    for d in dl:
        pass
    

    cc: @NivekT

    Versions

    main

    opened by ejguan 1
  • Start to graduate `PrototypeMultiProcessingReadingService` from

    Start to graduate `PrototypeMultiProcessingReadingService` from "prototype mode"

        unrelated: we should start to graduate it from "prototype mode" and starts find initial pioneering use case adoption~
    

    Originally posted by @wenleix in https://github.com/pytorch/data/pull/919#discussion_r1055161856

    opened by ejguan 3
  • Potential circular import in `prefetcher`

    Potential circular import in `prefetcher`

    🐛 Describe the bug

    In prefetcher.py, dataloader2 is imported. There is a potential circular import issue if dataloader2 needs to take some dependency on DataPipe. https://github.com/pytorch/data/blob/fbee6f75c9e630ea793116caea58911d5ad7d6e0/torchdata/datapipes/iter/util/prefetcher.py#L13

    We need to guarantee the dependency flow is dataloader2 depends on datapipes but not vice versa.

    Versions

    main

    opened by ejguan 10
  • Emphasize `shutdown` should be called for `DataLoader2` at the end of loop

    Emphasize `shutdown` should be called for `DataLoader2` at the end of loop

    📚 The doc issue

    When ReadingService is presented, it's better to call shutdown from DataLoader2 to properly clean up either distributed process group or persistent worker processes.

    • [ ] We should add a note regarding shutdown to DataLoader2.
    • [ ] We need to add a tutorial section for DataLoader2
    • [ ] Beyond documentation, we need to clean up our test cases to make sure shutdown is called as examples for our customers.

    Suggest a potential alternative/fix

    No response

    opened by ejguan 0
Releases(v0.5.1)
  • v0.5.1(Dec 16, 2022)

  • v0.5.0(Oct 27, 2022)

    TorchData 0.5.0 Release Notes

    • Highlights
    • Backwards Incompatible Change
    • Deprecations
    • New Features
    • Improvements
    • Bug Fixes
    • Performance
    • Documentation
    • Future Plans
    • Beta Usage Note

    Highlights

    We are excited to announce the release of TorchData 0.5.0. This release is composed of about 236 commits since 0.4.1, including ones from PyTorch Core since 1.12.1, made by more than 35 contributors. We want to sincerely thank our community for continuously improving TorchData.

    TorchData 0.5.0 updates are focused on consolidating the DataLoader2 and ReadingService APIs and benchmarking. Highlights include:

    • Added support to load data from more cloud storage providers, now covering AWS, Google Cloud Storage, and Azure. Detailed tutorial can be found here
    • Consolidated API for DataLoader2 and provided a few ReadingServices, with detailed documentation now available here
    • Provided more comprehensive DataPipe operations, e.g., random_split, repeat, set_length, and prefetch.
    • Provided pre-compiled torchdata binaries for arm64 Apple Silicon

    Backwards Incompatible Change

    DataPipe

    Changed the returned value of MapDataPipe.shuffle to an IterDataPipe (https://github.com/pytorch/pytorch/pull/83202)

    IterDataPipe is used to to preserve data order

    MapDataPipe.shuffle
    0.4.10.5.0
    >>> from torch.utils.data import IterDataPipe, MapDataPipe
    >>> from torch.utils.data.datapipes.map import SequenceWrapper
    >>> dp = SequenceWrapper(list(range(10))).shuffle()
    >>> isinstance(dp, MapDataPipe)
    True
    >>> isinstance(dp, IterDataPipe)
    False
          
    >>> from torch.utils.data import IterDataPipe, MapDataPipe
    >>> from torch.utils.data.datapipes.map import SequenceWrapper
    >>> dp = SequenceWrapper(list(range(10))).shuffle()
    >>> isinstance(dp, MapDataPipe)
    False
    >>> isinstance(dp, IterDataPipe)
    True
          

    on_disk_cache now doesn’t accept generator functions for the argument of filename_fn (https://github.com/pytorch/data/pull/810)

    on_disk_cache
    0.4.10.5.0
    >>> url_dp = IterableWrapper(["https://path/to/filename", ])
    >>> def filepath_gen_fn(url):
    …     yield from [url + f”/{i}” for i in range(3)]
    >>> cache_dp = url_dp.on_disk_cache(filepath_fn=filepath_gen_fn)
          
    >>> url_dp = IterableWrapper(["https://path/to/filename", ])
    >>> def filepath_gen_fn(url):
    …     yield from [url + f”/{i}” for i in range(3)]
    >>> cache_dp = url_dp.on_disk_cache(filepath_fn=filepath_gen_fn)
    # AssertionError
          

    DataLoader2

    Imposed single iterator constraint on DataLoader2 (https://github.com/pytorch/data/pull/700)

    DataLoader2 with a single iterator
    0.4.10.5.0
    >>> dl = DataLoader2(IterableWrapper(range(10)))
    >>> it1 = iter(dl)
    >>> print(next(it1))
    0
    >>> it2 = iter(dl)  # No reset here
    >>> print(next(it2))
    1
    >>> print(next(it1))
    2
          
    >>> dl = DataLoader2(IterableWrapper(range(10)))
    >>> it1 = iter(dl)
    >>> print(next(it1))
    0
    >>> it2 = iter(dl)  # DataLoader2 resets with the creation of a new iterator
    >>> print(next(it2))
    0
    >>> print(next(it1))
    # Raises exception, since it1 is no longer valid
          

    Deep copy DataPipe during DataLoader2 initialization or restoration (https://github.com/pytorch/data/pull/786, https://github.com/pytorch/data/pull/833)

    Previously, if a DataPipe is being passed to multiple DataLoaders, the DataPipe's state can be altered by any of those DataLoaders. In some cases, that may raise an exception due to the single iterator constraint; in other cases, some behaviors can be changed due to the adapters (e.g. shuffling) of another DataLoader.

    Deep copy DataPipe during DataLoader2 constructor
    0.4.10.5.0
    >>> dp = IterableWrapper([0, 1, 2, 3, 4])
    >>> dl1 = DataLoader2(dp)
    >>> dl2 = DataLoader2(dp)
    >>> for x, y in zip(dl1, dl2):
    …     print(x, y)
    # RuntimeError: This iterator has been invalidated because another iterator has been created from the same IterDataPipe...
          
    >>> dp = IterableWrapper([0, 1, 2, 3, 4])
    >>> dl1 = DataLoader2(dp)
    >>> dl2 = DataLoader2(dp)
    >>> for x, y in zip(dl1, dl2):
    …     print(x, y)
    0 0
    1 1
    2 2
    3 3
    4 4
          

    Deprecations

    DataLoader2

    Deprecated traverse function and only_datapipe argument (https://github.com/pytorch/pytorch/pull/85667)

    Please use traverse_dps with the behavior the same as only_datapipe=True. (https://github.com/pytorch/data/pull/793)

    DataPipe traverse function
    0.4.10.5.0
    >>> dp_graph = torch.utils.data.graph.traverse(datapipe, only_datapipe=False)
          
    >>> dp_graph = torch.utils.data.graph.traverse(datapipe, only_datapipe=False)
    FutureWarning: `traverse` function and only_datapipe argument will be removed after 1.13.
          

    New Features

    DataPipe

    • Added AIStore DataPipe (https://github.com/pytorch/data/pull/545, https://github.com/pytorch/data/pull/667)
    • Added support for IterDataPipe to trace DataFrames operations (https://github.com/pytorch/pytorch/pull/71931,
    • Added support for DataFrameMakerIterDataPipe to accept dtype_generator to solve unserializable dtype (https://github.com/pytorch/data/pull/537)
    • Added graph snapshotting by counting number of successful yields for IterDataPipe (https://github.com/pytorch/pytorch/pull/79479, https://github.com/pytorch/pytorch/pull/79657)
    • Implemented drop operation for IterDataPipe to drop column(s) (https://github.com/pytorch/data/pull/725)
    • Implemented FullSyncIterDataPipe to synchronize distributed shards (https://github.com/pytorch/data/pull/713)
    • Implemented slice and flatten operations for IterDataPipe (https://github.com/pytorch/data/pull/730)
    • Implemented repeat operation for IterDataPipe (https://github.com/pytorch/data/pull/748)
    • Added LengthSetterIterDataPipe (https://github.com/pytorch/data/pull/747)
    • Added RandomSplitter (without buffer) (https://github.com/pytorch/data/pull/724)
    • Added padden_tokens to max_token_bucketize to bucketize samples based on total padded token length (https://github.com/pytorch/data/pull/789)
    • Implemented thread based PrefetcherIterDataPipe (https://github.com/pytorch/data/pull/770, https://github.com/pytorch/data/pull/818, https://github.com/pytorch/data/pull/826, https://github.com/pytorch/data/pull/842)

    DataLoader2

    • Added CacheTimeout Adapter to redefine cache timeout of the DataPipe graph (https://github.com/pytorch/data/pull/571)
    • Added DistribtuedReadingService to support uneven data sharding (https://github.com/pytorch/data/pull/727)
    • Added PrototypeMultiProcessingReadingService
      • Added prefetching (https://github.com/pytorch/data/pull/826)
      • Fixed process termination (https://github.com/pytorch/data/pull/837)
      • Enabled deterministic training in distributed/non-distributed environment (https://github.com/pytorch/data/pull/827)
      • Handled empty queue exception properly (https://github.com/pytorch/data/pull/785)

    Releng

    • Provided pre-compiled torchdata binaries for arm64 Apple Silicon (https://github.com/pytorch/data/pull/692)

    Improvements

    DataPipe

    • Fixed error message coming from singler iterator constraint (https://github.com/pytorch/pytorch/pull/79547)
    • Enabled profiler record context in __next__ for IterDataPipe (https://github.com/pytorch/pytorch/pull/79757)
    • Raised warning for unpickable local function (#547) (https://github.com/pytorch/pytorch/pull/80232, https://github.com/pytorch/data/pull/547)
    • Cleaned up opened streams on the best effort basis (https://github.com/pytorch/data/pull/560, https://github.com/pytorch/pytorch/pull/78952)
    • Used streaming reading mode for unseekable streams in TarArchiveLoader (https://github.com/pytorch/data/pull/653) Improved GDrive 'content-disposition' error message (https://github.com/pytorch/data/pull/654)
    • Added as_tuple argument for CSVParserIterDataPipe` to convert output from list to tuple (https://github.com/pytorch/data/pull/646)
    • Raised Error when HTTPReader get 404 Response (#160) (https://github.com/pytorch/data/pull/569)
    • Added default no-op behavior for flatmap (https://github.com/pytorch/data/pull/749)
    • Added support to validate input_col with the provided map function for DataPipe (https://github.com/pytorch/pytorch/pull/80267, https://github.com/pytorch/data/pull/755, https://github.com/pytorch/pytorch/pull/84279)
    • Made ShufflerIterDataPipe support snapshotting (#83535)
    • Unified implementations between in_batch_shuffle with shuffle for IterDataPipe (https://github.com/pytorch/data/pull/745)
    • Made IterDataPipe.to_map_datapipe loading data lazily (https://github.com/pytorch/data/pull/765)
    • Added kwargs to open files for FSSpecFileLister and FSSpecSaver (https://github.com/pytorch/data/pull/804)
    • Added missing functional name for FileLister (#86497)

    DataLoader

    • Controlled shuffle option to all DataPipes with set_shuffle API https://github.com/pytorch/pytorch/pull/83741)
    • Made distributed process group lazily initialized & share seed via the process group (https://github.com/pytorch/pytorch/pull/85279)

    DataLoader2

    • Improved graph traverse function
      • Added support for unhashable DataPipe (https://github.com/pytorch/pytorch/pull/80509, https://github.com/pytorch/data/pull/559)
      • Added support for all python collection objects (https://github.com/pytorch/pytorch/pull/84079, https://github.com/pytorch/data/pull/773)
    • Ensured finalize and finalize_iteration are called during shutdown or exception (https://github.com/pytorch/data/pull/846)

    Releng

    • Enabled conda release to support GLIBC_2.27 (https://github.com/pytorch/data/pull/859)

    Bug Fixes

    DataPipe

    • Fixed error for static typing (https://github.com/pytorch/data/pull/572, https://github.com/pytorch/data/pull/645, https://github.com/pytorch/data/pull/651, https://github.com/pytorch/pytorch/pull/81275, https://github.com/pytorch/data/pull/758)
    • Fixed fork and unzip operations for the case of a single child (https://github.com/pytorch/pytorch/pull/81502)
    • Corrected the type of exception that is being raised by ShufflerMapDataPipe (https://github.com/pytorch/pytorch/pull/82666)
    • Fixed buffer overflow for unzip when columns_to_skip is specified (https://github.com/pytorch/data/pull/658)
    • Fixed TarArchiveLoader to skip open for opened TarFile stream (https://github.com/pytorch/data/pull/679)
    • Fixed mishandling of exception message in IterDataPipe (https://github.com/pytorch/pytorch/pull/84676)
    • Fixed interface generation in setup.py (#87081)

    Performance

    DataLoader2

    • Added benchmarking for DataLoader2
      • Added AWS cloud configurations (https://github.com/pytorch/data/pull/680)
      • Added benchmark from torchvision training references (https://github.com/pytorch/data/pull/714)

    Documentation

    DataPipe

    • Added examples for data loading with DataPipe
      • Read Criteo TSV and Parquet files and apply TorchArrow operations (https://github.com/pytorch/data/pull/561)
      • Read caltech256 and coco with AIStoreDataPipe (https://github.com/pytorch/data/pull/582)
      • Read from tigergraph database (https://github.com/pytorch/data/pull/783)
    • Improved docstring for DataPipe
      • DataPipe converters (https://github.com/pytorch/data/pull/710)
      • S3 DataPipe (https://github.com/pytorch/data/pull/784)
      • FileOpenerIterDataPipe (https://github.com/pytorch/pytorch/pull/81407)
      • buffer_size for MaxTokenBucketizer (https://github.com/pytorch/data/pull/834)
      • Prefetcher (https://github.com/pytorch/data/pull/835)
    • Added tutorial to load from Cloud Storage Provider including AWS S3, Google Cloud Platform and Azure Blob Storage (https://github.com/pytorch/data/pull/812, https://github.com/pytorch/data/pull/836)
    • Improved tutorial
      • Fixed tutorial for newline on Windows in generate_csv (https://github.com/pytorch/data/pull/675)
      • Improved note on shuffling behavior (https://github.com/pytorch/data/pull/688)
      • Fixed tutorial about shuffing before sharding (https://github.com/pytorch/data/pull/715)
      • Added random_split example (https://github.com/pytorch/data/pull/843)
    • Simplified long type names for online doc (https://github.com/pytorch/data/pull/838)

    DataLoader2

    • Improved docstring for DataLoader2 (https://github.com/pytorch/data/pull/581, https://github.com/pytorch/data/pull/817)
    • Added training examples using DataLoader2, ReadingService and DataPipe (https://github.com/pytorch/data/pull/563, https://github.com/pytorch/data/pull/664, https://github.com/pytorch/data/pull/670, https://github.com/pytorch/data/pull/787)

    Releng

    • Added contribution guide for third-party library (https://github.com/pytorch/data/pull/663)

    Future Plans

    We will continue benchmarking over datasets on local disk and cloud storage using TorchData. And, we will continue making DataLoader2 and related ReadingService more stable and provide more features like snapshotting the data pipeline and restoring it from the serialized state. Stay tuned and welcome any feedback.

    Beta Usage Note

    This library is currently in the Beta stage and currently does not have a stable release. The API may change based on user feedback or performance. We are committed to bring this library to stable release, but future changes may not be completely backward compatible. If you install from source or use the nightly version of this library, use it along with the PyTorch nightly binaries. If you have suggestions on the API or use cases you'd like to be covered, please open a GitHub issue. We'd love to hear thoughts and feedback.

    Source code(tar.gz)
    Source code(zip)
  • v0.4.1(Aug 5, 2022)

    TorchData 0.4.1 Release Notes

    Bug fixes

    • Fixed DataPipe working with DataLoader in the distributed environment (https://github.com/pytorch/pytorch/pull/80348, https://github.com/pytorch/pytorch/pull/81071, https://github.com/pytorch/pytorch/pull/81071)

    Documentation

    • Updated TorchData tutorial (#675, #688, #715)

    Releng

    • Provided pre-compiled torchdata binaries for arm64 Apple Silicon (#692)
      • Python [3.8~3.10]
    Source code(tar.gz)
    Source code(zip)
  • v0.4.0(Jun 28, 2022)

    TorchData 0.4.0 Release Notes

    • Highlights
    • Backwards Incompatible Change
    • Deprecations
    • New Features
    • Improvements
    • Performance
    • Documentation
    • Future Plans
    • Beta Usage Note

    Highlights

    We are excited to announce the release of TorchData 0.4.0. This release is composed of about 120 commits since 0.3.0, made by 23 contributors. We want to sincerely thank our community for continuously improving TorchData.

    TorchData 0.4.0 updates are focused on consolidating the DataPipe APIs and supporting more remote file systems. Highlights include:

    • DataPipe graph is now backward compatible with DataLoader regarding dynamic sharding and shuffle determinism in single-process, multiprocessing, and distributed environments. Please check the tutorial here.
    • AWSSDK is integrated to support listing/loading files from AWS S3.
    • Adding support to read from TFRecord and Hugging Face Hub.
    • DataLoader2 became available in prototype mode. For more details, please check our future plans.

    Backwards Incompatible Change

    DataPipe

    Updated Multiplexer (functional API mux) to stop merging multiple DataPipes whenever the shortest one is exhausted (https://github.com/pytorch/pytorch/pull/77145)

    Please use MultiplexerLongest (functional API mux_longgest) to achieve the previous functionality.

    0.3.00.4.0
    >>> dp1 = IterableWrapper(range(3))
    >>> dp2 = IterableWrapper(range(10, 15))
    >>> dp3 = IterableWrapper(range(20, 25))
    >>> output_dp = dp1.mux(dp2, dp3)
    >>> list(output_dp)
    [0, 10, 20, 1, 11, 21, 2, 12, 22, 3, 13, 23, 4, 14, 24]
    >>> len(output_dp)
    13
          
    >>> dp1 = IterableWrapper(range(3))
    >>> dp2 = IterableWrapper(range(10, 15))
    >>> dp3 = IterableWrapper(range(20, 25))
    >>> output_dp = dp1.mux(dp2, dp3)
    >>> list(output_dp)
    [0, 10, 20, 1, 11, 21, 2, 12, 22]
    >>> len(output_dp)
    9
          

    Enforcing single valid iterator for IterDataPipes w/wo multiple outputs https://github.com/pytorch/pytorch/pull/70479, (https://github.com/pytorch/pytorch/pull/75995)

    If you need to reference the same IterDataPipe multiple times, please apply .fork() on the IterDataPipe instance.

    IterDataPipe with a single output
    0.3.00.4.0
    >>> source_dp = IterableWrapper(range(10))
    >>> it1 = iter(source_dp)
    >>> list(it1)
    [0, 1, ..., 9]
    >>> it1 = iter(source_dp)
    >>> next(it1)
    0
    >>> it2 = iter(source_dp)
    >>> next(it2)
    0
    >>> next(it1)
    1
    # Multiple references of DataPipe
    >>> source_dp = IterableWrapper(range(10))
    >>> zip_dp = source_dp.zip(source_dp)
    >>> list(zip_dp)
    [(0, 0), ..., (9, 9)]
          
    >>> source_dp = IterableWrapper(range(10))
    >>> it1 = iter(source_dp)
    >>> list(it1)
    [0, 1, ..., 9]
    >>> it1 = iter(source_dp)  # This doesn't raise any warning or error
    >>> next(it1)
    0
    >>> it2 = iter(source_dp)
    >>> next(it2)  # Invalidates `it1`
    0
    >>> next(it1)
    RuntimeError: This iterator has been invalidated because another iterator has been created from the same IterDataPipe: IterableWrapperIterDataPipe(deepcopy=True, iterable=range(0, 10))
    This may be caused multiple references to the same IterDataPipe. We recommend using `.fork()` if that is necessary.
    For feedback regarding this single iterator per IterDataPipe constraint, feel free to comment on this issue: https://github.com/pytorch/data/issues/45.
    # Multiple references of DataPipe
    >>> source_dp = IterableWrapper(range(10))
    >>> zip_dp = source_dp.zip(source_dp)
    >>> list(zip_dp)
    RuntimeError: This iterator has been invalidated because another iterator has been createdfrom the same IterDataPipe: IterableWrapperIterDataPipe(deepcopy=True, iterable=range(0, 10))
    This may be caused multiple references to the same IterDataPipe. We recommend using `.fork()` if that is necessary.
    For feedback regarding this single iterator per IterDataPipe constraint, feel free to comment on this issue: https://github.com/pytorch/data/issues/45.
          

    IterDataPipe with multiple outputs
    0.3.00.4.0
    >>> source_dp = IterableWrapper(range(10))
    >>> cdp1, cdp2 = source_dp.fork(num_instances=2)
    >>> it1, it2 = iter(cdp1), iter(cdp2)
    >>> list(it1)
    [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
    >>> list(it2)
    [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
    >>> it1, it2 = iter(cdp1), iter(cdp2)
    >>> it3 = iter(cdp1)
    # Basically share the same reference as `it1`
    # doesn't reset because `cdp1` hasn't been read since reset
    >>> next(it1)
    0
    >>> next(it2)
    0
    >>> next(it3)
    1
    # The next line resets all ChildDataPipe
    # because `cdp2` has started reading
    >>> it4 = iter(cdp2)
    >>> next(it3)
    0
    >>> list(it4)
    [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
          
    >>> source_dp = IterableWrapper(range(10))
    >>> cdp1, cdp2 = source_dp.fork(num_instances=2)
    >>> it1, it2 = iter(cdp1), iter(cdp2)
    >>> list(it1)
    [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
    >>> list(it2)
    [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
    >>> it1, it2 = iter(cdp1), iter(cdp2)
    >>> it3 = iter(cdp1)  # This invalidates `it1` and `it2`
    >>> next(it1)
    RuntimeError: This iterator has been invalidated, because a new iterator has been created from one of the ChildDataPipes of _ForkerIterDataPipe(buffer_size=1000, num_instances=2).
    For feedback regarding this single iterator per IterDataPipe constraint, feel free to comment on this issue: https://github.com/pytorch/data/issues/45.
    >>> next(it2)
    RuntimeError: This iterator has been invalidated, because a new iterator has been created from one of the ChildDataPipes of _ForkerIterDataPipe(buffer_size=1000, num_instances=2).
    For feedback regarding this single iterator per IterDataPipe constraint, feel free to comment on this issue: https://github.com/pytorch/data/issues/45.
    >>> next(it3)
    0
    # The next line should not invalidate anything, as there was no new iterator created
    # for `cdp2` after `it2` was invalidated
    >>> it4 = iter(cdp2)
    >>> next(it3)
    1
    >>> list(it4)
    [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
          

    Deprecations

    DataPipe

    Deprecated functional APIs of open_file_by_fsspec and open_file_by_iopath for IterDataPipe (https://github.com/pytorch/pytorch/pull/78970, https://github.com/pytorch/pytorch/pull/79302)

    Please use open_files_by_fsspec and open_files_by_iopath

    0.3.00.4.0
    >>> dp = IterableWrapper([file_path, ])
    >>> dp = dp.open_file_by_fsspec()  # No Warning
    >>> dp = IterableWrapper([file_path, ])
    >>> dp = dp.open_file_by_iopath()  # No Warning
          
    >>> dp = IterableWrapper([file_path, ])
    >>> dp = dp.open_file_by_fsspec()
    FutureWarning: `FSSpecFileOpener()`'s functional API `.open_file_by_fsspec()` is deprecated since 0.4.0 and will be removed in 0.6.0.
    See https://github.com/pytorch/data/issues/163 for details.
    Please use `.open_files_by_fsspec()` instead.
    >>> dp = IterableWrapper([file_path, ])
    >>> dp = dp.open_file_by_iopath()
    FutureWarning: `IoPathFileOpener()`'s functional API `.open_file_by_iopath()` is deprecated since 0.4.0 and will be removed in 0.6.0.
    See https://github.com/pytorch/data/issues/163 for details.
    Please use `.open_files_by_iopath()` instead.
          

    Argument drop_empty_batches of Filter (functional API filter) is deprecated and going to be removed in the future release (https://github.com/pytorch/pytorch/pull/76060)

    0.3.00.4.0
    >>> dp = IterableWrapper([(1, 1), (2, 2), (3, 3)])
    >>> dp = dp.filter(lambda x: x[0] > 1, drop_empty_batches=True)
          
    >>> dp = IterableWrapper([(1, 1), (2, 2), (3, 3)])
    >>> dp = dp.filter(lambda x: x[0] > 1, drop_empty_batches=True)
    FutureWarning: The argument `drop_empty_batches` of `FilterIterDataPipe()` is deprecated since 1.12 and will be removed in 1.14.
    See https://github.com/pytorch/data/issues/163 for details.
          

    New Features

    DataPipe

    • Added utility to visualize DataPipe graphs (https://github.com/pytorch/data/pull/330)

    IterDataPipe

    • Added Bz2FileLoader with functional API of load_from_bz2 (https://github.com/pytorch/data/pull/312)
    • Added BatchMapper (functional API: map_batches) and FlatMapper (functional API: flat_map) (https://github.com/pytorch/data/pull/359)
    • Added support for WebDataset-style archives (https://github.com/pytorch/data/pull/367)
    • Added MultiplexerLongest with functional API of mux_longest (https://github.com/pytorch/data/pull/372)
    • Add ZipperLongest with functional API of zip_longest (https://github.com/pytorch/data/pull/373)
    • Added MaxTokenBucketizer with functional API of max_token_bucketize (https://github.com/pytorch/data/pull/283)
    • Added S3FileLister (functional API: list_files_by_s3) and S3FileLoader (functional API: load_files_by_s3) integrated with the native AWSSDK (https://github.com/pytorch/data/pull/165)
    • Added HuggingFaceHubReader (https://github.com/pytorch/data/pull/490)
    • Added TFRecordLoader with functional API of load_from_tfrecord (https://github.com/pytorch/data/pull/308)

    MapDataPipe

    • Added UnZipper with functional API of unzip (https://github.com/pytorch/data/pull/325)
    • Added MapToIterConverter with functional API of to_iter_datapipe (https://github.com/pytorch/data/pull/327)
    • Added InMemoryCacheHolder with functional API of in_memory_cache (https://github.com/pytorch/data/pull/328)

    Releng

    • Added nightly releases for TorchData. Users should be able to install nightly TorchData via
      • pip install –pre torchdata -f https://download.pytorch.org/whl/nightly/cpu
      • conda install -c pytorch-nightly torchdata
    • Added support of AWSSDK enabled DataPipes. See: README
      • AWSSDK was pre-compiled and assembled in TorchData for both nightly and 0.4.0 releases

    Improvements

    DataPipe

    • Added optional encoding argument to FileOpener (https://github.com/pytorch/pytorch/pull/72715)
    • Renamed BucketBatcher argument to avoid name collision (https://github.com/pytorch/data/pull/304)
    • Removed default parameter of ShufflerIterDataPipe (https://github.com/pytorch/pytorch/pull/74370)
    • Made profiler wrapper can delegating function calls to DataPipe iterator (https://github.com/pytorch/pytorch/pull/75275)
    • Added input_col argument to flatmap for applying fn to the specific column(s) (https://github.com/pytorch/data/pull/363)
    • Improved debug message when exceptions are raised within IterDataPipe (https://github.com/pytorch/pytorch/pull/75618)
    • Improved debug message when argument is a tuple/list of DataPipes (https://github.com/pytorch/pytorch/pull/76134)
    • Add functional API to StreamReader (functional API: open_files) and FileOpener (functional API: read_from_stream) (https://github.com/pytorch/pytorch/pull/76233)
    • Enabled graph traversal for MapDataPipe (https://github.com/pytorch/pytorch/pull/74851)
    • Added input_col argument to filter for applying filter_fn to the specific column(s) (https://github.com/pytorch/pytorch/pull/76060)
    • Added functional APIs for OnlineReaders (https://github.com/pytorch/data/pull/369)
      • HTTPReaderIterDataPipe: read_from_http
      • GDriveReaderDataPipe: read_from_gdrive
      • OnlineReaderIterDataPipe: read_from_remote
    • Cleared buffer for DataPipe during __del__ (https://github.com/pytorch/pytorch/pull/76345)
    • Overrode wrong python https proxy on Windows (https://github.com/pytorch/data/pull/371)
    • Exposed functional API of 'to_map_datapipe' from IterDataPipe's pyi interface (https://github.com/pytorch/data/pull/326)
    • Moved buffer for IterDataPipe from iterator to instance (self) (https://github.com/pytorch/data/pull/388)
    • Improved DataPipe serialization:
      • Enabled serialization of ForkerIterDataPipe (https://github.com/pytorch/pytorch/pull/73118)
      • Fixed issue with DataPipe serialization with dill (https://github.com/pytorch/pytorch/pull/72896)
      • Applied special serialization when dill is installed (https://github.com/pytorch/pytorch/pull/74958)
      • Applied dill serialization for demux and added cache to graph traverse (https://github.com/pytorch/pytorch/pull/75034)
      • Revamp serialization logic of DataPipes (https://github.com/pytorch/pytorch/pull/74984)
      • Prevented automatic reset after state is restored (https://github.com/pytorch/pytorch/pull/77774)
    • Moved IterDataPipe buffers from iter to instance (self) (#76999)
    • Refactored buffer of Multiplexer from __iter__ to instance (self) (https://github.com/pytorch/pytorch/pull/77775)
    • Made GDriveReader handling Virus Scan Warning (https://github.com/pytorch/data/pull/442)
    • Added **kwargs arguments to HttpReader to specify extra parameters for HTTP requests (https://github.com/pytorch/data/pull/392)
    • Updated FSSpecFileLister and IoPathFileLister to support multiple root paths and updated FSSpecFileLister to support S3 urls (https://github.com/pytorch/data/pull/383)
    • Fixed racing condition issue with writing files in multiprocessing
      • Added filelock to IoPathSaver to prevent racing condition (https://github.com/pytorch/data/pull/413)
      • Added lock mechanism to prevent on_disk_cache downloading twice https://github.com/pytorch/data/pull/409)
      • Add instructions about ImportError for portalocker (https://github.com/pytorch/data/pull/506)
    • Added a 's' to the functional names of open/list DataPipes (https://github.com/pytorch/data/pull/479)
    • Added list_file functional API to FSSpecFileLister and IoPathFileLister (https://github.com/pytorch/data/pull/463)
    • Added list_files functional API to FileLister (https://github.com/pytorch/pytorch/pull/78419)
    • Improved FSSpec DataPipes to accept extra keyword arguments (https://github.com/pytorch/data/pull/495)
    • Pass through kwargs to json.loads call in JsonParse (https://github.com/pytorch/data/pull/518)

    DataLoader

    • Added ability to use dill to pass DataPipes in multiprocessing (https://github.com/pytorch/pytorch/pull/77288))
    • DataLoader automatically apply sharding to DataPipe graph in single-process, multi-process and distributed environments (https://github.com/pytorch/pytorch/pull/78762, https://github.com/pytorch/pytorch/pull/78950, https://github.com/pytorch/pytorch/pull/79041, https://github.com/pytorch/pytorch/pull/79124, https://github.com/pytorch/pytorch/pull/79524)
    • Made ShufflerDataPipe deterministic with DataLoader in single-process, multi-process and distributed environments (https://github.com/pytorch/pytorch/pull/77741, https://github.com/pytorch/pytorch/pull/77855, https://github.com/pytorch/pytorch/pull/78765, https://github.com/pytorch/pytorch/pull/79829)
    • Prevented overriding shuffle settings in DataLoader for DataPipe (https://github.com/pytorch/pytorch/pull/75505)

    Releng

    • Made requirements.txt as the single source of truth for TorchData version (https://github.com/pytorch/data/pull/414)
    • Prohibited Release GHA workflows running on forked branches. (https://github.com/pytorch/data/pull/361)

    Performance

    DataPipe

    • Lazily generated exception message for performance (https://github.com/pytorch/pytorch/pull/78673)
      • Fixes regression introduced from single iterator constraint related PRs.
    • Disabled profiler for IterDataPipe by default (https://github.com/pytorch/pytorch/pull/78674)
      • By skipping over the record function when the profiler is not enabled, the speedup is up to 5-6x for DataPipes when their internal operations are very simple (e.g. IterableWrapper)

    Documentation

    DataPipe

    • Fixed typo in TorchVision example (https://github.com/pytorch/data/pull/311)
    • Updated DataPipe naming guidelines (https://github.com/pytorch/data/pull/428)
    • Updated documents from DataSet to PyTorch Dataset (https://github.com/pytorch/data/pull/292)
    • Added examples for graphs, meshes and point clouds using DataPipe (https://github.com/pytorch/data/pull/337)
    • Added examples for semantic segmentation and time series using DataPipe (https://github.com/pytorch/data/pull/340)
    • Expanded the contribution guide, especially including instructions to add a new DataPipe (https://github.com/pytorch/data/pull/354)
    • Updated tutorial about placing sharding_filter (https://github.com/pytorch/data/pull/487)
    • Improved graph visualization documentation (https://github.com/pytorch/data/pull/504)
    • Added instructions about ImportError for portalocker (https://github.com/pytorch/data/pull/506)
    • Updated examples to avoid lambdas (https://github.com/pytorch/data/pull/524)
    • Updated documentation for S3 DataPipes (https://github.com/pytorch/data/pull/534)
    • Updated links for tutorial (https://github.com/pytorch/data/pull/543)

    IterDataPipe

    • Fixed documentation for IterToMapConverter, S3FileLister and S3FileLoader (https://github.com/pytorch/data/pull/381)
    • Update documentation for S3 DataPipes (https://github.com/pytorch/data/pull/534)

    MapDataPipe

    • Updated contributing guide and added guidance for MapDataPipe (https://github.com/pytorch/data/pull/379)
      • Rather than re-implementing the same functionalities twice for both IterDataPipe and MapDataPipe, we encourage users to use the built-in functionalities of IterDataPipe and use the converter to MapDataPipe as needed.

    DataLoader/DataLoader2

    • Fixed tutorial about DataPipe working with DataLoader (https://github.com/pytorch/data/pull/458)
    • Updated examples and tutorial after automatic sharding has landed (https://github.com/pytorch/data/pull/505)
    • Add README for DataLoader2 (https://github.com/pytorch/data/pull/526, https://github.com/pytorch/data/pull/541)

    Releng

    • Added nightly documentation for TorchData in https://pytorch.org/data/main/
    • Fixed instruction to install TorchData (https://github.com/pytorch/data/pull/455)

    Future Plans

    For DataLoader2, we are introducing new ways to interact between DataPipes, DataLoading API, and backends (aka ReadingServices). Feature is stable in terms of API, but functionally not complete yet. We welcome early adopters and feedback, as well as potential contributors.

    Beta Usage Note

    This library is currently in the Beta stage and currently does not have a stable release. The API may change based on user feedback or performance. We are committed to bring this library to stable release, but future changes may not be completely backward compatible. If you install from source or use the nightly version of this library, use it along with the PyTorch nightly binaries. If you have suggestions on the API or use cases you'd like to be covered, please open a GitHub issue. We'd love to hear thoughts and feedback.

    Source code(tar.gz)
    Source code(zip)
  • v0.3.0(Mar 10, 2022)

    0.3.0 Release Notes

    We are delighted to present the Beta release of TorchData. This is a library of common modular data loading primitives for easily constructing flexible and performant data pipelines. Based on community feedback, we have found that the existing DataLoader bundled too many features together and can be difficult to extend. Moreover, different use cases often have to rewrite the same data loading utilities over and over again. The goal here is to enable composable data loading through Iterable-style and Map-style building blocks called “DataPipes” that work well out of the box with the PyTorch’s DataLoader.

    • Highlights
      • What are DataPipes?
      • Usage Example
    • New Features
    • Documentation
    • Usage in Domain Libraries
    • Future Plans
    • Beta Usage Note

    Highlights

    We are releasing DataPipes - there are Iterable-style DataPipe (IterDataPipe) and Map-style DataPipe (MapDataPipe).

    What are DataPipes?

    Early on, we observed widespread confusion between the PyTorch DataSets which represented reusable loading tooling (e.g. TorchVision's ImageFolder), and those that represented pre-built iterators/accessors over actual data corpora (e.g. TorchVision's ImageNet). This led to an unfortunate pattern of siloed inheritance of data tooling rather than composition.

    DataPipe is simply a renaming and repurposing of the PyTorch DataSet for composed usage. A DataPipe takes in some access function over Python data structures, __iter__ for IterDataPipes and __getitem__ for MapDataPipes , and returns a new access function with a slight transformation applied. For example, take a look at this JsonParser, which accepts an IterDataPipe over file names and raw streams, and produces a new iterator over the filenames and deserialized data:

    import json
    
    class JsonParserIterDataPipe(IterDataPipe):
        def __init__(self, source_datapipe, **kwargs) -> None:
            self.source_datapipe = source_datapipe
            self.kwargs = kwargs
    
        def __iter__(self):
            for file_name, stream in self.source_datapipe:
                data = stream.read()
                yield file_name, json.loads(data)
    
        def __len__(self):
            return len(self.source_datapipe) 
    

    You can see in this example how DataPipes can be easily chained together to compose graphs of transformations that reproduce sophisticated data pipelines, with streamed operation as a first-class citizen.

    Under this naming convention, DataSet simply refers to a graph of DataPipes, and a dataset module like ImageNet can be rebuilt as a factory function returning the requisite composed DataPipes.

    Usage Example

    In this example, we have a compressed TAR archive file stored in Google Drive and accessible via an URL. We demonstrate how you can use DataPipes to download the archive, cache the result, decompress the archive, filter for specific files, parse and return the CSV content. The full example with detailed explanation is included in the example folder.

    url_dp = IterableWrapper([URL])
    cache_compressed_dp = GDriveReader(cache_compressed_dp)
    # cache_decompressed_dp = ... # See source file for full code example
    # Opens and loads the content of the TAR archive file.
    cache_decompressed_dp = FileOpener(cache_decompressed_dp, mode="b").load_from_tar()
    # Filters for specific files based on the file name.
    cache_decompressed_dp = cache_decompressed_dp.filter(
        lambda fname_and_stream: _EXTRACTED_FILES[split] in fname_and_stream[0]
    )
    # Saves the decompressed file onto disk.
    cache_decompressed_dp = cache_decompressed_dp.end_caching(mode="wb", same_filepath_fn=True)
    data_dp = FileOpener(cache_decompressed_dp, mode="b")
    # Parses content of the decompressed CSV file and returns the result line by line. return 
    return data_dp.parse_csv().map(fn=lambda t: (int(t[0]), " ".join(t[1:])))
    

    New Features

    [Beta] IterDataPipe

    We have implemented over 50 Iterable-style DataPipes across 10 different categories. They cover different functionalities, such as opening files, parsing texts, transforming samples, caching, shuffling, and batching. For users who are interested in connecting to cloud providers (such as Google Drive or AWS S3), the fsspec and iopath DataPipes will allow you to do so. The documentation provides detailed explanations and usage examples of each IterDataPipe.

    [Beta] MapDataPipe

    Similar to IterDataPipe, we have various, but a more limited number of MapDataPipe available for different functionalities. More MapDataPipes support will come later. If the existing ones do not meet your needs, you can write a custom DataPipe.

    Documentation

    The documentation for TorchData is now live. It contains a tutorial that covers how to use DataPipes, use them with DataLoader, and implement custom ones.

    Usage in Domain Libraries

    In this release, some of the PyTorch domain libraries have migrated their datasets to use DataPipes. In TorchText, the popular datasets provided by the library are implemented using DataPipes and a section of its SST-2 binary text classification tutorial demonstrates how you can use DataPipes to preprocess data for your model. There also are other prototype implementations of datasets with DataPipes in TorchVision (available in nightly releases) and in TorchRec. You can find more specific examples here.

    Future Plans

    There will be a new version of DataLoader in the next release. At the high level, the plan is that DataLoader V2 will only be responsible for multiprocessing, distributed, and similar functionalities, not data processing logic. All data processing features, such as the shuffling and batching, will be moved out of DataLoader to DataPipe. At the same time, the current/old version of DataLoader should still be available and you can use DataPipes with that as well.

    Beta Usage Note

    This library is currently in the Beta stage and currently does not have a stable release. The API may change based on user feedback or performance. We are committed to bring this library to stable release, but future changes may not be completely backward compatible. If you install from source or use the nightly version of this library, use it along with the PyTorch nightly binaries. If you have suggestions on the API or use cases you'd like to be covered, please open a GitHub issue. We'd love to hear thoughts and feedback.

    Source code(tar.gz)
    Source code(zip)
Owner
null
A simplified framework and utilities for PyTorch

Here is Poutyne. Poutyne is a simplified framework for PyTorch and handles much of the boilerplating code needed to train neural networks. Use Poutyne

GRAAL/GRAIL 534 Dec 17, 2022
A collection of extensions and data-loaders for few-shot learning & meta-learning in PyTorch

Torchmeta A collection of extensions and data-loaders for few-shot learning & meta-learning in PyTorch. Torchmeta contains popular meta-learning bench

Tristan Deleu 1.7k Jan 6, 2023
Tez is a super-simple and lightweight Trainer for PyTorch. It also comes with many utils that you can use to tackle over 90% of deep learning projects in PyTorch.

Tez: a simple pytorch trainer NOTE: Currently, we are not accepting any pull requests! All PRs will be closed. If you want a feature or something does

abhishek thakur 1.1k Jan 4, 2023
A lightweight wrapper for PyTorch that provides a simple declarative API for context switching between devices, distributed modes, mixed-precision, and PyTorch extensions.

A lightweight wrapper for PyTorch that provides a simple declarative API for context switching between devices, distributed modes, mixed-precision, and PyTorch extensions.

Fidelity Investments 56 Sep 13, 2022
null 270 Dec 24, 2022
Unofficial PyTorch implementation of DeepMind's Perceiver IO with PyTorch Lightning scripts for distributed training

Unofficial PyTorch implementation of DeepMind's Perceiver IO with PyTorch Lightning scripts for distributed training

Martin Krasser 251 Dec 25, 2022
The easiest way to use deep metric learning in your application. Modular, flexible, and extensible. Written in PyTorch.

News March 3: v0.9.97 has various bug fixes and improvements: Bug fixes for NTXentLoss Efficiency improvement for AccuracyCalculator, by using torch i

Kevin Musgrave 5k Jan 2, 2023
PyTorch extensions for fast R&D prototyping and Kaggle farming

Pytorch-toolbelt A pytorch-toolbelt is a Python library with a set of bells and whistles for PyTorch for fast R&D prototyping and Kaggle farming: What

Eugene Khvedchenya 1.3k Jan 5, 2023
Fast, general, and tested differentiable structured prediction in PyTorch

Torch-Struct: Structured Prediction Library A library of tested, GPU implementations of core structured prediction algorithms for deep learning applic

HNLP 1.1k Jan 7, 2023
A tiny scalar-valued autograd engine and a neural net library on top of it with PyTorch-like API

micrograd A tiny Autograd engine (with a bite! :)). Implements backpropagation (reverse-mode autodiff) over a dynamically built DAG and a small neural

Andrej 3.5k Jan 8, 2023
A simple way to train and use PyTorch models with multi-GPU, TPU, mixed-precision

?? Accelerate was created for PyTorch users who like to write the training loop of PyTorch models but are reluctant to write and maintain the boilerplate code needed to use multi-GPUs/TPU/fp16.

Hugging Face 3.5k Jan 8, 2023
A very simple and small path tracer written in pytorch meant to be run on the GPU

MentisOculi Pytorch Path Tracer A very simple and small path tracer written in pytorch meant to be run on the GPU Why use pytorch and not some other c

Matthew B. Mirman 222 Dec 1, 2022
Kaldi-compatible feature extraction with PyTorch, supporting CUDA, batch processing, chunk processing, and autograd

Kaldi-compatible feature extraction with PyTorch, supporting CUDA, batch processing, chunk processing, and autograd

Fangjun Kuang 119 Jan 3, 2023
A pure Python implementation of Compact Bilinear Pooling and Count Sketch for PyTorch.

Compact Bilinear Pooling for PyTorch. This repository has a pure Python implementation of Compact Bilinear Pooling and Count Sketch for PyTorch. This

Grégoire Payen de La Garanderie 234 Dec 7, 2022
PyTorch Lightning Optical Flow models, scripts, and pretrained weights.

PyTorch Lightning Optical Flow models, scripts, and pretrained weights.

Henrique Morimitsu 105 Dec 16, 2022
PyTorch implementations of normalizing flow and its variants.

PyTorch implementations of normalizing flow and its variants.

Tatsuya Yatagawa 55 Dec 1, 2022
Pretrained ConvNets for pytorch: NASNet, ResNeXt, ResNet, InceptionV4, InceptionResnetV2, Xception, DPN, etc.

Pretrained models for Pytorch (Work in progress) The goal of this repo is: to help to reproduce research papers results (transfer learning setups for

Remi 8.7k Dec 31, 2022
Model summary in PyTorch similar to `model.summary()` in Keras

Keras style model.summary() in PyTorch Keras has a neat API to view the visualization of the model which is very helpful while debugging your network.

Shubham Chandel 3.7k Dec 29, 2022
torch-optimizer -- collection of optimizers for Pytorch

torch-optimizer torch-optimizer -- collection of optimizers for PyTorch compatible with optim module. Simple example import torch_optimizer as optim

Nikolay Novik 2.6k Jan 3, 2023