A Python library that enables ML teams to share, load, and transform data in a collaborative, flexible, and efficient way :chestnut:

Overview

Squirrel Core

Share, load, and transform data in a collaborative, flexible, and efficient way

Python PyPI
Conda Documentation Status Downloads License DOI Generic badge Slack


What is Squirrel?

Squirrel is a Python library that enables ML teams to share, load, and transform data in a collaborative, flexible, and efficient way.

  1. SPEED: Avoid data stall, i.e. the expensive GPU will not be idle while waiting for the data.

  2. COSTS: First, avoid GPU stalling, and second allow to shard & cluster your data and store & load it in bundles, decreasing the cost for your data bucket cloud storage.

  3. FLEXIBILITY: Work with a flexible standard data scheme which is adaptable to any setting, including multimodal data.

  4. COLLABORATION: Make it easier to share data & code between teams and projects in a self-service model.

Stream data from anywhere to your machine learning model as easy as:

it = (Catalog.from_plugins()["imagenet"].get_driver()
      .get_iter("train")
      .map(lambda r: (augment(r["image"]), r["label"]))
      .batched(100))

Check out our full getting started tutorial notebook. If you have any questions or would like to contribute, join our Slack community.

Installation

You can install squirrel-core by

pip install "squirrel-core[all]"

Documentation

Read our documentation at ReadTheDocs

Example Notebooks

Check out the Squirrel-datasets repository for open source and community-contributed tutorial and example notebooks of using Squirrel.

Contributing

Squirrel is open source and community contributions are welcome!

Check out the contribution guide to learn how to get involved.

The humans behind Squirrel

We are Merantix Momentum, a team of ~30 machine learning engineers, developing machine learning solutions for industry and research. Each project comes with its own challenges, data types and learnings, but one issue we always faced was scalable data loading, transforming and sharing. We were looking for a solution that would allow us to load the data in a fast and cost-efficient way, while keeping the flexibility to work with any possible dataset and integrate with any API. That's why we build Squirrel – and we hope you'll find it as useful as we do! By the way, we are hiring!

Citation

If you use Squirrel in your research, please cite it using:

@article{2022squirrelcore,
  title={Squirrel: A Python library that enables ML teams to share, load, and transform data in a collaborative, flexible, and efficient way.},
  author={Squirrel Developer Team},
  journal={GitHub. Note: https://github.com/merantix-momentum/squirrel-core},
  doi={10.5281/zenodo.6418280},
  year={2022}
}
Comments
  • Update Doc-String of MapDriver.get_iter

    Update Doc-String of MapDriver.get_iter

    • Better document the behavior of max_workers and link to official ThreadPoolExecutor documentation.
    • Update *_map doc-strings that use ThreadPoolExecutor and link to official ThreadPoolExecutor documentation.

    Fixes #60 issue

    Type of change

    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [ ] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [X] Documentation update
    • [ ] Refactoring including code style reformatting
    • [ ] Other (please describe):

    Checklist:

    • [X] I have read the contributing guideline doc (external contributors only)
    • [ ] Lint and unit tests pass locally with my changes
    • [X] I have kept the PR small so that it can be easily reviewed
    • [X] I have made corresponding changes to the documentation
    • [ ] I have added tests that prove my fix is effective or that my feature works
    • [ ] All dependency changes have been reflected in the pip requirement files.
    opened by kai-tub 14
  • Refactoring DataFrameDriver and related drivers

    Refactoring DataFrameDriver and related drivers

    Description

    Refactors the DataFrameDriver and all data frame-related drivers. In particular:

    • Fixes a current bug (I believe) that storage options are not passed down when reading using the pandas/dask interface. This affected the implementation of CsvDriver.
    • Refactors the DataFrameDriver base class to provide a common interface for all drivers that use some read functionality from pandas or dask. The base class now handles the storage options and read arguments precedence for all derived classes
    • Using the new abstraction, adds FeatherDriver, JsonDriver, ParquetDriver, and XlsDriver and refactors CsvDriver
    • This does break some datasets using the CsvDriver as the read_csv_kwargs are now renamed to a common read_kwargs. However, so far, only two research datasets used this property. See the corresponding PR in squirrel-datasets. So far, this is a bit of a rough sketch. I tested the existing CsvDriver-based datasets but otherwise, this requires a bit more cleanup I suppose.
    • Renames the previous use_dask option to engine across all data frame drivers.
    • Change the default DataFrame engine to Pandas.

    Type of change

    • [x] Bug fix (non-breaking change which fixes an issue)
    • [x] New feature (non-breaking change which adds functionality)
    • [x] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [x] Documentation update
    • [ ] Refactoring including code style reformatting
    • [ ] Other (please describe):

    Checklist:

    • [ ] I have read the contributing guideline doc (external contributors only)
    • [x] Lint and unit tests pass locally with my changes
    • [x] I have kept the PR small so that it can be easily reviewed
    • [x] I have made corresponding changes to the documentation
    • [x] I have added tests that prove my fix is effective or that my feature works
    • [x] All dependency changes have been reflected in the pip requirement files.
    opened by MaxSchambach 8
  • zip multiple iterables as a source

    zip multiple iterables as a source

    Description

    The use case: we have a store with samples, and 1..n other stores that each contain only features. These stores must have the same keys and same number of samples per shard.

    IterableZipSource makes it possible to zip items from several iterables and use that as a source. For instance:

    it1 = MessagepackDriver(url1).get_iter()
    it2 = MessagepackDriver(url2).get_iter()
    
    it3 = IterableZipSource(iterables=[it1, it2]).collect()
    
    

    Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

    Fixes # issue

    Type of change

    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [ ] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [ ] Documentation update
    • [ ] Refactoring including code style reformatting
    • [ ] Other (please describe):

    Checklist:

    • [ ] I have read the contributing guideline doc (external contributors only)
    • [ ] Lint and unit tests pass locally with my changes
    • [ ] I have kept the PR small so that it can be easily reviewed
    • [ ] I have made corresponding changes to the documentation
    • [ ] I have added tests that prove my fix is effective or that my feature works
    • [ ] All dependency changes have been reflected in the pip requirement files.
    opened by AlirezaSohofi 5
  • Add csv driver option to specify csv read args

    Add csv driver option to specify csv read args

    Description

    Adds a read_csv_kwargs argument to CsvDriver initialization which is used in all read_csv calls in the class. Does not break backward compatibility, as get_df and get_iter still allow to specify kwargs for read_csv which will take precedence over the ones given at initialization.

    This makes the creation of new catalog entries based on the CsvDriver much easier as dataset-specific read options (such as seperator, dtypes, etc.) can be specified in the driver_kwargs.

    Type of change

    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [x] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [x] Documentation update
    • [ ] Refactoring including code style reformatting
    • [ ] Other (please describe):

    Checklist:

    • [x] I have read the contributing guideline doc (external contributors only)
    • [x] Lint and unit tests pass locally with my changes
    • [x] I have kept the PR small so that it can be easily reviewed
    • [x] I have made corresponding changes to the documentation
    • [ ] I have added tests that prove my fix is effective or that my feature works
    • [ ] All dependency changes have been reflected in the pip requirement files.
    opened by MaxSchambach 4
  • Quantify randomness of shuffle in squirrel

    Quantify randomness of shuffle in squirrel

    Description

    Introduce a function to measure the randomness of a shuffle operation in the squirrel pipeline by implementing a simple example driver, random sampling and comparing the distances of sampled trajectories.

    Type of change

    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [x] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [ ] Documentation update
    • [ ] Refactoring including code style reformatting
    • [ ] Other (please describe):

    Checklist:

    • [ ] I have read the contributing guideline doc (external contributors only)
    • [x] Lint and unit tests pass locally with my changes
    • [x] I have kept the PR small so that it can be easily reviewed
    • [ ] I have made corresponding changes to the documentation
    • [ ] I have added tests that prove my fix is effective or that my feature works
    • [ ] All dependency changes have been reflected in the pip requirement files.
    opened by winfried-ripken 4
  • Explain automatic version iteration

    Explain automatic version iteration

    Description

    Adding explanation of the default version iteration behaviour of the catalog, which was not clearly stated before.

    Fixes # issue

    Type of change

    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [ ] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [ x ] Documentation update
    • [ ] Refactoring including code style reformatting
    • [ ] Other (please describe):

    Checklist:

    • [x] I have read the contributing guideline doc (external contributors only)
    • [ ] Lint and unit tests pass locally with my changes
    • [ ] I have kept the PR small so that it can be easily reviewed
    • [ ] I have made corresponding changes to the documentation
    • [ ] I have added tests that prove my fix is effective or that my feature works
    • [ ] All dependency changes have been reflected in the pip requirement files.
    opened by AdemFr 3
  • Add storage options kwargs to FPGen

    Add storage options kwargs to FPGen

    Description

    FilePathGenerator does not expose storage options, used by fsspec when instantiating a filesystem. This can prove to be problematic, when advanced options are needed for accessing the data, e.g. when needing requester_pays argument for accessing data within google bucket. This change adds such kwargs to the constructor of FilePathGenerator object, which are passed onto fsspec.

    Fixes # issue

    Type of change

    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [x] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [ ] Documentation update
    • [ ] Refactoring including code style reformatting
    • [ ] Other (please describe):

    Checklist:

    • [ ] I have read the contributing guideline doc (external contributors only)
    • [x] Lint and unit tests pass locally with my changes
    • [x] I have kept the PR small so that it can be easily reviewed
    • [ ] I have made corresponding changes to the documentation
    • [ ] I have added tests that prove my fix is effective or that my feature works
    • [x] All dependency changes have been reflected in the pip requirement files.
    opened by mg515 3
  • Warn when creating driver that points to an empty directory

    Warn when creating driver that points to an empty directory

    Description

    A warning is shown when we iterate over a driver, that points to an empty or non-existent directory.

    Type of change

    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [x] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [ ] Documentation update
    • [ ] Refactoring including code style reformatting
    • [ ] Other (please describe):

    Checklist:

    • [ ] I have read the contributing guideline doc (external contributors only)
    • [x] Lint and unit tests pass locally with my changes
    • [x] I have kept the PR small so that it can be easily reviewed
    • [ ] I have made corresponding changes to the documentation
    • [x] I have added tests that prove my fix is effective or that my feature works
    • [ ] All dependency changes have been reflected in the pip requirement files.
    opened by pzdkn 3
  • Nvidia DALI external source integration

    Nvidia DALI external source integration

    Description

    Motivation Squirrel is a fast data loading framework, and Nvidia DALI is a fast, gpu-accelerated library for complex ML workloads such as run-time augmentations. The aim is to provide users with an intuitive interface to use Squirrel as a backend for Nvidia DALI.

    Context For more context, check out these internal benchmarks below. Running the Squirrel pipeline without any augmentations is approx. 33k samples / sec. If you use Squirrel as an external source and an affine image augmentation from DALI you can reach approx. 28k samples / sec. This suggests that DALI can make use full use of Squirrel's speed as the data loading speed is almost not slowed down by the run-time augmentations (33k vs 28k). DALI brings two things to the table: you can augment data in batches and not one-by-one as is necessary with the other frameworks, and you can do it on the GPU.

    Screen Shot 2022-08-03 at 15 36 30

    Code Design

    • DALI comes with a concept called "pipeline" (docs), that defines how data should be read and transformed by DALI.
    • We use the external_source data reader API in the DALI pipeline, which we can provide with a modified Squirrel Iterable, the squirrel.iterstream.DaliExternalSource.
    • As suggested by Nvidia DALI staff, I benchmarked loading the samples one-by-one and let DALI do the batching. It turned out that batching in Squirrel was much faster (18.2k sps vs 32.2 sps). This suggests that DALI profits from the async loading in Squirrel here.
    • As suggested by Nvidia DALI staff, I tried using parallel external source, which is multi-proc dataloading by DALI. As stated in their docs, DALI prefers single samples (un-batched) here to let DALI handle the multi-proc logic of parallel data fetching. The problem here is that DALI would like a Callable external source here, Iterables are not allowed for parallel fetching. While this is technically possible (e.g. fit your dataset in one shard and then access the items by their keys, i.e. shard names), indexability is not straightforward and not yet integrated in squirrel. Since DALI already nearly makes use of Squirrel's full performance, we don't see that DALI could speed things up here. But it's worth investigating once the feature is implemented in Squirrel.
    • There was no performance increase by returning cupy arrays on the GPU to the external_source reader. Numpy was slightly faster, so users are advised to return numpy arrays in their collation function.

    Usage Pattern

    • users will simply turn their iterable into an external source with the iterstream API.
    # define a dummy pipeline
    @pipeline_def 
    def pipeline(it: DaliExternalSource, device: str) -> Tuple[DataNode]:
        img, label = fn.external_source(source=it, num_outputs=2, device=device)
        enhanced = fn.brightness_contrast(img, contrast=2) # do other augmentations here
        return enhanced, label
        
    it = squirrel_iterator.to_dali_external_source(batch_size, my_collation_fn)
    pipe = pipeline(it, device, batch_size=BATCH_SIZE)
    pipe.build()
    
    loader = DALIGenericIterator([pipe], ["img", "label"])
    for item in loader: 
        # ... 
    

    Things to Discuss

    1. I tried turning the iterstream into a DALIGenericIterator directly and abstract the above code away, but in my mind that does not make a lot of sense, as DALI users are used to the above API and we are really just an external source. The user will need to define their custom pipeline anyway for their use-case, so I don't see a big benefit of abstracting the below code away into a squirrel functionality - possibly adding some assumptions here and there and thereby limiting the original functionality of DALI (wdyt @AlirezaSohofi ?).
    2. We would need to find out if the self.i and self.n parameters need to be set for the external source as indicated here. For now, it seems to work out of the box, but maybe for more complex use-cases these variables are needed for DALI to keep track of the loaded samples. Sidenote: Currently DaliExternalSource could also simply be replaced with squirrel_iterable.batched(bs, fn), but I assume that self.i and self.n are needed somehow (input from NVIDIA needed here), so it's useful to have DaliExternalSource where we can add more features.
    3. Please check out the test_to_dali_external_source_gpu_multi_epoch. After iterating over Squirrel's generator once the iterable is empty. Hence after each epoch we need to create a new DALIGenericIterator. Afaik this is also how e.g. Pytorch Lightning handles it. Let me know if that sounds ok, or if we need to loop over the data.
    4. Tests & Requirements: Note that I added pytests for the code, but did not update the requirements accordingly, because the CI currently doesn't run GPU tasks. Moreover, we won't ask users to install DALI for now (also, there are many different versions for different cuda drivers), so we assume people will prefer installing themselves. The DaliExternalSource doesn't depend on any DALI code, so the DALI install is technically not required.

    Type of change

    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [x] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [ ] Documentation update
    • [ ] Refactoring including code style reformatting
    • [ ] Other (please describe):

    Checklist:

    • [ ] I have read the contributing guideline doc (external contributors only)
    • [ ] Lint and unit tests pass locally with my changes
    • [ ] I have kept the PR small so that it can be easily reviewed
    • [ ] I have made corresponding changes to the documentation
    • [ ] I have added tests that prove my fix is effective or that my feature works
    • [ ] All dependency changes have been reflected in the pip requirement files.
    opened by axkoenig 3
  • Store the processing steps in a stream

    Store the processing steps in a stream

    Description

    Store more information in Composables:

    • Which Squirrel version is used
    • Git info e.g. commit-hash, remote repository
    • Log processing steps when chaining Composables

    This aims to provide the user more information about the stream. When a Composable stores sensitive information e.g. url in FilePathGenerator, then this should not be logged.

    Fixes # issue

    Type of change

    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [x] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [ ] Documentation update
    • [x] Refactoring including code style reformatting
    • [ ] Other (please describe):

    Checklist:

    • [ ] I have read the contributing guideline doc (external contributors only)
    • [x] Lint and unit tests pass locally with my changes
    • [x] I have kept the PR small so that it can be easily reviewed
    • [ ] I have made corresponding changes to the documentation
    • [x] I have added tests that prove my fix is effective or that my feature works
    • [ ] All dependency changes have been reflected in the pip requirement files.
    opened by pzdkn 3
  • [FEATURE] Make `get_iter` method documentation about `max_workers` more explicit

    [FEATURE] Make `get_iter` method documentation about `max_workers` more explicit

    Hey, I've stumbled across a potentially easy-to-misunderstand part of the MapDriver.get_iter documentation:

    https://github.com/merantix-momentum/squirrel-core/blob/8e2942313c7d7dd974b1ca2f2308895f660d3d26/squirrel/driver/driver.py#L68-L155

    The documentation of max_workers states that by default None will be used and also mentions that this will cause async_map to be called but I missed these parts of the documentation and was surprised to see that so many threads were allocated.

    I am/was not too familiar with the ThreadPoolExecutor interface and find it somewhat surprising that None equals numer_of_processors x 5 according to the ThreadPoolExecutor definition. Maybe it would be helpful to explicitly state that by default ThreadPoolExecutor will be used with so many threads? The documentation string reads a bit unintuitive as the starts out that max_worker defines how many items are fetched simultaneously with max_worker and then continues to state that otherwise map is used. From that perspective, max_workers=None doesn't sound like it should be using any threads at all. Without knowing the default values of ThreadPoolExecutor I would make it more explicit that to disable threading one has to set max_workers=0/1 and that by default many threads are used.

    I am happy to add a PR with my suggested doc-string update if you agree! :)

    enhancement 
    opened by kai-tub 3
  • Interaction Nvidia DALI and Squirrel

    Interaction Nvidia DALI and Squirrel

    Description

    Describes in detail how Squirrel and DALI can work together. Also includes benchmarks on how to best utilize DALI and how it compares to transforms in Torchvision.

    Attaching PDF rendered version of the Sphinx documentation here. Unfortunately, I couldn't get syntax highlighting to work.

    Apparent next steps are figuring out how Squirrel and DALI can work together in multi-processing. It is not obvious how we could implement this, and if this provides a performance boost. Using a DALI parallel external source would probably be the way to go, but DALI expects a callable here that fetches individual images given a specific image index. This can be implemented easily if we set shard-size=1, but our initial experiments showed that larger shard sizes are more desirable.

    Type of change

    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [ ] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [x] Documentation update
    • [ ] Refactoring including code style reformatting
    • [ ] Other (please describe):

    Checklist:

    • [x] I have read the contributing guideline doc (external contributors only)
    • [x] Lint and unit tests pass locally with my changes
    • [x] I have kept the PR small so that it can be easily reviewed
    • [x] I have made corresponding changes to the documentation
    • [ ] I have added tests that prove my fix is effective or that my feature works
    • [ ] All dependency changes have been reflected in the pip requirement files.
    opened by axkoenig 0
  • Bugfix deserializer kwargs

    Bugfix deserializer kwargs

    Description

    Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

    Fixes # issue

    Type of change

    • [x] Bug fix (non-breaking change which fixes an issue)
    • [ ] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [ ] Documentation update
    • [ ] Refactoring including code style reformatting
    • [ ] Other (please describe):

    Checklist:

    • [ ] I have read the contributing guideline doc (external contributors only)
    • [ ] Lint and unit tests pass locally with my changes
    • [x] I have kept the PR small so that it can be easily reviewed
    • [ ] I have made corresponding changes to the documentation
    • [ ] I have added tests that prove my fix is effective or that my feature works
    • [ ] All dependency changes have been reflected in the pip requirement files.
    opened by mg515 0
  • PoC to cache data

    PoC to cache data

    driver = MessagepackDriver(url=url, cache_url=another_url)
    

    Description

    Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

    Fixes # issue

    Type of change

    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [ ] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [ ] Documentation update
    • [ ] Refactoring including code style reformatting
    • [ ] Other (please describe):

    Checklist:

    • [ ] I have read the contributing guideline doc (external contributors only)
    • [ ] Lint and unit tests pass locally with my changes
    • [ ] I have kept the PR small so that it can be easily reviewed
    • [ ] I have made corresponding changes to the documentation
    • [ ] I have added tests that prove my fix is effective or that my feature works
    • [ ] All dependency changes have been reflected in the pip requirement files.
    opened by AlirezaSohofi 0
  • Safety checks for store and driver using FilePathGenerator

    Safety checks for store and driver using FilePathGenerator

    Description

    For both store and driver we need to asses if a URL points to an empty directory or nested empty directories.

    • For drivers, warning the user when using empty directories alerts the user early on that the url might be invalid
    • For stores, we want to only overwrite an existing non-empty directory when it is explicitly allowed

    In both cases, checking if the directories/nested directories are empty are done through the FilePathGenerator

    Type of change

    • [x] Bug fix (non-breaking change which fixes an issue)
    • [ ] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [ ] Documentation update
    • [ ] Refactoring including code style reformatting
    • [ ] Other (please describe):

    Checklist:

    • [ ] I have read the contributing guideline doc (external contributors only)
    • [ ] Lint and unit tests pass locally with my changes
    • [ ] I have kept the PR small so that it can be easily reviewed
    • [ ] I have made corresponding changes to the documentation
    • [ ] I have added tests that prove my fix is effective or that my feature works
    • [ ] All dependency changes have been reflected in the pip requirement files.
    opened by pzdkn 2
  • [DRAFT] Support for different SquirrelStore compression modes

    [DRAFT] Support for different SquirrelStore compression modes

    Description

    See #59

    Fixes #59 issue

    Type of change

    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [X] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [X] Documentation update
    • [ ] Refactoring including code style reformatting
    • [ ] Other (please describe):

    Checklist:

    • [X] I have read the contributing guideline doc (external contributors only)
    • [ ] Lint and unit tests pass locally with my changes
    • [X] I have kept the PR small so that it can be easily reviewed
    • [X] I have made corresponding changes to the documentation
    • [ ] I have added tests that prove my fix is effective or that my feature works
    • [X] All dependency changes have been reflected in the pip requirement files.

    Draft State!

    This is a draft PR to make it easier to discuss the different pros and cons of various solutions. This is not in a final state.

    I tried to add some test and verify that they pass locally, but the tests spam a lot of ValueError: Bucket is requester pays. Setrequester_pays=Truewhen creating the GCSFileSystem. and it is hard to tell where these tests/errors are coming from. The contributing guideline provides no further information on how to run the tests.

    opened by kai-tub 9
  • [FEATURE] Allow configuring compression mode in MessagepackSerializer

    [FEATURE] Allow configuring compression mode in MessagepackSerializer

    Hey,

    Thank you for working on this library! I think it has a huge potential, especially for dataset creators to provide their dataset in an optimized deep-learning format that is well suited for distribution. The performance of the MessagepackSerializer is amazing and being able to distribute subsets of the dataset (shards) is something I never wanted but really want to utilize in the future!

    I have played around with some "MessagepackSerializer" configurations and according to some internal benchmarks, it would be helpful to allow the user to configure the compression algorithm.

    https://github.com/merantix-momentum/squirrel-core/blob/0ff6be7368a258d02e7c39d63ed2993d8a5af322/squirrel/serialization/msgpack.py#L28-L48

    Currently, the compression mode is "locked" to gzip. I assume the main reason is due to the wide usage of gzip and to keep the code 'simple' as it makes it easy in the deserializer to know that the gzip compression was used:

    https://github.com/merantix-momentum/squirrel-core/blob/0ff6be7368a258d02e7c39d63ed2993d8a5af322/squirrel/serialization/msgpack.py#L58-L81

    Here I would like to note that given the extension, fsspec (default) could also infer the compression by inspecting the filename suffix. But I can see how this might cause problems if somebody would like to switch out fsspec with something else (although I would have no idea with what and why :D )

    Other spots within the codebase that are coupled to this compression assumption are the methods from the SquirrelStore:

    https://github.com/merantix-momentum/squirrel-core/blob/0ff6be7368a258d02e7c39d63ed2993d8a5af322/squirrel/store/squirrel_store.py#L12-L67

    Or to show the significant parts:

    • get: https://github.com/merantix-momentum/squirrel-core/blob/0ff6be7368a258d02e7c39d63ed2993d8a5af322/squirrel/store/squirrel_store.py#L40-L41

    • set: https://github.com/merantix-momentum/squirrel-core/blob/0ff6be7368a258d02e7c39d63ed2993d8a5af322/squirrel/store/squirrel_store.py#L59-L60

    • keys: https://github.com/merantix-momentum/squirrel-core/blob/0ff6be7368a258d02e7c39d63ed2993d8a5af322/squirrel/store/squirrel_store.py#L66-L67

    In my internal benchmarks, I was able to greatly speed up the data loading by simply using no compression at all (None). Although I am fully aware that the correct compression mode heavily depends on the specific hardware/use case. But even in a network limited domain, I can see reasons to then prefer xz instead due to its better compression ratio and relatively similar decompression speed to gzip.

    IMHO, I think it should be ok to not store any suffix at all for the squirrel store. If I/a user looks inside of the squirrel store URL I think it is not mandatory to show what compression algorithm was used. The user could/should use the designated driver/metadata that comes bundled with the dataset and let the driver handle the correct decompression.

    If you don't agree I still think the gz extension doesn't have to be 'hardcoded' into these functions. This is actually something that confused me when I was looking at the internals of the code base. So instead, we could use something like:

    comp = kwargs.get("compression", "gzip")
    comp_to_ext_dict[comp] # just to show the concept
    

    With these modifications, it should be possible to utilize different compression modes and make them easily configurable. I would be very happy to create a PR and contribute to this project!

    enhancement 
    opened by kai-tub 3
Releases(v0.18.0)
  • v0.18.0(Nov 10, 2022)

    What's Changed

    • zip_index method for Composable by @AlirezaSohofi in https://github.com/merantix-momentum/squirrel-core/pull/92
    • Quantify randomness of shuffle in squirrel by @winfried-ripken in https://github.com/merantix-momentum/squirrel-core/pull/86
    • Change Catalog repr to sorted set by @MaxSchambach in https://github.com/merantix-momentum/squirrel-core/pull/94
    • Installation instruction by @AlirezaSohofi in https://github.com/merantix-momentum/squirrel-core/pull/96
    • Upgrade requirements by @winfried-ripken in https://github.com/merantix-momentum/squirrel-core/pull/97
    • Reference Huggingface, Hub and Torchvision Drivers by @axkoenig in https://github.com/merantix-momentum/squirrel-core/pull/99
    • Update requirements by @MaxSchambach in https://github.com/merantix-momentum/squirrel-core/pull/101
    • Refactoring DataFrameDriver and related drivers by @MaxSchambach in https://github.com/merantix-momentum/squirrel-core/pull/98

    Full Changelog: https://github.com/merantix-momentum/squirrel-core/compare/v0.17.7...v0.18.0

    Source code(tar.gz)
    Source code(zip)
  • v0.17.7(Oct 7, 2022)

    What's Changed

    • Add hooks to check backwards compatibility with py3.6+ by @AlpAribal in https://github.com/merantix-momentum/squirrel-core/pull/87
    • Add pyupgrade, yaml formatting and update all hooks by @AlpAribal in https://github.com/merantix-momentum/squirrel-core/pull/88
    • Fix file driver storage options by @axkoenig in https://github.com/merantix-momentum/squirrel-core/pull/85
    • Peng add kwargs to map by @pzdkn in https://github.com/merantix-momentum/squirrel-core/pull/90
    • Add hooks to csv driver by @winfried-ripken in https://github.com/merantix-momentum/squirrel-core/pull/91
    • Explain automatic version iteration by @AdemFr in https://github.com/merantix-momentum/squirrel-core/pull/84
    • Add csv driver option to specify csv read args by @MaxSchambach in https://github.com/merantix-momentum/squirrel-core/pull/93

    New Contributors

    • @MaxSchambach made their first contribution in https://github.com/merantix-momentum/squirrel-core/pull/93

    Full Changelog: https://github.com/merantix-momentum/squirrel-core/compare/v0.17.4...v0.17.7

    Source code(tar.gz)
    Source code(zip)
  • v0.17.4(Aug 31, 2022)

    What's Changed

    • Make this repo installable with all python versions by @winfried-loetzsch in https://github.com/merantix-momentum/squirrel-core/pull/82
    • Fix storage options by @axkoenig in https://github.com/merantix-momentum/squirrel-core/pull/83

    Full Changelog: https://github.com/merantix-momentum/squirrel-core/compare/v0.17.2...v0.17.4

    Source code(tar.gz)
    Source code(zip)
  • v0.17.2(Aug 25, 2022)

    What's Changed

    • Make CatalogSource visible in the API by @AlpAribal in https://github.com/merantix-momentum/squirrel-core/pull/71
    • Minor tweaks in documentation by @AlpAribal in https://github.com/merantix-momentum/squirrel-core/pull/73
    • Introduce rst linting via precommit hook by @AlpAribal in https://github.com/merantix-momentum/squirrel-core/pull/74
    • Remove binary file in tests dir by @AlpAribal in https://github.com/merantix-momentum/squirrel-core/pull/75
    • Unifies folder-creation behaviour when instantiation SquirrelStore by @pzdkn in https://github.com/merantix-momentum/squirrel-core/pull/72
    • Bugfix - Register Torch Composables by @axkoenig in https://github.com/merantix-momentum/squirrel-core/pull/78
    • Upgrade infra to py3.9 by @AlpAribal in https://github.com/merantix-momentum/squirrel-core/pull/79
    • Add storage options kwargs to FPGen by @mg515 in https://github.com/merantix-momentum/squirrel-core/pull/81

    New Contributors

    • @axkoenig made their first contribution in https://github.com/merantix-momentum/squirrel-core/pull/78
    • @mg515 made their first contribution in https://github.com/merantix-momentum/squirrel-core/pull/81

    Full Changelog: https://github.com/merantix-momentum/squirrel-core/compare/v0.16.0...v0.17.2

    Source code(tar.gz)
    Source code(zip)
  • v0.16.0(Jul 26, 2022)

    What's Changed

    • introduce loop and fixed size iterable by @AlirezaSohofi in https://github.com/merantix-momentum/squirrel-core/pull/47
    • Move cla assistant to workflows by @winfried-loetzsch in https://github.com/merantix-momentum/squirrel-core/pull/62
    • *add tutorials, *ignore test in api-ref, *remove unused execption by @pzdkn in https://github.com/merantix-momentum/squirrel-core/pull/63
    • First draft of advanced section for iterstreams by @pzdkn in https://github.com/merantix-momentum/squirrel-core/pull/55
    • Update Doc-String of MapDriver.get_iter by @kai-tub in https://github.com/merantix-momentum/squirrel-core/pull/61
    • Composable.compose gets source as kwarg, which is equal to self by @AlirezaSohofi in https://github.com/merantix-momentum/squirrel-core/pull/66
    • Peng add pytorch convenience functions to composable by @pzdkn in https://github.com/merantix-momentum/squirrel-core/pull/69
    • partial function for keys method by @AlirezaSohofi in https://github.com/merantix-momentum/squirrel-core/pull/70

    New Contributors

    • @kai-tub made their first contribution in https://github.com/merantix-momentum/squirrel-core/pull/61

    Full Changelog: https://github.com/merantix-momentum/squirrel-core/compare/0.14.2...v0.16.0

    Source code(tar.gz)
    Source code(zip)
  • 0.14.2(Jun 23, 2022)

    What's Changed

    • change squirrel test using a tmp public bucket by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/46
    • Update fs.open mode for catalog by @AdemFr in https://github.com/merantix-momentum/squirrel-core/pull/48
    • CatalogKey can be used to index catalog by @AlpAribal in https://github.com/merantix-momentum/squirrel-core/pull/49
    • accept callable as source for composable to make it completly lazy by @AlirezaSohofi in https://github.com/merantix-momentum/squirrel-core/pull/44
    • add sphinxcontrib-mermaid by @AlirezaSohofi in https://github.com/merantix-momentum/squirrel-core/pull/51
    • Architecture overview by @AlirezaSohofi in https://github.com/merantix-momentum/squirrel-core/pull/54
    • *add advanced store *reorganize sections *add icon,favicon by @pzdkn in https://github.com/merantix-momentum/squirrel-core/pull/53
    • Create codeql-analysis.yml by @AlirezaSohofi in https://github.com/merantix-momentum/squirrel-core/pull/52
    • Upgrade numpy & numba by @winfried-loetzsch in https://github.com/merantix-momentum/squirrel-core/pull/57
    • Winnie bump pyjwt by @winfried-loetzsch in https://github.com/merantix-momentum/squirrel-core/pull/58

    New Contributors

    • @AdemFr made their first contribution in https://github.com/merantix-momentum/squirrel-core/pull/48
    • @pzdkn made their first contribution in https://github.com/merantix-momentum/squirrel-core/pull/53
    • @winfried-loetzsch made their first contribution in https://github.com/merantix-momentum/squirrel-core/pull/57

    Full Changelog: https://github.com/merantix-momentum/squirrel-core/compare/v0.13.2...0.14.2

    Source code(tar.gz)
    Source code(zip)
  • v0.13.2(May 18, 2022)

    What's Changed

    • Fix SourceCombiner.get_iter() not interleaving correctly by @AlpAribal in https://github.com/merantix-momentum/squirrel-core/pull/45

    Full Changelog: https://github.com/merantix-momentum/squirrel-core/compare/v0.13.1...v0.13.2

    Source code(tar.gz)
    Source code(zip)
  • v0.13.1(May 18, 2022)

    What's Changed

    • Add community files by @ThomasWollmann in https://github.com/merantix-momentum/squirrel-core/pull/38
    • Minor requirement changes by @AlpAribal in https://github.com/merantix-momentum/squirrel-core/pull/40
    • messagepack unpacker set use_list argument to False by default by @AlirezaSohofi in https://github.com/merantix-momentum/squirrel-core/pull/39

    New Contributors

    • @AlpAribal made their first contribution in https://github.com/merantix-momentum/squirrel-core/pull/40

    Full Changelog: https://github.com/merantix-momentum/squirrel-core/compare/v0.12.3...v0.13.1

    Source code(tar.gz)
    Source code(zip)
  • v0.12.3(Apr 11, 2022)

    What's Changed

    • Update README.md by @ThomasWollmann in https://github.com/merantix-momentum/squirrel-core/pull/31
    • pin numpy and update PR template by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/34
    • Update README.md by @ThomasWollmann in https://github.com/merantix-momentum/squirrel-core/pull/33
    • update document links by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/36
    • update version to 0.12.3 by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/37

    Full Changelog: https://github.com/merantix-momentum/squirrel-core/compare/v0.12.2...v0.12.3

    Source code(tar.gz)
    Source code(zip)
  • v0.12.2(Apr 6, 2022)

    What's Changed

    • update img to github raw file so public pypi can load it by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/26
    • Tiansu add readthedocs.yml by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/27
    • add dependencies for readthedoc by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/28
    • fix readthedoc by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/29
    • update readthedocs links by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/30
    • Tiansu move leftover commits by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/32

    Full Changelog: https://github.com/merantix-momentum/squirrel-core/compare/v0.12.1...v0.12.2

    Source code(tar.gz)
    Source code(zip)
  • v0.12.1(Apr 5, 2022)

    What's Changed

    • update docs link by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/12
    • add logo by @ThomasWollmann in https://github.com/merantix-momentum/squirrel-core/pull/13
    • remove old extra file by @ThomasWollmann in https://github.com/merantix-momentum/squirrel-core/pull/14
    • add back keyring until public release by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/16
    • key_hook param of get_iter accepts SplitByRank and SplitByWorker, par… by @AlirezaSohofi in https://github.com/merantix-momentum/squirrel-core/pull/15
    • fix install instruction by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/18
    • Update README.md by @ThomasWollmann in https://github.com/merantix-momentum/squirrel-core/pull/19
    • Update README.md by @ThomasWollmann in https://github.com/merantix-momentum/squirrel-core/pull/20
    • Update README.md by @ThomasWollmann in https://github.com/merantix-momentum/squirrel-core/pull/21
    • Tiansu update black by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/22
    • add CLA bot by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/23
    • switch to publish in public pypi by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/24
    • update version to 0.12.1 by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/25

    New Contributors

    • @ThomasWollmann made their first contribution in https://github.com/merantix-momentum/squirrel-core/pull/13
    • @AlirezaSohofi made their first contribution in https://github.com/merantix-momentum/squirrel-core/pull/15

    Full Changelog: https://github.com/merantix-momentum/squirrel-core/compare/v0.12.0...v0.12.1

    Source code(tar.gz)
    Source code(zip)
  • v0.12.0(Mar 12, 2022)

    What's Changed

    • add basic files to get infrastructure running by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/3
    • new semantic versioning format for dev release by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/4
    • tiansu copy squirrel codebase by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/5
    • Tiansu add docs by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/9
    • add pypi classifiers by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/10
    • change version norm by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/11

    Full Changelog: https://github.com/merantix-momentum/squirrel-core/commits/v0.12.0

    Source code(tar.gz)
    Source code(zip)
Transformers4Rec is a flexible and efficient library for sequential and session-based recommendation, available for both PyTorch and Tensorflow.

Transformers4Rec is a flexible and efficient library for sequential and session-based recommendation, available for both PyTorch and Tensorflow.

null 730 Jan 9, 2023
AdaShare: Learning What To Share For Efficient Deep Multi-Task Learning

AdaShare: Learning What To Share For Efficient Deep Multi-Task Learning (NeurIPS 2020) Introduction AdaShare is a novel and differentiable approach fo

null 94 Dec 22, 2022
Adversarial Robustness Toolbox (ART) - Python Library for Machine Learning Security - Evasion, Poisoning, Extraction, Inference - Red and Blue Teams

Adversarial Robustness Toolbox (ART) is a Python library for Machine Learning Security. ART provides tools that enable developers and researchers to defend and evaluate Machine Learning models and applications against the adversarial threats of Evasion, Poisoning, Extraction, and Inference. ART supports all popular machine learning frameworks (TensorFlow, Keras, PyTorch, MXNet, scikit-learn, XGBoost, LightGBM, CatBoost, GPy, etc.), all data types (images, tables, audio, video, etc.) and machine learning tasks (classification, object detection, speech recognition, generation, certification, etc.).

null 3.4k Jan 4, 2023
A python software that can help blind people find things like laptops, phones, etc the same way a guide dog guides a blind person in finding his way.

GuidEye A python software that can help blind people find things like laptops, phones, etc the same way a guide dog guides a blind person in finding h

Munal Jain 0 Aug 9, 2022
A python comtrade load library accelerated by go

Comtrade-GRPC Code for python used is mainly from dparrini/python-comtrade. Just patch the code in BinaryDatReader.parse for parsing a little more eff

Bo 1 Dec 27, 2021
Simple, efficient and flexible vision toolbox for mxnet framework.

MXbox: Simple, efficient and flexible vision toolbox for mxnet framework. MXbox is a toolbox aiming to provide a general and simple interface for visi

Ligeng Zhu 31 Oct 19, 2019
[CVPR 2021] 'Searching by Generating: Flexible and Efficient One-Shot NAS with Architecture Generator'

[CVPR2021] Searching by Generating: Flexible and Efficient One-Shot NAS with Architecture Generator Overview This is the entire codebase for the paper

null 35 Dec 1, 2022
Simple Python application to transform Serial data into OSC messages

SerialToOSC-Bridge Simple Python application to transform Serial data into OSC messages. The current purpose is to be a compatibility layer between ha

Division of Applied Acoustics at Chalmers University of Technology 3 Jun 3, 2021
fastgradio is a python library to quickly build and share gradio interfaces of your trained fastai models.

fastgradio is a python library to quickly build and share gradio interfaces of your trained fastai models.

Ali Abdalla 34 Jan 5, 2023
A python bot to move your mouse every few seconds to appear active on Skype, Teams or Zoom as you go AFK. 🐭 🤖

PyMouseBot If you're from GT and annoyed with SGVPN idle timeouts while working on development laptop, You might find this useful. A python cli bot to

Oaker Min 6 Oct 24, 2022
Repo for FUZE project. I will also publish some Linux kernel LPE exploits for various real world kernel vulnerabilities here. the samples are uploaded for education purposes for red and blue teams.

Linux_kernel_exploits Some Linux kernel exploits for various real world kernel vulnerabilities here. More exploits are yet to come. This repo contains

Wei Wu 472 Dec 21, 2022
This code uses generative adversarial networks to generate diverse task allocation plans for Multi-agent teams.

Mutli-agent task allocation This code uses generative adversarial networks to generate diverse task allocation plans for Multi-agent teams. To change

Biorobotics Lab 5 Oct 12, 2022
Implementation of the Triangle Multiplicative module, used in Alphafold2 as an efficient way to mix rows or columns of a 2d feature map, as a standalone package for Pytorch

Triangle Multiplicative Module - Pytorch Implementation of the Triangle Multiplicative module, used in Alphafold2 as an efficient way to mix rows or c

Phil Wang 22 Oct 28, 2022
Python Implementation of algorithms in Graph Mining, e.g., Recommendation, Collaborative Filtering, Community Detection, Spectral Clustering, Modularity Maximization, co-authorship networks.

Graph Mining Author: Jiayi Chen Time: April 2021 Implemented Algorithms: Network: Scrabing Data, Network Construbtion and Network Measurement (e.g., P

Jiayi Chen 3 Mar 3, 2022
Python tools for 3D face: 3DMM, Mesh processing(transform, camera, light, render), 3D face representations.

face3d: Python tools for processing 3D face Introduction This project implements some basic functions related to 3D faces. You can use this to process

Yao Feng 2.3k Dec 30, 2022
Load What You Need: Smaller Multilingual Transformers for Pytorch and TensorFlow 2.0.

Smaller Multilingual Transformers This repository shares smaller versions of multilingual transformers that keep the same representations offered by t

Geotrend 79 Dec 28, 2022
Additional code for Stable-baselines3 to load and upload models from the Hub.

Hugging Face x Stable-baselines3 A library to load and upload Stable-baselines3 models from the Hub. Installation With pip Examples [Todo: add colab t

Hugging Face 34 Dec 10, 2022
Cl datasets - PyTorch image dataloaders and utility functions to load datasets for supervised continual learning

Continual learning datasets Introduction This repository contains PyTorch image

berjaoui 5 Aug 28, 2022
Multi-robot collaborative exploration and mapping through Voronoi partition and DRL in unknown environment

Voronoi Multi_Robot Collaborate Exploration Introduction In the unknown environment, the cooperative exploration of multiple robots is completed by Vo

PeaceWord 6 Nov 22, 2022