A Python library that enables ML teams to share, load, and transform data in a collaborative, flexible, and efficient way :chestnut:

Merantix Momentum

Last update: Dec 7, 2022

Related tags

Deep Learning python nlp data-science machine-learning natural-language-processing ai computer-vision deep-learning tensorflow cv ml collaboration pytorch distributed dataops cloud-computing datasets data-ingestion jax data-mesh

Overview

Squirrel Core

Share, load, and transform data in a collaborative, flexible, and efficient way

What is Squirrel?

Squirrel is a Python library that enables ML teams to share, load, and transform data in a collaborative, flexible, and efficient way.

SPEED: Avoid data stall, i.e. the expensive GPU will not be idle while waiting for the data.
COSTS: First, avoid GPU stalling, and second allow to shard & cluster your data and store & load it in bundles, decreasing the cost for your data bucket cloud storage.
FLEXIBILITY: Work with a flexible standard data scheme which is adaptable to any setting, including multimodal data.
COLLABORATION: Make it easier to share data & code between teams and projects in a self-service model.

Stream data from anywhere to your machine learning model as easy as:

it = (Catalog.from_plugins()["imagenet"].get_driver()
      .get_iter("train")
      .map(lambda r: (augment(r["image"]), r["label"]))
      .batched(100))

Check out our full getting started tutorial notebook. If you have any questions or would like to contribute, join our Slack community.

Installation

You can install squirrel-core by

pip install "squirrel-core[all]"

Documentation

Read our documentation at ReadTheDocs

Example Notebooks

Check out the Squirrel-datasets repository for open source and community-contributed tutorial and example notebooks of using Squirrel.

Contributing

Squirrel is open source and community contributions are welcome!

Check out the contribution guide to learn how to get involved.

The humans behind Squirrel

We are Merantix Momentum, a team of ~30 machine learning engineers, developing machine learning solutions for industry and research. Each project comes with its own challenges, data types and learnings, but one issue we always faced was scalable data loading, transforming and sharing. We were looking for a solution that would allow us to load the data in a fast and cost-efficient way, while keeping the flexibility to work with any possible dataset and integrate with any API. That's why we build Squirrel – and we hope you'll find it as useful as we do! By the way, we are hiring!

Citation

If you use Squirrel in your research, please cite it using:

@article{2022squirrelcore,
  title={Squirrel: A Python library that enables ML teams to share, load, and transform data in a collaborative, flexible, and efficient way.},
  author={Squirrel Developer Team},
  journal={GitHub. Note: https://github.com/merantix-momentum/squirrel-core},
  doi={10.5281/zenodo.6418280},
  year={2022}
}

Comments

Update Doc-String of MapDriver.get_iter
Better document the behavior of max_workers and link to official ThreadPoolExecutor documentation.

Update *_map doc-strings that use ThreadPoolExecutor and link to official ThreadPoolExecutor documentation.

Fixes #60 issue

Type of change

[ ] Bug fix (non-breaking change which fixes an issue)

[ ] New feature (non-breaking change which adds functionality)

[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)

[X] Documentation update

[ ] Refactoring including code style reformatting

[ ] Other (please describe):

Checklist:

[X] I have read the contributing guideline doc (external contributors only)

[ ] Lint and unit tests pass locally with my changes

[X] I have kept the PR small so that it can be easily reviewed

[X] I have made corresponding changes to the documentation

[ ] I have added tests that prove my fix is effective or that my feature works

[ ] All dependency changes have been reflected in the pip requirement files.
opened by kai-tub 14
Refactoring DataFrameDriver and related drivers
Description

Refactors the DataFrameDriver and all data frame-related drivers. In particular:

Fixes a current bug (I believe) that storage options are not passed down when reading using the pandas/dask interface. This affected the implementation of CsvDriver.

Refactors the DataFrameDriver base class to provide a common interface for all drivers that use some read functionality from pandas or dask. The base class now handles the storage options and read arguments precedence for all derived classes

Using the new abstraction, adds FeatherDriver, JsonDriver, ParquetDriver, and XlsDriver and refactors CsvDriver

This does break some datasets using the CsvDriver as the read_csv_kwargs are now renamed to a common read_kwargs. However, so far, only two research datasets used this property. See the corresponding PR in squirrel-datasets. So far, this is a bit of a rough sketch. I tested the existing CsvDriver-based datasets but otherwise, this requires a bit more cleanup I suppose.

Renames the previous use_dask option to engine across all data frame drivers.

Change the default DataFrame engine to Pandas.

Type of change

[x] Bug fix (non-breaking change which fixes an issue)

[x] New feature (non-breaking change which adds functionality)

[x] Breaking change (fix or feature that would cause existing functionality to not work as expected)

[x] Documentation update

[ ] Refactoring including code style reformatting

[ ] Other (please describe):

Checklist:

[ ] I have read the contributing guideline doc (external contributors only)

[x] Lint and unit tests pass locally with my changes

[x] I have kept the PR small so that it can be easily reviewed

[x] I have made corresponding changes to the documentation

[x] I have added tests that prove my fix is effective or that my feature works

[x] All dependency changes have been reflected in the pip requirement files.
opened by MaxSchambach 8
zip multiple iterables as a source
Description

The use case: we have a store with samples, and 1..n other stores that each contain only features. These stores must have the same keys and same number of samples per shard.

IterableZipSource makes it possible to zip items from several iterables and use that as a source. For instance:

it1 = MessagepackDriver(url1).get_iter() it2 = MessagepackDriver(url2).get_iter() it3 = IterableZipSource(iterables=[it1, it2]).collect()

Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

Fixes # issue

Type of change

[ ] Bug fix (non-breaking change which fixes an issue)

[ ] New feature (non-breaking change which adds functionality)

[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)

[ ] Documentation update

[ ] Refactoring including code style reformatting

[ ] Other (please describe):

Checklist:

[ ] I have read the contributing guideline doc (external contributors only)

[ ] Lint and unit tests pass locally with my changes

[ ] I have kept the PR small so that it can be easily reviewed

[ ] I have made corresponding changes to the documentation

[ ] I have added tests that prove my fix is effective or that my feature works

[ ] All dependency changes have been reflected in the pip requirement files.
opened by AlirezaSohofi 5
Add csv driver option to specify csv read args
Description

Adds a read_csv_kwargs argument to CsvDriver initialization which is used in all read_csv calls in the class. Does not break backward compatibility, as get_df and get_iter still allow to specify kwargs for read_csv which will take precedence over the ones given at initialization.

This makes the creation of new catalog entries based on the CsvDriver much easier as dataset-specific read options (such as seperator, dtypes, etc.) can be specified in the driver_kwargs.

Type of change

[ ] Bug fix (non-breaking change which fixes an issue)

[x] New feature (non-breaking change which adds functionality)

[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)

[x] Documentation update

[ ] Refactoring including code style reformatting

[ ] Other (please describe):

Checklist:

[x] I have read the contributing guideline doc (external contributors only)

[x] Lint and unit tests pass locally with my changes

[x] I have kept the PR small so that it can be easily reviewed

[x] I have made corresponding changes to the documentation

[ ] I have added tests that prove my fix is effective or that my feature works

[ ] All dependency changes have been reflected in the pip requirement files.
opened by MaxSchambach 4
Quantify randomness of shuffle in squirrel
Description

Introduce a function to measure the randomness of a shuffle operation in the squirrel pipeline by implementing a simple example driver, random sampling and comparing the distances of sampled trajectories.

Type of change

[ ] Bug fix (non-breaking change which fixes an issue)

[x] New feature (non-breaking change which adds functionality)

[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)

[ ] Documentation update

[ ] Refactoring including code style reformatting

[ ] Other (please describe):

Checklist:

[ ] I have read the contributing guideline doc (external contributors only)

[x] Lint and unit tests pass locally with my changes

[x] I have kept the PR small so that it can be easily reviewed

[ ] I have made corresponding changes to the documentation

[ ] I have added tests that prove my fix is effective or that my feature works

[ ] All dependency changes have been reflected in the pip requirement files.
opened by winfried-ripken 4
Explain automatic version iteration
Description

Adding explanation of the default version iteration behaviour of the catalog, which was not clearly stated before.

Fixes # issue

Type of change

[ ] Bug fix (non-breaking change which fixes an issue)

[ ] New feature (non-breaking change which adds functionality)

[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)

[ x ] Documentation update

[ ] Refactoring including code style reformatting

[ ] Other (please describe):

Checklist:

[x] I have read the contributing guideline doc (external contributors only)

[ ] Lint and unit tests pass locally with my changes

[ ] I have kept the PR small so that it can be easily reviewed

[ ] I have made corresponding changes to the documentation

[ ] I have added tests that prove my fix is effective or that my feature works

[ ] All dependency changes have been reflected in the pip requirement files.
opened by AdemFr 3
Add storage options kwargs to FPGen
Description

FilePathGenerator does not expose storage options, used by fsspec when instantiating a filesystem. This can prove to be problematic, when advanced options are needed for accessing the data, e.g. when needing requester_pays argument for accessing data within google bucket. This change adds such kwargs to the constructor of FilePathGenerator object, which are passed onto fsspec.

Fixes # issue

Type of change

[ ] Bug fix (non-breaking change which fixes an issue)

[x] New feature (non-breaking change which adds functionality)

[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)

[ ] Documentation update

[ ] Refactoring including code style reformatting

[ ] Other (please describe):

Checklist:

[ ] I have read the contributing guideline doc (external contributors only)

[x] Lint and unit tests pass locally with my changes

[x] I have kept the PR small so that it can be easily reviewed

[ ] I have made corresponding changes to the documentation

[ ] I have added tests that prove my fix is effective or that my feature works

[x] All dependency changes have been reflected in the pip requirement files.
opened by mg515 3
Warn when creating driver that points to an empty directory
Description

A warning is shown when we iterate over a driver, that points to an empty or non-existent directory.

Type of change

[ ] Bug fix (non-breaking change which fixes an issue)

[x] New feature (non-breaking change which adds functionality)

[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)

[ ] Documentation update

[ ] Refactoring including code style reformatting

[ ] Other (please describe):

Checklist:

[ ] I have read the contributing guideline doc (external contributors only)

[x] Lint and unit tests pass locally with my changes

[x] I have kept the PR small so that it can be easily reviewed

[ ] I have made corresponding changes to the documentation

[x] I have added tests that prove my fix is effective or that my feature works

[ ] All dependency changes have been reflected in the pip requirement files.
opened by pzdkn 3
Nvidia DALI external source integration
Description

Motivation Squirrel is a fast data loading framework, and Nvidia DALI is a fast, gpu-accelerated library for complex ML workloads such as run-time augmentations. The aim is to provide users with an intuitive interface to use Squirrel as a backend for Nvidia DALI.

Context For more context, check out these internal benchmarks below. Running the Squirrel pipeline without any augmentations is approx. 33k samples / sec. If you use Squirrel as an external source and an affine image augmentation from DALI you can reach approx. 28k samples / sec. This suggests that DALI can make use full use of Squirrel's speed as the data loading speed is almost not slowed down by the run-time augmentations (33k vs 28k). DALI brings two things to the table: you can augment data in batches and not one-by-one as is necessary with the other frameworks, and you can do it on the GPU.

Code Design

DALI comes with a concept called "pipeline" (docs), that defines how data should be read and transformed by DALI.

We use the external_source data reader API in the DALI pipeline, which we can provide with a modified Squirrel Iterable, the squirrel.iterstream.DaliExternalSource.

As suggested by Nvidia DALI staff, I benchmarked loading the samples one-by-one and let DALI do the batching. It turned out that batching in Squirrel was much faster (18.2k sps vs 32.2 sps). This suggests that DALI profits from the async loading in Squirrel here.

As suggested by Nvidia DALI staff, I tried using parallel external source, which is multi-proc dataloading by DALI. As stated in their docs, DALI prefers single samples (un-batched) here to let DALI handle the multi-proc logic of parallel data fetching. The problem here is that DALI would like a Callable external source here, Iterables are not allowed for parallel fetching. While this is technically possible (e.g. fit your dataset in one shard and then access the items by their keys, i.e. shard names), indexability is not straightforward and not yet integrated in squirrel. Since DALI already nearly makes use of Squirrel's full performance, we don't see that DALI could speed things up here. But it's worth investigating once the feature is implemented in Squirrel.

There was no performance increase by returning cupy arrays on the GPU to the external_source reader. Numpy was slightly faster, so users are advised to return numpy arrays in their collation function.

Usage Pattern

users will simply turn their iterable into an external source with the iterstream API.

# define a dummy pipeline @pipeline_def def pipeline(it: DaliExternalSource, device: str) -> Tuple[DataNode]: img, label = fn.external_source(source=it, num_outputs=2, device=device) enhanced = fn.brightness_contrast(img, contrast=2) # do other augmentations here return enhanced, label it = squirrel_iterator.to_dali_external_source(batch_size, my_collation_fn) pipe = pipeline(it, device, batch_size=BATCH_SIZE) pipe.build() loader = DALIGenericIterator([pipe], ["img", "label"]) for item in loader: # ...

Things to Discuss

I tried turning the iterstream into a DALIGenericIterator directly and abstract the above code away, but in my mind that does not make a lot of sense, as DALI users are used to the above API and we are really just an external source. The user will need to define their custom pipeline anyway for their use-case, so I don't see a big benefit of abstracting the below code away into a squirrel functionality - possibly adding some assumptions here and there and thereby limiting the original functionality of DALI (wdyt @AlirezaSohofi ?).

We would need to find out if the self.i and self.n parameters need to be set for the external source as indicated here. For now, it seems to work out of the box, but maybe for more complex use-cases these variables are needed for DALI to keep track of the loaded samples. Sidenote: Currently DaliExternalSource could also simply be replaced with squirrel_iterable.batched(bs, fn), but I assume that self.i and self.n are needed somehow (input from NVIDIA needed here), so it's useful to have DaliExternalSource where we can add more features.

Please check out the test_to_dali_external_source_gpu_multi_epoch. After iterating over Squirrel's generator once the iterable is empty. Hence after each epoch we need to create a new DALIGenericIterator. Afaik this is also how e.g. Pytorch Lightning handles it. Let me know if that sounds ok, or if we need to loop over the data.

Tests & Requirements: Note that I added pytests for the code, but did not update the requirements accordingly, because the CI currently doesn't run GPU tasks. Moreover, we won't ask users to install DALI for now (also, there are many different versions for different cuda drivers), so we assume people will prefer installing themselves. The DaliExternalSource doesn't depend on any DALI code, so the DALI install is technically not required.

Type of change

[ ] Bug fix (non-breaking change which fixes an issue)

[x] New feature (non-breaking change which adds functionality)

[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)

[ ] Documentation update

[ ] Refactoring including code style reformatting

[ ] Other (please describe):

Checklist:

[ ] I have read the contributing guideline doc (external contributors only)

[ ] Lint and unit tests pass locally with my changes

[ ] I have kept the PR small so that it can be easily reviewed

[ ] I have made corresponding changes to the documentation

[ ] I have added tests that prove my fix is effective or that my feature works

[ ] All dependency changes have been reflected in the pip requirement files.
opened by axkoenig 3
Store the processing steps in a stream
Description

Store more information in Composables:

Which Squirrel version is used

Git info e.g. commit-hash, remote repository

Log processing steps when chaining Composables

This aims to provide the user more information about the stream. When a Composable stores sensitive information e.g. url in FilePathGenerator, then this should not be logged.

Fixes # issue

Type of change

[ ] Bug fix (non-breaking change which fixes an issue)

[x] New feature (non-breaking change which adds functionality)

[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)

[ ] Documentation update

[x] Refactoring including code style reformatting

[ ] Other (please describe):

Checklist:

[ ] I have read the contributing guideline doc (external contributors only)

[x] Lint and unit tests pass locally with my changes

[x] I have kept the PR small so that it can be easily reviewed

[ ] I have made corresponding changes to the documentation

[x] I have added tests that prove my fix is effective or that my feature works

[ ] All dependency changes have been reflected in the pip requirement files.
opened by pzdkn 3
[FEATURE] Make `get_iter` method documentation about `max_workers` more explicit

Hey, I've stumbled across a potentially easy-to-misunderstand part of the MapDriver.get_iter documentation:

https://github.com/merantix-momentum/squirrel-core/blob/8e2942313c7d7dd974b1ca2f2308895f660d3d26/squirrel/driver/driver.py#L68-L155

The documentation of max_workers states that by default None will be used and also mentions that this will cause async_map to be called but I missed these parts of the documentation and was surprised to see that so many threads were allocated.

I am/was not too familiar with the ThreadPoolExecutor interface and find it somewhat surprising that None equals numer_of_processors x 5 according to the ThreadPoolExecutor definition. Maybe it would be helpful to explicitly state that by default ThreadPoolExecutor will be used with so many threads? The documentation string reads a bit unintuitive as the starts out that max_worker defines how many items are fetched simultaneously with max_worker and then continues to state that otherwise map is used. From that perspective, max_workers=None doesn't sound like it should be using any threads at all. Without knowing the default values of ThreadPoolExecutor I would make it more explicit that to disable threading one has to set max_workers=0/1 and that by default many threads are used.

I am happy to add a PR with my suggested doc-string update if you agree! :)
enhancement

opened by kai-tub 3
Interaction Nvidia DALI and Squirrel
Description

Describes in detail how Squirrel and DALI can work together. Also includes benchmarks on how to best utilize DALI and how it compares to transforms in Torchvision.

Attaching PDF rendered version of the Sphinx documentation here. Unfortunately, I couldn't get syntax highlighting to work.

Apparent next steps are figuring out how Squirrel and DALI can work together in multi-processing. It is not obvious how we could implement this, and if this provides a performance boost. Using a DALI parallel external source would probably be the way to go, but DALI expects a callable here that fetches individual images given a specific image index. This can be implemented easily if we set shard-size=1, but our initial experiments showed that larger shard sizes are more desirable.

Type of change

[ ] Bug fix (non-breaking change which fixes an issue)

[ ] New feature (non-breaking change which adds functionality)

[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)

[x] Documentation update

[ ] Refactoring including code style reformatting

[ ] Other (please describe):

Checklist:

[x] I have read the contributing guideline doc (external contributors only)

[x] Lint and unit tests pass locally with my changes

[x] I have kept the PR small so that it can be easily reviewed

[x] I have made corresponding changes to the documentation

[ ] I have added tests that prove my fix is effective or that my feature works

[ ] All dependency changes have been reflected in the pip requirement files.
opened by axkoenig 0
Bugfix deserializer kwargs
Description

Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

Fixes # issue

Type of change

[x] Bug fix (non-breaking change which fixes an issue)

[ ] New feature (non-breaking change which adds functionality)

[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)

[ ] Documentation update

[ ] Refactoring including code style reformatting

[ ] Other (please describe):

Checklist:

[ ] I have read the contributing guideline doc (external contributors only)

[ ] Lint and unit tests pass locally with my changes

[x] I have kept the PR small so that it can be easily reviewed

[ ] I have made corresponding changes to the documentation

[ ] I have added tests that prove my fix is effective or that my feature works

[ ] All dependency changes have been reflected in the pip requirement files.
opened by mg515 0
PoC to cache data
driver = MessagepackDriver(url=url, cache_url=another_url)

Description

Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

Fixes # issue

Type of change

[ ] Bug fix (non-breaking change which fixes an issue)

[ ] New feature (non-breaking change which adds functionality)

[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)

[ ] Documentation update

[ ] Refactoring including code style reformatting

[ ] Other (please describe):

Checklist:

[ ] I have read the contributing guideline doc (external contributors only)

[ ] Lint and unit tests pass locally with my changes

[ ] I have kept the PR small so that it can be easily reviewed

[ ] I have made corresponding changes to the documentation

[ ] I have added tests that prove my fix is effective or that my feature works

[ ] All dependency changes have been reflected in the pip requirement files.
opened by AlirezaSohofi 0
Safety checks for store and driver using FilePathGenerator
Description

For both store and driver we need to asses if a URL points to an empty directory or nested empty directories.

For drivers, warning the user when using empty directories alerts the user early on that the url might be invalid

For stores, we want to only overwrite an existing non-empty directory when it is explicitly allowed

In both cases, checking if the directories/nested directories are empty are done through the FilePathGenerator

Type of change

[x] Bug fix (non-breaking change which fixes an issue)

[ ] New feature (non-breaking change which adds functionality)

[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)

[ ] Documentation update

[ ] Refactoring including code style reformatting

[ ] Other (please describe):

Checklist:

[ ] I have read the contributing guideline doc (external contributors only)

[ ] Lint and unit tests pass locally with my changes

[ ] I have kept the PR small so that it can be easily reviewed

[ ] I have made corresponding changes to the documentation

[ ] I have added tests that prove my fix is effective or that my feature works

[ ] All dependency changes have been reflected in the pip requirement files.
opened by pzdkn 2
[DRAFT] Support for different SquirrelStore compression modes
Description

See #59

Fixes #59 issue

Type of change

[ ] Bug fix (non-breaking change which fixes an issue)

[X] New feature (non-breaking change which adds functionality)

[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)

[X] Documentation update

[ ] Refactoring including code style reformatting

[ ] Other (please describe):

Checklist:

[X] I have read the contributing guideline doc (external contributors only)

[ ] Lint and unit tests pass locally with my changes

[X] I have kept the PR small so that it can be easily reviewed

[X] I have made corresponding changes to the documentation

[ ] I have added tests that prove my fix is effective or that my feature works

[X] All dependency changes have been reflected in the pip requirement files.

Draft State!

This is a draft PR to make it easier to discuss the different pros and cons of various solutions. This is not in a final state.

I tried to add some test and verify that they pass locally, but the tests spam a lot of ValueError: Bucket is requester pays. Setrequester_pays=Truewhen creating the GCSFileSystem. and it is hard to tell where these tests/errors are coming from. The contributing guideline provides no further information on how to run the tests.
opened by kai-tub 9
[FEATURE] Allow configuring compression mode in MessagepackSerializer
Hey,

Thank you for working on this library! I think it has a huge potential, especially for dataset creators to provide their dataset in an optimized deep-learning format that is well suited for distribution. The performance of the MessagepackSerializer is amazing and being able to distribute subsets of the dataset (shards) is something I never wanted but really want to utilize in the future!

I have played around with some "MessagepackSerializer" configurations and according to some internal benchmarks, it would be helpful to allow the user to configure the compression algorithm.

https://github.com/merantix-momentum/squirrel-core/blob/0ff6be7368a258d02e7c39d63ed2993d8a5af322/squirrel/serialization/msgpack.py#L28-L48

Currently, the compression mode is "locked" to gzip. I assume the main reason is due to the wide usage of gzip and to keep the code 'simple' as it makes it easy in the deserializer to know that the gzip compression was used:

https://github.com/merantix-momentum/squirrel-core/blob/0ff6be7368a258d02e7c39d63ed2993d8a5af322/squirrel/serialization/msgpack.py#L58-L81

Here I would like to note that given the extension, fsspec (default) could also infer the compression by inspecting the filename suffix. But I can see how this might cause problems if somebody would like to switch out fsspec with something else (although I would have no idea with what and why :D )

Other spots within the codebase that are coupled to this compression assumption are the methods from the SquirrelStore:

https://github.com/merantix-momentum/squirrel-core/blob/0ff6be7368a258d02e7c39d63ed2993d8a5af322/squirrel/store/squirrel_store.py#L12-L67

Or to show the significant parts:

get: https://github.com/merantix-momentum/squirrel-core/blob/0ff6be7368a258d02e7c39d63ed2993d8a5af322/squirrel/store/squirrel_store.py#L40-L41

set: https://github.com/merantix-momentum/squirrel-core/blob/0ff6be7368a258d02e7c39d63ed2993d8a5af322/squirrel/store/squirrel_store.py#L59-L60

keys: https://github.com/merantix-momentum/squirrel-core/blob/0ff6be7368a258d02e7c39d63ed2993d8a5af322/squirrel/store/squirrel_store.py#L66-L67

In my internal benchmarks, I was able to greatly speed up the data loading by simply using no compression at all (None). Although I am fully aware that the correct compression mode heavily depends on the specific hardware/use case. But even in a network limited domain, I can see reasons to then prefer xz instead due to its better compression ratio and relatively similar decompression speed to gzip.

IMHO, I think it should be ok to not store any suffix at all for the squirrel store. If I/a user looks inside of the squirrel store URL I think it is not mandatory to show what compression algorithm was used. The user could/should use the designated driver/metadata that comes bundled with the dataset and let the driver handle the correct decompression.

If you don't agree I still think the gz extension doesn't have to be 'hardcoded' into these functions. This is actually something that confused me when I was looking at the internals of the code base. So instead, we could use something like:

comp = kwargs.get("compression", "gzip") comp_to_ext_dict[comp] # just to show the concept

With these modifications, it should be possible to utilize different compression modes and make them easily configurable. I would be very happy to create a PR and contribute to this project!
enhancement
opened by kai-tub 3

Releases(v0.18.0)

v0.18.0(Nov 10, 2022)
What's Changed

zip_index method for Composable by @AlirezaSohofi in https://github.com/merantix-momentum/squirrel-core/pull/92

Quantify randomness of shuffle in squirrel by @winfried-ripken in https://github.com/merantix-momentum/squirrel-core/pull/86

Change Catalog repr to sorted set by @MaxSchambach in https://github.com/merantix-momentum/squirrel-core/pull/94

Installation instruction by @AlirezaSohofi in https://github.com/merantix-momentum/squirrel-core/pull/96

Upgrade requirements by @winfried-ripken in https://github.com/merantix-momentum/squirrel-core/pull/97

Reference Huggingface, Hub and Torchvision Drivers by @axkoenig in https://github.com/merantix-momentum/squirrel-core/pull/99

Update requirements by @MaxSchambach in https://github.com/merantix-momentum/squirrel-core/pull/101

Refactoring DataFrameDriver and related drivers by @MaxSchambach in https://github.com/merantix-momentum/squirrel-core/pull/98

Full Changelog: https://github.com/merantix-momentum/squirrel-core/compare/v0.17.7...v0.18.0
Source code(tar.gz)
Source code(zip)
v0.17.7(Oct 7, 2022)
What's Changed

Add hooks to check backwards compatibility with py3.6+ by @AlpAribal in https://github.com/merantix-momentum/squirrel-core/pull/87

Add pyupgrade, yaml formatting and update all hooks by @AlpAribal in https://github.com/merantix-momentum/squirrel-core/pull/88

Fix file driver storage options by @axkoenig in https://github.com/merantix-momentum/squirrel-core/pull/85

Peng add kwargs to map by @pzdkn in https://github.com/merantix-momentum/squirrel-core/pull/90

Add hooks to csv driver by @winfried-ripken in https://github.com/merantix-momentum/squirrel-core/pull/91

Explain automatic version iteration by @AdemFr in https://github.com/merantix-momentum/squirrel-core/pull/84

Add csv driver option to specify csv read args by @MaxSchambach in https://github.com/merantix-momentum/squirrel-core/pull/93

New Contributors

@MaxSchambach made their first contribution in https://github.com/merantix-momentum/squirrel-core/pull/93

Full Changelog: https://github.com/merantix-momentum/squirrel-core/compare/v0.17.4...v0.17.7
Source code(tar.gz)
Source code(zip)
v0.17.4(Aug 31, 2022)
What's Changed

Make this repo installable with all python versions by @winfried-loetzsch in https://github.com/merantix-momentum/squirrel-core/pull/82

Fix storage options by @axkoenig in https://github.com/merantix-momentum/squirrel-core/pull/83

Full Changelog: https://github.com/merantix-momentum/squirrel-core/compare/v0.17.2...v0.17.4
Source code(tar.gz)
Source code(zip)
v0.17.2(Aug 25, 2022)
What's Changed

Make CatalogSource visible in the API by @AlpAribal in https://github.com/merantix-momentum/squirrel-core/pull/71

Minor tweaks in documentation by @AlpAribal in https://github.com/merantix-momentum/squirrel-core/pull/73

Introduce rst linting via precommit hook by @AlpAribal in https://github.com/merantix-momentum/squirrel-core/pull/74

Remove binary file in tests dir by @AlpAribal in https://github.com/merantix-momentum/squirrel-core/pull/75

Unifies folder-creation behaviour when instantiation SquirrelStore by @pzdkn in https://github.com/merantix-momentum/squirrel-core/pull/72

Bugfix - Register Torch Composables by @axkoenig in https://github.com/merantix-momentum/squirrel-core/pull/78

Upgrade infra to py3.9 by @AlpAribal in https://github.com/merantix-momentum/squirrel-core/pull/79

Add storage options kwargs to FPGen by @mg515 in https://github.com/merantix-momentum/squirrel-core/pull/81

New Contributors

@axkoenig made their first contribution in https://github.com/merantix-momentum/squirrel-core/pull/78

@mg515 made their first contribution in https://github.com/merantix-momentum/squirrel-core/pull/81

Full Changelog: https://github.com/merantix-momentum/squirrel-core/compare/v0.16.0...v0.17.2
Source code(tar.gz)
Source code(zip)
v0.16.0(Jul 26, 2022)
What's Changed

introduce loop and fixed size iterable by @AlirezaSohofi in https://github.com/merantix-momentum/squirrel-core/pull/47

Move cla assistant to workflows by @winfried-loetzsch in https://github.com/merantix-momentum/squirrel-core/pull/62

*add tutorials, *ignore test in api-ref, *remove unused execption by @pzdkn in https://github.com/merantix-momentum/squirrel-core/pull/63

First draft of advanced section for iterstreams by @pzdkn in https://github.com/merantix-momentum/squirrel-core/pull/55

Update Doc-String of MapDriver.get_iter by @kai-tub in https://github.com/merantix-momentum/squirrel-core/pull/61

Composable.compose gets source as kwarg, which is equal to self by @AlirezaSohofi in https://github.com/merantix-momentum/squirrel-core/pull/66

Peng add pytorch convenience functions to composable by @pzdkn in https://github.com/merantix-momentum/squirrel-core/pull/69

partial function for keys method by @AlirezaSohofi in https://github.com/merantix-momentum/squirrel-core/pull/70

New Contributors

@kai-tub made their first contribution in https://github.com/merantix-momentum/squirrel-core/pull/61

Full Changelog: https://github.com/merantix-momentum/squirrel-core/compare/0.14.2...v0.16.0
Source code(tar.gz)
Source code(zip)
0.14.2(Jun 23, 2022)
What's Changed

change squirrel test using a tmp public bucket by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/46

Update fs.open mode for catalog by @AdemFr in https://github.com/merantix-momentum/squirrel-core/pull/48

CatalogKey can be used to index catalog by @AlpAribal in https://github.com/merantix-momentum/squirrel-core/pull/49

accept callable as source for composable to make it completly lazy by @AlirezaSohofi in https://github.com/merantix-momentum/squirrel-core/pull/44

add sphinxcontrib-mermaid by @AlirezaSohofi in https://github.com/merantix-momentum/squirrel-core/pull/51

Architecture overview by @AlirezaSohofi in https://github.com/merantix-momentum/squirrel-core/pull/54

*add advanced store *reorganize sections *add icon,favicon by @pzdkn in https://github.com/merantix-momentum/squirrel-core/pull/53

Create codeql-analysis.yml by @AlirezaSohofi in https://github.com/merantix-momentum/squirrel-core/pull/52

Upgrade numpy & numba by @winfried-loetzsch in https://github.com/merantix-momentum/squirrel-core/pull/57

Winnie bump pyjwt by @winfried-loetzsch in https://github.com/merantix-momentum/squirrel-core/pull/58

New Contributors

@AdemFr made their first contribution in https://github.com/merantix-momentum/squirrel-core/pull/48

@pzdkn made their first contribution in https://github.com/merantix-momentum/squirrel-core/pull/53

@winfried-loetzsch made their first contribution in https://github.com/merantix-momentum/squirrel-core/pull/57

Full Changelog: https://github.com/merantix-momentum/squirrel-core/compare/v0.13.2...0.14.2
Source code(tar.gz)
Source code(zip)
v0.13.2(May 18, 2022)
What's Changed

Fix SourceCombiner.get_iter() not interleaving correctly by @AlpAribal in https://github.com/merantix-momentum/squirrel-core/pull/45

Full Changelog: https://github.com/merantix-momentum/squirrel-core/compare/v0.13.1...v0.13.2
Source code(tar.gz)
Source code(zip)
v0.13.1(May 18, 2022)
What's Changed

Add community files by @ThomasWollmann in https://github.com/merantix-momentum/squirrel-core/pull/38

Minor requirement changes by @AlpAribal in https://github.com/merantix-momentum/squirrel-core/pull/40

messagepack unpacker set use_list argument to False by default by @AlirezaSohofi in https://github.com/merantix-momentum/squirrel-core/pull/39

New Contributors

@AlpAribal made their first contribution in https://github.com/merantix-momentum/squirrel-core/pull/40

Full Changelog: https://github.com/merantix-momentum/squirrel-core/compare/v0.12.3...v0.13.1
Source code(tar.gz)
Source code(zip)
v0.12.3(Apr 11, 2022)
What's Changed

Update README.md by @ThomasWollmann in https://github.com/merantix-momentum/squirrel-core/pull/31

pin numpy and update PR template by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/34

Update README.md by @ThomasWollmann in https://github.com/merantix-momentum/squirrel-core/pull/33

update document links by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/36

update version to 0.12.3 by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/37

Full Changelog: https://github.com/merantix-momentum/squirrel-core/compare/v0.12.2...v0.12.3
Source code(tar.gz)
Source code(zip)
v0.12.2(Apr 6, 2022)
What's Changed

update img to github raw file so public pypi can load it by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/26

Tiansu add readthedocs.yml by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/27

add dependencies for readthedoc by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/28

fix readthedoc by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/29

update readthedocs links by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/30

Tiansu move leftover commits by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/32

Full Changelog: https://github.com/merantix-momentum/squirrel-core/compare/v0.12.1...v0.12.2
Source code(tar.gz)
Source code(zip)
v0.12.1(Apr 5, 2022)
What's Changed

update docs link by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/12

add logo by @ThomasWollmann in https://github.com/merantix-momentum/squirrel-core/pull/13

remove old extra file by @ThomasWollmann in https://github.com/merantix-momentum/squirrel-core/pull/14

add back keyring until public release by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/16

key_hook param of get_iter accepts SplitByRank and SplitByWorker, par… by @AlirezaSohofi in https://github.com/merantix-momentum/squirrel-core/pull/15

fix install instruction by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/18

Update README.md by @ThomasWollmann in https://github.com/merantix-momentum/squirrel-core/pull/19

Update README.md by @ThomasWollmann in https://github.com/merantix-momentum/squirrel-core/pull/20

Update README.md by @ThomasWollmann in https://github.com/merantix-momentum/squirrel-core/pull/21

Tiansu update black by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/22

add CLA bot by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/23

switch to publish in public pypi by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/24

update version to 0.12.1 by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/25

New Contributors

@ThomasWollmann made their first contribution in https://github.com/merantix-momentum/squirrel-core/pull/13

@AlirezaSohofi made their first contribution in https://github.com/merantix-momentum/squirrel-core/pull/15

Full Changelog: https://github.com/merantix-momentum/squirrel-core/compare/v0.12.0...v0.12.1
Source code(tar.gz)
Source code(zip)
v0.12.0(Mar 12, 2022)
What's Changed

add basic files to get infrastructure running by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/3

new semantic versioning format for dev release by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/4

tiansu copy squirrel codebase by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/5

Tiansu add docs by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/9

add pypi classifiers by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/10

change version norm by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/11

Full Changelog: https://github.com/merantix-momentum/squirrel-core/commits/v0.12.0
Source code(tar.gz)
Source code(zip)

A Python library that enables ML teams to share, load, and transform data in a collaborative, flexible, and efficient way :chestnut:

Related tags

Overview

Squirrel Core

What is Squirrel?

Installation

Documentation

Example Notebooks

Contributing

The humans behind Squirrel

Citation

Comments

Type of change

Checklist:

Description

Type of change

Checklist:

Description

Type of change

Checklist:

Description

Type of change

Checklist:

Description

Type of change

Checklist:

Description

Type of change

Checklist:

Description

Type of change

Checklist:

Description

Type of change

Checklist:

Description

Type of change

Checklist:

Description

Type of change

Checklist:

Description

Type of change

Checklist:

Description

Type of change

Checklist:

Description

Type of change

Checklist:

Description

Type of change

Checklist:

Description

Type of change

Checklist:

Draft State!

Releases(v0.18.0)

v0.18.0(Nov 10, 2022)

What's Changed

v0.17.7(Oct 7, 2022)

What's Changed

New Contributors

v0.17.4(Aug 31, 2022)

What's Changed

v0.17.2(Aug 25, 2022)

What's Changed

New Contributors

v0.16.0(Jul 26, 2022)

What's Changed

New Contributors

0.14.2(Jun 23, 2022)

What's Changed

New Contributors

v0.13.2(May 18, 2022)

What's Changed

v0.13.1(May 18, 2022)

What's Changed

New Contributors

v0.12.3(Apr 11, 2022)