Hub is a dataset format with a simple API for creating, storing, and collaborating on AI datasets of any size.

Overview


Dataset Format for AI

Docs PyPI version PyPI version GitHub issues codecov

DocumentationGetting StartedAPI ReferenceExamplesBlogSlack CommunityTwitter

About Hub

Hub is a dataset format with a simple API for creating, storing, and collaborating on AI datasets of any size. The hub data layout enables rapid transformations and streaming of data while training models at scale. Hub is used by Google, Waymo, Red Cross, Oxford University, and Omdena.

Hub includes the following features:

  • Storage agnostic API: Use the same API to upload, download, and stream datasets to/from AWS S3/S3-compatible storage, GCP, Activeloop cloud, local storage, as well as in-memory.
  • Compressed storage: Store images, audios and videos in their native compression, decompressing them only when needed, for e.g., when training a model.
  • Lazy NumPy-like slicing: Treat your S3 or GCP datasets as if they are a collection of NumPy arrays in your system's memory. Slice them, index them, or iterate through them. Only the bytes you ask for will be downloaded!
  • Dataset version control: Commits, branches, checkout - Concepts you are already familiar with in your code repositories can now be applied to your datasets as well.
  • Third-party integrations: Hub comes with built-in integrations for Pytorch and Tensorflow. Train your model with a few lines of code - we even take care of dataset shuffling. :)
  • Distributed transforms: Rapidly apply transformations on your datasets using multi-threading, multi-processing, or our built-in Ray integration.
  • Instant visualization support: Hub datasets are instantly visualized with bounding boxes, masks, annotations, etc. in Activeloop Platform (see below).

Getting Started with Hub

🚀 How to install Hub

Hub is written in 100% Python and can be quickly installed using pip.

pip3 install hub

🧠 Training a PyTorch model on a Hub dataset

Load CIFAR-10, one of the readily available datasets in Hub:

import hub
import torch
from torchvision import transforms, models

ds = hub.load('hub://activeloop/cifar10-train')

Inspect tensors in the dataset:

ds.tensors.keys()    # dict_keys(['images', 'labels'])
ds.labels[0].numpy() # array([6], dtype=uint32)

Train a PyTorch model on the Cifar-10 dataset without the need to download it

First, define a transform for the images and use Hub's built-in PyTorch one-line dataloader to connect the data to the compute:

tform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5]),
])

hub_loader = ds.pytorch(num_workers=0, batch_size=4, transform={
                        'images': tform, 'labels': None}, shuffle=True)

Next, define the model, loss and optimizer:

net = models.resnet18(pretrained=False)
net.fc = torch.nn.Linear(net.fc.in_features, len(ds.labels.info.class_names))
    
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

Finally, the training loop for 2 epochs:

for epoch in range(2):
    running_loss = 0.0
    for i, data in enumerate(hub_loader):
        images, labels = data['images'], data['labels']
        
        # zero the parameter gradients
        optimizer.zero_grad()

        # forward + backward + optimize
        outputs = net(images)
        loss = criterion(outputs, labels.reshape(-1))
        loss.backward()
        optimizer.step()
        
        # print statistics
        running_loss += loss.item()
        if i % 100 == 99:    # print every 100 mini-batches
            print('[%d, %5d] loss: %.3f' %
                (epoch + 1, i + 1, running_loss / 100))
            running_loss = 0.0

🏗️ How to create a Hub Dataset

A hub dataset can be created in various locations (Storage providers). This is how the paths for each of them would look like:

Storage provider Example path
Activeloop cloud hub://user_name/dataset_name
AWS S3 / S3 compatible s3://bucket_name/dataset_name
GCP gcp://bucket_name/dataset_name
Local storage path to local directory
In-memory mem://dataset_name

Let's create a dataset in the Activeloop cloud. Activeloop cloud provides free storage up to 300 GB per user (more info here). Create a new account with Hub from the terminal using activeloop register if you haven't already. You will be asked for a user name, email ID, and password. The user name you enter here will be used in the dataset path.

$ activeloop register
Enter your details. Your password must be at least 6 characters long.
Username:
Email:
Password:

Initialize an empty dataset in the Activeloop Cloud:

/test-dataset")">
import hub

ds = hub.empty("hub://
    
     /test-dataset"
    )

Next, create a tensor to hold images in the dataset we just initialized:

images = ds.create_tensor("images", htype="image", sample_compression="jpg")

Assuming you have a list of image file paths, let's upload them to the dataset:

image_paths = ...
with ds:
    for image_path in image_paths:
        image = hub.read(image_path)
        ds.images.append(image)

Alternatively, you can also upload numpy arrays. Since the images tensor was created with sample_compression="jpg", the arrays will be compressed with jpeg compression.

import numpy as np

with ds:
    for _ in range(1000):  # 1000 random images
        random_image = np.random.randint(0, 256, (100, 100, 3))  # 100x100 image with 3 channels
        ds.images.append(random_image)

🚀 How to load a Hub Dataset

You can load the dataset you just created with a single line of code:

/test-dataset")">
import hub

ds = hub.load("hub://
    
     /test-dataset"
    )

You can also access other publicly available hub datasets, not just the ones you created. Here is how you would load the Objectron Bikes Dataset:

import hub

ds = hub.load('hub://activeloop/objectron_bike_train')

To get the first image in the Objectron Bikes dataset in numpy format:

image_arr = ds.image[0].numpy()

📚 Documentation

Getting started guides, examples, tutorials, API reference, and other useful information can be found on our documentation page.

🎓 For Students and Educators

Hub users can access and visualize a variety of popular datasets through a free integration with Activeloop's Platform. Users can also create and store their own datasets and make them available to the public. Free storage of up to 300 GB is available for students and educators:

Storage for public datasets hosted by Activeloop 200GB Free
Storage for private datasets hosted by Activeloop 100GB Free

👩‍💻 Comparisons to Familiar Tools

Hub vs DVC

Hub and DVC offer dataset version control similar to git for data, but their methods for storing data differ significantly. Hub converts and stores data as chunked compressed arrays, which enables rapid streaming to ML models, whereas DVC operates on top of data stored in less efficient traditional file structures. The Hub format makes dataset versioning significantly easier compared to traditional file structures by DVC when datasets are composed of many files (i.e., many images). An additional distinction is that DVC primarily uses a command-line interface, whereas Hub is a Python package. Lastly, Hub offers an API to easily connect datasets to ML frameworks and other common ML tools and enables instant dataset visualization through Activeloop's visualization tool.

Activeloop Hub vs TensorFlow Datasets (TFDS)

Hub and TFDS seamlessly connect popular datasets to ML frameworks. Hub datasets are compatible with both PyTorch and TensorFlow, whereas TFDS are only compatible with TensorFlow. A key difference between Hub and TFDS is that Hub datasets are designed for streaming from the cloud, whereas TFDS must be downloaded locally prior to use. As a result, with Hub, one can import datasets directly from TensorFlow Datasets and stream them either to PyTorch or TensorFlow. In addition to providing access to popular publicly available datasets, Hub also offers powerful tools for creating custom datasets, storing them on a variety of cloud storage providers, and collaborating with others via simple API. TFDS is primarily focused on giving the public easy access to commonly available datasets, and management of custom datasets is not the primary focus. A full comparison article can be found here.

Activeloop Hub vs HuggingFace

Hub and HuggingFace offer access to popular datasets, but Hub primarily focuses on computer vision, whereas HuggingFace focuses on natural language processing. HuggingFace Transforms and other computational tools for NLP are not analogous to features offered by Hub.

Community

Join our Slack community to learn more about unstructured dataset management using Hub and to get help from the Activeloop team and other users.

We'd love your feedback by completing our 3-minute survey.

As always, thanks to our amazing contributors!

Made with contributors-img.

Please read CONTRIBUTING.md to get started with making contributions to Hub.

README Badge

Using Hub? Add a README badge to let everyone know:

hub

[![hub](https://img.shields.io/badge/powered%20by-hub%20-ff5a1f.svg)](https://github.com/activeloopai/Hub)

Disclaimers

Dataset Licenses

Hub users may have access to a variety of publicly available datasets. We do not host or distribute these datasets, vouch for their quality or fairness, or claim that you have a license to use the datasets. It is your responsibility to determine whether you have permission to use the datasets under their license.

If you're a dataset owner and do not want your dataset to be included in this library, please get in touch through a GitHub issue. Thank you for your contribution to the ML community!

Usage Tracking

By default, we collect usage data using Bugout (here's the code that does it). It does not collect user data other than anonymized IP address data, and it only logs the Hub library's own actions. This helps our team understand how the tool is used and how to build features that matter to you! After you register with Activeloop, data is no longer anonymous. You can always opt-out of reporting using the CLI command below:

activeloop reporting --off

Acknowledgment

This technology was inspired by our research work at Princeton University. We would like to thank William Silversmith @SeungLab for his awesome cloud-volume tool.

Comments
  • [2.0] writing/reading fixed-shape arrays to chunks

    [2.0] writing/reading fixed-shape arrays to chunks

    support chunked writing (not appending) for np.arrays -> storage providers with the following qualities:

    • batched/unbatched
    • fixed shape (all samples have same shape)

    also adds:

    • in .circleci/config.yaml run pytest & pytest-benchmark separately (with --benchmark-skip & --benchmark-only flags)

    things this branch does not do:

    • appending (writing arrays to a key that already had arrays written to it)
    • caching
    • index map chunking
    • compression
    enhancement 
    opened by nollied 415
  • Create a tutorial on Colab

    Create a tutorial on Colab

    Create a tutorial on Colab

    Users should be able to load a dataset, train a model, and upload the dataset. Feel free to start from a small example and then make the example comprehensive.

    good first issue hacktoberfest 
    opened by davidbuniat 33
  • v1-alpha candidate

    v1-alpha candidate

    1. Ability to modify datasets on fly. Datasets are no longer immutable and can be modified over time
    2. Larger datasets can now be uploaded as we removed some RAM limiting components from the hub
    3. Caching is introduced to improve IO performance.
    4. Dynamic shaping enables very large images/data support. You can have large images/data stored in hub.
    opened by edogrigqv2 31
  • [FEATURE] Adding FFHQ dataset

    [FEATURE] Adding FFHQ dataset

    I have the 1024 and 128 scale pngs from the FFHQ dataset. I'd like to upload this as a hub:// dataset so that you can copy it to the activeloop namespace.

    Currently I am considering how to structure the dataset, and what splits it should be uploaded as.

    Below is the schema I have used so far. It includes all of the metadata from the original dataset including the URLs to the original files, and the pixel_md5 hashes match when looping back over the dataset and recomputing them.

    ds = hub.empty("./ffhq-1024", overwrite=True)
    
    with ds:
        ds.create_tensor("metadata/author", htype="text")
        ds.create_tensor("metadata/country", htype="text")
        ds.create_tensor("metadata/date_crawled", htype="text")
        ds.create_tensor("metadata/date_uploaded", htype="text")
        ds.create_tensor("metadata/license", htype="text")
        ds.create_tensor("metadata/license_url", htype="text")
        ds.create_tensor("metadata/photo_title", htype="text")
        ds.create_tensor("metadata/photo_url", htype="text")
    
        ds.create_tensor("images/image", htype="image", sample_compression="png")
        ds.create_tensor("images/face_landmarks", dtype=np.float32)
        ds.create_tensor("images/file_md5", htype="text")
        ds.create_tensor("images/file_path", htype="text")
        ds.create_tensor("images/file_url", htype="text")
        ds.create_tensor("images/file_size", dtype=np.int32)
        ds.create_tensor("images/pixel_md5", htype="text")
    
        ds.create_tensor("thumbs/image", htype="image", sample_compression="png")
        ds.create_tensor("thumbs/face_landmarks", dtype=np.float32)
        ds.create_tensor("thumbs/file_md5", htype="text")
        ds.create_tensor("thumbs/file_path", htype="text")
        ds.create_tensor("thumbs/file_url", htype="text")
        ds.create_tensor("thumbs/file_size", dtype=np.int32)
        ds.create_tensor("thumbs/pixel_md5", htype="text")
    
        ds.create_tensor("wilds/face_landmarks", dtype=np.float32)
        ds.create_tensor("wilds/face_rect", dtype=np.float32)
        ds.create_tensor("wilds/file_md5", htype="text")
        ds.create_tensor("wilds/file_path", htype="text")
        ds.create_tensor("wilds/file_url", htype="text")
        ds.create_tensor("wilds/file_size", dtype=np.int32)
        ds.create_tensor("wilds/pixel_md5", htype="text")
        ds.create_tensor("wilds/pixel_size", dtype=np.int32)
    

    Does this structure abide by Hub best practices?

    Would it be a good idea to also upload a "ffhq-128" without the 1024 images, and "ffhq-meta" without the 128 images also?

    >>> next(ds.tensorflow().as_numpy_iterator())
    {
      'metadata/author': array([b'Jeremy Frumkin'], dtype=object), 
      'metadata/country': array([b''], dtype=object), 
      'metadata/date_crawled': array([b'2018-10-10'], dtype=object), 
      'metadata/date_uploaded': array([b'2007-08-16'], dtype=object), 
      'metadata/license': array([b'Attribution-NonCommercial License'], dtype=object), 
      'metadata/license_url': array([b'https://creativecommons.org/licenses/by-nc/2.0/'], dtype=object), 
      'metadata/photo_title': array([b'DSCF0899.JPG'], dtype=object), 
      'metadata/photo_url': array([b'https://www.flickr.com/photos/frumkin/1133484654/'], dtype=object), 
      
      'images/image': array([[[  0, 133, 147], ..., [132, 157, 164]]], dtype=uint8), 
      'images/face_landmarks': array([[131.62, 453.8 ], ..., [521.04, 715.26]], dtype=float32), 
      'images/file_md5': array([b'ddeaeea6ce59569643715759d537fd1b'], dtype=object), 
      'images/file_path': array([b'images1024x1024/00000/00000.png'], dtype=object), 
      'images/file_size': array([1488194], dtype=int32), 
      'images/file_url': array([b'https://drive.google.com/uc?id=1xJYS4u3p0wMmDtvUE13fOkxFaUGBoH42'], dtype=object), 
      'images/pixel_md5': array([b'47238b44dfb87644460cbdcc4607e289'], dtype=object), 
      
      'thumbs/image': array([[[  0, 130, 146], ..., [134, 157, 163]]], dtype=uint8), 
      'thumbs/face_landmarks': array([[ 16.4525 ,  56.725  ], ..., [ 65.13   ,  89.4075 ]], dtype=float32), 
      'thumbs/file_md5': array([b'bd3e40b2ba20f76b55dc282907b89cd1'], dtype=object), 
      'thumbs/file_path': array([b'thumbnails128x128/00000/00000.png'], dtype=object), 
      'thumbs/file_size': array([29050], dtype=int32), 
      'thumbs/file_url': array([b'https://drive.google.com/uc?id=1fUMlLrNuh5NdcnMsOpSJpKcDfYLG6_7E'], dtype=object), 
      'thumbs/pixel_md5': array([b'38d7e93eb9a796d0e65f8c64de8ba161'], dtype=object), 
      
      'wilds/face_landmarks': array([[ 562.5,  697.5], ..., [1060.5,  996.5]], dtype=float32), 
      'wilds/face_rect': array([ 667.,  410., 1438., 1181.], dtype=float32), 
      'wilds/file_md5': array([b'1dc0287e73e485efb0516a80ce9d42b4'], dtype=object), 
      'wilds/file_path': array([b'in-the-wild-images/00000/00000.png'], dtype=object), 
      'wilds/file_size': array([3991569], dtype=int32), 
      'wilds/file_url': array([b'https://drive.google.com/uc?id=1yT9RlvypPefGnREEbuHLE6zDXEQofw-m'], dtype=object), 
      'wilds/pixel_md5': array([b'86b3470c42e33235d76b979161fb2327'], dtype=object), 
      'wilds/pixel_size': array([2016, 1512], dtype=int32)
    }
    

    Getting the 900GB Wilds images, along with the TFRecords that are pre-resized for each intermediate scale is proving to be harder to acquire. But just hosting the 1024 scale images would already be a huge improvement in making the dataset accessible.

    enhancement 
    opened by JossWhittle 28
  • [Feature] pretty prints of objects

    [Feature] pretty prints of objects

    🚨🚨 Feature Request

    If your feature will improve HUB

    To explore the structure of a dataset it is convenient to have nicer and more informative prints of dataset objects and samples

    Description of the possible solution

    1) show ds

    now

    > ds
    Dataset(path='hub://activeloop/abalone_full_dataset', tensors=['length', 'diameter', 'height', 'weight'])
    

    Something along the lines would work (taken from SQLlite)

    > ds.height
    path: "hub://activeloop/abalone_full_dataset", samples:  1532596
    
    tensor    htype        dtype    shape       compression
    ------    ------       ------   ------      -----------
    length    image        uint8    256x256x3   jpeg
    diameter  image        float32  512x512x3   zstd
    height    image        float32  512x512x3   zstd
    weight    class_label  int32    32          None
    
    

    and in jupyter notebook shown as a table similar to pandas

    2) show ds.tensor

    now

    > ds.height
    Tensor(key='Length')
    

    at least provide full information about tensor

    > ds.height
    Tensor(
        key='height', 
        htype='image', 
        dtype='uint8', 
        shape=(256, 256, 3), 
        sample_compression='jpeg'
    )
    

    or to make consistent with 1)

    > ds.height
    tensor    htype    dtype     shape       compression
    ------    ------   ------    ------      -----------
    height    image    float32   512x512x3   zstd
    

    2) show ds[0:5] sample

    > ds[0:5]
        length    diameter     height     weight
        ------    --------     ------     ------
    0      0.5    [[0.,...,0]] "sent.."      dog   
    0      0.5    [[0.,...,0]] "text a"      dog   
    0      0.5    [[0.,...,0]] "text b"      dog   
    

    and in jupyter notebook visualize images (and other htypes)

    Notes

    • [ ] Feel free to provide a better format for printing dataset, tensor and sample classes
    • [ ] Feel free to suggest other important classes/objects need to printed properly for exploring the structure
    enhancement good first issue 
    opened by davidbuniat 25
  • [FEATURE] Benchmarking memory

    [FEATURE] Benchmarking memory

    🚨🚨 Feature Request

    • [ ] Related to an existing Issue
    • [X] A new implementation (Improvement, Extension)

    We should benchmark memory usage when fetching from a Hub dataset.

    If your feature will improve HUB

    In the near term, well-scoped memory benchmarks will assess new features. In the long term, it can be used to compare performance with other libraries such as Zarr and Tile.

    Description of the possible solution

    We could start with a client-side benchmark reading from a local volume, perhaps with memory-profiler.

    help wanted good first issue 
    opened by mynameisvinn 25
  • [BUG] Tests fail in Windows Enviroment specifically

    [BUG] Tests fail in Windows Enviroment specifically

    🐛🐛 Bug Report

    In the current test sequence, 11 tests fail with the error AttributeError: module 'numcodecs' has no attribute 'MsgPack', however, this error does not exist in colab environments

    ⚗️ Current Behavior

    When pytest . is run on a Windows 10 environment, 11 tests fail and 6 of them have the error message as described above.

    Expected behavior/code These errors should not be thrown

    ⚙️ Environment

    • Python version(s):
      • Python 3.7.9
    • OS: Windows 10

    🖼 Additional context/Screenshots (optional)

    Add any other context about the problem here. If applicable, add screenshots to help explain.

    opened by DebadityaPal 21
  • MPII Human Pose Dataset

    MPII Human Pose Dataset

    Describe the dataset

    Add MPII Human Pose Dataset dataset to Hub. So this would work.

    import hub
    ds = hub.load("username/mpii-human-pose-dataset")
    

    Steps

    1. Please take a look at the docs on uploading datasets.

    2. Uploading script should be added to examples folder

    Example

    You can find an example of large dataset loading and upload here:

    • https://github.com/activeloopai/Hub/blob/master/examples/coco/upload_coco2017.py
    good first issue hacktoberfest dataset 
    opened by kristinagrig06 21
  • [FEATURE] Append MPL headers on source

    [FEATURE] Append MPL headers on source

    🚨🚨 Feature Request

    • [x] A new implementation (Improvement, Extension)

    Is your feature request related to a problem?

    Hub currently uses Mozilla Public License (MPL), which requires the following header (from Exhibit A of the license) to be attached to source.

    This Source Code Form is subject to the terms of the Mozilla Public License, v. 2.0. If a copy of the MPL was not distributed with this file, You can obtain one at https://mozilla.org/MPL/2.0/.
    

    We need help appending MPL headers on source (where appropriate).

    good first issue 
    opened by mynameisvinn 20
  • hub-2.0 chunk generator

    hub-2.0 chunk generator

    this is an essential part of the chunk engine. this is a contribution that is narrow in scope (not implementing the whole chunk engine). i also added explicit type checking during pytests using the pytest-mypy package.

    this contribution converts bytes -> chunks & has tests to represent as many edge cases as possible.

    note: this chunk generator is for chunking with respect to the primary axis. it does not support slicing, but i came up with a modification that will support it.

    let's merge this into release/2.0 first to get the ball rolling & i will make another PR with the modification to support slicing.

    enhancement v2 
    opened by nollied 19
  • Add the the Fine-Grained Visual Categorization IMET 2020 dataset

    Add the the Fine-Grained Visual Categorization IMET 2020 dataset

    Describe the dataset

    Add IMET 2020 FGVC7 dataset to Hub. So this would work.

    import hub
    ds = hub.load("username/imet-2020-fgvc7")
    

    Steps

    1. Please take a look at the docs on uploading datasets.

    2. Uploading script should be added to examples folder

    Example

    You can find an example of large dataset loading and upload here:

    • https://github.com/activeloopai/Hub/blob/master/examples/coco/upload_coco2017.py
    good first issue hacktoberfest dataset 
    opened by mikayelh 19
  • [DL-943] Nones + transform fix

    [DL-943] Nones + transform fix

    🚀 🚀 Pull Request

    Checklist:

    • [ ] My code follows the style guidelines of this project and the Contributing document
    • [ ] I have commented my code, particularly in hard-to-understand areas
    • [ ] I have kept the coverage-rate up
    • [ ] I have performed a self-review of my own code and resolved any problems
    • [ ] I have checked to ensure there aren't any other open Pull Requests for the same change
    • [ ] I have described and made corresponding changes to the relevant documentation
    • [ ] New and existing unit tests pass locally with my changes

    Changes

    opened by farizrahman4u 1
  • [BUG] pytorch dataloader index error

    [BUG] pytorch dataloader index error

    🐛🐛 Bug Report

    I'm trying to understand an issue that is making the PyTorch data loader from deeplake throw an index error for some samples unexpectedly. When I try to fetch the data directly from the data set, the behaviour is not reproducible.

    The error first appeared during model training. I was able to reproduce it with the following code:

    def deeplake_transform(sample_in, patch_size: int, num_seg_classes: int):
        seg_indices = sample_in["masks/label"]
        partial_mask = sample_in["masks/mask"].astype("float32")
        full_mask = np.zeros((num_seg_classes, patch_size, patch_size), dtype=np.float32)
        for i, idx in enumerate(seg_indices):
            full_mask[idx] = partial_mask[i]
    
        return dict(
            inputs=dict(image=T.ToTensor()(sample_in["images"])),
            targets=dict(
                segmentations=full_mask,
                classifications=sample_in["labels"].astype("float32"),
            ),
        )
    data_loader = ds.pytorch(
            transform=deeplake_transform,
            decode_method={"images": "numpy"},
            batch_size=1,
            num_workers=1,
            transform_kwargs={"num_seg_classes": 67, "patch_size": 512},
        )
    iter_loader = iter(data_loader)
    
    
    while True:
        try:
            sample = next(iter_loader)
        except Exception as e:
            print(e)
            break
    
        idx += 1
        if idx == len(ds):
            print("finished")
            break
    

    The following error is thrown without much context.

    Caught IndexError in DataLoader worker process 0.
    Original Traceback (most recent call last):
      File ".venv/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
        data = fetcher.fetch(index)
      File ".venv/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 34, in fetch
        data.append(next(self.dataset_iter))
      File "/home/test/.venv/lib/python3.8/site-packages/deeplake/integrations/pytorch/dataset.py", line 472, in __iter__
        for data in stream:
      File ".venv/lib/python3.8/site-packages/deeplake/core/io.py", line 311, in read
        yield from self.stream(block)
      File "/home/test/.venv/lib/python3.8/site-packages/deeplake/core/io.py", line 355, in stream
        data = engine.read_sample_from_chunk(
      File ".venv/lib/python3.8/site-packages/deeplake/core/chunk_engine.py", line 1528, in read_sample_from_chunk
        return chunk.read_sample(
      File ".venv/lib/python3.8/site-packages/deeplake/core/chunk/uncompressed_chunk.py", line 213, in read_sample
        sb, eb = bps[local_index]
      File ".venv/lib/python3.8/site-packages/deeplake/core/meta/encode/base_encoder.py", line 247, in __getitem__
        self._encoded[row_index], row_index, local_sample_index
    IndexError: index 7133 is out of bounds for axis 0 with size 7133
    

    But the following code produces no errors and exhausts the iterator.

    for sample in ds:
        try: # try to read all the data that is used in the 
            sample["images"].data()['value']
            sample["masks/mask"].data()['value']
            sample["masks/label"].data()['value']
            sample["labels"].data()['value']
        except:
            break
        
    

    I'm looking for help here since it may be related to the chunk_engine behaviour. It could help if the internal exception handler were more explicit about the error.

    ⚙️ Environment

    • Python version(s): 3.8.10
    • OS: Ubuntu 18.04
    • IDE: VS-Code
    • Packages: [torch==1.13.1, deeplake==3.1.7]
    bug 
    opened by lspinheiro 2
  • Tweaks to readme

    Tweaks to readme

    🚀 🚀 Pull Request

    Checklist:

    • [ ] My code follows the style guidelines of this project and the Contributing document
    • [ ] I have commented my code, particularly in hard-to-understand areas
    • [ ] I have kept the coverage-rate up
    • [ ] I have performed a self-review of my own code and resolved any problems
    • [ ] I have checked to ensure there aren't any other open Pull Requests for the same change
    • [ ] I have described and made corresponding changes to the relevant documentation
    • [ ] New and existing unit tests pass locally with my changes

    Changes

    opened by istranic 1
  • Parquet reader

    Parquet reader

    🚀 🚀 Pull Request

    Checklist:

    • [ ] My code follows the style guidelines of this project and the Contributing document
    • [ ] I have commented my code, particularly in hard-to-understand areas
    • [ ] I have kept the coverage-rate up
    • [ ] I have performed a self-review of my own code and resolved any problems
    • [ ] I have checked to ensure there aren't any other open Pull Requests for the same change
    • [ ] I have described and made corresponding changes to the relevant documentation
    • [ ] New and existing unit tests pass locally with my changes

    Changes

    opened by farizrahman4u 0
  • Add support for saving query in query.json

    Add support for saving query in query.json

    🚀 🚀 Pull Request

    Checklist:

    • [ ] My code follows the style guidelines of this project and the Contributing document
    • [ ] I have commented my code, particularly in hard-to-understand areas
    • [ ] I have kept the coverage-rate up
    • [ ] I have performed a self-review of my own code and resolved any problems
    • [ ] I have checked to ensure there aren't any other open Pull Requests for the same change
    • [ ] I have described and made corresponding changes to the relevant documentation
    • [ ] New and existing unit tests pass locally with my changes

    Changes

    opened by adolkhan 0
  • Added support for audio/video support in hub.ingest

    Added support for audio/video support in hub.ingest

    🚀 🚀 Pull Request

    Checklist:

    • [ ] My code follows the style guidelines of this project and the Contributing document
    • [ ] I have commented my code, particularly in hard-to-understand areas
    • [ ] I have kept the coverage-rate up
    • [ ] I have performed a self-review of my own code and resolved any problems
    • [ ] I have checked to ensure there aren't any other open Pull Requests for the same change
    • [ ] I have described and made corresponding changes to the relevant documentation
    • [ ] New and existing unit tests pass locally with my changes

    Changes

    Resolves #1556

    opened by aadityasinha-dotcom 1
Releases(v3.1.7)
  • v3.1.7(Dec 30, 2022)

    🧭 What's Changed

    • [AL-2069] Adds tensorflow support to enterprise dataloader (#2079) @AbhinavTuli
    • Removed pandas dependency (#2085) @adolkhan
    • Fix Random split + views issue (#2084) @AbhinavTuli
    • [CUS-64] Enterprise dataloader support kwargs, fixes issue with pytorch lightning (#2080) @AbhinavTuli
    • [BUGFIX] [CUS-62] Transform append with empty samples (#2077) @FayazRahman

    🗂 Documentation

    • Added ingest_coco to docs (#2082) @ProgerDav

    ⚙️ Who Contributes

    @AbhinavTuli, @FayazRahman, @ProgerDav and @adolkhan

    Source code(tar.gz)
    Source code(zip)
  • v3.1.6(Dec 28, 2022)

    🧭 What's Changed

    • [AL-2067] Add NIFTI support (#2076) @FayazRahman
    • Print hint to forward the visualizer port. (#2069) @khustup
    • Remove torch dependency (#2074) @levongh

    🚀 New

    • [DL-824] Ingestion for COCO format (#2027) @ProgerDav

    ⚙️ Who Contributes

    @AbhinavTuli, @FayazRahman, @ProgerDav, @khustup and @levongh

    Source code(tar.gz)
    Source code(zip)
  • v3.1.5(Dec 22, 2022)

    🧭 What's Changed

    • Tests fix for python 3.8 after numpy update (#2070) @farizrahman4u
    • [BUGFIX] Fix PIL decode method with multiple workers and shuffling (#2068) @FayazRahman
    • [AL-2078] Switch random split doc section (#2064) @AbhinavTuli
    • [AL-1976] Adds downsampling support (#2034) @AbhinavTuli
    • mmdet_test_fix (#2067) @adolkhan
    • [CUS-50] MMdet Mask Fix (#2052) @adolkhan

    ⚙️ Who Contributes

    @AbhinavTuli, @FayazRahman, @adolkhan, @farizrahman4u and @istranic

    Source code(tar.gz)
    Source code(zip)
  • v3.1.4(Dec 15, 2022)

    🧭 What's Changed

    • [Bug Fix] Pickling fix for DDP + Enterprise loader (#2059) @AbhinavTuli
    • [AL-1995] Adds ability to randomly split Deep Lake datasets (#2035) @AbhinavTuli
    • [CUS-48] MMDet DDP test fix (#2040) @farizrahman4u
    • [AL-2045] Fix corruption caused by pop (#2057) @farizrahman4u
    • Remove pandas imports (#2053) @farizrahman4u
    • MMDet + DDP progressbar fix (#2050) @farizrahman4u
    • [AL-2054] Rechunk bug fix and speedup (#2056) @FayazRahman
    • [AL-2037] Print error that sequences are not allowed with the pytorch dataloader (#2046) @farizrahman4u
    • [AL-2053] Log .dataloader instead of .numpy and .pytorch (#2054) @AbhinavTuli
    • [DL-920] Better version control for views (#2032) @FayazRahman
    • [DL-805] Groups + Loader fixes (#2045) @farizrahman4u

    ⚙️ Who Contributes

    @AbhinavTuli, @FayazRahman and @farizrahman4u

    Source code(tar.gz)
    Source code(zip)
  • v3.1.3(Dec 9, 2022)

    🧭 What's Changed

    • Fix _temp_tensors attribute error (#2044) @FayazRahman
    • [CUS-57] [CUS-58] In place ds connect (#2041) @ProgerDav
    • [AL-2002] Cache libdeeplake dataset to speed up repeated use (#2036) @AbhinavTuli
    • Fix transform readonly tests (#2047) @AbhinavTuli
    • [DL-761] mesh htype support (#1940) @adolkhan
    • [DL-815] Unifying src_token and dest_token to token (#2038) @adolkhan
    • Fixing torch import(#2042) @adolkhan
    • [CUS-56] Restrict characters in dataset names (#2037) @FayazRahman
    • [DL-793][CUS-46] Add wandb logging to indra loader (#2039) @farizrahman4u
    • Allow the use of compute functions on read-only datasets (#2019) @daniel-falk
    • MMDet Augmentations Fix (#2033) @adolkhan

    ⚙️ Who Contributes

    @AbhinavTuli, @FayazRahman, @ProgerDav, @adolkhan, @daniel-falk and @farizrahman4u

    Source code(tar.gz)
    Source code(zip)
  • v3.1.2(Dec 1, 2022)

    🧭 What's Changed

    • [DL-888] Dataset copying speedup and fixes (#2005) @FayazRahman
    • Do not hide S3 access errors (#1884) @daniel-falk
    • [DL-905] [DL-916] Consistent progressbar arg + example for decode_method (#2021) @FayazRahman

    ⚙️ Who Contributes

    @FayazRahman and @daniel-falk

    Source code(tar.gz)
    Source code(zip)
  • v3.1.1(Nov 29, 2022)

    🧭 What's Changed

    • Mmdet integration (#2026) @adolkhan
    • Allow persistent workers in dataloader (#2028) @AbhinavTuli
    • [AL-2012] speedup pop element from dataset (#2024) @levongh
    • [AL-2036] remove tiled image extraction (#2017) @levongh
    • Handle repeated samples in shuffle (#2018) @AbhinavTuli
    • [DL-910] Tensorflow iteration fix (#2013) @FayazRahman

    ⚙️ Who Contributes

    @AbhinavTuli, @FayazRahman, @adolkhan and @levongh

    Source code(tar.gz)
    Source code(zip)
  • v3.1.0(Nov 17, 2022)

    🧭 What's Changed

    • [DL-896] pip install deeplake[enterprise] (#2008) @farizrahman4u
    • [AL-2017] Add decode method to Pytorch API (#1991) @AbhinavTuli
    • [DL-885] Fix iteration warnings (#1989) @FayazRahman
    • [CUS-35] Fix merging class labels when class names aren't populated (#2007) @AbhinavTuli
    • Allow np.array as sampler weights. Update docs. (#1999) @khustup
    • [DL-893] Fast UUID + speedup sample id tensor (#1988) @farizrahman4u
    • [AL-2024] Add MPL license to Deep Lake in Pypi (#1998) @AbhinavTuli

    ⚙️ Who Contributes

    @AbhinavTuli, @FayazRahman, @farizrahman4u and @khustup

    Source code(tar.gz)
    Source code(zip)
  • v3.0.18(Nov 11, 2022)

    🧭 What's Changed

    • Bump libdeeplake version to fix issue with dataloader crashing over multiple epochs(#2000) @AbhinavTuli
    • [DL-811] [DL-857] API reference updates (#1977) @FayazRahman

    ⚙️ Who Contributes

    @AbhinavTuli and @FayazRahman

    Source code(tar.gz)
    Source code(zip)
  • v3.0.17(Nov 10, 2022)

    🧭 What's Changed

    • [CUS-32] Fix dataloader behaviour for json and list tensors (#1995) @AbhinavTuli
    • [CUS-30] Add support for bytes in json tensors (#1994) @AbhinavTuli
    • Add timeout to Pypi version check (#1996) @AbhinavTuli

    ⚙️ Who Contributes

    @AbhinavTuli

    Source code(tar.gz)
    Source code(zip)
  • v3.0.16(Nov 9, 2022)

    🧭 What's Changed

    • Libdeeplake update to fix issue with linked tensors on certain systems (#1992) @levongh
    • [AL-1850] [CUS-29] Version control diff and merge improvements (#1862) @AbhinavTuli
    • Adds support for sampling. (#1987) @khustup
    • [DL-879] Improve download API (#1986) @FayazRahman
    • [AL-1992] [CUS-18] Fixes token expiration issue using hub:// datasets (#1983) @AbhinavTuli
    • Mesh & Point Cloud htype's docs (#1979) @adolkhan

    ⚙️ Who Contributes

    @AbhinavTuli, @FayazRahman, @adolkhan, @khustup and @levongh

    Source code(tar.gz)
    Source code(zip)
  • v3.0.15(Nov 4, 2022)

    🧭 What's Changed

    • Serve link creds for non deeplake datasets in ds.visualize (#1974) @khustup
    • [DL-790] Speedup extend (#1936) @farizrahman4u

    ⚙️ Who Contributes

    @farizrahman4u and @khustup

    Source code(tar.gz)
    Source code(zip)
  • v3.0.14(Nov 1, 2022)

    🧭 What's Changed

    • [AL-2010] Fixes verification of linked samples during rechunking (#1980) @AbhinavTuli
    • No Wheels (fix for pip install on Windows) (#1976) @farizrahman4u
    • [AL-2011] Fixes a bug with popping samples (#1975) @AbhinavTuli
    • [AL-1964] Expose path for linked tensors (#1963) @AbhinavTuli
    • [DL-759] Deeplake connect (#1951) @ProgerDav

    ⚙️ Who Contributes

    @AbhinavTuli, @ProgerDav and @farizrahman4u

    Source code(tar.gz)
    Source code(zip)
  • v3.0.13(Oct 28, 2022)

    🧭 What's Changed

    • Update libdeeplake version (#1970) @AbhinavTuli
    • Update shuffle buffer to handle bytes (#1968) @AbhinavTuli

    ⚙️ Who Contributes

    @AbhinavTuli

    Source code(tar.gz)
    Source code(zip)
  • v3.0.12(Oct 27, 2022)

    🧭 What's Changed

    • Libdeeplake fixes and improvements (#1964) @AbhinavTuli
      • Greatly improves performance when working with compressed jpeg and png data
      • Experimental dataloader transforms now receive PIL images instead of numpy arrays, ToPILImage transform should not be included
      • Fixes deadlocking issue when multiple nested dataloaders are created
      • Fixed unexpected segmentation faults
      • Added wheels for centOS
      • Added wheels for arm64 and x86_64 (fixed linking errors during lib import)
    • [DL]-819 Add error messages related to user not being logged in (#1955) @adolkhan
    • [DL-804] Dont support group.info (#1960) @FayazRahman
    • [DL-782] Delete temp tensors in case append fails during transforms (#1924) @FayazRahman
    • Improves experimental dataloader performance for tensors with jpeg and png images (#1961) @AbhinavTuli
    • [AL-1999] [Bug fix] lnfo not being updated after using Deep Lake compute on dataset. (#1956) @AbhinavTuli
    • Fixed shape polygon fix (#1959) @FayazRahman
    • [DL-821] Fix allowing commit on views (#1953) @farizrahman4u
    • [DL-814][CUS-14][CUS-17] Pytorch fixes (#1949) @farizrahman4u
    • [CUS-22] Update query and htypes api reference (#1948) @FayazRahman
    • [CUS-24] Fix polygons bug with fixed shape inputs (#1950) @farizrahman4u
    • [DL-756] Log loading creds except in transforms (#1937) @FayazRahman
    • [Dl 706] Improve speed of materialization (#1902) @adolkhan
    • [AL-1990] add shuffle argument to .shuffle for experimental dataloader(#1942) @levongh
    • [DL-726][DL-789] Ignore corrupt tensors + fetch_chunks for .data(), .text() etc (#1932) @farizrahman4u
    • [DL-798] Fix partial read skip for chunk compressed chunks (#1939) @farizrahman4u

    ⚙️ Who Contributes

    @AbhinavTuli, @FayazRahman, @adolkhan, @davidbuniat, @farizrahman4u, @istranic and @levongh

    Source code(tar.gz)
    Source code(zip)
  • v3.0.10(Oct 13, 2022)

    🧭 What's Changed

    • libdeeplake upgrade (#1938) @davidbuniat
      • Query shape(image) bug fixed
      • Query regex for contains function deployed. Example: SELECT * WHERE contains(labels, 'an') on imagenet, will return all samples with class names containing. There are two wildcards supported * - any number of characters (including 0) and ? - exactly one character.
    • fix read for wav compressed audio (#1935) @gorinars
    • [DL-730] Make sure hub.list does not report the token to bugout (#1917) @adolkhan
    • Update Deep Lake version after release (#1934) @AbhinavTuli

    ⚙️ Who Contributes

    @AbhinavTuli, @adolkhan, @davidbuniat, @gorinars and [email protected]

    Source code(tar.gz)
    Source code(zip)
  • v3.0.9(Oct 11, 2022)

    🧭 What's Changed

    • Update libdeeplake version (#1933) @AbhinavTuli
    • [DL-764] API reference updates (#1929) @FayazRahman
    • Fix region issue with activeloop storage datasets (#1930) @AbhinavTuli
    • [DL-755] Specify transform kwargs in ds.pytorch call (#1925) @farizrahman4u
    • [DL-783] Rich compatibility (#1926) @farizrahman4u

    ⚙️ Who Contributes

    @AbhinavTuli, @FayazRahman and @farizrahman4u

    Source code(tar.gz)
    Source code(zip)
  • v3.0.8(Oct 6, 2022)

    🧭 What's Changed

    • libdeeplake update to fix memory issues (#1927) @AbhinavTuli
    • [DL-777] Polygons bug fix (#1922) @farizrahman4u
    • Variable local cache prefix (#1839) @GMW99
    • [DL-763] Locking fix (#1921) @farizrahman4u
    • [DL-701] Columnar views (#1912) @farizrahman4u

    ⚙️ Who Contributes

    @AbhinavTuli, @GMW99 and @farizrahman4u

    Source code(tar.gz)
    Source code(zip)
  • v3.0.7(Oct 5, 2022)

    🧭 What's Changed

    • Updated libdeeplake version, removes torch as dependency, fixes issue with strings in dataloader (#1919) @AbhinavTuli
    • [DL-753] [DL-722] Fix appending linked data with verify=False (#1914) @FayazRahman
    • Allow tensorflow dataset to fetch chunks (#1887) @daniel-falk
    • [DL-754] Add reporting for W&B integration (#1918) @FayazRahman

    ⚙️ Who Contributes

    @AbhinavTuli, @FayazRahman, @daniel-falk, @davidbuniat and @mikayelh

    Source code(tar.gz)
    Source code(zip)
  • v3.0.6(Sep 30, 2022)

    🧭 What's Changed

    • Update libdeeplake version to fix issue with distributed mode (#1915) @AbhinavTuli
    • [AL-1967] Fixes issue with readonly mode error raised despite not trying to write to dataset (#1911) @AbhinavTuli

    ⚙️ Who Contributes

    @AbhinavTuli and @davidbuniat

    Source code(tar.gz)
    Source code(zip)
  • v3.0.5(Sep 29, 2022)

    Introducing Deep Lake

    We are more than excited to transition into Deep Lake, data lake for deep learning applications. Furthermore we released

    • an academic paper describing all technical details https://arxiv.org/pdf/2209.10785.pdf.
    • business white paper you can find on https://deeplake.ai
    • We also move the api reference to https://docs.deeplake.ai/en/latest/

    Behind the scenes those are 5 key stepping stones of Deep Lake.

    1. Version Control: Git for data
    2. Visualize: In-browser visualization engine
    3. Query: Rapid queries with Tensor Query language
    4. Materialize: Format native to deep learning
    5. Stream: Streaming Data Loaders

    If you wonder...

    • Why we renamed Hub to Deep Lake?

    Hub originally was a chunked array format which evolved with version control, streaming engine, query capabilities naturally while iterating with community members. The name has been too generic to describe the tool often leading to a confusion with dataset hubs. Inspired from A. Pinhassi’s blogpost we renamed the package from hub to deeplake

     > pip3 install deeplake
    
    • Where does Deep Lakehouse comes into the place?

    While the format including versioning, lineage is fully open-source. Query, streaming and visualization engines built in C++ are yet closed source. They are accessible through Python interface for all users. While committed to open-source principles, we are planning to open-source high performance engines as they commoditize.

    🧭 What's Changed

    • Update README.zh-cn.md (#1910) @tatevikh
    • Update README.md (#1909) @istranic
    • Staging 3.0.5 (#1908) @farizrahman4u
    • Tiling Fix (#1907) @farizrahman4u
    • 3.0.3 (#1906) @farizrahman4u
    • [DL-746] hub->deeplake (#1895) @farizrahman4u
    • [DL-747] API Reference updates: new compressions + new Htypes page (#1892) @FayazRahman
    • Tensor Query Language documentation (#1896) @FayazRahman
    • Added more file formats for compression (#1597) @aadityasinha-dotcom
    • Indra import fix (#1891) @farizrahman4u
    • API Reference updates (#1886) @FayazRahman
    • Update version to 2.8.6 (#1889) @AbhinavTuli

    🐛 Bug Fixes

    • Passing token down (#1903) @ProgerDav

    ⚙️ Who Contributes

    @AbhinavTuli, @FayazRahman, @ProgerDav, @aadityasinha-dotcom, @artgish, @davidbuniat, @farizrahman4u, @istranic, @mikayelh and @tatevikh

    Source code(tar.gz)
    Source code(zip)
  • v2.8.5(Sep 20, 2022)

    🧭 What's Changed

    • [DL-717] Add installation instructions to API reference (#1882) @FayazRahman
    • [DL-702] API reference updates (#1883) @FayazRahman
    • [DL-711] Allow view optimization when read_only=True (#1865) @farizrahman4u
    • Fixes bug with is_sequence (#1880) @AbhinavTuli
    • [DL-714] Add Ellipsis support for indexing (#1878) @farizrahman4u
    • [DL-645] Fix memory leak in transforms (#1871) @adolkhan
    • [DL-715] Fix wandb integration path issue (#1879) @farizrahman4u
    • Add docstrings for experimental features(#1876) @levongh
    • [DL-693] Disable label sync for dataset copy transform (#1875) @FayazRahman
    • [DL-709] Docker build fix (#1860) @farizrahman4u
    • Improve indra error message in case of missing dependencies (#1873) @farizrahman4u
    • [DL-710] Fix locking issue with deepcopy (#1864) @farizrahman4u

    ⚙️ Who Contributes

    @AbhinavTuli, @FayazRahman, @adolkhan, @davidbuniat, @farizrahman4u and @levongh

    Source code(tar.gz)
    Source code(zip)
  • v2.8.4(Sep 15, 2022)

    🧭 What's Changed

    • Fixes import issue on Python 3.10 (#1867) @adolkhan
    • Big speedup for experimental dataloader initialization (#1869) @AbhinavTuli
    • Adds docstrings for experimenal features (#1868) @levongh

    ⚙️ Who Contributes

    @AbhinavTuli, @FayazRahman, @adolkhan, @davidbuniat and @levongh

    Source code(tar.gz)
    Source code(zip)
  • v2.8.3(Sep 14, 2022)

    🧭 What's Changed

    • Fixes type mismatch for expiration(#1858) @levongh
    • Flag to disable wandb integration (#1863) @farizrahman4u
    • Fixes wandb+local datasets (#1861) @hakanardo
    • [DL-668] Make pytorch() work with views (#1855) @farizrahman4u
    • [AL-1949] Make experimental pytorch dataloader consistent with existing implementation (#1853) @AbhinavTuli
    • [DL-650] Better error handling when not passing a tensor name to ds.append (#1817) @adolkhan
    • Update docs URL in readme (#1857) @FayazRahman
    • Speedup conversion of hub storage datasets->deeplake for experimental features (#1856) @levongh
    • [DL-611] New API reference (#1830) @FayazRahman
    • Wandb update: report datasets created with deepcopy (#1848) @farizrahman4u
    • [Bugfix] 1828 raising UserNotLoggedInException when invalid path is provided (#1829) @adolkhan
    • [DL-655] Added min and max length options (#1841) @adolkhan

    ⚙️ Who Contributes

    @AbhinavTuli, @FayazRahman, @adolkhan, @davidbuniat, @farizrahman4u, @hakanardo and @levongh

    Source code(tar.gz)
    Source code(zip)
  • v2.8.1(Sep 9, 2022)

    🧭 What's Changed

    • Ensure that new format for chunk id isn't used for encoders with version <= 2.7.6 (#1850) @AbhinavTuli

    ⚙️ Who Contributes

    @AbhinavTuli and @davidbuniat

    Source code(tar.gz)
    Source code(zip)
  • v2.8.0(Sep 7, 2022)

    🧭 What's Changed

    • Release Candidate 0 for new experimental dataloader and queries (#1819) @AbhinavTuli
    • [AL-1946] Fix delete group + reset bug (#1843) @AbhinavTuli
    • [DL-652] Add append_empty arg to ds.append (#1846) @farizrahman4u
    • Avoid printing syncing labels message when no labels were added (#1845) @FayazRahman
    • [DL-684] Fix ds.reset bug with local datasets (#1842) @FayazRahman
    • Use staging visualizer in tests. Correct dev visualizer url. (#1838) @khustup
    • Changes default chunk id size to 8 bits from 4 bits to reduce possibility of collisions (#1835) @AbhinavTuli
    • wandb integration (#1739) @farizrahman4u

    ⚙️ Who Contributes

    @AbhinavTuli, @FayazRahman, @farizrahman4u and @khustup

    Source code(tar.gz)
    Source code(zip)
  • v2.7.5(Aug 24, 2022)

    🧭 What's Changed

    • [AL-1775] Point Cloud htype (#1685) @adolkhan
    • [AL-1912] Don't allow generic htypes with link (#1824) @AbhinavTuli
    • [Bugfix] Fixes rechunking with hub link + cloud paths (#1825) @AbhinavTuli
    • Enable progressbar for syncing labels (#1820) @FayazRahman
    • [Bug fix] Ensure None/"ENV" isn't added to used_creds_keys for linked data (#1823) @AbhinavTuli

    ⚙️ Who Contributes

    @AbhinavTuli, @FayazRahman and @adolkhan

    Source code(tar.gz)
    Source code(zip)
  • v2.7.4(Aug 15, 2022)

    🧭 What's Changed

    • Fix get_incompatible_dtype bug (#1814) @farizrahman4u
    • [AL-1888] Enable rechunking for text like htypes (#1815) @AbhinavTuli
    • [AL-1858] Treat empty list as None (#1813) @AbhinavTuli
    • Older reporting configurations were not properly handling username (#1806) @zomglings

    ⚙️ Who Contributes

    @AbhinavTuli, @farizrahman4u and @zomglings

    Source code(tar.gz)
    Source code(zip)
  • v2.7.3(Aug 10, 2022)

    🧭 What's Changed

    • [AL-1884] Fixes bug with ds.reset for newly added/deleted tensors (#1797) @AbhinavTuli
    • [DL-618] Appending to class labels with text using multiple workers (#1794) @FayazRahman
    • [AL-1848] New agreements handling (#1796) @AbhinavTuli
    • [DL-590] S3: Always show retry warnings (#1807) @farizrahman4u
    • [DL-620] Prevent saving of dataset views for public datasets when user is not logged in (#1803) @farizrahman4u

    ⚙️ Who Contributes

    @AbhinavTuli, @FayazRahman and @farizrahman4u

    Source code(tar.gz)
    Source code(zip)
  • v2.7.2(Jul 26, 2022)

    🧭 What's Changed

    • [DL-593] Bugout correctly identifying the user's username when tokens are used (#1792) @adolkhan
    • Fix double indexing when saving strided views (#1793) @farizrahman4u

    🚀 New

    • Gcp support for connected datasets (#1736) @ProgerDav

    ⚙️ Who Contributes

    @ProgerDav, @adolkhan, @davidbuniat and @farizrahman4u

    Source code(tar.gz)
    Source code(zip)
🍅🍅🍅YOLOv5-Lite: lighter, faster and easier to deploy. Evolved from yolov5 and the size of model is only 1.7M (int8) and 3.3M (fp16). It can reach 10+ FPS on the Raspberry Pi 4B when the input size is 320×320~

YOLOv5-Lite:lighter, faster and easier to deploy Perform a series of ablation experiments on yolov5 to make it lighter (smaller Flops, lower memory, a

pogg 1.5k Jan 5, 2023
SPT_LSA_ViT - Implementation for Visual Transformer for Small-size Datasets

Vision Transformer for Small-Size Datasets Seung Hoon Lee and Seunghyun Lee and Byung Cheol Song | Paper Inha University Abstract Recently, the Vision

Lee SeungHoon 87 Jan 1, 2023
Json2Xml tool will help you convert from json COCO format to VOC xml format in Object Detection Problem.

JSON 2 XML All codes assume running from root directory. Please update the sys path at the beginning of the codes before running. Over View Json2Xml t

Nguyễn Trường Lâu 6 Aug 22, 2022
Txt2Xml tool will help you convert from txt COCO format to VOC xml format in Object Detection Problem.

TXT 2 XML All codes assume running from root directory. Please update the sys path at the beginning of the codes before running. Over View Txt2Xml too

Nguyễn Trường Lâu 4 Nov 24, 2022
A repository for storing njxzc final exam review material

文档地址,请戳我 ?? ?? ?? ☀️ 1.Reason 大三上期末复习软件工程的时候,发现其他高校在GitHub上开源了他们学校的期末试题,我很受触动。期末

GuJiakai 2 Jan 18, 2022
Neurons Dataset API - The official dataloader and visualization tools for Neurons Datasets.

Neurons Dataset API - The official dataloader and visualization tools for Neurons Datasets. Introduction We propose our dataloader API for loading and

null 1 Nov 19, 2021
Additional code for Stable-baselines3 to load and upload models from the Hub.

Hugging Face x Stable-baselines3 A library to load and upload Stable-baselines3 models from the Hub. Installation With pip Examples [Todo: add colab t

Hugging Face 34 Dec 10, 2022
CBKH: The Cornell Biomedical Knowledge Hub

Cornell Biomedical Knowledge Hub (CBKH) CBKG integrates data from 18 publicly available biomedical databases. The current version of CBKG contains a t

null 44 Dec 21, 2022
🤗 Push your spaCy pipelines to the Hugging Face Hub

spacy-huggingface-hub: Push your spaCy pipelines to the Hugging Face Hub This package provides a CLI command for uploading any trained spaCy pipeline

Explosion 30 Oct 9, 2022
A PyTorch Lightning Callback for pushing models to the Hugging Face Hub 🤗⚡️

hf-hub-lightning A callback for pushing lightning models to the Hugging Face Hub. Note: I made this package for myself, mostly...if folks seem to be i

Nathan Raw 27 Dec 14, 2022
An easy way to build PyTorch datasets. Modularly build datasets and automatically cache processed results

EasyDatas An easy way to build PyTorch datasets. Modularly build datasets and automatically cache processed results Installation pip install git+https

Ximing Yang 4 Dec 14, 2021
Deep Learning Datasets Maker is a QGIS plugin to make datasets creation easier for raster and vector data.

Deep Learning Dataset Maker Deep Learning Datasets Maker is a QGIS plugin to make datasets creation easier for raster and vector data. How to use Down

deepbands 25 Dec 15, 2022
Cl datasets - PyTorch image dataloaders and utility functions to load datasets for supervised continual learning

Continual learning datasets Introduction This repository contains PyTorch image

berjaoui 5 Aug 28, 2022
A transformer which can randomly augment VOC format dataset (both image and bbox) online.

VocAug It is difficult to find a script which can augment VOC-format dataset, especially the bbox. Or find a script needs complex requirements so it i

Coder.AN 1 Mar 5, 2022
Large dataset storage format for Pytorch

H5Record Large dataset ( > 100G, <= 1T) storage format for Pytorch (wip) Support python 3 pip install h5record Why? Writing large dataset is still a

theblackcat102 43 Oct 22, 2022
A set of tools for converting a darknet dataset to COCO format working with YOLOX

darknet格式数据→COCO darknet训练数据目录结构(详情参见dataset/darknet): darknet ├── class.names ├── gen_config.data ├── gen_train.txt ├── gen_valid.txt └── images

RapidAI-NG 148 Jan 3, 2023
DiffQ performs differentiable quantization using pseudo quantization noise. It can automatically tune the number of bits used per weight or group of weights, in order to achieve a given trade-off between model size and accuracy.

Differentiable Model Compression via Pseudo Quantization Noise DiffQ performs differentiable quantization using pseudo quantization noise. It can auto

Facebook Research 145 Dec 30, 2022
CenterFace(size of 7.3MB) is a practical anchor-free face detection and alignment method for edge devices.

CenterFace Introduce CenterFace(size of 7.3MB) is a practical anchor-free face detection and alignment method for edge devices. Recent Update 2019.09.

StarClouds 1.2k Dec 21, 2022
Computational Methods Course at UdeA. Forked and size reduced from:

Computational Methods for Physics & Astronomy Book version at: https://restrepo.github.io/ComputationalMethods by: Sebastian Bustamante 2014/2015 Dieg

Diego Restrepo 11 Sep 10, 2022