Hangar is version control for tensor data. Commit, branch, merge, revert, and collaborate in the data-defined software era.

Tensorwerk

Last update: Nov 29, 2022

Related tags

Data Analysis hangar-py

Overview

docs
tests
package

Hangar is version control for tensor data. Commit, branch, merge, revert, and collaborate in the data-defined software era.

Free software: Apache 2.0 license

What is Hangar?

Hangar is based off the belief that too much time is spent collecting, managing, and creating home-brewed version control systems for data. At its core Hangar is designed to solve many of the same problems faced by traditional code version control system (i.e. Git), just adapted for numerical data:

Time travel through the historical evolution of a dataset
Zero-cost Branching to enable exploratory analysis and collaboration
Cheap Merging to build datasets over time (with multiple collaborators)
Completely abstracted organization and management of data files on disk
Ability to only retrieve a small portion of the data (as needed) while still maintaining complete historical record
Ability to push and pull changes directly to collaborators or a central server (i.e. a truly distributed version control system)

The ability of version control systems to perform these tasks for codebases is largely taken for granted by almost every developer today; however, we are in-fact standing on the shoulders of giants, with decades of engineering which has resulted in these phenomenally useful tools. Now that a new era of "Data-Defined software" is taking hold, we find there is a strong need for analogous version control systems which are designed to handle numerical data at large scale... Welcome to Hangar!

The Hangar Workflow:

   Checkout Branch
          |
          ▼
 Create/Access Data
          |
          ▼
Add/Remove/Update Samples
          |
          ▼
       Commit

Log Style Output:

*   5254ec (master) : merge commit combining training updates and new validation samples
|\
| * 650361 (add-validation-data) : Add validation labels and image data in isolated branch
* | 5f15b4 : Add some metadata for later reference and add new training samples received after initial import
|/
*   baddba : Initial commit adding training images and labels

Learn more about what Hangar is all about at https://hangar-py.readthedocs.io/

Installation

Hangar is in early alpha development release!

pip install hangar

Documentation

https://hangar-py.readthedocs.io/

Development

To run the all tests run:

tox

Note, to combine the coverage data from all the tox environments run:

Windows	set PYTEST_ADDOPTS=--cov-append tox
Other	PYTEST_ADDOPTS=--cov-append tox

Comments

Dataloaders for PyTorch & Tensorflow
Motivation and Context

PyTorch DataLoader for loading data from hangar directly into PyTorch

If it fixes an open issue, please link to the issue here:

#13

Types of changes

What types of changes does your code introduce? Put an x in all the boxes that apply:

[ ] Documentation update

[ ] Bug fix (non-breaking change which fixes an issue)

[x] New feature (non-breaking change which adds functionality)

[ ] Breaking change (fix or feature that would cause existing functionality to change)

Is this PR ready for review, or a work in progress?

[x] Ready for review

[ ] Work in progress

How Has This Been Tested?

Put an x in the boxes that apply:

[ ] Current tests cover modifications made

[ ] New tests have been added to the test suite

[ ] Modifications were made to existing tests to support these changes

[ ] Tests may be needed, but they are not included when the PR was proposed

[ ] I don't know. Help!

Checklist:

[x] My code follows the code style of this project.

[x] My change requires a change to the documentation.

[ ] I have updated the documentation accordingly.

[x] I have read the CONTRIBUTING document.

[x] I have signed (or will sign when prompted) the tensorwork CLA.

[ ] I have added tests to cover my changes.

[ ] All new and existing tests passed.
opened by hhsecond 27
API Redesign
Motivation and Context

Why is this change required? What problem does it solve?:

To simplify user interface with arraysets and provide some concept of a dataset as a view across arraysets.

NOTE: This initial PR is a proof of concept only, and will require extensive discussion before the final design is agreed upon

If it fixes an open issue, please link to the issue here:

related to #79 and many conversations on the Hangar Users Slack Channel

Description

Describe your changes in detail:

Added CheckoutIndexer class which is inhereted in ReaderCheckout and WriterCheckout to enable the following API. (originally proposed by @lantiga and @elistevens)

dset = repo.checkout(write=True) # get an arrayset of the dataset (i.e. a "column" of the dataset?) aset = dset['foo'] # get a specific array from 'foo' (returns a named tuple) arr = dset['foo', '1'] # set it too dset['foo', '1'] = arr # get data from dset (returns a named tuple) subarr = dset['foo', '1'] # and set into it dset['foo', '1'] = subarr + 1 # get a sample of a dataset across 'foo' and 'bar' (returns a named tuple) sample = dset[('foo', 'bar'), '1'] # get a sample of all arraysets in the checkout (returns a named tuple) sample = dset[:, '1'] sample = dset[..., '1'] # get multiple samples sample_ids = ['1', '2', '3'] batch = dset[('foo', 'bar'), sample_ids] batch = dset[:, sample_ids] batch = dset[..., sample_ids]

Types of changes

What types of changes does your code introduce? Put an x in all the boxes that apply:

[ ] Documentation update

[ ] Bug fix (non-breaking change which fixes an issue)

[x] New feature (non-breaking change which adds functionality)

[ ] Breaking change (fix or feature that would cause existing functionality to change)

Is this PR ready for review, or a work in progress?

[x] Ready for review

[ ] Work in progress

How Has This Been Tested?

Put an x in the boxes that apply:

[ ] Current tests cover modifications made

[x] New tests have been added to the test suite

[ ] Modifications were made to existing tests to support these changes

[ ] Tests may be needed, but they are not included when the PR was proposed

[ ] I don't know. Help!

Checklist:

[x] My code follows the code style of this project.

[x] My change requires a change to the documentation.

[x] I have updated the documentation accordingly.

[x] I have read the CONTRIBUTING document.

[x] I have signed (or will sign when prompted) the tensorwork CLA.

[x] I have added tests to cover my changes.

[x] All new and existing tests passed.
opened by rlizzo 14
Rename datasets to datacell
Motivation and Context

Why is this change required? What problem does it solve?:

The name Datasets no longer fit the appropriate description as a container of tensor/array data.

In order to make this clearer, datasets has been replaced by the term datacells. Since this more accuratly describes the ability for a single sample in a dataset to be made up of individual pieces spread across datacells.

Description

Describe your changes in detail:

The new explantation is in full HERE, but for the sake of brevity, this diagram illustrates the "unique" relationship that a dataset has to samples and datacells:

A Dataset is thought of as containing Samples, but is actually defined by Datacells, which store parts of fully defined Samples in structures common across the full aggregation of Samples in the Dataset _____________________________________ S1 | S2 | S3 | <------------------------| -------------------------------------- | image | image | image | <- Datacell 1 <--| | filename | filename | filename | <- Datacell 2 <--|-- Dataset label | label | label | <- Datacell 3 <--| annotation | - | annotation | <- Datacell 4 <--| If a sample does not have a piece of data, lack of info in the Datacell makes no difference in any way to the larger picture.

Types of changes

What types of changes does your code introduce? Put an x in all the boxes that apply:

[ ] Documentation update

[ ] Bug fix (non-breaking change which fixes an issue)

[ ] New feature (non-breaking change which adds functionality)

[x] Breaking change (fix or feature that would cause existing functionality to change)

Is this PR ready for review, or a work in progress?

[x] Ready for review

[ ] Work in progress

How Has This Been Tested?

Put an x in the boxes that apply:

[x] Current tests cover modifications made

[ ] New tests have been added to the test suite

[x] Modifications were made to existing tests to support these changes

[ ] Tests may be needed, but they are not included when the PR was proposed

[ ] I don't know. Help!

Checklist:

[x] My code follows the code style of this project.

[x] My change requires a change to the documentation.

[x] I have updated the documentation accordingly.

[x] I have read the CONTRIBUTING document.

[x] I have signed (or will sign when prompted) the tensorwork CLA.

[x] I have added tests to cover my changes.

[x] All new and existing tests passed.

Please give this a review @lantiga @hhsecond
Awaiting Review
opened by rlizzo 14
Arrayset Subsamples
Motivation and Context

Why is this change required? What problem does it solve?:

This is a large PR which started with the motivation of allowing arraysets to contain subsamples under a common key. Though minimal work was needed for the technical implementation (with esentially no changes made to the hangar core record parsing, history traversal, or tensor storage backends), the integration of the API into the current model proved difficult, which required some major refactoring of what was previously known as ArraysetDataReader and ArraysetDataWriter classes.

Description

Describe your changes in detail:

Rather than try to combine every possible API method needed by flat and nested arrayset access into a frankenstein monster class, each access convention implements it's own API class methods (fully independent from one another). The appropriate constructors are selected based on the constains_subsamples argument in init_arrayset(). The argument is recorded in the schema so the correct type can be identified in subsequent checkouts.

I'm working on putting together a summary of the API. That will follow shortly.

At the moment, about half the tests for the new nested sample container are missing, and I need to re-evaluate some implementation details for how backend file handles are dealt with.

Screenshots (if appropriate):

Types of changes

What types of changes does your code introduce? Put an x in all the boxes that apply:

[ ] Documentation update

[ ] Bug fix (non-breaking change which fixes an issue)

[x] New feature (non-breaking change which adds functionality)

[ ] Breaking change (fix or feature that would cause existing functionality to change)

Is this PR ready for review, or a work in progress?

[ ] Ready for review

[x] Work in progress

How Has This Been Tested?

Put an x in the boxes that apply:

[ ] Current tests cover modifications made

[x] New tests have been added to the test suite

[x] Modifications were made to existing tests to support these changes

[ ] Tests may be needed, but they are not included when the PR was proposed

[ ] I don't know. Help!

Checklist:

[x] My code follows the code style of this project.

[x] My change requires a change to the documentation.

[ ] I have updated the documentation accordingly.

[x] I have read the CONTRIBUTING document.

[x] I have signed (or will sign when prompted) the tensorwork CLA.

[ ] I have added tests to cover my changes.

[ ] All new and existing tests passed.

enhancement WIP Awaiting Review
opened by rlizzo 13
Hangar Real World Quick Start Tutorial
Motivation and Context

Why is this change required? What problem does it solve?:

New tutorial covering only the basic stuff for version 0.5 release, such as Repository creation and initialization, adding data to columns and committing changes.

Types of changes

What types of changes does your code introduce? Put an x in all the boxes that apply:

[x] Documentation update

[ ] Bug fix (non-breaking change which fixes an issue)

[ ] New feature (non-breaking change which adds functionality)

[ ] Breaking change (fix or feature that would cause existing functionality to change)

Is this PR ready for review, or a work in progress?

[x] Ready for review

[ ] Work in progress

How Has This Been Tested?

Put an x in the boxes that apply:

[x] Current tests cover modifications made

[ ] New tests have been added to the test suite

[ ] Modifications were made to existing tests to support these changes

[ ] Tests may be needed, but they are not included when the PR was proposed

[ ] I don't know. Help!

Checklist:

[x] My code follows the code style of this project.

[ ] My change requires a change to the documentation.

[ ] I have updated the documentation accordingly.

[x] I have read the CONTRIBUTING document.

[x] I have signed (or will sign when prompted) the tensorwork CLA.

[ ] I have added tests to cover my changes.

[ ] All new and existing tests passed.
opened by alessiamarcolini 11
[BUG REPORT] Multiprocess pytorch dataloaders
Describe the bug A clear and concise description of what the bug is.

The pytorch dataloader cannot currently be run with multiprocess workers:

>>> torch_dset = make_torch_dataset(aset, index_range=slice(1, 100)) >>> loader = DataLoader(torch_dset, batch_size=16, num_workers=2) >>> for batch in loader: ... train_model(batch) Exception: Cannot pickle `hangar.dataloaders.TorchDataset.BatchTuple`

This is because the the BatchTuple wrappers passed to hangar.dataloaders.TorchDataset are dynamically defined namedtuple classes, whose definition is not appropriately scoped for a forked subprocess to introspect it's name/contents upon pickling.

https://github.com/tensorwerk/hangar-py/blob/e2c7a89ccb9ddb379e8a3fa8f20dae20fcfb6345/src/hangar/dataloaders/torchloader.py#L63-L74

As the structure needs to be dynamically defined based on user arguments, we cannot just place the BatchTuple definition in the main body of the module.

Two possible solutions:

@lantiga and @hhsecond, let me know what you prefer, or any other solutions you might have.

1) keep the current return type exactally the same, but add the definition of BatchTuple to globals() before it is passed to TorchLoader

wrapper = namedtuple('BatchTuple', field_names=field_names) else: wrapper = namedtuple('BatchTuple', field_names=gasets.arrayset_names, rename=True) globals()[`BatchTuple`] = wrapper return TorchDataset(gasets.arrayset_array, gasets.sample_names, wrapper)

This works, but is generally bad practice to manually modify global scope.

2) return a dict of field_names and tensors instead of a namedtuple

wrapper = tuple(field_names) else: wrapper = tuple(gasets.arrayset_names) return TorchDataset(gasets.arrayset_array, gasets.sample_names, wrapper)

And in TorchDataset replace: https://github.com/tensorwerk/hangar-py/blob/e2c7a89ccb9ddb379e8a3fa8f20dae20fcfb6345/src/hangar/dataloaders/torchloader.py#L135 with

return dict(zip(self.wrapper, out))

Which still works, and does not modify globals(), but changes the output o the function to something "not quite as nice" as a namedtuple.

Severity

Select an option:

[ ] Data Corruption / Loss of Any Kind

[ ] Unexpected Behavior, Exceptions or Error Thrown

[x] Performance Bottleneck

Bug: Priority 3
opened by rlizzo 10
Plugins revamp
Motivation and Context

Revamping plugin system to make at actually pluggable for different data types etc.

Description

Introducing new io module. This will help users to use the import/export functionality of plugins through the program if they don't wan't to interact with the low-level hangar APIs

We potentially can move other modules like dataset inside io module

Test cases are work in progress

Docs are work in progress

Types of changes

What types of changes does your code introduce? Put an x in all the boxes that apply:

[x] Documentation update

[ ] Bug fix (non-breaking change which fixes an issue)

[x] New feature (non-breaking change which adds functionality)

[ ] Breaking change (fix or feature that would cause existing functionality to change)

Is this PR ready for review, or a work in progress?

[ ] Ready for review

[x] Work in progress

How Has This Been Tested?

Put an x in the boxes that apply:

[x] Current tests cover modifications made

[x] New tests have been added to the test suite

[x] Modifications were made to existing tests to support these changes

[ ] Tests may be needed, but they are not included when the PR was proposed

[ ] I don't know. Help!

Checklist:

[x] My code follows the code style of this project.

[x] My change requires a change to the documentation.

[x] I have updated the documentation accordingly.

[x] I have read the CONTRIBUTING document.

[x] I have signed (or will sign when prompted) the tensorwork CLA.

[x] I have added tests to cover my changes.

[x] All new and existing tests passed.

Awaiting Review
opened by hhsecond 10

[BUG REPORT] Commit inside context manager throws RuntimeError

Describe the bug If we try to commit inside the context manager (before __exit__()), hangar throws RuntimeError saying No changes made in the staging area. Cannot commit.. We should allow the user to do commits inside the context manager IMO but probably with a warning about the performance hit

Severity

Select an option:

[ ] Data Corruption / Loss of Any Kind
[x] Unexpected Behavior, Exceptions or Error Thrown
[ ] Performance Bottleneck

To Reproduce

import numpy as np

from hangar import Repository
repo = Repository(path='myhangarrepo')
repo.init(user_name='Sherin Thomas', user_email='[email protected]', remove_old=True)

# generate data
data = []
for i in range(1000):
    data.append(np.random.rand(28, 28))
data = np.array(data)

co = repo.checkout(write=True)
data_dset = co.datasets.init_dataset('mnist_data', prototype=data[0])
co.commit('datasets init')
co.close()
co = repo.checkout(write=True)
data_dset = co.datasets['mnist_data']

with data_dset:
    for i in range(len(data)):
        sample_name = str(i)
        data_dset[sample_name] = data[i]
        co.commit('dataset curation: stage 1')  # this throws error
co.close()

Expected behavior It should not break the program instead raise a warning about the performance hit

Bug: Priority 2 PR In Progress

opened by hhsecond 9

Dataloaders for PyTorch

@rlizzo I was thinking an API like hangar.dataloaders.pytorch (let me know if you have another structuring in your mind). Basically, the idea is to load data in batches synchronously or asynchronously and enable the features of PyTorch DataLoader. My plan is to have a Dataset class and a DataLoader class which is essentially the way PyTorch's data loader work.
enhancement Resolved

opened by hhsecond 9
[BUG REPORT] New repo creation is unfriendly
Describe the bug

The message HANGAR RUNTIME WARNING: no repository exists at /some/path/__hangar, please use init_repo function makes me think my script is doing something wrong, even though the next thing that I do is call repo.init().

Additionally, if the path does not exist, there should be an option to have it be created.

A single default param exists=True flag could handle both cases, creating the directory and suppressing the init warning when set to False

Severity

Select an option:

[ ] Data Corruption / Loss of Any Kind

[x] Unexpected Behavior, Exceptions or Error Thrown

[ ] Performance Bottleneck

Bug: Priority 2
opened by elistevens 8
Import export in CLI
Motivation and Context

Why is this change required? What problem does it solve?:

Introducing import-export utility over the command line

If it fixes an open issue, please link to the issue here:

This PR won't solve issue #72 completely but starting with image import and export. Also, the release is going to be experimental and the APIs are subjective to change

Description

Describe your changes in detail:

Moving CLI as a module

Introducing import option using click

Introducing export option using click

Base class for the introduction of plugin system

Types of changes

What types of changes does your code introduce? Put an x in all the boxes that apply:

[x] Documentation update

[ ] Bug fix (non-breaking change which fixes an issue)

[x] New feature (non-breaking change which adds functionality)

[ ] Breaking change (fix or feature that would cause existing functionality to change)

Is this PR ready for review, or a work in progress?

[ ] Ready for review

[x] Work in progress

How Has This Been Tested?

Put an x in the boxes that apply:

[ ] Current tests cover modifications made

[x] New tests have been added to the test suite

[ ] Modifications were made to existing tests to support these changes

[ ] Tests may be needed, but they are not included when the PR was proposed

[ ] I don't know. Help!

Checklist:

[x] My code follows the code style of this project.

[x] My change requires a change to the documentation.

[ ] I have updated the documentation accordingly.

[x] I have read the CONTRIBUTING document.

[x] I have signed (or will sign when prompted) the tensorwork CLA.

[ ] I have added tests to cover my changes.

[x] All new and existing tests passed.
opened by hhsecond 7
WIP Switch To mkdocs
Motivation and Context

Why is this change required? What problem does it solve?:

better documentation generator.

Does not work at the current moment (unhappy with API plugin for numpy style docstrings).

Types of changes

What types of changes does your code introduce? Put an x in all the boxes that apply:

[x] Documentation update

[ ] Bug fix (non-breaking change which fixes an issue)

[ ] New feature (non-breaking change which adds functionality)

[ ] Breaking change (fix or feature that would cause existing functionality to change)

Is this PR ready for review, or a work in progress?

[ ] Ready for review

[x] Work in progress

How Has This Been Tested?

Put an x in the boxes that apply:

[ ] Current tests cover modifications made

[ ] New tests have been added to the test suite

[ ] Modifications were made to existing tests to support these changes

[ ] Tests may be needed, but they are not included when the PR was proposed

[ ] I don't know. Help!

Checklist:

[ ] My code follows the code style of this project.

[ ] My change requires a change to the documentation.

[ ] I have updated the documentation accordingly.

[ ] I have read the CONTRIBUTING document.

[ ] I have signed (or will sign when prompted) the tensorwork CLA.

[ ] I have added tests to cover my changes.

[ ] All new and existing tests passed.
opened by rlizzo 3
[FEATURE REQUEST] Read checkout to be able to read data from staging area

Is your feature request related to a problem? Please describe. Currently, read checkout won't be able to read data from the staging area. I was wondering what would be the technical difficulties for reading data from staging area

Describe the solution you'd like repo.checkout(stage=True) or something similar could give access to the staging area. I think the major bottleneck from implementing this is the fact that the staging area could change and might even make the data unavailable. Maybe we could have a flag changes when the data changes and on read from a stage checkout, it could let the reading process know that the data changed (not the specific information but just that the data changed) and it can invalidate the checkout?
enhancement

opened by hhsecond 1
[QUESTION & DOCS]: hangar versus DVID?

Executive Summary how does the approach of hangar compare with DVID? I am looking at how solutions to managing really large datasets, and stored ML models. The project dvid seems like its doing something similar to hangar?

I don't know enough about devops to be able to determine what kind of solution I could or should choose or what one is buying into when they choose one.
question documentation

opened by kurtsansom 0

[BUG REPORT] Diff status always returns CLEAN inside CM

Describe the bug Diff status always returns CLEAN inside CM

Severity

Select an option:

[ ] Data Corruption / Loss of Any Kind
[x] Unexpected Behavior, Exceptions or Error Thrown
[ ] Performance Bottleneck

To Reproduce

from hangar import Repository
import numpy as np


repo = Repository('.')
repo.init(user_name='me', user_email='[email protected]', remove_old=True)
co = repo.checkout(write=True)
co.add_ndarray_column('x', prototype=np.array([1]))
co.commit('added columns')
co.close()

co = repo.checkout(write=True)
x = co.columns['x']


with x:
    for i in range(10):
        x[i] = np.array([i])
        print(co.diff.status())  # this should return DIRTY but returns CLEAN
print(co.diff.status())  # this returns DIRTY as expected
co.commit('adding file')

Bug: Awaiting Priority Assignment

opened by hhsecond 0

[BUG REPORT] Transaction registers closes early

Describe the bug In case of multiple columns, if we open only few of such columns in context manager and still tries to write on other columns inside the context manager, transactions get False. A sample script to reproduce is given below

Severity

Select an option:

[ ] Data Corruption / Loss of Any Kind
[x] Unexpected Behavior, Exceptions or Error Thrown
[ ] Performance Bottleneck

To Reproduce

from hangar import Repository
import numpy as np


repo = Repository('.')
repo.init(user_name='me', user_email='[email protected]', remove_old=True)
co = repo.checkout(write=True)
co.add_ndarray_column('x', prototype=np.array([1]))
co.add_ndarray_column('y', prototype=np.array([1]))
co.commit('added columns')
co.close()

co = repo.checkout(write=True)
x = co.columns['x']
y = co.columns['y']


with x:  # note that we are opening only `x` in the CM
    for i in range(10):
        y[i] = np.array([i])  # but we are trying to update `y` column
        x[i] = np.array([i])
co.commit('adding file')
co.close()

Desktop (please complete the following information):

OS: Ubuntu 19.10
Python: 3.7
Hangar: 0.5.1.dev0 (master, at the time of writing)

Bug: Priority 1

opened by hhsecond 2

Commit Level Metadata

May be mention that this metadata is commit level and will not be part of the history.

Originally posted by @hhsecond in https://github.com/tensorwerk/hangar-py/pull/180
enhancement

opened by rlizzo 0

Releases(v0.5.2)

v0.5.2(May 8, 2020)
v0.5.2 (2020-05-08)

New Features

New column data type supporting arbitrary bytes data. (#198) @rlizzo

Improvements

str typed columns can now accept data containing any unicode code-point. In prior releases data containing any non-ascii character could not be written to this column type. (#198) @rlizzo

Bug Fixes

Fixed issue where str and (newly added) bytes column data could not be fetched / pushed between a local client repository and remote server. (#198) @rlizzo

Source code(tar.gz)
Source code(zip)
v0.5.1(Apr 6, 2020)
v0.5.1 (2020-04-05)

BugFixes

Fixed issue where importing make_torch_dataloader or make_tf_dataloader under python 3.6 Would raise a NameError regardless of if the package is installed. (#196) @rlizzo

Source code(tar.gz)
Source code(zip)
v0.5.0(Apr 4, 2020)
v0.5.0 (2020-04-4)

Improvements

Python 3.8 is now fully supported. (#193) @rlizzo

Major backend overhaul which defines column layouts and data types in the same interchangable / extensable manner as storage backends. This will allow rapid development of new layouts and data type support as new use cases are discovered by the community. (#184) @rlizzo

Column and backend classes are now fully serializable (pickleable) for read-only checkouts. (#180) @rlizzo

Modularized internal structure of API classes to easily allow new columnn layouts / data types to be added in the future. (#180) @rlizzo

Improved type / value checking of manual specification for column backend and backend_options. (#180) @rlizzo

Standardized column data access API to follow python standard library dict methods API. (#180) @rlizzo

Memory usage of arrayset checkouts has been reduced by ~70% by using C-structs for allocating sample record locating info. (#179) @rlizzo

Read times from the HDF5_00 and HDF5_01 backend have been reduced by 33-38% (or more for arraysets with many samples) by eliminating redundant computation of chunked storage B-Tree. (#179) @rlizzo

Commit times and checkout times have been reduced by 11-18% by optimizing record parsing and memory allocation. (#179) @rlizzo

New Features

Added str type column with same behavior as ndarray column (supporting both single-level and nested layouts) added to replace functionality of removed metadata container. (#184) @rlizzo

New backend based on LMDB has been added (specifier of lmdb_30). (#184) @rlizzo

Added .diff() method to Repository class to enable diffing changes between any pair of commits / branches without needing to open the diff base in a checkout. (#183) @rlizzo

New CLI command hangar diff which reports a summary view of changes made between any pair of commits / branches. (#183) @rlizzo

Added .log() method to Checkout objects so graphical commit graph or machine readable commit details / DAG can be queried when operating on a particular commit. (#183) @rlizzo

"string" type columns now supported alongside "ndarray" column type. (#180) @rlizzo

New "column" API, which replaces "arrayset" name. (#180) @rlizzo

Arraysets can now contain "nested subsamples" under a common sample key. (#179) @rlizzo

New API to add and remove samples from and arrayset. (#179) @rlizzo

Added repo.size_nbytes and repo.size_human to report disk usage of a repository on disk. (#174) @rlizzo

Added method to traverse the entire repository history and cryptographically verify integrity. (#173) @rlizzo

Changes

Argument syntax of __getitem__() and get() methods of ReaderCheckout and WriterCheckout classes. The new format supports handeling arbitrary arguments specific to retrieval of data from any column type. (#183) @rlizzo

Removed

metadata container for str typed data has been completly removed. It is replaced by a highly extensible and much more user-friendly str typed column. (#184) @rlizzo

__setitem__() method in WriterCheckout objects. Writing data to columns via a checkout object is no longer supported. (#183) @rlizzo

Bug Fixes

Backend data stores no longer use file symlinks, improving compatibility with some types file systems. (#171) @rlizzo

All arrayset types ("flat" and "nested subsamples") and backend readers can now be pickled -- for parallel processing -- in a read-only checkout. (#179) @rlizzo

Breaking changes

New backend record serialization format is incompatible with repositories written in version 0.4 or earlier.

New arrayset API is incompatible with Hangar API in version 0.4 or earlier.

Source code(tar.gz)
Source code(zip)
v0.5.0dev3(Apr 4, 2020)

Pre-Release for v0.5.0. Full Changelog To Follow.
Source code(tar.gz)
Source code(zip)
v0.5.0dev2(Apr 4, 2020)

Pre-Release for v0.5.0. Full Changelog To Follow.
Source code(tar.gz)
Source code(zip)
v0.4.0(Nov 26, 2019)
Release Notes

New Features

Added ability to delete branch names/pointers from a local repository via both API and CLI. #128 @rlizzo

Added local keyword arg to arrayset key/value iterators to return only locally available samples #131 @rlizzo

Ability to change the backend storage format and options applied to an arrayset after initialization. #133 @rlizzo

Added blosc compression to HDF5 backend by default on PyPi installations. #146 @rlizzo

Added Benchmarking Suite to Test for Performance Regressions in PRs. #155 @rlizzo

Added new backend optimized to increase speeds for fixed size arrayset access. #160 @rlizzo

Improvements

Removed msgpack and pyyaml dependencies. Cleaned up and improved remote client/server code. #130 @rlizzo

Multiprocess Torch DataLoaders allowed on Linux and MacOS. #144 @rlizzo

Added CLI options commit, checkout, arrayset create, & arrayset remove. #150 @rlizzo

Plugin system revamp. #134 @hhsecond

Documentation Improvements and Typo-Fixes. #156 @alessiamarcolini

Removed implicit removal of arrayset schema from checkout if every sample was removed from arrayset. This could potentially result in dangling accessors which may or may not self-destruct (as expected) in certain edge-cases. #159 @rlizzo

Added type codes to hash digests so that calculation function can be updated in the future without breaking repos written in previous Hangar versions. #165 @rlizzo

Bug Fixes

Programatic access to repository log contents now returns branch heads alongside other log info. #125 @rlizzo

Fixed minor bug in types of values allowed for Arrayset names vs Sample names. #151 @rlizzo

Fixed issue where using checkout object to access a sample in multiple arraysets would try to create a namedtuple instance with invalid field names. Now incompatible field names are automatically renamed with their positional index. #161 @rlizzo

Explicitly raise error if commit argument is set while checking out a repository with write=True. #166 @rlizzo

Breaking changes

New commit reference serialization format is incompatible with repositories written in version 0.3.0 or earlier.

Source code(tar.gz)
Source code(zip)
v0.4.0b0(Oct 19, 2019)

Source code(tar.gz)
Source code(zip)
v0.3.0(Sep 10, 2019)
New Features

API addition allowing reading and writing arrayset data from a checkout object directly. (#115) @rlizzo

Data importer, exporters, and viewers via CLI for common file formats. Includes plugin system for easy extensibility in the future. (#103) (@rlizzo, @hhsecond)

Improvements

Added tutorial on working with remote data. (#113) @rlizzo

Added Tutorial on Tensorflow and PyTorch Dataloaders. (#117) @hhsecond

Large performance improvement to diff/merge algorithm (~30x previous). (#112) @rlizzo

New commit hash algorithm which is much more reproducible in the long term. (#120) @rlizzo

HDF5 backend updated to increase speed of reading/writing variable sized dataset compressed chunks (#120) @rlizzo

Bug Fixes

Fixed ML Dataloaders errors for a number of edge cases surrounding partial-remote data and non-common keys. (#110) (@hhsecond, @rlizzo)

Breaking changes

New commit hash algorithm is incompatible with repositories written in version 0.2.0 or earlier

Source code(tar.gz)
Source code(zip)
v0.2.0(Aug 9, 2019)
See changelog for full details

New Features

Numpy memory-mapped array file backend added.

Remote server data backend added.

Selection heuristics to determine appropriate backend from arrayset schema.

Partial remote clones and fetch operations now fully supported.

CLI has been placed under test coverage, added interface usage to docs.

TensorFlow and PyTorch Machine Learning Dataloader Methods (Experimental Release).

Improvements

Record format versioning and standardization so to not break backwards compatibility in the future.

Backend addition and update developer protocols and documentation.

Read-only checkout arrayset sample get methods now are multithread and multiprocess safe.

Read-only checkout metadata sample get methods are thread safe if used within a context manager.

Samples can be assigned integer names in addition to string names.

Forgetting to close a write-enabled checkout before terminating the python process will close the checkout automatically for many situations.

Repository software version compatability methods added to ensure upgrade paths in the future.

Many tests added (including support for Mac OSX on Travis-CI). lead

Bug Fixes

Diff results for fast forward merges now returns sensible results.

Many type annotations added, and developer documentation improved.

Breaking changes

Renamed all references to datasets in the API / world-view to arraysets.

These are backwards incompatible changes. For all versions > 0.2, repository upgrade utilities will be provided if breaking changes occur.

Source code(tar.gz)
Source code(zip)
v0.1.1(May 24, 2019)

Fix for readme which had typos and was push to PyPi
Source code(tar.gz)
Source code(zip)
v0.1.0(May 24, 2019)
New Features

Remote client-server config negotiation and administrator permissions (#10) @rlizzo

Allow single python process to access multiple repositories simultaneously (#20) @rlizzo

Fast-Forward and 3-Way Merge and Diff methods now fully supported and behaving as expected (#32) @rlizzo

Improvements

Initial test-case specification (#14) @hhsecond

Checkout test-case work (#25) @hhsecond

Metadata test-case work (#27) @hhsecond

Any potential failure cases raise exceptions instead of silently returning (#16) @rlizzo

Many usability improvements in a variety of commits

Bug Fixes

Ensure references to checkout dataset or metadata objects cannot operate after the checkout is closed. (#41) @rlizzo

Sensible exception classes and error messages raised on a variety of situations (Many commits) @hhsecond & @rlizzo

Many minor issues addressed.

API Additions

Refer to API documentation (#23)

Breaking changes

All repositories written with previous versions of Hangar are liable to break when using this version. Please upgrade versions immediately.

Source code(tar.gz)
Source code(zip)

Owner

Tensorwerk

GitHub

Python-based Space Physics Environment Data Analysis Software

pySPEDAS pySPEDAS is an implementation of the SPEDAS framework for Python. The Space Physics Environment Data Analysis Software (SPEDAS) framework is

98 Dec 22, 2022

CPSPEC is an astrophysical data reduction software for timing

CPSPEC manual Introduction CPSPEC is an astrophysical data reduction software for timing. Various timing properties, such as power spectra and cross s

1 Oct 20, 2021

OpenDrift is a software for modeling the trajectories and fate of objects or substances drifting in the ocean, or even in the atmosphere.

opendrift OpenDrift is a software for modeling the trajectories and fate of objects or substances drifting in the ocean, or even in the atmosphere. Do

167 Dec 13, 2022

PyNHD is a part of HyRiver software stack that is designed to aid in watershed analysis through web services.

A part of HyRiver software stack that provides access to NHD+ V2 data through NLDI and WaterData web services

23 Dec 14, 2022

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

3.7k Jan 3, 2023

🧪 Panel-Chemistry - exploratory data analysis and build powerful data and viz tools within the domain of Chemistry using Python and HoloViz Panel.

???? ??. The purpose of the panel-chemistry project is to make it really easy for you to do DATA ANALYSIS and build powerful DATA AND VIZ APPLICATIONS within the domain of Chemistry using using Python and HoloViz Panel.

97 Dec 8, 2022

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

2 Nov 20, 2021

Elementary is an open-source data reliability framework for modern data teams. The first module of the framework is data lineage.

Data lineage made simple, reliable, and automated. Effortlessly track the flow of data, understand dependencies and analyze impact. Features Visualiza

898 Jan 9, 2023

A Pythonic introduction to methods for scaling your data science and machine learning work to larger datasets and larger models, using the tools and APIs you know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

This tutorial's purpose is to introduce Pythonistas to methods for scaling their data science and machine learning work to larger datasets and larger models, using the tools and APIs they know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

102 Nov 10, 2022

PrimaryBid - Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift

Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift This project is composed of two parts: Part1 and Part2

1 Jan 19, 2022

Python data processing, analysis, visualization, and data operations

Python This is a Python data processing, analysis, visualization and data operations of the source code warehouse, book ISBN: 9787115527592 Descriptio

1 Jan 16, 2022

Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

Data Scientist Learning Plan Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

27 Nov 1, 2022

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code. Tuplex has similar Python APIs to Apache Spark or Dask, but rather than invoking the Python interpreter, Tuplex generates optimized LLVM bytecode for the given pipeline and input data set.

791 Jan 4, 2023

Hangar is version control for tensor data. Commit, branch, merge, revert, and collaborate in the data-defined software era.

Related tags

Overview

Overview

What is Hangar?

Installation

Documentation

Development

Comments

Motivation and Context

If it fixes an open issue, please link to the issue here:

Types of changes

How Has This Been Tested?

Checklist:

Motivation and Context

Why is this change required? What problem does it solve?:

If it fixes an open issue, please link to the issue here:

Description

Describe your changes in detail:

Types of changes

How Has This Been Tested?

Checklist:

Motivation and Context

Why is this change required? What problem does it solve?:

Description

Describe your changes in detail:

Types of changes

How Has This Been Tested?

Checklist:

Motivation and Context

Why is this change required? What problem does it solve?:

Description

Describe your changes in detail:

Screenshots (if appropriate):

Types of changes

How Has This Been Tested?

Checklist:

Motivation and Context

Why is this change required? What problem does it solve?:

Types of changes

How Has This Been Tested?

Checklist:

Two possible solutions:

Motivation and Context

Description

Types of changes

How Has This Been Tested?

Checklist:

Motivation and Context

Why is this change required? What problem does it solve?:

If it fixes an open issue, please link to the issue here:

Description

Describe your changes in detail:

Types of changes

How Has This Been Tested?

Checklist:

Motivation and Context

Why is this change required? What problem does it solve?:

Types of changes

How Has This Been Tested?

Checklist:

Releases(v0.5.2)

v0.5.2(May 8, 2020)

v0.5.2 (2020-05-08)

New Features

Improvements

Bug Fixes

v0.5.1(Apr 6, 2020)

BugFixes

v0.5.0(Apr 4, 2020)

v0.5.0 (2020-04-4)

Improvements

New Features

Changes

Removed

Bug Fixes

Breaking changes

v0.5.0dev3(Apr 4, 2020)

v0.5.0dev2(Apr 4, 2020)

v0.4.0(Nov 26, 2019)