Hangar is version control for tensor data. Commit, branch, merge, revert, and collaborate in the data-defined software era.

Overview

Overview

docs Documentation Status
tests
package

Hangar is version control for tensor data. Commit, branch, merge, revert, and collaborate in the data-defined software era.

  • Free software: Apache 2.0 license

What is Hangar?

Hangar is based off the belief that too much time is spent collecting, managing, and creating home-brewed version control systems for data. At its core Hangar is designed to solve many of the same problems faced by traditional code version control system (i.e. Git), just adapted for numerical data:

  • Time travel through the historical evolution of a dataset
  • Zero-cost Branching to enable exploratory analysis and collaboration
  • Cheap Merging to build datasets over time (with multiple collaborators)
  • Completely abstracted organization and management of data files on disk
  • Ability to only retrieve a small portion of the data (as needed) while still maintaining complete historical record
  • Ability to push and pull changes directly to collaborators or a central server (i.e. a truly distributed version control system)

The ability of version control systems to perform these tasks for codebases is largely taken for granted by almost every developer today; however, we are in-fact standing on the shoulders of giants, with decades of engineering which has resulted in these phenomenally useful tools. Now that a new era of "Data-Defined software" is taking hold, we find there is a strong need for analogous version control systems which are designed to handle numerical data at large scale... Welcome to Hangar!

The Hangar Workflow:

   Checkout Branch
          |
          ▼
 Create/Access Data
          |
          ▼
Add/Remove/Update Samples
          |
          ▼
       Commit

Log Style Output:

*   5254ec (master) : merge commit combining training updates and new validation samples
|\
| * 650361 (add-validation-data) : Add validation labels and image data in isolated branch
* | 5f15b4 : Add some metadata for later reference and add new training samples received after initial import
|/
*   baddba : Initial commit adding training images and labels

Learn more about what Hangar is all about at https://hangar-py.readthedocs.io/

Installation

Hangar is in early alpha development release!

pip install hangar

Documentation

https://hangar-py.readthedocs.io/

Development

To run the all tests run:

tox

Note, to combine the coverage data from all the tox environments run:

Windows
set PYTEST_ADDOPTS=--cov-append
tox
Other
PYTEST_ADDOPTS=--cov-append tox
Comments
  • Dataloaders for PyTorch & Tensorflow

    Dataloaders for PyTorch & Tensorflow

    Motivation and Context

    PyTorch DataLoader for loading data from hangar directly into PyTorch

    If it fixes an open issue, please link to the issue here:

    #13

    Types of changes

    What types of changes does your code introduce? Put an x in all the boxes that apply:

    • [ ] Documentation update
    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [x] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to change)

    Is this PR ready for review, or a work in progress?

    • [x] Ready for review
    • [ ] Work in progress

    How Has This Been Tested?

    Put an x in the boxes that apply:

    • [ ] Current tests cover modifications made
    • [ ] New tests have been added to the test suite
    • [ ] Modifications were made to existing tests to support these changes
    • [ ] Tests may be needed, but they are not included when the PR was proposed
    • [ ] I don't know. Help!

    Checklist:

    • [x] My code follows the code style of this project.
    • [x] My change requires a change to the documentation.
    • [ ] I have updated the documentation accordingly.
    • [x] I have read the CONTRIBUTING document.
    • [x] I have signed (or will sign when prompted) the tensorwork CLA.
    • [ ] I have added tests to cover my changes.
    • [ ] All new and existing tests passed.
    opened by hhsecond 27
  • API Redesign

    API Redesign

    Motivation and Context

    Why is this change required? What problem does it solve?:

    To simplify user interface with arraysets and provide some concept of a dataset as a view across arraysets.

    NOTE: This initial PR is a proof of concept only, and will require extensive discussion before the final design is agreed upon

    If it fixes an open issue, please link to the issue here:

    related to #79 and many conversations on the Hangar Users Slack Channel

    Description

    Describe your changes in detail:

    Added CheckoutIndexer class which is inhereted in ReaderCheckout and WriterCheckout to enable the following API. (originally proposed by @lantiga and @elistevens)

    dset = repo.checkout(write=True)
    # get an arrayset of the dataset (i.e. a "column" of the dataset?)
    aset = dset['foo']
    
    # get a specific array from 'foo' (returns a named tuple)
    arr = dset['foo', '1']
    # set it too
    dset['foo', '1'] = arr
    
    # get data from dset (returns a named tuple)
    subarr = dset['foo', '1']
    # and set into it
    dset['foo', '1'] = subarr + 1
    
    # get a sample of a dataset across 'foo' and 'bar' (returns a named tuple)
    sample = dset[('foo', 'bar'), '1']
    
    # get a sample of all arraysets in the checkout (returns a named tuple)
    sample = dset[:, '1']
    sample = dset[..., '1']
    
    # get multiple samples
    sample_ids = ['1', '2', '3']
    batch = dset[('foo', 'bar'), sample_ids]
    batch = dset[:, sample_ids]
    batch = dset[..., sample_ids]
    

    Types of changes

    What types of changes does your code introduce? Put an x in all the boxes that apply:

    • [ ] Documentation update
    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [x] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to change)

    Is this PR ready for review, or a work in progress?

    • [x] Ready for review
    • [ ] Work in progress

    How Has This Been Tested?

    Put an x in the boxes that apply:

    • [ ] Current tests cover modifications made
    • [x] New tests have been added to the test suite
    • [ ] Modifications were made to existing tests to support these changes
    • [ ] Tests may be needed, but they are not included when the PR was proposed
    • [ ] I don't know. Help!

    Checklist:

    • [x] My code follows the code style of this project.
    • [x] My change requires a change to the documentation.
    • [x] I have updated the documentation accordingly.
    • [x] I have read the CONTRIBUTING document.
    • [x] I have signed (or will sign when prompted) the tensorwork CLA.
    • [x] I have added tests to cover my changes.
    • [x] All new and existing tests passed.
    opened by rlizzo 14
  • Rename datasets to datacell

    Rename datasets to datacell

    Motivation and Context

    Why is this change required? What problem does it solve?:

    The name Datasets no longer fit the appropriate description as a container of tensor/array data.

    In order to make this clearer, datasets has been replaced by the term datacells. Since this more accuratly describes the ability for a single sample in a dataset to be made up of individual pieces spread across datacells.

    Description

    Describe your changes in detail:

    The new explantation is in full HERE, but for the sake of brevity, this diagram illustrates the "unique" relationship that a dataset has to samples and datacells:

       A Dataset is thought of as containing Samples, but is actually defined by
        Datacells, which store parts of fully defined Samples in structures
           common across the full aggregation of Samples in the Dataset
    
       _____________________________________
             S1     |    S2    |     S3     |  <------------------------|
       --------------------------------------                           |
           image    |  image   |   image    |  <- Datacell 1  <--|      |
         filename   | filename |  filename  |  <- Datacell 2  <--|-- Dataset
           label    |  label   |   label    |  <- Datacell 3  <--|
         annotation |    -     | annotation |  <- Datacell 4  <--|
    
    
       If a sample does not have a piece of data, lack of info in the Datacell
             makes no difference in any way to the larger picture.
    

    Types of changes

    What types of changes does your code introduce? Put an x in all the boxes that apply:

    • [ ] Documentation update
    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [ ] New feature (non-breaking change which adds functionality)
    • [x] Breaking change (fix or feature that would cause existing functionality to change)

    Is this PR ready for review, or a work in progress?

    • [x] Ready for review
    • [ ] Work in progress

    How Has This Been Tested?

    Put an x in the boxes that apply:

    • [x] Current tests cover modifications made
    • [ ] New tests have been added to the test suite
    • [x] Modifications were made to existing tests to support these changes
    • [ ] Tests may be needed, but they are not included when the PR was proposed
    • [ ] I don't know. Help!

    Checklist:

    • [x] My code follows the code style of this project.
    • [x] My change requires a change to the documentation.
    • [x] I have updated the documentation accordingly.
    • [x] I have read the CONTRIBUTING document.
    • [x] I have signed (or will sign when prompted) the tensorwork CLA.
    • [x] I have added tests to cover my changes.
    • [x] All new and existing tests passed.

    Please give this a review @lantiga @hhsecond

    Awaiting Review 
    opened by rlizzo 14
  • Arrayset Subsamples

    Arrayset Subsamples

    Motivation and Context

    Why is this change required? What problem does it solve?:

    This is a large PR which started with the motivation of allowing arraysets to contain subsamples under a common key. Though minimal work was needed for the technical implementation (with esentially no changes made to the hangar core record parsing, history traversal, or tensor storage backends), the integration of the API into the current model proved difficult, which required some major refactoring of what was previously known as ArraysetDataReader and ArraysetDataWriter classes.

    Description

    Describe your changes in detail:

    Rather than try to combine every possible API method needed by flat and nested arrayset access into a frankenstein monster class, each access convention implements it's own API class methods (fully independent from one another). The appropriate constructors are selected based on the constains_subsamples argument in init_arrayset(). The argument is recorded in the schema so the correct type can be identified in subsequent checkouts.

    I'm working on putting together a summary of the API. That will follow shortly.

    At the moment, about half the tests for the new nested sample container are missing, and I need to re-evaluate some implementation details for how backend file handles are dealt with.

    Screenshots (if appropriate):

    Types of changes

    What types of changes does your code introduce? Put an x in all the boxes that apply:

    • [ ] Documentation update
    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [x] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to change)

    Is this PR ready for review, or a work in progress?

    • [ ] Ready for review
    • [x] Work in progress

    How Has This Been Tested?

    Put an x in the boxes that apply:

    • [ ] Current tests cover modifications made
    • [x] New tests have been added to the test suite
    • [x] Modifications were made to existing tests to support these changes
    • [ ] Tests may be needed, but they are not included when the PR was proposed
    • [ ] I don't know. Help!

    Checklist:

    • [x] My code follows the code style of this project.
    • [x] My change requires a change to the documentation.
    • [ ] I have updated the documentation accordingly.
    • [x] I have read the CONTRIBUTING document.
    • [x] I have signed (or will sign when prompted) the tensorwork CLA.
    • [ ] I have added tests to cover my changes.
    • [ ] All new and existing tests passed.
    enhancement WIP Awaiting Review 
    opened by rlizzo 13
  • Hangar Real World Quick Start Tutorial

    Hangar Real World Quick Start Tutorial

    Motivation and Context

    Why is this change required? What problem does it solve?:

    New tutorial covering only the basic stuff for version 0.5 release, such as Repository creation and initialization, adding data to columns and committing changes.

    Types of changes

    What types of changes does your code introduce? Put an x in all the boxes that apply:

    • [x] Documentation update
    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [ ] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to change)

    Is this PR ready for review, or a work in progress?

    • [x] Ready for review
    • [ ] Work in progress

    How Has This Been Tested?

    Put an x in the boxes that apply:

    • [x] Current tests cover modifications made
    • [ ] New tests have been added to the test suite
    • [ ] Modifications were made to existing tests to support these changes
    • [ ] Tests may be needed, but they are not included when the PR was proposed
    • [ ] I don't know. Help!

    Checklist:

    • [x] My code follows the code style of this project.
    • [ ] My change requires a change to the documentation.
    • [ ] I have updated the documentation accordingly.
    • [x] I have read the CONTRIBUTING document.
    • [x] I have signed (or will sign when prompted) the tensorwork CLA.
    • [ ] I have added tests to cover my changes.
    • [ ] All new and existing tests passed.
    opened by alessiamarcolini 11
  • [BUG REPORT] Multiprocess pytorch dataloaders

    [BUG REPORT] Multiprocess pytorch dataloaders

    Describe the bug A clear and concise description of what the bug is.

    The pytorch dataloader cannot currently be run with multiprocess workers:

    >>> torch_dset = make_torch_dataset(aset, index_range=slice(1, 100))
    >>> loader = DataLoader(torch_dset, batch_size=16, num_workers=2)
    >>> for batch in loader:
    ...     train_model(batch)
    Exception: Cannot pickle `hangar.dataloaders.TorchDataset.BatchTuple`
    

    This is because the the BatchTuple wrappers passed to hangar.dataloaders.TorchDataset are dynamically defined namedtuple classes, whose definition is not appropriately scoped for a forked subprocess to introspect it's name/contents upon pickling.

    https://github.com/tensorwerk/hangar-py/blob/e2c7a89ccb9ddb379e8a3fa8f20dae20fcfb6345/src/hangar/dataloaders/torchloader.py#L63-L74

    As the structure needs to be dynamically defined based on user arguments, we cannot just place the BatchTuple definition in the main body of the module.

    Two possible solutions:

    @lantiga and @hhsecond, let me know what you prefer, or any other solutions you might have.

    1) keep the current return type exactally the same, but add the definition of BatchTuple to globals() before it is passed to TorchLoader

            wrapper = namedtuple('BatchTuple', field_names=field_names)
        else:
            wrapper = namedtuple('BatchTuple', field_names=gasets.arrayset_names, rename=True)
    
        globals()[`BatchTuple`] = wrapper
        return TorchDataset(gasets.arrayset_array, gasets.sample_names, wrapper)
    

    This works, but is generally bad practice to manually modify global scope.

    2) return a dict of field_names and tensors instead of a namedtuple

            wrapper = tuple(field_names)
        else:
            wrapper = tuple(gasets.arrayset_names)
    
        return TorchDataset(gasets.arrayset_array, gasets.sample_names, wrapper)
    

    And in TorchDataset replace: https://github.com/tensorwerk/hangar-py/blob/e2c7a89ccb9ddb379e8a3fa8f20dae20fcfb6345/src/hangar/dataloaders/torchloader.py#L135 with

        return dict(zip(self.wrapper, out))
    

    Which still works, and does not modify globals(), but changes the output o the function to something "not quite as nice" as a namedtuple.

    Severity

    Select an option:

    • [ ] Data Corruption / Loss of Any Kind
    • [ ] Unexpected Behavior, Exceptions or Error Thrown
    • [x] Performance Bottleneck
    Bug: Priority 3 
    opened by rlizzo 10
  • Plugins revamp

    Plugins revamp

    Motivation and Context

    Revamping plugin system to make at actually pluggable for different data types etc.

    Description

    • Introducing new io module. This will help users to use the import/export functionality of plugins through the program if they don't wan't to interact with the low-level hangar APIs
    • We potentially can move other modules like dataset inside io module
    • Test cases are work in progress
    • Docs are work in progress

    Types of changes

    What types of changes does your code introduce? Put an x in all the boxes that apply:

    • [x] Documentation update
    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [x] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to change)

    Is this PR ready for review, or a work in progress?

    • [ ] Ready for review
    • [x] Work in progress

    How Has This Been Tested?

    Put an x in the boxes that apply:

    • [x] Current tests cover modifications made
    • [x] New tests have been added to the test suite
    • [x] Modifications were made to existing tests to support these changes
    • [ ] Tests may be needed, but they are not included when the PR was proposed
    • [ ] I don't know. Help!

    Checklist:

    • [x] My code follows the code style of this project.
    • [x] My change requires a change to the documentation.
    • [x] I have updated the documentation accordingly.
    • [x] I have read the CONTRIBUTING document.
    • [x] I have signed (or will sign when prompted) the tensorwork CLA.
    • [x] I have added tests to cover my changes.
    • [x] All new and existing tests passed.
    Awaiting Review 
    opened by hhsecond 10
  • [BUG REPORT] Commit inside context manager throws RuntimeError

    [BUG REPORT] Commit inside context manager throws RuntimeError

    Describe the bug If we try to commit inside the context manager (before __exit__()), hangar throws RuntimeError saying No changes made in the staging area. Cannot commit.. We should allow the user to do commits inside the context manager IMO but probably with a warning about the performance hit

    Severity

    Select an option:

    • [ ] Data Corruption / Loss of Any Kind
    • [x] Unexpected Behavior, Exceptions or Error Thrown
    • [ ] Performance Bottleneck

    To Reproduce

    import numpy as np
    
    from hangar import Repository
    repo = Repository(path='myhangarrepo')
    repo.init(user_name='Sherin Thomas', user_email='[email protected]', remove_old=True)
    
    # generate data
    data = []
    for i in range(1000):
        data.append(np.random.rand(28, 28))
    data = np.array(data)
    
    co = repo.checkout(write=True)
    data_dset = co.datasets.init_dataset('mnist_data', prototype=data[0])
    co.commit('datasets init')
    co.close()
    co = repo.checkout(write=True)
    data_dset = co.datasets['mnist_data']
    
    with data_dset:
        for i in range(len(data)):
            sample_name = str(i)
            data_dset[sample_name] = data[i]
            co.commit('dataset curation: stage 1')  # this throws error
    co.close()
    

    Expected behavior It should not break the program instead raise a warning about the performance hit

    Bug: Priority 2 PR In Progress 
    opened by hhsecond 9
  • Dataloaders for PyTorch

    Dataloaders for PyTorch

    @rlizzo I was thinking an API like hangar.dataloaders.pytorch (let me know if you have another structuring in your mind). Basically, the idea is to load data in batches synchronously or asynchronously and enable the features of PyTorch DataLoader. My plan is to have a Dataset class and a DataLoader class which is essentially the way PyTorch's data loader work.

    enhancement Resolved 
    opened by hhsecond 9
  • [BUG REPORT] New repo creation is unfriendly

    [BUG REPORT] New repo creation is unfriendly

    Describe the bug

    The message HANGAR RUNTIME WARNING: no repository exists at /some/path/__hangar, please use init_repo function makes me think my script is doing something wrong, even though the next thing that I do is call repo.init().

    Additionally, if the path does not exist, there should be an option to have it be created.

    A single default param exists=True flag could handle both cases, creating the directory and suppressing the init warning when set to False

    Severity

    Select an option:

    • [ ] Data Corruption / Loss of Any Kind
    • [x] Unexpected Behavior, Exceptions or Error Thrown
    • [ ] Performance Bottleneck
    Bug: Priority 2 
    opened by elistevens 8
  • Import export in CLI

    Import export in CLI

    Motivation and Context

    Why is this change required? What problem does it solve?:

    Introducing import-export utility over the command line

    If it fixes an open issue, please link to the issue here:

    This PR won't solve issue #72 completely but starting with image import and export. Also, the release is going to be experimental and the APIs are subjective to change

    Description

    Describe your changes in detail:

    • Moving CLI as a module
    • Introducing import option using click
    • Introducing export option using click
    • Base class for the introduction of plugin system

    Types of changes

    What types of changes does your code introduce? Put an x in all the boxes that apply:

    • [x] Documentation update
    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [x] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to change)

    Is this PR ready for review, or a work in progress?

    • [ ] Ready for review
    • [x] Work in progress

    How Has This Been Tested?

    Put an x in the boxes that apply:

    • [ ] Current tests cover modifications made
    • [x] New tests have been added to the test suite
    • [ ] Modifications were made to existing tests to support these changes
    • [ ] Tests may be needed, but they are not included when the PR was proposed
    • [ ] I don't know. Help!

    Checklist:

    • [x] My code follows the code style of this project.
    • [x] My change requires a change to the documentation.
    • [ ] I have updated the documentation accordingly.
    • [x] I have read the CONTRIBUTING document.
    • [x] I have signed (or will sign when prompted) the tensorwork CLA.
    • [ ] I have added tests to cover my changes.
    • [x] All new and existing tests passed.
    opened by hhsecond 7
  • WIP Switch To mkdocs

    WIP Switch To mkdocs

    Motivation and Context

    Why is this change required? What problem does it solve?:

    better documentation generator.

    Does not work at the current moment (unhappy with API plugin for numpy style docstrings).

    Types of changes

    What types of changes does your code introduce? Put an x in all the boxes that apply:

    • [x] Documentation update
    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [ ] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to change)

    Is this PR ready for review, or a work in progress?

    • [ ] Ready for review
    • [x] Work in progress

    How Has This Been Tested?

    Put an x in the boxes that apply:

    • [ ] Current tests cover modifications made
    • [ ] New tests have been added to the test suite
    • [ ] Modifications were made to existing tests to support these changes
    • [ ] Tests may be needed, but they are not included when the PR was proposed
    • [ ] I don't know. Help!

    Checklist:

    • [ ] My code follows the code style of this project.
    • [ ] My change requires a change to the documentation.
    • [ ] I have updated the documentation accordingly.
    • [ ] I have read the CONTRIBUTING document.
    • [ ] I have signed (or will sign when prompted) the tensorwork CLA.
    • [ ] I have added tests to cover my changes.
    • [ ] All new and existing tests passed.
    opened by rlizzo 3
  • [FEATURE REQUEST] Read checkout to be able to read data from staging area

    [FEATURE REQUEST] Read checkout to be able to read data from staging area

    Is your feature request related to a problem? Please describe. Currently, read checkout won't be able to read data from the staging area. I was wondering what would be the technical difficulties for reading data from staging area

    Describe the solution you'd like repo.checkout(stage=True) or something similar could give access to the staging area. I think the major bottleneck from implementing this is the fact that the staging area could change and might even make the data unavailable. Maybe we could have a flag changes when the data changes and on read from a stage checkout, it could let the reading process know that the data changed (not the specific information but just that the data changed) and it can invalidate the checkout?

    enhancement 
    opened by hhsecond 1
  • [QUESTION & DOCS]: hangar versus DVID?

    [QUESTION & DOCS]: hangar versus DVID?

    Executive Summary how does the approach of hangar compare with DVID? I am looking at how solutions to managing really large datasets, and stored ML models. The project dvid seems like its doing something similar to hangar?

    I don't know enough about devops to be able to determine what kind of solution I could or should choose or what one is buying into when they choose one.

    question documentation 
    opened by kurtsansom 0
  • [BUG REPORT] Diff status always returns CLEAN inside CM

    [BUG REPORT] Diff status always returns CLEAN inside CM

    Describe the bug Diff status always returns CLEAN inside CM

    Severity

    Select an option:

    • [ ] Data Corruption / Loss of Any Kind
    • [x] Unexpected Behavior, Exceptions or Error Thrown
    • [ ] Performance Bottleneck

    To Reproduce

    from hangar import Repository
    import numpy as np
    
    
    repo = Repository('.')
    repo.init(user_name='me', user_email='[email protected]', remove_old=True)
    co = repo.checkout(write=True)
    co.add_ndarray_column('x', prototype=np.array([1]))
    co.commit('added columns')
    co.close()
    
    co = repo.checkout(write=True)
    x = co.columns['x']
    
    
    with x:
        for i in range(10):
            x[i] = np.array([i])
            print(co.diff.status())  # this should return DIRTY but returns CLEAN
    print(co.diff.status())  # this returns DIRTY as expected
    co.commit('adding file')
    
    
    Bug: Awaiting Priority Assignment 
    opened by hhsecond 0
  • [BUG REPORT] Transaction registers closes early

    [BUG REPORT] Transaction registers closes early

    Describe the bug In case of multiple columns, if we open only few of such columns in context manager and still tries to write on other columns inside the context manager, transactions get False. A sample script to reproduce is given below

    Severity

    Select an option:

    • [ ] Data Corruption / Loss of Any Kind
    • [x] Unexpected Behavior, Exceptions or Error Thrown
    • [ ] Performance Bottleneck

    To Reproduce

    from hangar import Repository
    import numpy as np
    
    
    repo = Repository('.')
    repo.init(user_name='me', user_email='[email protected]', remove_old=True)
    co = repo.checkout(write=True)
    co.add_ndarray_column('x', prototype=np.array([1]))
    co.add_ndarray_column('y', prototype=np.array([1]))
    co.commit('added columns')
    co.close()
    
    co = repo.checkout(write=True)
    x = co.columns['x']
    y = co.columns['y']
    
    
    with x:  # note that we are opening only `x` in the CM
        for i in range(10):
            y[i] = np.array([i])  # but we are trying to update `y` column
            x[i] = np.array([i])
    co.commit('adding file')
    co.close()
    

    Desktop (please complete the following information):

    • OS: Ubuntu 19.10
    • Python: 3.7
    • Hangar: 0.5.1.dev0 (master, at the time of writing)
    Bug: Priority 1 
    opened by hhsecond 2
  • Commit Level Metadata

    Commit Level Metadata

    May be mention that this metadata is commit level and will not be part of the history.

    Originally posted by @hhsecond in https://github.com/tensorwerk/hangar-py/pull/180

    enhancement 
    opened by rlizzo 0
Releases(v0.5.2)
  • v0.5.2(May 8, 2020)

    v0.5.2 (2020-05-08)

    New Features

    • New column data type supporting arbitrary bytes data. (#198) @rlizzo

    Improvements

    • str typed columns can now accept data containing any unicode code-point. In prior releases data containing any non-ascii character could not be written to this column type. (#198) @rlizzo

    Bug Fixes

    • Fixed issue where str and (newly added) bytes column data could not be fetched / pushed between a local client repository and remote server. (#198) @rlizzo
    Source code(tar.gz)
    Source code(zip)
  • v0.5.1(Apr 6, 2020)

  • v0.5.0(Apr 4, 2020)

    v0.5.0 (2020-04-4)

    Improvements

    • Python 3.8 is now fully supported. (#193) @rlizzo
    • Major backend overhaul which defines column layouts and data types in the same interchangable / extensable manner as storage backends. This will allow rapid development of new layouts and data type support as new use cases are discovered by the community. (#184) @rlizzo
    • Column and backend classes are now fully serializable (pickleable) for read-only checkouts. (#180) @rlizzo
    • Modularized internal structure of API classes to easily allow new columnn layouts / data types to be added in the future. (#180) @rlizzo
    • Improved type / value checking of manual specification for column backend and backend_options. (#180) @rlizzo
    • Standardized column data access API to follow python standard library dict methods API. (#180) @rlizzo
    • Memory usage of arrayset checkouts has been reduced by ~70% by using C-structs for allocating sample record locating info. (#179) @rlizzo
    • Read times from the HDF5_00 and HDF5_01 backend have been reduced by 33-38% (or more for arraysets with many samples) by eliminating redundant computation of chunked storage B-Tree. (#179) @rlizzo
    • Commit times and checkout times have been reduced by 11-18% by optimizing record parsing and memory allocation. (#179) @rlizzo

    New Features

    • Added str type column with same behavior as ndarray column (supporting both single-level and nested layouts) added to replace functionality of removed metadata container. (#184) @rlizzo
    • New backend based on LMDB has been added (specifier of lmdb_30). (#184) @rlizzo
    • Added .diff() method to Repository class to enable diffing changes between any pair of commits / branches without needing to open the diff base in a checkout. (#183) @rlizzo
    • New CLI command hangar diff which reports a summary view of changes made between any pair of commits / branches. (#183) @rlizzo
    • Added .log() method to Checkout objects so graphical commit graph or machine readable commit details / DAG can be queried when operating on a particular commit. (#183) @rlizzo
    • "string" type columns now supported alongside "ndarray" column type. (#180) @rlizzo
    • New "column" API, which replaces "arrayset" name. (#180) @rlizzo
    • Arraysets can now contain "nested subsamples" under a common sample key. (#179) @rlizzo
    • New API to add and remove samples from and arrayset. (#179) @rlizzo
    • Added repo.size_nbytes and repo.size_human to report disk usage of a repository on disk. (#174) @rlizzo
    • Added method to traverse the entire repository history and cryptographically verify integrity. (#173) @rlizzo

    Changes

    • Argument syntax of __getitem__() and get() methods of ReaderCheckout and WriterCheckout classes. The new format supports handeling arbitrary arguments specific to retrieval of data from any column type. (#183) @rlizzo

    Removed

    • metadata container for str typed data has been completly removed. It is replaced by a highly extensible and much more user-friendly str typed column. (#184) @rlizzo
    • __setitem__() method in WriterCheckout objects. Writing data to columns via a checkout object is no longer supported. (#183) @rlizzo

    Bug Fixes

    • Backend data stores no longer use file symlinks, improving compatibility with some types file systems. (#171) @rlizzo
    • All arrayset types ("flat" and "nested subsamples") and backend readers can now be pickled -- for parallel processing -- in a read-only checkout. (#179) @rlizzo

    Breaking changes

    • New backend record serialization format is incompatible with repositories written in version 0.4 or earlier.
    • New arrayset API is incompatible with Hangar API in version 0.4 or earlier.
    Source code(tar.gz)
    Source code(zip)
  • v0.5.0dev3(Apr 4, 2020)

  • v0.5.0dev2(Apr 4, 2020)

  • v0.4.0(Nov 26, 2019)

    Release Notes

    New Features

    • Added ability to delete branch names/pointers from a local repository via both API and CLI. #128 @rlizzo
    • Added local keyword arg to arrayset key/value iterators to return only locally available samples #131 @rlizzo
    • Ability to change the backend storage format and options applied to an arrayset after initialization. #133 @rlizzo
    • Added blosc compression to HDF5 backend by default on PyPi installations. #146 @rlizzo
    • Added Benchmarking Suite to Test for Performance Regressions in PRs. #155 @rlizzo
    • Added new backend optimized to increase speeds for fixed size arrayset access. #160 @rlizzo

    Improvements

    • Removed msgpack and pyyaml dependencies. Cleaned up and improved remote client/server code. #130 @rlizzo
    • Multiprocess Torch DataLoaders allowed on Linux and MacOS. #144 @rlizzo
    • Added CLI options commit, checkout, arrayset create, & arrayset remove. #150 @rlizzo
    • Plugin system revamp. #134 @hhsecond
    • Documentation Improvements and Typo-Fixes. #156 @alessiamarcolini
    • Removed implicit removal of arrayset schema from checkout if every sample was removed from arrayset. This could potentially result in dangling accessors which may or may not self-destruct (as expected) in certain edge-cases. #159 @rlizzo
    • Added type codes to hash digests so that calculation function can be updated in the future without breaking repos written in previous Hangar versions. #165 @rlizzo

    Bug Fixes

    • Programatic access to repository log contents now returns branch heads alongside other log info. #125 @rlizzo
    • Fixed minor bug in types of values allowed for Arrayset names vs Sample names. #151 @rlizzo
    • Fixed issue where using checkout object to access a sample in multiple arraysets would try to create a namedtuple instance with invalid field names. Now incompatible field names are automatically renamed with their positional index. #161 @rlizzo
    • Explicitly raise error if commit argument is set while checking out a repository with write=True. #166 @rlizzo

    Breaking changes

    • New commit reference serialization format is incompatible with repositories written in version 0.3.0 or earlier.
    Source code(tar.gz)
    Source code(zip)
  • v0.4.0b0(Oct 19, 2019)

  • v0.3.0(Sep 10, 2019)

    New Features

    • API addition allowing reading and writing arrayset data from a checkout object directly. (#115) @rlizzo
    • Data importer, exporters, and viewers via CLI for common file formats. Includes plugin system for easy extensibility in the future. (#103) (@rlizzo, @hhsecond)

    Improvements

    • Added tutorial on working with remote data. (#113) @rlizzo
    • Added Tutorial on Tensorflow and PyTorch Dataloaders. (#117) @hhsecond
    • Large performance improvement to diff/merge algorithm (~30x previous). (#112) @rlizzo
    • New commit hash algorithm which is much more reproducible in the long term. (#120) @rlizzo
    • HDF5 backend updated to increase speed of reading/writing variable sized dataset compressed chunks (#120) @rlizzo

    Bug Fixes

    • Fixed ML Dataloaders errors for a number of edge cases surrounding partial-remote data and non-common keys. (#110) (@hhsecond, @rlizzo)

    Breaking changes

    • New commit hash algorithm is incompatible with repositories written in version 0.2.0 or earlier
    Source code(tar.gz)
    Source code(zip)
  • v0.2.0(Aug 9, 2019)

    See changelog for full details

    New Features

    • Numpy memory-mapped array file backend added.
    • Remote server data backend added.
    • Selection heuristics to determine appropriate backend from arrayset schema.
    • Partial remote clones and fetch operations now fully supported.
    • CLI has been placed under test coverage, added interface usage to docs.
    • TensorFlow and PyTorch Machine Learning Dataloader Methods (Experimental Release).

    Improvements

    • Record format versioning and standardization so to not break backwards compatibility in the future.
    • Backend addition and update developer protocols and documentation.
    • Read-only checkout arrayset sample get methods now are multithread and multiprocess safe.
    • Read-only checkout metadata sample get methods are thread safe if used within a context manager.
    • Samples can be assigned integer names in addition to string names.
    • Forgetting to close a write-enabled checkout before terminating the python process will close the checkout automatically for many situations.
    • Repository software version compatability methods added to ensure upgrade paths in the future.
    • Many tests added (including support for Mac OSX on Travis-CI). lead

    Bug Fixes

    • Diff results for fast forward merges now returns sensible results.
    • Many type annotations added, and developer documentation improved.

    Breaking changes

    • Renamed all references to datasets in the API / world-view to arraysets.
    • These are backwards incompatible changes. For all versions > 0.2, repository upgrade utilities will be provided if breaking changes occur.
    Source code(tar.gz)
    Source code(zip)
  • v0.1.1(May 24, 2019)

  • v0.1.0(May 24, 2019)

    New Features

    • Remote client-server config negotiation and administrator permissions (#10) @rlizzo
    • Allow single python process to access multiple repositories simultaneously (#20) @rlizzo
    • Fast-Forward and 3-Way Merge and Diff methods now fully supported and behaving as expected (#32) @rlizzo

    Improvements

    • Initial test-case specification (#14) @hhsecond
    • Checkout test-case work (#25) @hhsecond
    • Metadata test-case work (#27) @hhsecond
    • Any potential failure cases raise exceptions instead of silently returning (#16) @rlizzo
    • Many usability improvements in a variety of commits

    Bug Fixes

    • Ensure references to checkout dataset or metadata objects cannot operate after the checkout is closed. (#41) @rlizzo
    • Sensible exception classes and error messages raised on a variety of situations (Many commits) @hhsecond & @rlizzo
    • Many minor issues addressed.

    API Additions

    • Refer to API documentation (#23)

    Breaking changes

    • All repositories written with previous versions of Hangar are liable to break when using this version. Please upgrade versions immediately.
    Source code(tar.gz)
    Source code(zip)
Owner
Tensorwerk
Tensorwerk
Python-based Space Physics Environment Data Analysis Software

pySPEDAS pySPEDAS is an implementation of the SPEDAS framework for Python. The Space Physics Environment Data Analysis Software (SPEDAS) framework is

SPEDAS 98 Dec 22, 2022
CPSPEC is an astrophysical data reduction software for timing

CPSPEC manual Introduction CPSPEC is an astrophysical data reduction software for timing. Various timing properties, such as power spectra and cross s

Tenyo Kawamura 1 Oct 20, 2021
OpenDrift is a software for modeling the trajectories and fate of objects or substances drifting in the ocean, or even in the atmosphere.

opendrift OpenDrift is a software for modeling the trajectories and fate of objects or substances drifting in the ocean, or even in the atmosphere. Do

OpenDrift 167 Dec 13, 2022
PyNHD is a part of HyRiver software stack that is designed to aid in watershed analysis through web services.

A part of HyRiver software stack that provides access to NHD+ V2 data through NLDI and WaterData web services

Taher Chegini 23 Dec 14, 2022
Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Amundsen 3.7k Jan 3, 2023
🧪 Panel-Chemistry - exploratory data analysis and build powerful data and viz tools within the domain of Chemistry using Python and HoloViz Panel.

???? ??. The purpose of the panel-chemistry project is to make it really easy for you to do DATA ANALYSIS and build powerful DATA AND VIZ APPLICATIONS within the domain of Chemistry using using Python and HoloViz Panel.

Marc Skov Madsen 97 Dec 8, 2022
Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

null 2 Nov 20, 2021
Elementary is an open-source data reliability framework for modern data teams. The first module of the framework is data lineage.

Data lineage made simple, reliable, and automated. Effortlessly track the flow of data, understand dependencies and analyze impact. Features Visualiza

null 898 Jan 9, 2023
A Pythonic introduction to methods for scaling your data science and machine learning work to larger datasets and larger models, using the tools and APIs you know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

This tutorial's purpose is to introduce Pythonistas to methods for scaling their data science and machine learning work to larger datasets and larger models, using the tools and APIs they know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

Coiled 102 Nov 10, 2022
PrimaryBid - Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift

Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift This project is composed of two parts: Part1 and Part2

Emmanuel Boateng Sifah 1 Jan 19, 2022
Python data processing, analysis, visualization, and data operations

Python This is a Python data processing, analysis, visualization and data operations of the source code warehouse, book ISBN: 9787115527592 Descriptio

FangWei 1 Jan 16, 2022
Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

Data Scientist Learning Plan Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

Trung-Duy Nguyen 27 Nov 1, 2022
Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code. Tuplex has similar Python APIs to Apache Spark or Dask, but rather than invoking the Python interpreter, Tuplex generates optimized LLVM bytecode for the given pipeline and input data set.

Tuplex 791 Jan 4, 2023
A data parser for the internal syncing data format used by Fog of World.

A data parser for the internal syncing data format used by Fog of World. The parser is not designed to be a well-coded library with good performance, it is more like a demo for showing the data structure.

Zed(Zijun) Chen 40 Dec 12, 2022
Fancy data functions that will make your life as a data scientist easier.

WhiteBox Utilities Toolkit: Tools to make your life easier Fancy data functions that will make your life as a data scientist easier. Installing To ins

WhiteBox 3 Oct 3, 2022
A Big Data ETL project in PySpark on the historical NYC Taxi Rides data

Processing NYC Taxi Data using PySpark ETL pipeline Description This is an project to extract, transform, and load large amount of data from NYC Taxi

Unnikrishnan 2 Dec 12, 2021
Utilize data analytics skills to solve real-world business problems using Humana’s big data

Humana-Mays-2021-HealthCare-Analytics-Case-Competition- The goal of the project is to utilize data analytics skills to solve real-world business probl

Yongxian (Caroline) Lun 1 Dec 27, 2021
PostQF is a user-friendly Postfix queue data filter which operates on data produced by postqueue -j.

PostQF Copyright © 2022 Ralph Seichter PostQF is a user-friendly Postfix queue data filter which operates on data produced by postqueue -j. See the ma

Ralph Seichter 11 Nov 24, 2022