Manage large and heterogeneous data spaces on the file system.

Overview

signac - simple data management

Affiliated with NumFOCUS PyPI conda-forge CircleCI RTD License PyPI-downloads Slack Twitter GitHub Stars

The signac framework helps users manage and scale file-based workflows, facilitating data reuse, sharing, and reproducibility.

It provides a simple and robust data model to create a well-defined indexable storage layout for data and metadata. This makes it easier to operate on large data spaces, streamlines post-processing and analysis and makes data collectively accessible.

Resources

Installation

The recommended installation method for signac is through conda or pip. The software is tested for Python 3.6+ and is built for all major platforms.

To install signac via the conda-forge channel, execute:

conda install -c conda-forge signac

To install signac via pip, execute:

pip install signac

Detailed information about alternative installation methods can be found in the documentation.

Quickstart

The framework facilitates a project-based workflow. Set up a new project:

$ mkdir my_project
$ cd my_project
$ signac init MyProject

and access the project handle:

>>> project = signac.get_project()

Testing

You can test this package by executing:

$ python -m pytest tests/

Acknowledgment

When using signac as part of your work towards a publication, we would really appreciate that you acknowledge signac appropriately. We have prepared examples on how to do that here. Thank you very much!

The signac framework is a NumFOCUS Affiliated Project.

Comments
  • Added buffering to SyncedCollection

    Added buffering to SyncedCollection

    Description

    Added buffering feature for SyncedCollection. The buffering will be provided by signac.buffered and SyncedCollection.buffered.

    Motivation and Context

    Related to #249. This is continuation of work in PR

    Types of Changes

    • [ ] Documentation update
    • [ ] Bug fix
    • [x] New feature
    • [ ] Breaking change1

    1The change breaks (or has the potential to break) existing functionality.

    Checklist:

    If necessary:

    • [ ] I have updated the API documentation as part of the package doc-strings.
    • [ ] I have created a separate pull request to update the framework documentation on signac-docs and linked it here.
    • [ ] I have updated the changelog and added all related issue and pull request numbers for future reference (if applicable). See example below.

    Example for a changelog entry: Fix issue with launching rockets to the moon (#101, #212).

    GSoC 
    opened by vishav1771 30
  • Refactor path function handling

    Refactor path function handling

    This PR switched from a bug fix to a refactor. See #666 for the bug fix only.

    Original Title: Don't generate views with underspecified path provided by user

    Signac has unexpected behavior when generating a view if the user specifies a custom path that doesn't uniquely specifiy jobs.

    I would expect signac to link to all jobs matching the description in the view folder.

    What @Nipuli-Gunaratne and I found was that signac just picks one job.

    Description

    We check that the mapping of source-> link is 1-1 in other parts of import_export.py but don't check user's provided path function.

    Two posssible solutions I see are:

    1. Generate an error and exit. Suggest the fix in the error message.
    2. Try to fix the problem by adding jobs ids to the path.

    In this draft PR, I find the places in the code for adding the error checking code in two places it might fit: import_export.py::_make_path_function and in linked_view.py::create_linked_view.

    I think it works best in _make_path_function.

    Motivation and Context

    If you make this test project

    #init.py
    import signac
    
    project = signac.init_project('Test-view')
    
    jobs = [dict(a=1,b=1),
            dict(a=1,b=2),
            dict(a=2,b=1),
            dict(a=2,b=2)
            ]
    
    
    for j in jobs:
        job = project.open_job(j)
        job.init()
        print(j, job.id)
    

    and generate a view with a user-specified custom path that does not uniquely identify jobs

    signac view test_error "a/{a}"
    

    Signac just picks one of the jobs to link. a=1 gets job 8aacdb17187e6acf2b175d4aa08d7213 (b=2) and not 386b19932c82f3f9749dd6611e846293 (b=1)
    a=2 gets job 5e4d14d82c320bafb2f1286fe486d1f8 (b=1) and not d48f81ad571306570e2eb9fe7920cd3c (b=2)

    Fix that we'd suggest to users OR try to do automatically:

    Remake the path specification as "a/{a}/id/{id}"

    Types of Changes

    • [ ] Documentation update
    • [x] Bug fix
    • [ ] New feature
    • [ ] Breaking change1

    1The change breaks (or has the potential to break) existing functionality.

    Checklist:

    If necessary:

    • [ ] I have updated the API documentation as part of the package doc-strings.
    • [ ] I have created a separate pull request to update the framework documentation on signac-docs and linked it here.
    • [ ] I have updated the changelog and added all related issue and pull request numbers for future reference (if applicable). See example below.
    refactor 
    opened by cbkerr 24
  • Proposal: Unify dict classes and improve buffering and synchronization

    Proposal: Unify dict classes and improve buffering and synchronization

    Tl;dr: We need to improve synchronization and caching logic, and I think the first step is to combine the _SyncedDict, SyncedAttrDict, and JSONDict classes.

    I apologize in advance for the lengthy nature of this issue. This issue will serve as a pseudo-signac Enhancement Proposal, I'll try and document very thoroughly and it can be a test case for the utility of such proposals :)

    In view of our recent push for deprecations and our discussion of reorganizing namespaces and subpackages to prepare for signac 2.0, I'd like to also revisit discussion of the different dict classes. We have various open bugs and features (#234, #196, #239, #238, #198) that are related to improving our synchronization and caching processes. Our synchronization clearly has some holes in it, and in the process of making #239 @bdice has raised concerns about inconsistencies with respect to cache correctness and cache coherence, e.g. the fact that a job that exists and is cached will still exist in the cache after it is deleted (Bradley, feel free to add more information).

    Fixing all of these is a complex problem, in part due to fragmentation in our implementation of various parts of the logic. I'd like to use this issue to broadly discuss the various problems that we need to fix, and we can spawn off related issues as needed once we have more of a plan of attack to address our problems. Planning this development more thoroughly is critical since the bugs that may arise touch on pretty much all critical code paths in signac. I think that a good first step is looking into simplifying the logic associated with our various dictionary classes. That change should make it easier to improve #198 since synchronization will be in one place. After that, I think it will be easier to consider the various levels of caching and properly define the invariants we want to preserve.

    With respect to the various dictionary classes, I think we need to reassess and simplify our hierarchy:

    • The core class is the _SyncedDict class in core/synceddict.py, and I think it should exist in more or less its current form.
    • I understand the logic of separating out the SyncedAttrDict in core/attrdict.py since attribute-based access to a dictionary is technically a feature unrelated to synchronization. However, the class is very minimal, and I think that the benefits of maintaining this level of purity in distinction is outweighed by the increased difficulty users and newer developers have in finding code in the code base. I would like to merge these classes.
    • The JSONDict class in core/jsondict.py is, in my opinion, harder to justify separating from _SyncedDict on a conceptual level. Although in principle one could argue for different types of file backends, in practice we're very tied to JSON. The bigger problem, though, is that in my understanding (please correct me if I'm wrong here) the primary distinction between the two classes is less about the file backend and more about buffering. We always set the parent of SyncedAttrDict, which is what we use for job statepoints, and this ensures that statepoints changes are immediately synced to disk. Conversely, job documents are JSONDict objects, which use buffering. The fact that the _SyncedDict has _load and _save methods that essentially must be implemented by a child class when parent is not set, and that the JSONDict is the only example we have of such a class, suggests that this is a level of abstraction that isn't very helpful and mainly complicates management of the code. At least for now, I would prefer to unify the JSONDict with _SyncedDict; the logic for when we buffer is already governed by the parent, but the logic for how we buffer is governed by the various other functions in jsondict.py. Afterwards, if we see a benefit to separating the choice of file backend, we could recreate JSONDict where the new version of the class would really only implement JSON-specific logic. This change would have the added benefit of unifying statepoints and documents: I don't think it is intuitive design to have the document and the statepoint be two different classes for reasons of buffering, and it makes it substantially more difficult to follow the logic of how the _sp_save_hook works and why it's necessary. Longer term, I would like to refactor the logic for persistence vs. buffering so that the roles of Job and _SyncedDict are more disjoint, but I recognize that there may not be a way to completely decouple the bidirectional link.

    @csadorf @bdice @mikemhenry any commentary on this is welcome, also please tag any other devs who might have enough knowledge of these topics to provide useful feedback.

    enhancement proposal GSoC refactor 
    opened by vyasr 24
  • Lazy statepoint loading

    Lazy statepoint loading

    Description

    Changes behavior of Job to load its statepoint lazily, when opened by id.

    Motivation and Context

    Implementation of #238.

    Types of Changes

    • [ ] Documentation update
    • [ ] Bug fix
    • [x] New feature
    • [x] Breaking change1

    1The change breaks (or has the potential to break) existing functionality.

    Checklist:

    If necessary:

    • [ ] I have updated the API documentation as part of the package doc-strings.
    • [ ] I have created a separate pull request to update the framework documentation on signac-docs and linked it here.
    • [x] I have updated the changelog.
    enhancement 
    opened by bdice 24
  • Add id property for jobs

    Add id property for jobs

    Add id property to Job

    Description

    Adds id as a property of the Job class. I also added a test to ensure that it produces the correct string.

    Motivation and Context

    Just follows the python trend of using properties instead of getters and setters.

    Types of Changes

    • [ ] Documentation update
    • [ ] Bug fix
    • [x] New feature
    • [ ] Breaking change1

    1The change breaks (or has the potential to break) existing functionality.

    Checklist:

    If necessary:

    • [ ] I have updated the API documentation as part of the package doc-strings.
    • [ ] I have created a separate pull request to update the framework documentation on signac-docs and linked it here.
    • [ ] I have updated the changelog.
    enhancement 
    opened by b-butler 23
  • Improve job.data, project.data (H5Store) examples.

    Improve job.data, project.data (H5Store) examples.

    Description

    I created this PR to start addressing this issue: https://github.com/glotzerlab/signac-docs/issues/50

    @klywang Do you have specific suggestions on how to improve this? (I don't think I've "solved" the issue yet, since I haven't provided explicit examples for job.data as requested.) Feel free to edit this PR directly.

    Motivation and Context

    https://github.com/glotzerlab/signac-docs/issues/50

    Types of Changes

    • [x] Documentation update
    • [ ] Bug fix
    • [ ] New feature
    • [ ] Breaking change1

    1The change breaks (or has the potential to break) existing functionality.

    Checklist:

    If necessary:

    • [ ] I have updated the API documentation as part of the package doc-strings.
    • [ ] I have created a separate pull request to update the framework documentation on signac-docs and linked it here.
    • [ ] I have updated the changelog.
    opened by bdice 22
  • Improve Sync Data Structures

    Improve Sync Data Structures

    Description

    This PR is related to #249 . In this PR, we are implementing SyncedCollection, SyncedAttrDict, SyncedList, JSONCollection, JSONDict, JSONList.

    Motivation and Context

    This refractor is to provide support for the multiple backends and resolve #196.

    Types of Changes

    • [ ] Documentation update
    • [ ] Bug fix
    • [x] New feature
    • [x] Breaking change1

    1The change breaks (or has the potential to break) existing functionality.

    Checklist:

    If necessary:

    • [ ] I have updated the API documentation as part of the package doc-strings.
    • [ ] I have created a separate pull request to update the framework documentation on signac-docs and linked it here.
    • [ ] I have updated the changelog and added all related issue and pull request numbers for future reference (if applicable). See example below.

    Example for a changelog entry: Fix issue with launching rockets to the moon (#101, #212).

    GSoC 
    opened by vishav1771 20
  • Optimize `Collection` for internal use in `Project`

    Optimize `Collection` for internal use in `Project`

    A possible optimization for signac 2.0 (or later) would be to reduce the amount of code we use in the Collection class when calling Project.find_jobs. See: https://github.com/glotzerlab/signac/blob/b29e0485c7998ba0c6d041e9ec15b533334d9b64/signac/contrib/project.py#L717

    We copy a large amount of data used from the Project's internal caches during the construction of a Collection and calling its find method. At the very least, we never use the Collection's ability to interact with data on disk or its ability to automatically generate ids (primary keys) for new records in the context of a signac Project, so we could eliminate some logic there in a cut-down class.

    Originally posted by @bdice in https://github.com/glotzerlab/signac/issues/652#issuecomment-1002303783

    I attempted an optimization in 13f2a8fb205f65d49421f4f4009ee3d78d00f9bf but it was unclear if copy/reference semantics would be correct in the resulting indices outside the context of a signac Project's limited usage of Collection. A smaller class that is designed for the actual use case of signac's Project.find_jobs could act as an internal cache with only the necessary logic (e.g. no file I/O or id generation).

    enhancement refactor 
    opened by bdice 19
  • Convert all docstrings to numpy style

    Convert all docstrings to numpy style

    As part of our overall docs overhaul (see glotzerlab/signac-docs#64), we want to convert our docstrings to numpy style (as decided in glotzerlab/signac-docs#74). The best automated tool I'm familiar with for this task is pyment. In addition to converting docstrings, it will also generate docstrings for functions, classes, etc that are missing docstrings entirely. However, the conversion will require significant manual review to ensure that all docstrings are converted correctly.

    enhancement documentation 
    opened by vyasr 19
  • Assigning to nested keys in a job document

    Assigning to nested keys in a job document

    Original report by Bradley Dice (Bitbucket: bdice, GitHub: bdice).


    I would like to use nested keys in a job document. Presently, this does not work as one would expect from "normal" dictionaries. See code snippet below.

    #!python
    >>> job.document # We start from an empty job document
    {}
    >>> job.document['a']['b'] = 'c' # This will error as expected, since job.document['a'] is unassigned
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/home/bdice/env/glotzer-software/signac/0.8.5/lib/python/signac-0.8.5-py3.5.egg/signac/core/jsondict.py", line 48, in __getitem__
    KeyError: 'a'
    >>> job.document['a'] = dict() # We create an empty dictionary as the value of key 'a'
    >>> job.document['a']['b'] = 'c' # We attempt to assign a nested key
    >>> job.document # This nested key does not appear to be set
    {'a': {}}
    >>> job.document['a'] = dict(b = 'c') # However, it is possible to set the value to be another dictionary, and this works
    >>> job.document
    {'a': {'b': 'c'}}
    

    After looking through the source code of the JSonDict with @vyasr, it is not immediately clear how to fix this in the source. It would probably involve some customization of the __getitem__ or __setitem__ functions. However, this workaround will function:

    #!python
    >>> temp_dict = job.document['a']
    >>> temp_dict['b'] = 'c'
    >>> job.document['a'] = temp_dict
    

    Version:

    $ signac --version
    signac 0.8.5
    
    bug 
    opened by csadorf 19
  • Use more compact schema for root directory files

    Use more compact schema for root directory files

    Based on feature usage, the project root directory currently contains a number of special files. Some of them are hidden, some are not.

    • signac_project_document.json
    • signac.rc
    • .signac_sp_cache.json.gz
    • .signac_history.txt

    As suggested by @vyasr we may want to switch to a more compact storage format, for example by bundling all files within a .signac folder.

    enhancement pinned 
    opened by csadorf 18
  • Bump coverage from 6.5.0 to 7.0.1

    Bump coverage from 6.5.0 to 7.0.1

    Bumps coverage from 6.5.0 to 7.0.1.

    Changelog

    Sourced from coverage's changelog.

    Version 7.0.1 — 2022-12-23

    • When checking if a file mapping resolved to a file that exists, we weren't considering files in .whl files. This is now fixed, closing issue 1511_.

    • File pattern rules were too strict, forbidding plus signs and curly braces in directory and file names. This is now fixed, closing issue 1513_.

    • Unusual Unicode or control characters in source files could prevent reporting. This is now fixed, closing issue 1512_.

    • The PyPy wheel now installs on PyPy 3.7, 3.8, and 3.9, closing issue 1510_.

    .. _issue 1510: nedbat/coveragepy#1510 .. _issue 1511: nedbat/coveragepy#1511 .. _issue 1512: nedbat/coveragepy#1512 .. _issue 1513: nedbat/coveragepy#1513

    .. _changes_7-0-0:

    Version 7.0.0 — 2022-12-18

    Nothing new beyond 7.0.0b1.

    .. _changes_7-0-0b1:

    Version 7.0.0b1 — 2022-12-03

    A number of changes have been made to file path handling, including pattern matching and path remapping with the [paths] setting (see :ref:config_paths). These changes might affect you, and require you to update your settings.

    (This release includes the changes from 6.6.0b1 <changes_6-6-0b1_>_, since 6.6.0 was never released.)

    • Changes to file pattern matching, which might require updating your configuration:

      • Previously, * would incorrectly match directory separators, making precise matching difficult. This is now fixed, closing issue 1407_.

      • Now ** matches any number of nested directories, including none.

    • Improvements to combining data files when using the

    ... (truncated)

    Commits
    • c5cda3a docs: releases take a little bit longer now
    • 9d4226e docs: latest sample HTML report
    • 8c77758 docs: prep for 7.0.1
    • da1b282 fix: also look into .whl files for source
    • d327a70 fix: more information when mapping rules aren't working right.
    • 35e249f fix: certain strange characters caused reporting to fail. #1512
    • 152cdc7 fix: don't forbid plus signs in file names. #1513
    • 31513b4 chore: make upgrade
    • 873b059 test: don't run tests on Windows PyPy-3.9
    • 5c5caa2 build: PyPy wheel now installs on 3.7, 3.8, and 3.9. #1510
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    dependencies 
    opened by dependabot[bot] 1
  • Bump pytest-xdist from 3.0.2 to 3.1.0

    Bump pytest-xdist from 3.0.2 to 3.1.0

    Bumps pytest-xdist from 3.0.2 to 3.1.0.

    Changelog

    Sourced from pytest-xdist's changelog.

    pytest-xdist 3.1.0 (2022-12-01)

    Features

    • [#789](https://github.com/pytest-dev/pytest-xdist/issues/789) <https://github.com/pytest-dev/pytest-xdist/issues/789>_: Users can now set a default distribution mode in their configuration file:

      .. code-block:: ini

      [pytest]
      addopts = --dist loadscope
      
    • [#842](https://github.com/pytest-dev/pytest-xdist/issues/842) <https://github.com/pytest-dev/pytest-xdist/issues/842>_: Python 3.11 is now officially supported.

    Removals

    • [#842](https://github.com/pytest-dev/pytest-xdist/issues/842) <https://github.com/pytest-dev/pytest-xdist/issues/842>_: Python 3.6 is no longer supported.
    Commits
    • 92a76bb Release 3.1.0
    • 6226965 Merge pull request #851 from nicoddemus/789-default-dist-mode
    • 7a0bc4c Let users configure dist mode in the configuration file
    • c6bcd20 [pre-commit.ci] pre-commit autoupdate (#849)
    • 99c80c3 Fix typo psutils -> psutil (#848)
    • e14895a [pre-commit.ci] pre-commit autoupdate (#846)
    • bb27210 Merge pull request #844 from pytest-dev/pre-commit-ci-update-config
    • 4a33933 Use ternary operator to remove mypy error
    • 41620d2 [pre-commit.ci] pre-commit autoupdate
    • 6b6f133 Merge pull request #842 from nicoddemus/drop-py36-add-py311
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    dependencies 
    opened by dependabot[bot] 1
  • Bump redis from 4.3.5 to 4.4.0

    Bump redis from 4.3.5 to 4.4.0

    Bumps redis from 4.3.5 to 4.4.0.

    Release notes

    Sourced from redis's releases.

    Version 4.4.0

    Changes

    4.4.0rc4 release notes 4.4.0rc3 release notes 4.4.0rc2 release notes 4.4.0rc1 release notes

    🚀 New Features (since 4.4.0rc4)

    • Async clusters: Support creating locks inside async functions (#2471)

    🐛 Bug Fixes (since 4.4.0rc4)

    • Async: added 'blocking' argument to call lock method (#2454)
    • Added a replacement for the default cluster node in the event of failure. (#2463)
    • Fixed geosearch: Wrong number of arguments for geosearch command (#2464)

    🧰 Maintenance (since 4.4.0rc4)

    • Updating dev dependencies (#2475)
    • Removing deprecated LGTM (#2473)
    • Added an explicit index name in RediSearch example (#2466)
    • Adding connection step to bloom filter examples (#2478)

    Contributors (since 4.4.0rc4)

    We'd like to thank all the contributors who worked on this release!

    @​Sibuken, @​barshaul, @​chayim, @​dvora-h, @​nermiller, @​uglide and @​utkarshgupta137

    4.4.0rc4

    Changes

    🚀 New Features

    • CredentialsProvider class added to support password rotation (#2261)
    • Enable AsyncIO cluster mode lock (#2446)

    🐛 Bug Fixes

    • Failover handling improvements for RedisCluster and Async RedisCluster (#2377)
    • Improved response parsing options handler for special cases (#2302)

    Contributors

    We'd like to thank all the contributors who worked on this release!

    @​KMilhan, @​barshaul, @​dvora-h and @​fadida

    4.4.0rc3

    Changes

    ... (truncated)

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    dependencies 
    opened by dependabot[bot] 1
  • Bump numpy from 1.23.5 to 1.24.1

    Bump numpy from 1.23.5 to 1.24.1

    Bumps numpy from 1.23.5 to 1.24.1.

    Release notes

    Sourced from numpy's releases.

    v1.24.1

    NumPy 1.24.1 Release Notes

    NumPy 1.24.1 is a maintenance release that fixes bugs and regressions discovered after the 1.24.0 release. The Python versions supported by this release are 3.8-3.11.

    Contributors

    A total of 12 people contributed to this release. People with a "+" by their names contributed a patch for the first time.

    • Andrew Nelson
    • Ben Greiner +
    • Charles Harris
    • Clément Robert
    • Matteo Raso
    • Matti Picus
    • Melissa Weber Mendonça
    • Miles Cranmer
    • Ralf Gommers
    • Rohit Goswami
    • Sayed Adel
    • Sebastian Berg

    Pull requests merged

    A total of 18 pull requests were merged for this release.

    • #22820: BLD: add workaround in setup.py for newer setuptools
    • #22830: BLD: CIRRUS_TAG redux
    • #22831: DOC: fix a couple typos in 1.23 notes
    • #22832: BUG: Fix refcounting errors found using pytest-leaks
    • #22834: BUG, SIMD: Fix invalid value encountered in several ufuncs
    • #22837: TST: ignore more np.distutils.log imports
    • #22839: BUG: Do not use getdata() in np.ma.masked_invalid
    • #22847: BUG: Ensure correct behavior for rows ending in delimiter in...
    • #22848: BUG, SIMD: Fix the bitmask of the boolean comparison
    • #22857: BLD: Help raspian arm + clang 13 about __builtin_mul_overflow
    • #22858: API: Ensure a full mask is returned for masked_invalid
    • #22866: BUG: Polynomials now copy properly (#22669)
    • #22867: BUG, SIMD: Fix memory overlap in ufunc comparison loops
    • #22868: BUG: Fortify string casts against floating point warnings
    • #22875: TST: Ignore nan-warnings in randomized out tests
    • #22883: MAINT: restore npymath implementations needed for freebsd
    • #22884: BUG: Fix integer overflow in in1d for mixed integer dtypes #22877
    • #22887: BUG: Use whole file for encoding checks with charset_normalizer.

    Checksums

    ... (truncated)

    Commits
    • a28f4f2 Merge pull request #22888 from charris/prepare-1.24.1-release
    • f8fea39 REL: Prepare for the NumPY 1.24.1 release.
    • 6f491e0 Merge pull request #22887 from charris/backport-22872
    • 48f5fe4 BUG: Use whole file for encoding checks with charset_normalizer [f2py] (#22...
    • 0f3484a Merge pull request #22883 from charris/backport-22882
    • 002c60d Merge pull request #22884 from charris/backport-22878
    • 38ef9ce BUG: Fix integer overflow in in1d for mixed integer dtypes #22877 (#22878)
    • bb00c68 MAINT: restore npymath implementations needed for freebsd
    • 64e09c3 Merge pull request #22875 from charris/backport-22869
    • dc7bac6 TST: Ignore nan-warnings in randomized out tests
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    dependencies 
    opened by dependabot[bot] 1
  • Bump tables from 3.7.0 to 3.8.0

    Bump tables from 3.7.0 to 3.8.0

    Bumps tables from 3.7.0 to 3.8.0.

    Release notes

    Sourced from tables's releases.

    Release v3.8.0

    Changes from 3.7.0 to 3.8.0

    Improvements

    • Support for Python 3.11 has been added (PR #962).
    • Support for Python 3.6 and Python 3.7 has been dropped (PR #966).
    • Added a new (registered) HDF5 filter for Blosc2 compressor (PR #969).
    • Added optimized paths for Blosc2 reading and writing in tables. This bypasses the HDF5 filter pipeline by building the Blosc2 CFrames and sending them to the HDF5 direct chunking machinery (PR #969).
    • Internal C-Blosc sources updated to 1.21.2.
    • Thanks to Oscar Guiñon, Francesc Alted for implementing Blosc2 the support and NumFOCUS for providing a grant for that.

    Other changes

    • Starting form this release, C source files generated by Cython are no longer included in the source distribution package.
    • Pre-built HTML documentation is no longer included in the source package.
    Changelog

    Sourced from tables's changelog.

    Changes from 3.8.0 to 3.9.0

    XXX version-specific blurb XXX

    Commits
    • e34d1f7 Update copyright year
    • f1e9fc3 Add a performance comparison with pandas
    • ce2c32d Getting ready for release 3.8.0
    • 0f28388 Prevent PyPy builds on linux too
    • 53eaed0 Prevent building macos pypy wheels.
    • e241459 Merge pull request #979 from PyTables/tables-3.8.0
    • 0f14177 Continue silencing warnings for recent NumPy (1.24)
    • 5d02ad6 Merge branch 'master' into tables-3.8.0
    • a0fccf2 Silence more warnings for recent NumPy (1.24)
    • 8ea4b98 Silence a warning for recent NumPy
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    dependencies 
    opened by dependabot[bot] 1
  • Move to a fully pyproject.toml based build

    Move to a fully pyproject.toml based build

    Description

    This PR removes setup.py and setup.cfg entirely, migrating all project and build configuration information into pyproject.toml. In the process, all linter configs have also been moved into pyproject.toml. The exception is flake8, which does not (and will not) support pyproject.toml, so the flake8 configuration is now stored in the .flake8 file, which is specific to this linter. Additionally, bump2version also does not support pyproject.toml (although unlike flake8 the proposal has not been entirely rejected, so it may eventually), so that configuration has also been moved to a project-specific .bumpversion file.

    Motivation and Context

    Various changes to Python packaging over the last 6 or 7 years have moved towards more static packaging and towards storing data in a backend-agnostic format. These changes allow these of setuptools alternatives (like flit) as well as more reproducible builds based on build isolation into virtual environments that provide all necessary build dependencies. Direct invocation of setup.py has been deprecated in the process. The changes in this PR modernize signac's build system for compatibility with these new approaches.

    Checklist:

    opened by vyasr 2
Releases(v1.8.0)
  • v1.8.0(Oct 5, 2022)

    [1.8.0] -- 2022-10-05

    Added

    • Official support for Python 3.10 (#631).
    • Benchmarks can be run using the asv (airspeed velocity) tool (#629).
    • Continuous integration tests run in parallel with pytest-xdist (#705).
    • The Project.path and Job.path properties (#685).

    Changed

    • Schema migration is now performed on directories rather than signac projects and supports a wider range of schemas (#654).
    • Deprecated features now use FutureWarning instead of DeprecationWarning, which is hidden by default (#687, #691, #692).
    • Project names have a default in anticipation of removing names entirely. Project names will be removed in signac 2.0 (#644).
    • Project.workspace is now a property, not a method (#685).
    • Continuous integration uses GitHub Actions instead of CircleCI (#776, #788).
    • Raise errors in testing when DeprecatedWarnings or FutureWarnings are raised (#713).
    • Change GitHub PR to check for uncompleted tasks (i.e. unchecked checkboxes) (#686).

    Deprecated

    • Project methods read_statepoints, write_statepoints, and dump_statepoints are deprecated (#579, #197).
    • Project.index method is deprecated (#591, #588).
    • JobSearchIndex class is deprecated (#600).
    • index argument is deprecated in Project methods (#602, #588).
    • signac.cite module is deprecated (#611, #592).
    • The config module and all its methods are deprecated (#675, #753, #814).
    • Accessing Project.workspace as a method, it should be accessed as a property (#685).
    • Project.num_jobs (#685).
    • ProjectSchema.__call__, ProjectSchema.detect (#685).

    Fixed

    • H5Store.mode returns the file mode (#607).
    • User-provided path functions now raise an error if not unique (#666).
    • Collection class no longer raises an error when searching by a primary key that does not exist (#676).
    • Relative paths on Windows are not used if the current directory has no common prefix (#777).
    • get_project() now raises an error if provided a root directory that does not exist (#779, #792).
    • Catch internally raised warnings on use of deprecated password cache (#754).
    • Catch KeyError from multithreading error (#710).
    • Tests now properly show raised warnings (#603).

    Removed

    • Removed upper bound of Python 4 on python_requires (#780, #781).
    • Dropped support for Python 3.6 and Python 3.7 (#715) following the recommended support schedules of NEP 29.
    • Dropped dependency on deprecation package (#687, #718).
    • Removed unused _extract utility function to avoid CVE-2007-4559 (#829).
    Source code(tar.gz)
    Source code(zip)
  • v1.7.0(Jun 8, 2021)

    This release adds SyncedCollections, a new, performant, and flexible approach to syncing job state points and documents with an underlying resource. Thanks to all who contributed! 🎨

    Added

    • New SyncedCollection class and subclasses to replace JSONDict with more general support for different types of resources (such as MongoDB collections or Redis databases) and more complete support for different data types synchronized with files (#196, #234, #249, #316, #383, #397, #465, #484, #529, #530). This change introduces a minor-backwards incompatible change; for users making direct use of signac buffering, the force_write parameter is no longer respected. If the argument is passed, a warning will now be raised to indicate that it is ignored and will be removed in signac 2.0.
    • Unified querying for state point and document filters using 'sp' and 'doc' as prefixes (#332, #514). This change introduces a minor backwards-incompatible change to the Collection index schema ('statepoint'->'sp'), but this does not affect any APIs, only indexes saved to file using a previous version of signac. Indexing APIs will be removed in signac 2.0.

    Changed

    • Optimized internal path joins to speed up project iteration (#515).

    Deprecated

    • doc_filter arguments, which are replaced by namespaced filters. Due to their long history, doc_filter arguments will still be accepted in signac 2.0 and will only be removed in 3.0 (#516).
    • The modules signac.core.attrdict, signac.core.json, signac.core.jsondict, and signac.core.synceddict.py are deprecated in favor of the new SyncedCollection classes and will be removed in signac 2.0 (#483).

    Fixed

    • Corrected docstrings for Job.update_statepoint and Project.update_statepoint (#506, #563).
    Source code(tar.gz)
    Source code(zip)
  • v1.6.0(Jan 25, 2021)

    This release focuses on performance improvements and better docs. Large projects should see massive speedups (4-7x on an SSD) for iterating over the project and working with signac-flow. Now you can scale up your science! 🎨

    Added

    • Implemented JobsCursor.__contains__ check (#449).
    • Added documentation for JobsCursor class (#475).

    Changed

    • Optimized job hash and equality checks (#442, #455).
    • Optimized H5Store initialization (#443).
    • State points are loaded lazily when Job is opened by id (#238, #239).
    • Optimized Job and Project classes to cache internal properties and initialize on access (#451).
    • Python 3.6 is only tested with oldest dependencies (#474).
    • Improved documentation for updating and resetting state points (#444).

    Deprecated

    • Deprecate syncutil.copytree method (#439).

    Fixed

    • Zero-dimensional NumPy arrays can be used in state points and documents (#449).
    Source code(tar.gz)
    Source code(zip)
  • v1.5.1(Dec 20, 2020)

    Added

    • Support for h5py version 3 (#411).
    • Added pyupgrade to pre-commit hooks (#413).
    • Code is formatted with black and isort pre-commit hooks (#415).
    • Added macOS to CircleCI testing pipeline (#281, #414).
    • Official support for Python 3.9 (#417).

    Changed

    • Optimized internal function _mkdir_p (#421).
    • Optimized performance of job initialization (#422).
    • Optimized performance of buffer storage (#428).
    • Optimized performance of creating/loading synced data structures (#429).
    Source code(tar.gz)
    Source code(zip)
  • v1.5.0(Sep 21, 2020)

    Added

    • Type annotations are validated during continuous integration (#313).
    • Added _repr_html_ method in ProjectSchema class (#314, #324).
    • Allow grouping by variables that are not present in all jobs in the project in JobsCursor.groupby (#321, #323).
    • Added parameters usecols and flatten to allow selection of columns and flattening of nested data when converting signac data into a pandas DataFrame (#327, #330).
    • Added support for pre-commit hooks (#355, #358).
    • Expanded CLI documentation (#187, #359, #377).

    Changed

    Fixed

    • Fix the signac config verify command (previously broken) (#301, #302).
    • Warnings now appear when raised by the signac CLI (#317, #308).
    • Fix dots in synchronization error messages (#375, #376).

    Deprecated

    • Deprecate the create_access_modules method in Project, to be removed in 2.0 (#303, #308).
    • The MainCrawler class has replaced the MasterCrawler class. Both classes are deprecated (#342).

    Removed

    • Dropped support for Python 3.5 (#340). The signac project will follow the NEP 29 deprecation policy going forward.
    • Removed dependency on pytest-subtests (#379).
    Source code(tar.gz)
    Source code(zip)
  • v1.4.0(Feb 29, 2020)

    Added

    • Added Windows to platforms tested with continuous integration (#264, #266).
    • Add command line option -m/--merge for signac sync (#280, #230).

    Changed

    • Workspace directory is created when Project is initialized (#267, #271).
    • Changed testing framework from unittest to pytest (#212, #275).
    • Refactored internal use of deprecated get_statepoint function (#227, #282).

    Fixed

    • Fixed issues on Windows with H5Store, project import/export, and operations that move files (#264, #266).
    • Calling items or values on _SyncedDict objects does not mutate nested dictionaries (#234, #269).
    • Fixed issue with project.data access from separate instances of H5StoreManager (#274, #278).
    • Fixed error when launching signac shell if permissions are denied for .signac_shell_history (#279).

    Removed

    • Removed vendored tqdm module and replaced it with a requirement (#289).
    • Removed support for rapidjson as an alternative JSON library (#285, #287).
    • Removed tuple of keys implementation of nested dictionaries (#272, #296).
    Source code(tar.gz)
    Source code(zip)
  • v1.3.0(Dec 19, 2019)

    Added

    • Official support for Python 3.8 (#258).
    • Add properties Project.id and Job.id (#250).
    • Add signac.diff_jobs function to compare two or more state points (#248, #247).
    • Add function to initialize a sample data space for testing purposes (#215).
    • Add schema version to ensure compatibility and enable migrations in future package versions (#165, #253).

    Changed

    • Implemented Project.__contains__ check in constant time (#231).

    Fixed

    • Attempting to create a linked view for a Project on Windows now raises an informative error message (#214, #236).
    • Project configuration is initialized using ConfigObj, allowing the configuration to include commas and special characters (#251, #252).

    Deprecated

    • Deprecate the get_id method in Project and Job classes in favor of the id property, to be removed in 2.0 (#250).
    • In-memory modification of the project configuration, to be removed in 2.0 (#246).

    Removed

    • Dropped support for Python 2.7 (#232).
    Source code(tar.gz)
    Source code(zip)
  • v1.2.0(Jul 22, 2019)

    Added

    • Keep signac shell command history on a per-project basis (#134, #194).
    • Add read_json() and to_json() methods to Collection class (#104, #200).

    Fixed

    • Fix issue where shallow copies of instances of Job would behave incorrectly (#153, #207).
    • Fix issue causing a failure of the automatic conversion of valid key types (#168, #205).
    • Improve the "dots in keys" error message to make it easier to fix related issues (#170, #205).
    • Update the __repr__ and __repr_html__ implementations of the Project, Job, and JobsCursor classes (#193).
    • Reduce the logging verbosity about a missing default host key in the configuration (#201).
    • Fix issue with incorrect detection of dict-like files managed with the DictManager class (e.g. job.stores) (#203).
    • Fix issue with generating views from the command line for projects with only one job (#208, #211).
    • Fix issue with heterogeneous types in state point values that are lists (#209, #210).
    Source code(tar.gz)
    Source code(zip)
  • v1.1.0(May 19, 2019)

    Added

    • Add command line options --sp and --doc for signac find that allow users to display key-value pairs of the state point and document in combination with the job id (#97, #146).
    • Improve the representation (return value of repr()) of instances of H5Group and SyncedAttrDict.

    Fixed

    • Fix: Searches for whole numbers will match all numerically matching integers regardless of whether they are stored as decimals or whole numbers (#169).
    • Fix: Passing an instance of dict to H5Store.setdefault() will return an instance of H5Group instead of a dict (#180).
    • Fix error with storing numpy arrays and scalars in a synced dictionary (e.g. job.statepoint, job.document) (#184).
    • Fix issue with ResourceWarning originating from unclosed instance of Collection (#186).
    • Fix issue with using the get_project() function with a relative path and search=False (#191).

    Removed

    • Support for Python version 3.4 (no longer tested).
    Source code(tar.gz)
    Source code(zip)
  • v1.0.0(Feb 28, 2019)

    Highlights

    • Native integration of HDF5 files with the H5Store and H5StoreManager, which are exposed as the job.data, job.stores, project.data, and project.stores properties respectively.
    • The newly added signac.get_job() function makes it easier to obtain instances of Job by calling the function from within a job's workspace directory or by directly providing the path to the job's workspace directory. This is especially useful for interactive work or when accessing jobs which are outside of the current project.
    • Simplified export of project and job data to pandas dataframes via the to_dataframe() function.
    • Projects and job search results are displayed nicely in Jupyter Notebooks.
    • Support for compressed Collection files.

    Added

    • Official support for Python 3.7.
    • The H5Store and H5StoreManager classes, which are useful for storing (numerical) array-like data with an HDF5-backend. These classes are exposed within the root namespace.
    • The job.data and project.data properties which present an instance of H5Store to access numerical data within the job workspace and project root directory.
    • The job.stores and project.stores properties, which present an instance of H5StoreManager to manage multiple instances of H5Store to store numerical array-like data within the project workspace and project root directory.
    • The signac.get_job() and the signac.Project.get_job() functions that allow users to get a job handle by switching into or providing the job's workspace directory.
    • The job variable is automatically set when opening a signac shell from within a job's workspace directory.
    • Add the signac shell -c option which allows the direct specification of Python commands to be executed within the shell.
    • Automatic cast of numpy arrays to lists when storing them within a JSONDict, e.g., a job.statepoint or job.document.
    • Enable Collection class to manage collections stored in compressed files (gzip, zip, etc.).
    • Enable deleting of JSONDict keys through the attribute interface, e.g., del job.doc.foo.
    • Pretty HTML representation of instances of Project and JobsCursor targeted at Jupyter Notebooks (requires pandas, automatically enabled when installed).
    • The to_dataframe() function to export the job state point and document data of a Project or a JobsCursor, e.g., the result of Project.find_jobs(), as a pandas.Dataframe (requires pandas).

    Changed

    • Dots (.) in keys are no longer allowed for JSONDict and Collection keys (previously deprecated).
    • The JSONDict module is exposed in the root namespace, which is useful for storing text-serializable data with a JSON-backend similar to the job.statepoint or job.document, etc.
    • The Job.init() method returns the job to allow one-line job creation and initialization.
    • The search argument was added to the signac.get_project() function, which when True (the default), will cause signac to search for a project within and above a specified root directory, not only within the root directory. The behavior without any arguments remains unchanged.

    Fixed

    • Fix Collection.update() behavior such that existing documents with identical primary key are updated. Previously, a KeyError would be raised.
    • Fix issue where the Job.move() would trigger a confusing DestinationExists exception when trying to move jobs across devices / file systems.
    • Fix issue that caused failures when the python-rapidjson package is installed. The python-rapidjson package is used as the primary JSON-backend when installed.
    • Fix issue where schema with multiple keys would subset incorrectly if the list of jobs or statepoints was provided as an iterator rather than a sequence.

    Removed

    • Removes the obsolete and deprecated core.search_engine module.
    • The previously deprecated Project.find_statepoints() and Project.find_job_documents() functions have been removed.
    • The Project.find_jobs() no longer accepts the obsolete index argument.
    Source code(tar.gz)
    Source code(zip)
  • v0.9.5(Jan 31, 2019)

    Fixed

    • Ensure that the next() function can be called for a JobsIterator, e.g., project.find().
    • Pickling issue that occurs when a _SyncedDict (job.statepoint, job.document, etc.) contains a list.
    • Issue with the readline module that would cause signac shell to fail on Windows operating systems.
    Source code(tar.gz)
    Source code(zip)
Owner
Glotzer Group
We develop molecular simulation tools to study the self-assembly of complex materials and explore matter at the nanoscale.
Glotzer Group
Open-source Laplacian Eigenmaps for dimensionality reduction of large data in python.

Fast Laplacian Eigenmaps in python Open-source Laplacian Eigenmaps for dimensionality reduction of large data in python. Comes with an wrapper for NMS

null 17 Jul 9, 2022
An ETL Pipeline of a large data set from a fictitious music streaming service named Sparkify.

An ETL Pipeline of a large data set from a fictitious music streaming service named Sparkify. The ETL process flows from AWS's S3 into staging tables in AWS Redshift.

null 1 Feb 11, 2022
signac-flow - manage workflows with signac

signac-flow - manage workflows with signac The signac framework helps users manage and scale file-based workflows, facilitating data reuse, sharing, a

Glotzer Group 44 Oct 14, 2022
This is an example of how to automate Ridit Analysis for a dataset with large amount of questions and many item attributes

This is an example of how to automate Ridit Analysis for a dataset with large amount of questions and many item attributes

Ishan Hegde 1 Nov 17, 2021
CleanX is an open source python library for exploring, cleaning and augmenting large datasets of X-rays, or certain other types of radiological images.

cleanX CleanX is an open source python library for exploring, cleaning and augmenting large datasets of X-rays, or certain other types of radiological

Candace Makeda Moore, MD 20 Jan 5, 2023
TextDescriptives - A Python library for calculating a large variety of statistics from text

A Python library for calculating a large variety of statistics from text(s) using spaCy v.3 pipeline components and extensions. TextDescriptives can be used to calculate several descriptive statistics, readability metrics, and metrics related to dependency distance.

null 150 Dec 30, 2022
Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Amundsen 3.7k Jan 3, 2023
A forecasting system dedicated to smart city data

smart-city-predictions System prognostyczny dedykowany dla danych inteligentnych miast Praca inżynierska realizowana przez Michała Stawikowskiego and

Kevin Lai 1 Nov 8, 2021
Galvanalyser is a system for automatically storing data generated by battery cycling machines in a database

Galvanalyser is a system for automatically storing data generated by battery cycling machines in a database, using a set of "harvesters", whose job it

Battery Intelligence Lab 20 Sep 28, 2022
Elementary is an open-source data reliability framework for modern data teams. The first module of the framework is data lineage.

Data lineage made simple, reliable, and automated. Effortlessly track the flow of data, understand dependencies and analyze impact. Features Visualiza

null 898 Jan 9, 2023
🧪 Panel-Chemistry - exploratory data analysis and build powerful data and viz tools within the domain of Chemistry using Python and HoloViz Panel.

???? ??. The purpose of the panel-chemistry project is to make it really easy for you to do DATA ANALYSIS and build powerful DATA AND VIZ APPLICATIONS within the domain of Chemistry using using Python and HoloViz Panel.

Marc Skov Madsen 97 Dec 8, 2022
Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

null 2 Nov 20, 2021
PrimaryBid - Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift

Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift This project is composed of two parts: Part1 and Part2

Emmanuel Boateng Sifah 1 Jan 19, 2022
Evidence enables analysts to deliver a polished business intelligence system using SQL and markdown.

Evidence enables analysts to deliver a polished business intelligence system using SQL and markdown

null 915 Dec 26, 2022
songplays datamart provide details about the musical taste of our customers and can help us to improve our recomendation system

Songplays User activity datamart The following document describes the model used to build the songplays datamart table and the respective ETL process.

Leandro Kellermann de Oliveira 1 Jul 13, 2021
Convert tables stored as images to an usable .csv file

Convert an image of numbers to a .csv file This Python program aims to convert images of array numbers to corresponding .csv files. It uses OpenCV for

null 711 Dec 26, 2022
fds is a tool for Data Scientists made by DAGsHub to version control data and code at once.

Fast Data Science, AKA fds, is a CLI for Data Scientists to version control data and code at once, by conveniently wrapping git and dvc

DAGsHub 359 Dec 22, 2022
Python data processing, analysis, visualization, and data operations

Python This is a Python data processing, analysis, visualization and data operations of the source code warehouse, book ISBN: 9787115527592 Descriptio

FangWei 1 Jan 16, 2022
Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

Data Scientist Learning Plan Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

Trung-Duy Nguyen 27 Nov 1, 2022