Quilt is a self-organizing data hub for S3

Overview

docs on_gitbook chat on_slack codecov pypi

Quilt is a self-organizing data hub

Python Quick start, tutorials

If you have Python and an S3 bucket, you're ready to create versioned datasets with Quilt. Visit the Quilt docs for installation instructions, a quick start, and more.

Quilt in action

Who is Quilt for?

Quilt is for data-driven teams and offers features for coders (data scientists, data engineers, developers) and business users alike.

What does Quilt do?

Quilt manages data like code so that teams in machine learning, biotech, and analytics can experiment faster, build smarter models, and recover from errors.

How does Quilt work?

Quilt consists of a Python client, web catalog, lambda functions—all of which are open source—plus a suite of backend services and Docker containers orchestrated by CloudFormation.

The backend services are available under a paid license on quiltdata.com.

Use cases

  • Share data at scale. Quilt wraps AWS S3 to add simple URLs, web preview for large files, and sharing via email address (no need to create an IAM role).
  • Understand data better through inline documentation (Jupyter notebooks, markdown) and visualizations (Vega, Vega Lite)
  • Discover related data by indexing objects in ElasticSearch
  • Model data by providing a home for large data and models that don't fit in git, and by providing immutable versions for objects and data sets (a.k.a. "Quilt Packages")
  • Decide by broadening data access within the organization and supporting the documentation of decision processes through audit-able versioning and inline documentation

Roadmap

I - Performance and core services

  • Address performance issues with push (e.g. re-hash)
  • Provide Presto-DB-powered services for filtering package repos with SQL
  • Investigate and implement more efficient manifest formats (e.g. Parquet), that scale to 10M keys; consider abbreviated "fast manifests" for lazy browsing
  • Refactor s3://bucket/.quilt for improved listing and delete performance

II - CI/CD for data

  • Ability to fork/merge packages
  • Data quality monitoring

III - Storage agnostic (support Azure, GCP buckets)

  • Evaluate min.io and ceph.io as shims
  • Evaluate feasibility of on-prem local storage as a repo

IV - Cloud agnostic

  • Evaluate K8s and Terraform to replace CloudFormation
  • Shim lambdas (consider serverless.com)
  • Shim ElasticSearch (consider SOLR)
  • Shim IAM via RBAC
Comments
  • build.yml globbing

    build.yml globbing

    Uses syntax:

    contents:
      foo:
        "*.baz"   # Case insensitive.  All files are sub-nodes of 'foo'
    

    A few notes:

    • case insensitivity was actually kindof a pain -- but it works now
    • ~~there are a couple tools in tensorflow that I didn't bring over -- like the backported pathlib in tools.compat, and file->node duplicate naming conflict resolver.~~ These have been brought over. There may be conflicts for the TensorFlow branch once this is merged into master..
    • subdirs are made into nodenames for now instead of making subnodes -- so subdir_foo_csv
      • This now uses only the filename of the found file, and appends a number if there's a conflict.
    • ~~still rough~~
    opened by eode 27
  • When on the CLI, catch ctrl-c early in execution and exit cleanly

    When on the CLI, catch ctrl-c early in execution and exit cleanly

    Normally I'd just do a try: except: block on or in main(), but at load-time, we load a bunch of external modules that take a lot of time, during which ctrl-c will cause an exception that misses that block. ..so, this was added to quilt/__init__.py.

    Also, sometimes during development we want to be able to see the traceback -- for example, if we're wondering what's taking so long, or what function is causing network activity. Toward that end, there's now a variable quilt._DEV_MODE which enables/disables traceback.

    If your first argument is --dev, or if the environment has QUILT_DEV_MODE=True, tracebacks will still be shown. Help for --dev is suppressed and doesn't show up in quilt help, quilt --help, etc.

    opened by eode 19
  • System-wide local storage

    System-wide local storage

    I have a multi-user Linux server that I will be using for a Data Science class. I would like to use quilt to distribute datasets with Jupyter Notebooks to students. But I don't want all students to download their own copies of the data when using quilt. I see that quilt uses appdirs.user_data_dir() to get the directory to use, and that I can set XDG_DATA_HOME to override that location:

    https://github.com/ActiveState/appdirs/blob/master/appdirs.py#L92

    • Will this break quilt?
    • Will this give me the optimization of each unique dataset only being downloaded a single time for all users?
    • How will this affect users creating and pushing datasets?

    Thanks!

    opened by ellisonbg 18
  • Package list pagination

    Package list pagination

    Adds pagination to the package list. WIP.

    TODO:

    • [x] styling tweaks + use proper colors and sizes from the design system or smth
    • [x] integrate pagination into profile view
    • [x] intl
    • [x] unit tests

    Open questions:

    • [x] is UX okayish? (wording, results-per-page selection)
    • [x] where to store pagination state? now the component is a "controlled component" with its state in redux, but maybe it makes sense to use local state instead (it will make it easier to use)
      • locally
    • [x] should we make the PackageList paginated by default?
      • yes
    • [x] do we need to link url params to the pagination state?
      • no

    cc @asah

    opened by nl0 14
  • Teams UI

    Teams UI

    TODO

    • [x] wire up member actions (remove, reset password)
    • [x] indicate activity while performing member actions (lock, spinner)
    • [x] wire up package actions (delete)
    • [x] indicate activity while performing package actions (lock, spinner)
    • [x] handle member addition
    • [x] styling tweaks
    • [x] overall polish
    • [ ] pagination?
    • [x] proper i18n?
    • [x] fix npm run build
    • [x] fix linter warnings
    • [x] confirmations for actions?
    opened by nl0 13
  • WIP Package: traffic stats (installs and views sparklines)

    WIP Package: traffic stats (installs and views sparklines)

    Show traffic stats on the package page.

    Depends on:

    • ~#587~
    • ~#588~
    • #594

    TODO

    • [x] tweak styling
    • [x] intl
    • [x] wire up the api
    • [x] fill up the sparkline timeseries with zeroes to the fixed length (52)? @akarve
      • will be done on the API side
    • [x] get total installs count somewhere or change display logic to not show it altogether @akarve
      • will be done on the API side
    • [x] wait for the API changes and adjust if necessary
    opened by nl0 12
  • Synchronous catalog config from a global var + types

    Synchronous catalog config from a global var + types

    In order to simplify access to the config data throughout the app (to a simple import cfg from 'utils/Config') and setup of some tools (such as sentry), I've decided to refactor the config system to use a synchronous JS file which sets a global variable that later can be accessed by the app code.

    This requires a change in deployment which will come soon (later today).

    UPD: Actually, deployment change is only relevant for OPEN, so we can merge this PR without waiting for that change.

    In order for local catalog instance to continue functioning, one should rename config.json file in static-dev folder to config.js and prepend the contents with window.QUILT_CATALOG_CONFIG =

    TODO

    • [x] changelog
    opened by nl0 11
  • [ENH] upload parquet files directly

    [ENH] upload parquet files directly

    Hi team,

    thanks for the great work on this project. With the current behavior when users have tabular data like CSV, its gets converted to parquet upon quilt push. But, when the data are already stored as a series of parquet files, the binaries are simply uploaded to the quilt server and the underlying dataframes aren't registered under tables/

    is it possible to allow uploading of parquet files directly so that the dataframes are registered and useable?

    opened by knaaptime 11
  • Export command

    Export command

    Summary

    Includes quilt export and command.export(...)

    Only handles FileNode objects, not TableNode objects (minimum for TF support).

    Depends On

    ~~#536 Minor test improvements~~ ~~#537 Build.py uncaught exception fix, other minor misc from export PR~~

    opened by eode 11
  • Update styles dependencies

    Update styles dependencies

    • highlight.js: 10.7.2 → 11.0.1
    • katex: 0.13.5 → 0.13.11
    • react-ace: 9.4.0 → 9.4.1
    • remarkable: 1.7.4 → 2.0.1
    • sanitize.css: 11.0.1 → 12.0.1
    • vega-embed: 6.17.0 → 6.18.2
    opened by fiskus 10
  • Fix config read+support tests on Windows

    Fix config read+support tests on Windows

    Description

    quilt3 fails to load config on Windows as of 3.1.12, giving me this error on import quilt3:

    AttributeError: 'WindowsPath' object has no attribute 'read'
    

    from inside the yaml code. Tracing around, I believe the root is that this check should accept all pathlib.Paths, not just pathlib.PosixPath. Without that check, quilt3 passes a pathlib object instead of a stream into yaml.safe_load, which triggers the above error.

    Changing this worked for me locally.

    opened by NathanDeMaria 10
  • Bump json5 from 1.0.1 to 1.0.2 in /catalog

    Bump json5 from 1.0.1 to 1.0.2 in /catalog

    Bumps json5 from 1.0.1 to 1.0.2.

    Release notes

    Sourced from json5's releases.

    v1.0.2

    • Fix: Properties with the name __proto__ are added to objects and arrays. (#199) This also fixes a prototype pollution vulnerability reported by Jonathan Gregson! (#295). This has been backported to v1. (#298)
    Changelog

    Sourced from json5's changelog.

    Unreleased [code, diff]

    v2.2.3 [code, diff]

    v2.2.2 [code, diff]

    • Fix: Properties with the name __proto__ are added to objects and arrays. (#199) This also fixes a prototype pollution vulnerability reported by Jonathan Gregson! (#295).

    v2.2.1 [code, diff]

    • Fix: Removed dependence on minimist to patch CVE-2021-44906. (#266)

    v2.2.0 [code, diff]

    • New: Accurate and documented TypeScript declarations are now included. There is no need to install @types/json5. (#236, #244)

    v2.1.3 [code, diff]

    • Fix: An out of memory bug when parsing numbers has been fixed. (#228, #229)

    v2.1.2 [code, diff]

    ... (truncated)

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    js dependencies javascript 
    opened by dependabot[bot] 1
  • Preview values in collapsed state in JsonDisplay

    Preview values in collapsed state in JsonDisplay

    Description

    TODO

    • [ ] Unit tests
    • [ ] Automated tests (e.g. Preflight)
    • [ ] Documentation
      • [ ] Python: Run build.py for new docstrings
      • [ ] JavaScript: basic explanation and screenshot of new features
    • [ ] Changelog entry (skip if change is not significant to end users, e.g. docs only)
    opened by fiskus 1
  • Cross-account SNS topic notifications & subscription

    Cross-account SNS topic notifications & subscription

    Description

    • Documentation to describe how to get notifications for cross-account S3 bucket changes
    • All PNG screenshot images have private data blurred and have been crushed with ImageOptim
    opened by robnewman 1
  • Images in Perspective table

    Images in Perspective table

    opened by fiskus 1
  • Update dependency jsonwebtoken to 9.0.0 [SECURITY]

    Update dependency jsonwebtoken to 9.0.0 [SECURITY]

    Mend Renovate

    This PR contains the following updates:

    | Package | Change | |---|---| | jsonwebtoken | 8.5.1 -> 9.0.0 |

    GitHub Vulnerability Alerts

    CVE-2022-23540

    Overview

    In versions <=8.5.1 of jsonwebtoken library, lack of algorithm definition in the jwt.verify() function can lead to signature validation bypass due to defaulting to the none algorithm for signature verification.

    Am I affected?

    You will be affected if you do not specify algorithms in the jwt.verify() function

    How do I fix it?

    Update to version 9.0.0 which removes the default support for the none algorithm in the jwt.verify() method.

    Will the fix impact my users?

    There will be no impact, if you update to version 9.0.0 and you don’t need to allow for the none algorithm. If you need 'none' algorithm, you have to explicitly specify that in jwt.verify() options.

    CVE-2022-23541

    Overview

    Versions <=8.5.1 of jsonwebtoken library can be misconfigured so that passing a poorly implemented key retrieval function (referring to the secretOrPublicKey argument from the readme link) will result in incorrect verification of tokens. There is a possibility of using a different algorithm and key combination in verification than the one that was used to sign the tokens. Specifically, tokens signed with an asymmetric public key could be verified with a symmetric HS256 algorithm. This can lead to successful validation of forged tokens.

    Am I affected?

    You will be affected if your application is supporting usage of both symmetric key and asymmetric key in jwt.verify() implementation with the same key retrieval function.

    How do I fix it?

    Update to version 9.0.0.

    Will the fix impact my users?

    There is no impact for end users

    CVE-2022-23539

    Overview

    Versions <=8.5.1 of jsonwebtoken library could be misconfigured so that legacy, insecure key types are used for signature verification. For example, DSA keys could be used with the RS256 algorithm.

    Am I affected?

    You are affected if you are using an algorithm and a key type other than the combinations mentioned below

    | Key type | algorithm | |----------|------------------------------------------| | ec | ES256, ES384, ES512 | | rsa | RS256, RS384, RS512, PS256, PS384, PS512 | | rsa-pss | PS256, PS384, PS512 |

    And for Elliptic Curve algorithms:

    | alg | Curve | |-------|------------| | ES256 | prime256v1 | | ES384 | secp384r1 | | ES512 | secp521r1 |

    How do I fix it?

    Update to version 9.0.0. This version validates for asymmetric key type and algorithm combinations. Please refer to the above mentioned algorithm / key type combinations for the valid secure configuration. After updating to version 9.0.0, If you still intend to continue with signing or verifying tokens using invalid key type/algorithm value combinations, you’ll need to set the allowInvalidAsymmetricKeyTypes option to true in the sign() and/or verify() functions.

    Will the fix impact my users?

    There will be no impact, if you update to version 9.0.0 and you already use a valid secure combination of key type and algorithm. Otherwise, use the allowInvalidAsymmetricKeyTypes option to true in the sign() and verify() functions to continue usage of invalid key type/algorithm combination in 9.0.0 for legacy compatibility.

    CVE-2022-23529

    Overview

    For versions <=8.5.1 of jsonwebtoken library, if a malicious actor has the ability to modify the key retrieval parameter (referring to the secretOrPublicKey argument from the readme link) of the jwt.verify() function, they can gain remote code execution (RCE).

    Am I affected?

    You are affected only if you allow untrusted entities to modify the key retrieval parameter of the jwt.verify() on a host that you control.

    How do I fix it?

    Update to version 9.0.0

    Will the fix impact my users?

    The fix has no impact on end users.

    Credits

    Palo Alto Networks


    Configuration

    📅 Schedule: Branch creation - "" (UTC), Automerge - At any time (no schedule defined).

    🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.

    Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

    🔕 Ignore: Close this PR and you won't be reminded about this update again.


    • [ ] If you want to rebase/retry this PR, check this box

    This PR has been generated by Mend Renovate. View repository job log here.

    js dependencies 
    opened by renovate[bot] 1
Releases(3.2.1)
  • 3.2.1(Oct 15, 2020)

    Python API

    • [Performance] 2X to 5X faster multi-threaded hashing of S3 objects (#1816, #1788)
    • [Fixed] Bump minimum required version of tqdm. Fixes a crash (UnseekableStreamError) during upload retry. (#1853)

    CLI

    • [Added] Add --meta argument to push (#1793)
    • [Fixed] Fix crash in list-packages (#1852)
    Source code(tar.gz)
    Source code(zip)
  • v3.2.0(Sep 8, 2020)

    Python:

    • Refactors local and s3 storage-layer code around a new PackageRegistry base class (to support improved file layouts in future releases)
    • Multithreaded download for large files, large performance gains when installing packages with large files, especially on large instances
    • Package name added to Package.resolve_hash
    • Bugfix: remove package revision by shorthash
    • Performance improvements for build and push

    Catalog & Lambdas:

    • PDF previews
    • Browse full package contents (no longer limited to 1000 files)
    • Indexing and search package-level metadata
    • Fixed issue with download button for certain text files
    • FCS files: content indexing and preview
    • Catalog sign-in with email (or username)
    • Catalog support for sign-in with Okta
    Source code(tar.gz)
    Source code(zip)
  • v3.1.14(Jun 13, 2020)

    Catalog

    • .cef preview
    • allow hiding download button
    • only show stats for 2-level extensions for .gz files

    Python

    • quilt3.logged_in()
    • fix retries during hashing
    • improve progress bars
    • fix quilt3 catalog
    • expanded documentation
    • reduce pyyaml requirements to prevent version conflicts

    Backend

    • improve unit test coverage for indexing lambdas
    • fix real-time delete handling (incl. for unversioned objects)
    • handle all s3:ObjectCreated: and ObjectRemoved: events (fixes ES search state and bucket Overview)
    Source code(tar.gz)
    Source code(zip)
  • 3.1.13(Apr 15, 2020)

    Python API

    • Official support for Windows
    • Add support for Python 3.7, 3.8
    • Fix Package import in Python
    • Updated libraries for stability and security
    • Quiet TQDM for log files ($ export QUILT_MINIMIZE_STDOUT=true )
    • CLI setting of config parameters

    Catalog

    • new feature to filter large S3 directories with regex
    • more reliable bucket region inference
    • Support preview of larger Jupyter notebooks in S3 (via transparent GZIP)
    • JS (catalog) dependencies for stability and security
    • extended Parquet file support (for files without a .parquet extension)
    • Improvements to catalog signing logic for external and in-stack buckets

    Special thanks to @NathanDeMaria (CLI and Windows support) and @JacksonMaxfield for contributing code to this release.

    Source code(tar.gz)
    Source code(zip)
  • 3.1.12(Mar 11, 2020)

  • 3.1.11(Mar 10, 2020)

    Catalog

    • Updated JS dependencies
    • Display package truncation warning in Packages

    Python

    • quilt3 install foo/bar/subdirectory
    • Bug fixes for CopyObject and other exceptions
    Source code(tar.gz)
    Source code(zip)
  • 3.1.10(Jan 29, 2020)

    Python Client

    • Fix bug introduced in 3.1.9 where uploads fail due to incorrect error checking after a HEAD request to see if an object already exists (#1512)
    Source code(tar.gz)
    Source code(zip)
  • 3.1.9(Jan 29, 2020)

    Python Client

    • quilt3 install now displays the tophash of the installed package (#1461)
    • Added quilt3 --version (#1495)
    • Added quilt3 disable-telemetry CLI command (#1496)
    • CLI command to launch catalog directly to file viewer - quilt3 catalog $S3_URL (#1470, #1487)
    • No longer run local container for quilt3 catalog (#1504). See (#1468, #1483, #1482) for various bugs leading to this decision.
    • Add PhysicalKey class to abstract away local files vs unversioned s3 object vs versioned s3 object (#1456, #1473, #1478)
    • Changed cache directory location (#1466)
    • More informative progress bars (#1506)
    • Improve support for downloading from public buckets (#1503)
    • Always disable telemetry during tests (#1494)
    • Bug fix: prevent misleading CLI argument abbreviations (#1481) such as --to referring to --tophash
    • Bug fix: background upload/download threads are now killed if the main thread is interrupted (#1486)
    • Performance improvements: load JSONL manifest faster (#1480)
    • Performance improvement: If there is an error when copying files, fail quickly (#1488)

    Catalog

    • Better package listing UX (#1462)
    • Improve bucket stats visualization when there are many categories (#1469)
    Source code(tar.gz)
    Source code(zip)
  • 3.1.8(Dec 20, 2019)

  • 3.1.7(Dec 13, 2019)

    Catalog

    • New LOCAL mode for running the catalog on localhost

    Python API

    • quilt3 catalog command to run the Quilt catalog on your local machine
    • quilt3 verify compares the state of a directory to the contents of a package version
    • Added a local file cache for installed packages
    • Performance improvements for upload and download
    • Support for short hashes to identify package versions
    • Adding telemetry for API calls
    Source code(tar.gz)
    Source code(zip)
  • 3.1.6(Dec 3, 2019)

    API Improvements

    • Implement Package.rollback
    • Drop support for object metadata (outside of packages)
    • Change the number of threads used when installing and pushing from 4 to 10 (S3 default)
    • Misc bug fixes
    Source code(tar.gz)
    Source code(zip)
  • 3.1.5(Nov 20, 2019)

    Catalog

    • Fix package listing for packages with more 100 revisions
    • Add stacked area charts for downloads
    • 2-level file-extensions for bucket summary

    Python

    • Fix uploads of very large files
    • Remove unnecessary copying during push
    Source code(tar.gz)
    Source code(zip)
  • 3.1.4(Oct 17, 2019)

  • 3.1.3(Oct 11, 2019)

    • Bug fix: when adding python objects to a package a temporary file would be created and then deleted when the object was pushed, leading to a crash if you tried to push that package again (PR #1264)
    Source code(tar.gz)
    Source code(zip)
  • 3.1.2(Oct 11, 2019)

    • Added support for adding an in-memory object (such as a pandas.DataFrame) to a package via package.set()
    • Fix to work with pyarrow 0.15.0
    • Performance improvements for list_packages and delete_package
    • Added list_package_versions function
    Source code(tar.gz)
    Source code(zip)
  • 3.0.0(May 24, 2019)

  • 2.9.15(Jan 9, 2019)

  • 2.9.14(Dec 20, 2018)

    Compiler

    • Adding a hash argument to quilt.push to allow pushing any package version to a registry.

    Registry

    • Make object sizes required.
    • Update urllib3 version for security patch

    Docs

    • Improved instructions for running registries.
    Source code(tar.gz)
    Source code(zip)
  • 2.9.13(Nov 12, 2018)

  • 2.9.12(Oct 11, 2018)

    Make Quilt work with pyarrow 0.11

    • Update Parquet reading code to match the API change in pyarrow 0.11.
    • Fix downloading of zero-byte files
    Source code(tar.gz)
    Source code(zip)
  • 2.9.11(Sep 11, 2018)

    Compiler

    • New helper function quilt.save adds an object (e.g., a Pandas DataFrame) to an existing package by performing a sub-package build and push in a single step
    • BugFix: quilt.load now correctly returns sub-packages (fixes issue #741)

    Registry

    • Send a welcome email to new users after activation
    Source code(tar.gz)
    Source code(zip)
  • 2.9.10(Aug 8, 2018)

    Compiler

    • fixes an issue with packages created on older versions of pyarrow
    • improves readability for quilt inspect
    • allow adding a node with metadata using sub-package build/push

    Registry

    • adds documentation for running a private registry in AWS
    Source code(tar.gz)
    Source code(zip)
  • v2.9.9(Jul 31, 2018)

  • 2.9.8(Jul 30, 2018)

    Compiler

    • Added support for sub-package build and push to allow updates to allow adding nodes to large packages without materializing the whole package
    • First-class support for ndarray

    Registry

    • Replaced dependence on external OAuth2 provider with a built-in authentication and session management
    • Registry support for sub-package push

    Catalog

    • Updated to support new registry authentication
    Source code(tar.gz)
    Source code(zip)
  • 2.9.7(Jul 11, 2018)

    Compiler

    • added Bracket accessor for GroupNodes
    • asa.plot to show images in packages
    • asa.torch to convert packages to PyTorch Datasets
    • Enforce fragment store as read-only

    Catalog

    • Added source maps and CI for catalog testing
    Source code(tar.gz)
    Source code(zip)
  • 2.9.6(Jun 13, 2018)

    Documentation

    Expands and improves documentation for working with Quilt packages.

    Bug fixes and small improvements

    • Load packages by hash
    • Choose a custom loader for DataNodes with asa=

    Registry

    • Specify Ubuntu version in Dockerfiles
    Source code(tar.gz)
    Source code(zip)
  • 2.9.5(May 23, 2018)

    Catalog

    • display package traffic stats in catalog

    Compiler

    • filter packages based on per-node metadata
    • get/set metadata for package nodes
    • support custom loaders in the _data method

    Registry

    • package commenting
    Source code(tar.gz)
    Source code(zip)
  • 2.9.4(Apr 20, 2018)

    Compiler

    • Metadata-only package install
    • Build DataFrames from existing Parquet files
    • Remove HDF5 dependencies
    • Code cleanup and refactoring

    Registry

    • Option for metadata-only package installs
    • New endpoint for fetching missing fragments (e.g., from partially installed packages)
    • Improved full-text search
    Source code(tar.gz)
    Source code(zip)
  • 2.9.3(Mar 20, 2018)

    Compiler:

    • Allow building packages out of other packages and elements from other packages. A new build-file keyword, package inserts a package (or sub-package) as an element in the package being built.

    Catalog:

    • Upgrade router and other dependencies
    • Display packages by author
    Source code(tar.gz)
    Source code(zip)
  • 2.9.2(Mar 1, 2018)

    Catalog Changes to support private registries

    • Amin UI for controlling users and access
    • Auditing views

    Globbing for package builds

    • Allow specifying sets of input files in build.yml

    Command-line support for private registries

    • Specify teams packages
    • Admin commands to create and activate/deactivate users
    Source code(tar.gz)
    Source code(zip)
Owner
Quilt Data
Version and deploy data
Quilt Data
Fully reproducible, Dockerized, step-by-step, tutorial on how to mock a "real-time" Kafka data stream from a timestamped csv file. Detailed blog post published on Towards Data Science.

time-series-kafka-demo Mock stream producer for time series data using Kafka. I walk through this tutorial and others here on GitHub and on my Medium

Maria Patterson 26 Nov 15, 2022
Data-Scrapping SEO - the project uses various data scrapping and Google autocompletes API tools to provide relevant points of different keywords so that search engines can be optimized

Data-Scrapping SEO - the project uses various data scrapping and Google autocompletes API tools to provide relevant points of different keywords so that search engines can be optimized; as this information is gathered, the marketing team can target the top keywords to get your company’s website higher on a results page.

Vibhav Kumar Dixit 2 Jul 18, 2022
A tutorial for people to run synthetic data replica's from source healthcare datasets

Synthetic-Data-Replica-for-Healthcare Description What is this? A tailored hands-on tutorial showing how to use Python to create synthetic data replic

null 11 Mar 22, 2022
advance python series: Data Classes, OOPs, python

Working With Pydantic - Built-in Data Process ========================== Normal way to process data (reading json file): the normal princiople, it's f

Phung Hưng Binh 1 Nov 8, 2021
A Python library for setting up projects using tabular data.

A Python library for setting up projects using tabular data. It can create project folders, standardize delimiters, and convert files to CSV from either individual files or a directory.

null 0 Dec 13, 2022
An open source utility for creating publication quality LaTex figures generated from OpenFOAM data files.

foamTEX An open source utility for creating publication quality LaTex figures generated from OpenFOAM data files. Explore the docs » Report Bug · Requ

null 1 Dec 19, 2021
Python code for working with NFL play by play data.

nfl_data_py nfl_data_py is a Python library for interacting with NFL data sourced from nflfastR, nfldata, dynastyprocess, and Draft Scout. Includes im

null 82 Jan 5, 2023
This contains timezone mapping information for when preprocessed from the geonames data

when-data This contains timezone mapping information for when preprocessed from the geonames data. It exists in a separate repository so that one does

Armin Ronacher 2 Dec 7, 2021
Quick tutorial on orchest.io that shows how to build multiple deep learning models on your data with a single line of code using python

Deep AutoViML Pipeline for orchest.io Quickstart Build Deep Learning models with a single line of code: deep_autoviml Deep AutoViML helps you build te

Ram Seshadri 6 Oct 2, 2022
Generates, filters, parses, and cleans data regarding the financial disclosures of judges in the American Judicial System

This repository contains code that gets data regarding financial disclosures from the Court Listener API main.py: contains driver code that interacts

Ali Rastegar 2 Aug 6, 2022
Soccerdata - Efficiently scrape soccer data from various sources

SoccerData is a collection of wrappers over soccer data from Club Elo, ESPN, FBr

Pieter Robberechts 195 Jan 4, 2023
DataAnalysis: Some data analysis projects in charles_pikachu

DataAnalysis DataAnalysis: Some data analysis projects in charles_pikachu You can star this repository to keep track of the project if it's helpful fo

null 9 Nov 4, 2022
Massively parallel self-organizing maps: accelerate training on multicore CPUs, GPUs, and clusters

Somoclu Somoclu is a massively parallel implementation of self-organizing maps. It exploits multicore CPUs, it is able to rely on MPI for distributing

Peter Wittek 239 Nov 10, 2022
Massively parallel self-organizing maps: accelerate training on multicore CPUs, GPUs, and clusters

Somoclu Somoclu is a massively parallel implementation of self-organizing maps. It exploits multicore CPUs, it is able to rely on MPI for distributing

Peter Wittek 239 Nov 10, 2022
Solving the Traveling Salesman Problem using Self-Organizing Maps

Solving the Traveling Salesman Problem using Self-Organizing Maps This repository contains an implementation of a Self Organizing Map that can be used

Diego Vicente 3.1k Dec 31, 2022
Simple implementation of Self Organizing Maps (SOMs) with rectangular and hexagonal grid topologies

py-self-organizing-map Simple implementation of Self Organizing Maps (SOMs) with rectangular and hexagonal grid topologies. A SOM is a simple unsuperv

Jonas Grebe 1 Feb 10, 2022
Implementation of SOMs (Self-Organizing Maps) with neighborhood-based map topologies.

py-self-organizing-maps Simple implementation of self-organizing maps (SOMs) A SOM is an unsupervised method for learning a mapping from a discrete ne

Jonas Grebe 6 Nov 22, 2022
A tool for scraping and organizing data from NewsBank API searches

nbscraper Overview This simple tool automates the process of copying, pasting, and organizing data from NewsBank API searches. Curerntly, nbscrape onl

null 0 Jun 17, 2021
Eureka is a Rest-API framework scraper based on FastAPI for cleaning and organizing data, designed for the Eureka by Turing project of the National University of Colombia

Eureka is a Rest-API framework scraper based on FastAPI for cleaning and organizing data, designed for the Eureka by Turing project of the National University of Colombia

Julian Camilo Velandia 3 May 4, 2022
🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

?? The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

Hugging Face 15k Jan 2, 2023