Intake is a lightweight package for finding, investigating, loading and disseminating data.

Intake

Last update: Jan 1, 2023

Related tags

Overview

Intake: A general interface for loading data

Intake is a lightweight set of tools for loading and sharing data in data science projects. Intake helps you:

Load data from a variety of formats (see the current list of known plugins) into containers you already know, like Pandas dataframes, Python lists, NumPy arrays, and more.
Convert boilerplate data loading code into reusable Intake plugins
Describe data sets in catalog files for easy reuse and sharing between projects and with others.
Share catalog information (and data sets) over the network with the Intake server

Documentation is available at Read the Docs.

Status of intake and related packages is available at Status Dashboard

Weekly news about this repo and other related projects can be found on the wiki

Install

Recommended method using conda:

conda install -c conda-forge intake

You can also install using pip, in which case you have a choice as to how many of the optional dependencies you install, with the simplest having least requirements

pip install intake

and additional sections [server], [plot] and [dataframe], or to include everything:

pip install intake[complete]

Note that you may well need specific drivers and other plugins, which usually have additional dependencies of their own.

Development

Create development Python environment with the required dependencies, ideally with conda. The requirements can be found in the yml files in the scripts/ci/ directory of this repo.
- e.g. conda env create -f scripts/ci/environment-py38.yml and then conda activate test_env
Install intake using pip install -e .[complete]
Use pytest to run tests.
Create a fork on github to be able to submit PRs.
We respect, but do not enforce, pep8 standards; all new code should be covered by tests.

Comments

Monthly dev meeting

I mentioned this idea on gitter, but the idea arose at the dask dev meeting that it would be a generally useful thing to have a monthly dev meeting where we can discuss maintenance, road-map and so on.

Perhaps 10:30am eastern on the first Thursday of the month (that would put it right before the dask monthly meeting, so maybe it'd be easier to remember?).

@danielballan @martindurant

opened by jsignell 40

Use entrypoints to manage drivers. Add subcommand.

I would like it to be possible to:

Provide drivers that are discoverable by intake without necessarily packaging them in a package named intake*
Have the option to disable a specific intake driver from getting autodiscovered without uninstalling the package that provides it or disabling other drivers in that package

I think the Jupyter notebook serverextension system has settled on a nice way to manage this kind of configuration (after many iterations and pivots over the years). This PR imitates that system. It's just a first pass to evaluate interest and would need more careful thought before being merged.

Demo:

An intake drivers subcommand can list the drivers that are added to intake.registry at import time.

$ intake drivers list
netcdf                        intake_xarray.netcdf.NetCDFSource
opendap                       intake_xarray.opendap.OpenDapSource
rasterio                      intake_xarray.raster.RasterIOSource
remote-xarray                 intake_xarray.xarray_container.RemoteXarray
zarr                          intake_xarray.xzarr.ZarrSource

A verbose option includes __file__ locations, potentially useful for untangling issues with environments.

$ intake drivers list -v
netcdf                        intake_xarray.netcdf.NetCDFSource @ /home/dallan/Repos/bnl/intake-xarray/intake_xarray/netcdf.py
opendap                       intake_xarray.opendap.OpenDapSource @ /home/dallan/Repos/bnl/intake-xarray/intake_xarray/opendap.py
rasterio                      intake_xarray.raster.RasterIOSource @ /home/dallan/Repos/bnl/intake-xarray/intake_xarray/raster.py
remote-xarray                 intake_xarray.xarray_container.RemoteXarray @ /home/dallan/Repos/bnl/intake-xarray/intake_xarray/xarray_container.py
zarr                          intake_xarray.xzarr.ZarrSource @ /home/dallan/Repos/bnl/intake-xarray/intake_xarray/xzarr.py

Now suppose I want to disable the 'zarr'` driver provided by intake_xarray. Perhaps I have a different implementation that I want to use with 'zarr' and I need to avoid the name collision.

$ intake drivers disable intake_xarray.xzarr.ZarrSource
$ intake drivers list
netcdf                        intake_xarray.netcdf.NetCDFSource
opendap                       intake_xarray.opendap.OpenDapSource
rasterio                      intake_xarray.raster.RasterIOSource
remote-xarray                 intake_xarray.xarray_container.RemoteXarray
$ python -c "import intake; print('zarr' in intake.registry)"
False

I can later re-enable it:

$ intake drivers enable intake_xarray.xzarr.ZarrSource
$ python -c "import intake; print('zarr' in intake.registry)"
True

The enable/disable state is stored in a separate YAML file for each driver in ~/.intake/drivers.d, imitating the system used by Jupyter. For backward compatibility, drivers in packages that begin with intake* are included in the registry unless they are explicitly disabled. (That is, they need not have any configuration in ~/.intake/drivers.d.) Drivers in packages with other names can be explicitly enabled:

$ intake drivers enable offbrand_catalog.MongoMetadataStoreCatalog
$ intake drivers list
netcdf                        intake_xarray.netcdf.NetCDFSource
opendap                       intake_xarray.opendap.OpenDapSource
rasterio                      intake_xarray.raster.RasterIOSource
remote-xarray                 intake_xarray.xarray_container.RemoteXarray
zarr                          intake_xarray.xzarr.ZarrSource
mongo_metadatastore           offbrand_catalog.MongoMetadataStoreCatalog

The enable command created the following file at ~/.intake/drivers.d/offbrand_catalog.MongoMetadataStoreCatalog.yml:

offbrand_catalog.MongoMetadataStoreCatalog:                                                                                                                                                   
  enabled: true

As documented by Jupyter, packages can automatically enable their drivers at install time by using data_files in setup.py to place the corresponding files in ~/.intake/drivers.d/.

in progress

opened by danielballan 35

New Panel GUI
Try it locally

At the moment this work depends on development version of panel and bokeh:

conda install -c conda-forge panel==0.5.1 hvplot==0.4.0 bokeh==1.1.0 jupyter labextension install @pyviz/jupyterlab_pyviz # if using jupyterlab

Then: take a look at examples/GUI.ipynb or do panel serve intake/gui/server.py

Towards #225

Feature comparability with ipywidgets GUI:

[x] local files

[x] remote files

[x] catalog selection

[x] data source selection

[x] data source description

[x] catalog search

[x] initialization as intake.gui

Additional functionality:

[x] windows paths

[x] make sure that control buttons reflect visible status

[x] add home button to local file browser

[x] unit tests

[x] make path editable

[x] disable buttons when they don't make sense

[x] use properties over methods

[x] try to abstract widget from functionality

[x] choose from defined plots
opened by jsignell 33
nesting of catalogs in a deep directory structure
Hi folks, especially @martindurant to whom I spoke briefly today. I am working with a large collection of data (CMIP6) containing mostly netcdf and zarr objects - consisting of thousands of files in a directory tree. Obviously I cannot write flat YAML catalogs in this case. What I am thinking is putting a 'config.yaml' file in EVERY directory and making the sources section point to config.yaml files in each of its subdirectories and so on ... The nesting would be about 6-7 levels deep.

Here is my example.

This works great, except for two (hopefully) small issues (for more details, see above link)

<tab> completion will work to get from the parent to child catalog (sub-directory), but then will not get to the grandchild catalog (sub-sub-directory)

The path in each YAML file is relative to the <dir> in the initial intake.open_catalog('<dir>/config.yaml'). This means that if I intake.open_catalog(config.yaml) in a subdirectory, the paths are going to need to be relative to this subdirectory, but I have had to hard-wire them to an arbitrary initial parent directory.

If I am re-inventing the wheel and you could point me to an existing solution, that would be very much appreciated. Or maybe there is a parent/child setting or equivalent to make this work?
opened by naomi-henderson 27
Slim down requirements
Currently the requirements file looks like the following:

jinja2 msgpack-numpy msgpack-python numpy pandas pytest python-snappy ruamel.yaml >= 0.15.0 requests appdirs six tornado >= 4.5.1 dask[complete] holoviews

This makes it difficult for downstream deployments to easily depend on it. I wonder if perhaps some of these requirements could be made optional for minimal deployments.
opened by mrocklin 21
Let users never see Entry objects

This is "unification light" - still have Entry objects, but the user doesn't see them. The catalog just gives you the default source directly, but you can override them with call or clone (maybe should be get??) to get a new version of the source. Note that .describe() will give you the original catalog definition (with user parameters), not your overridden version, but repr of the source now also gives you that YAML view with the current set of arguments.

cc @danielballan @tacaswell

opened by martindurant 18

problem under pip

I don't catch why but this instruction doesn't work, yet ruamel.yaml is installed (Python-3.7, same error on Python-3.6))

from ruamel_yaml.constructor import DuplicateKeyError

I get a

C:\WinP\bd37\bu\winp64-3.7.x.1\python-3.7.0b5.amd64\lib\site-packages\intake\catalog\local.py in <module>()
      7 import yaml
      8 
----> 9 from ruamel_yaml.constructor import DuplicateKeyError
     10 
     11 from jinja2 import Template

ModuleNotFoundError: No module named 'ruamel_yaml'

any clue ?

opened by stonebig 18

(Compressed) Excel driver

Hi,

My main data sources at the moment are remote zipped Excel files, so I was starting to build a data package and a driver package but I need your feedback: should it be an Excel Driver with a compression type option, is anyone already working on it? I would use pandas Excel reader as the reader.

If all lights are green I will start this intake-excel driver.

Regards, Guillaume

opened by gansanay 17
Metadata fields used for plotting
Catalogues allow defining arbitrary metadata to be associated with a dataset, which could be very useful to provide hints to the plotting system. Specifically there are two types of options that could be useful in this regard:

Plot options: Options passed directly to the plotting API, e.g. datashade, width, height, colorbar, logx, logy etc.

Field annotations: Additional metadata to associate with dataset fields (i.e. columns in a dataframe), including labels (used for axis labels), units (also for axis labels), ranges (to set axis and color range limits)

It would be good to decide on the syntax to express these options. I'm currently imagining something like this:

sources: nyc_taxi: description: NYC Taxi dataset driver: parquet args: urlpath: 's3://datashader-data/nyc_taxi_wide.parq' metadata: plot: datashade: true fields: dropoff_x: label: 'Longitude' dropoff_y: label: 'Latitude' fare_amount: label: 'Fare' unit: '$' range: [0, 100]

Other suggestions with (perhaps with less nesting) would be welcome though.
opened by philippjfr 16
Incompatible with pandas 1.0.0

Pandas chose to remove support for msgpack in order to guide users toward Arrow instead. This breaks this codepath in intake

https://github.com/intake/intake/blob/33a096721765fc7fe79e958d3aa5f050a4c60937/intake/container/serializer.py#L56-L57

which now raises an AttributeError because Data.Frame.to_msgpack no longer exists.

I'm not immediately sure what the right fix is here, but maybe we should take up the suggestion in the TODO comment in this same function and transition to relying on distributed.

https://github.com/intake/intake/blob/33a096721765fc7fe79e958d3aa5f050a4c60937/intake/container/serializer.py#L48-L49

opened by danielballan 15
Windows support
Adds window support and appveyor to make sure it doesn't break again.

Some things I'm worried about not being tested properly:

does serialization work properly on windows?

should CATALOG_DIR have a trailing /?

does catalog flattening work properly on windows?
opened by jsignell 15
Question: what is the expected type for `direct_access` ?

When looking at the occurrences of direct_access the expected types don't align which leads me to wonder what the expected type should be?

https://github.com/intake/intake/search?q=direct_access

When loading a catalog from a YAML file, the expected type based here is expected to be a string. However, the default value assumed by a LocalCatalogEntry appears to be a boolean.

If I ignore the type specified by type hints and the docstrings, I can properly serialize the catalog and read from a YAML file to instantiate the entries.

opened by lukecampbell 1
Programmatically add a catalog to Intake
On https://intake.readthedocs.io/en/latest/quickstart.html#adding-data-source-packages-using-the-intake-path I read `Adding Data Source Packages using the Intake path: Intake checks the Intake config file for catalog_path or the environment variable "INTAKE_PATH" for a colon separated list of paths (semicolon on windows) to search for catalog files. When you import intake we will see all entries from all of the catalogues referenced as part of a global catalog called intake.cat``

Should the title be Adding Catalog Packages... instead of Adding Data Source Packages.

Is there an API to add a catalog to intake? (*)

(*) eg.

cat = intake.open_catalog('us_states.yml') intake.add_catalog(cat) # ???
opened by echarles 4
Add CSV output to intake get CLI

This small change adds the ability to get CSV output from the intake get CLI . Previously it was printing out the string representation of a DataFrame. It also adds a --output option for writing to a file instead of stdout.

Closes #684

opened by edsu 4

intake get output CSV

The docs say this for intake get:

Given the name of a catalog entry, this subcommand outputs the entire data source to standard output.

But when I run it I see the string representation of the Pandas DataFrame, not the entire dataset as CSV:

$ intake get catalogs/bodleian.yaml turkish
                                                  id  ...                                           contents
0  https://iiif.bodleian.ox.ac.uk/iiif/manifest/a...  ...                                                NaN
0  https://iiif.bodleian.ox.ac.uk/iiif/manifest/2...  ...                                                NaN
0  https://iiif.bodleian.ox.ac.uk/iiif/manifest/a...  ...                                                NaN
0  https://iiif.bodleian.ox.ac.uk/iiif/manifest/9...  ...                                                NaN
0  https://iiif.bodleian.ox.ac.uk/iiif/manifest/2...  ...                                                NaN
0  https://iiif.bodleian.ox.ac.uk/iiif/manifest/e...  ...                                                NaN
0  https://iiif.bodleian.ox.ac.uk/iiif/manifest/6...  ...                                                NaN
0  https://iiif.bodleian.ox.ac.uk/iiif/manifest/c...  ...  [Persian ghazals (ff. 1b-16b). Nesimi, Dīvān-i...

[8 rows x 30 columns]

Would a PR to implement CSV output to STDOUT or an optional file be a welcome addition?

PS. Thank you for a beautiful and extremely useful data tool!

opened by edsu 0

Usage/design question: why specifying `driver` is compulsory?

For some use cases, I would like to store a path to a folder. I thought that this could be done by removing driver from the catalog entry, but this triggers an error. For example:

---------------------------------------------------------------------------
ValidationError                           Traceback (most recent call last)
Input In [6], in <cell line: 3>()
      1 from intake import open_catalog
----> 3 cat = open_catalog("../cat.yml")
      5 cat.f

File ~/miniconda3/envs/intake/lib/python3.10/site-packages/intake/__init__.py:167, in open_catalog(uri, **kwargs)
    164 if driver not in registry:
    165     raise ValueError('Unknown catalog driver (%s), supply one of: %s'
    166                      % (driver, list(sorted(registry))))
--> 167 return registry[driver](uri, **kwargs)

File ~/miniconda3/envs/intake/lib/python3.10/site-packages/intake/catalog/local.py:573, in YAMLFileCatalog.__init__(self, path, autoreload, **kwargs)
    571 self.filesystem = kwargs.pop('fs', None)
    572 self.access = "name" not in kwargs
--> 573 super(YAMLFileCatalog, self).__init__(**kwargs)

File ~/miniconda3/envs/intake/lib/python3.10/site-packages/intake/catalog/base.py:110, in Catalog.__init__(self, entries, name, description, metadata, ttl, getenv, getshell, persist_mode, storage_options, user_parameters)
    108 self.updated = time.time()
    109 self._entries = entries if entries is not None else self._make_entries_container()
--> 110 self.force_reload()

File ~/miniconda3/envs/intake/lib/python3.10/site-packages/intake/catalog/base.py:168, in Catalog.force_reload(self)
    166 """Imperative reload data now"""
    167 self.updated = time.time()
--> 168 self._load()

File ~/miniconda3/envs/intake/lib/python3.10/site-packages/intake/catalog/local.py:608, in YAMLFileCatalog._load(self, reload)
    606     logger.warning("Use of '!template' deprecated - fixing")
    607     text = text.replace('!template ', '')
--> 608 self.parse(text)

File ~/miniconda3/envs/intake/lib/python3.10/site-packages/intake/catalog/local.py:687, in YAMLFileCatalog.parse(self, text)
    684 result = CatalogParser(data, context=context, getenv=self.getenv,
    685                        getshell=self.getshell)
    686 if result.errors:
--> 687     raise exceptions.ValidationError(
    688         "Catalog '{}' has validation errors:\n\n{}"
    689         "".format(self.path, "\n".join(result.errors)), result.errors)
    691 cfg = result.data
    693 self._entries = {}

ValidationError: Catalog '../cat.yml' has validation errors:

("missing required key 'driver'", {'args': {'urlpath': '{{ CATALOG_DIR }}/data'}})

My question is: why is specifying driver compulsory? I understand that without it, a user could import catalog and face an error at a later stage. Is that the main reason or are there some other considerations? (apologies if this was covered in previous issues or documentation, I did a quick scan, but couldn't find anything relevant)

opened by SultanOrazbayev 8

not getting optional dependencies for dataframe using poetry

I'm not sure whether this behavior indicates a problem with poetry or with a setup.py file in intake or dask. Had to bug one of you first, and you drew the short straw (sorry!).

I am working on a new Intake driver. I have a pyproject.toml file like this:

[tool.poetry]
name = "intake-xyz"
version = "0.1.0"
description = "XYZ plugin for Intake"
authors = ["Ian Carroll <[email protected]>"]
exclude = ['**/tests']

[tool.poetry.plugins."intake.drivers"]
"xyz" = "intake_xyz.source:XYZDataFrameSource"

[tool.poetry.dependencies]
python = "^3.8"
intake = {extras = ["dataframe"], version = "^0.6.5"}

[tool.poetry.dev-dependencies]
pytest = "^7.1.2"

[build-system]
requires = ["poetry-core>=1.0.0"]
build-backend = "poetry.core.masonry.api"```

With the above, poetry install will not install all the dependencies. It does this:

% poetry install
Creating virtualenv intake-xyz-1PWxoOBD-py3.10 in ...
Updating dependencies
Resolving dependencies... (0.3s)

Writing lock file

Package operations: 24 installs, 0 updates, 0 removals

  • Installing locket (1.0.0)
  • Installing pyparsing (3.0.9)
  • Installing toolz (0.12.0)
  • Installing cloudpickle (2.1.0)
  • Installing fsspec (2022.5.0)
  • Installing markupsafe (2.1.1)
  • Installing msgpack (1.0.4)
  • Installing numpy (1.23.1)
  • Installing packaging (21.3)
  • Installing partd (1.2.0)
  • Installing pyyaml (6.0)
  • Installing appdirs (1.4.4)
  • Installing attrs (21.4.0)
  • Installing dask (2022.7.0)
  • Installing entrypoints (0.4)
  • Installing iniconfig (1.1.1)
  • Installing jinja2 (3.1.2)
  • Installing msgpack-numpy (0.4.8)
  • Installing pluggy (1.0.0)
  • Installing py (1.11.0)
  • Installing pyarrow (8.0.0)
  • Installing tomli (2.0.1)
  • Installing intake (0.6.5)
  • Installing pytest (7.1.2)

Installing the current project: intake-xyz (0.1.0)

Notice that pandas is missing. Only by installing "dask[dataframe]" explicitly do I get pandas.

% poetry run pip install "dask[dataframe]"
Requirement already satisfied: dask[dataframe] in path/to/site-packages (2022.7.0)
Requirement already satisfied: packaging>=20.0 in path/to/site-packages (from dask[dataframe]) (21.3)
Requirement already satisfied: cloudpickle>=1.1.1 in path/to/site-packages (from dask[dataframe]) (2.1.0)
Requirement already satisfied: fsspec>=0.6.0 in path/to/site-packages (from dask[dataframe]) (2022.5.0)
Requirement already satisfied: toolz>=0.8.2 in path/to/site-packages (from dask[dataframe]) (0.12.0)
Requirement already satisfied: partd>=0.3.10 in path/to/site-packages (from dask[dataframe]) (1.2.0)
Requirement already satisfied: pyyaml>=5.3.1 in path/to/site-packages (from dask[dataframe]) (6.0)
Collecting pandas>=1.0
  Using cached pandas-1.4.3-cp310-cp310-macosx_10_9_x86_64.whl (11.5 MB)
Requirement already satisfied: numpy>=1.18 in path/to/site-packages (from dask[dataframe]) (1.23.1)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in path/to/site-packages (from packaging>=20.0->dask[dataframe]) (3.0.9)
Collecting python-dateutil>=2.8.1
  Using cached python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB)
Collecting pytz>=2020.1
  Using cached pytz-2022.1-py2.py3-none-any.whl (503 kB)
Requirement already satisfied: locket in path/to/site-packages (from partd>=0.3.10->dask[dataframe]) (1.0.0)
Collecting six>=1.5
  Using cached six-1.16.0-py2.py3-none-any.whl (11 kB)
Installing collected packages: pytz, six, python-dateutil, pandas
Successfully installed pandas-1.4.3 python-dateutil-2.8.2 pytz-2022.1 six-1.16.0

I have no trouble working around this, by adding the "dask[dataframe]" dependency explicitly, but that shouldn't be necessary. Care to place blame?

opened by itcarroll 0

Owner

Intake

Taking the pain out of data access and distribution

GitHub https://intake.readthedocs.io/

VevestaX is an open source Python package for ML Engineers and Data Scientists.

VevestaX Track failed and successful experiments as well as features. VevestaX is an open source Python package for ML Engineers and Data Scientists.

24 Dec 14, 2022

Python package to transfer data in a fast, reliable, and packetized form.

pySerialTransfer Python package to transfer data in a fast, reliable, and packetized form.

101 Dec 7, 2022

Elementary is an open-source data reliability framework for modern data teams. The first module of the framework is data lineage.

Data lineage made simple, reliable, and automated. Effortlessly track the flow of data, understand dependencies and analyze impact. Features Visualiza

898 Jan 9, 2023

GWpy is a collaboration-driven Python package providing tools for studying data from ground-based gravitational-wave detectors

GWpy is a collaboration-driven Python package providing tools for studying data from ground-based gravitational-wave detectors. GWpy provides a user-f

342 Jan 7, 2023

A powerful data analysis package based on mathematical step functions. Strongly aligned with pandas.

The leading use-case for the staircase package is for the creation and analysis of step functions. Pretty exciting huh. But don't hit the close button

48 Dec 21, 2022

Python package for processing UC module spectral data.

UC Module Python Package How To Install clone repo. cd UC-module pip install . How to Use uc.module.UC(measurment=str, dark=str, reference=str, heade

1 Oct 20, 2021

PyEmits, a python package for easy manipulation in time-series data.

PyEmits, a python package for easy manipulation in time-series data. Time-series data is very common in real life. Engineering FSI industry (Financial

5 Sep 23, 2022

nrgpy is the Python package for processing NRG Data Files

nrgpy nrgpy is the Python package for processing NRG Data Files Website and source: https://github.com/nrgpy/nrgpy Documentation: https://nrgpy.github

23 Dec 8, 2022

small package with utility functions for analyzing (fly) calcium imaging data

fly2p Tools for analyzing two-photon (2p) imaging data collected with Vidrio Scanimage software and micromanger. Loading scanimage data relies on scan

3 Dec 14, 2022

Python package for analyzing behavioral data for Brain Observatory: Visual Behavior

Allen Institute Visual Behavior Analysis package This repository contains code for analyzing behavioral data from the Allen Brain Observatory: Visual

16 Nov 4, 2022

🧪 Panel-Chemistry - exploratory data analysis and build powerful data and viz tools within the domain of Chemistry using Python and HoloViz Panel.

???? ??. The purpose of the panel-chemistry project is to make it really easy for you to do DATA ANALYSIS and build powerful DATA AND VIZ APPLICATIONS within the domain of Chemistry using using Python and HoloViz Panel.

97 Dec 8, 2022

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

2 Nov 20, 2021

PrimaryBid - Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift

Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift This project is composed of two parts: Part1 and Part2

1 Jan 19, 2022

A python package which can be pip installed to perform statistics and visualize binomial and gaussian distributions of the dataset

GBiStat package A python package to assist programmers with data analysis. This package could be used to plot : Binomial Distribution of the dataset p

4 Oct 17, 2022

ToeholdTools is a Python package and desktop app designed to facilitate analyzing and designing toehold switches, created as part of the 2021 iGEM competition.

ToeholdTools Category Status Repository Package Build Quality A library for the analysis of toehold switch riboregulators created by the iGEM team Cit

0 Dec 1, 2021

fds is a tool for Data Scientists made by DAGsHub to version control data and code at once.

Fast Data Science, AKA fds, is a CLI for Data Scientists to version control data and code at once, by conveniently wrapping git and dvc

359 Dec 22, 2022

Python data processing, analysis, visualization, and data operations

Python This is a Python data processing, analysis, visualization and data operations of the source code warehouse, book ISBN: 9787115527592 Descriptio

1 Jan 16, 2022

Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

Data Scientist Learning Plan Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

27 Nov 1, 2022

A Python package for Bayesian forecasting with object-oriented design and probabilistic models under the hood.

Disclaimer This project is stable and being incubated for long-term support. It may contain new experimental code, for which APIs are subject to chang

1.6k Dec 29, 2022

Intake is a lightweight package for finding, investigating, loading and disseminating data.

Related tags

Overview

Intake: A general interface for loading data

Install

Development

Comments

Try it locally

Owner

Intake

VevestaX is an open source Python package for ML Engineers and Data Scientists.

Python package to transfer data in a fast, reliable, and packetized form.

Elementary is an open-source data reliability framework for modern data teams. The first module of the framework is data lineage.

GWpy is a collaboration-driven Python package providing tools for studying data from ground-based gravitational-wave detectors

A powerful data analysis package based on mathematical step functions. Strongly aligned with pandas.

Python package for processing UC module spectral data.

PyEmits, a python package for easy manipulation in time-series data.

nrgpy is the Python package for processing NRG Data Files

small package with utility functions for analyzing (fly) calcium imaging data

Python package for analyzing behavioral data for Brain Observatory: Visual Behavior

🧪 Panel-Chemistry - exploratory data analysis and build powerful data and viz tools within the domain of Chemistry using Python and HoloViz Panel.

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

PrimaryBid - Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift

A python package which can be pip installed to perform statistics and visualize binomial and gaussian distributions of the dataset

ToeholdTools is a Python package and desktop app designed to facilitate analyzing and designing toehold switches, created as part of the 2021 iGEM competition.

fds is a tool for Data Scientists made by DAGsHub to version control data and code at once.

Python data processing, analysis, visualization, and data operations

Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

A Python package for Bayesian forecasting with object-oriented design and probabilistic models under the hood.