Publish Xarray Datasets via a REST API.

Overview

Xpublish

Publish Xarray Datasets via a REST API.

GitHub Workflow Status Documentation Status Binder

Serverside: Publish a Xarray Dataset through a rest API

ds.rest.serve(host="0.0.0.0", port=9000)

Client-side: Connect to a published dataset

The published dataset can be accessed from various kinds of client applications. Here is an example of directly accessing the data from within Python:

import xarray as xr
import zarr
from fsspec.implementations.http import HTTPFileSystem

fs = HTTPFileSystem()
http_map = fs.get_mapper('http://0.0.0.0:9000')

# open as a zarr group
zg = zarr.open_consolidated(http_map, mode='r')

# or open as another Xarray Dataset
ds = xr.open_zarr(http_map, consolidated=True)

Why?

Xpublish lets you serve/share/publish Xarray Datasets via a web application.

The data and/or metadata in the Xarray Datasets can be exposed in various forms through pluggable REST API endpoints. Efficient, on-demand delivery of large datasets may be enabled with Dask on the server-side.

We are exploring applications of Xpublish that include:

  • publish on-demand or derived data products
  • turning xarray objects into streaming services (e.g. OPeNDAP)

How?

Under the hood, Xpublish is using a web app (FastAPI) that is exposing a REST-like API with builtin and/or user-defined endpoints.

For example, Xpublish provides by default a minimal Zarr compatible REST-like API with the following endpoints:

  • .zmetadata: returns Zarr-formatted metadata keys as json strings.
  • var/0.0.0: returns a variable data chunk as a binary string.
Issues
  • Refactor routes

    Refactor routes

    First step towards addressing #25.

    This moves all path operation functions out of RestAccessor and creates instead fastapi.APIRouter instances in a new routers sub-package. Each module in routers contains a APIRouter instance dedicated to a specific part of the API.

    Each function operates on the served dataset by overriding the get_dataset dependency for RestAccessor.app.

    TODO:

    • [x] move zarr-specific path operation functions after #21
    • ~~maybe refactor tests (if directly testing APIRouter instances is possible and a good idea)~~
    opened by benbovy 11
  • Publishing a collection of datasets

    Publishing a collection of datasets

    It would be great if we could publish multiple datasets on the same server.

    I'm thinking of something like this:

    xpublish.serve(
        {'ds1': xarray.Dataset(...), 'ds2': xarray.Dataset(...)},
        host="127.0.0.1",
        port=9000
    )
    

    or

    # will launch the server
    ds1.rest.serve(host="127.0.0.1", port=9000, name="ds1")
    
    # same host/port -> will reuse the server
    ds2.rest.serve(host="127.0.0.1", port=9000, name="ds2")
    

    Would there be any technical challenge in supporting this?

    This will certainly break the current API end points, unless both cases (single dataset vs collection of datasets) are supported (perhaps not on the same running server).

    For the case of multiple datasets, all the current end points could for example have the prefix /datasets/<name>/. Some additional end points may be useful for listing the datasets in the collection.

    opened by benbovy 7
  • AttributeError: 'Dataset' object has no attribute 'rest'

    AttributeError: 'Dataset' object has no attribute 'rest'

    Hello,

    xpublish looks very promising and I want to use it for serving a few datasets in an experiment. I've installed xpublish in a conda environment

    I do run into the exception

    AttributeError: 'Dataset' object has no attribute 'rest' when running the simple script:

    #!/opt/anaconda/envs/env_xpublish/bin/python
    
    import click
    import sys
    import pandas as pd
    import numpy as np
    import xarray as xr
    import xpublish
    
    ds = xr.Dataset({'foo': (('x', 'y'), np.random.rand(4, 5))},
                     coords={'x': [10, 20, 30, 40],
                            'y': pd.date_range('2000-01-01', periods=5),
                            'z': ('x', list('abcd'))})
    
    
    ds.rest.serve(host='0.0.0.0', port=9000)
    

    Any help/tips is really appreciated

    question 
    opened by fabricebrito 6
  • Fix tests with last Xarray versions

    Fix tests with last Xarray versions

    I guess the failing roundtrip tests are related to https://github.com/pydata/xarray/pull/2844 but I'm not sure what to do here to fix it. Any idea @jhamman @andersy005?

    opened by benbovy 6
  • Flexible routes

    Flexible routes

    Overview

    This PR modifies xpublish to be able to server multiple datasets based on @benbovy prototype.

    This is an attempt to address #23 and #25.

    Notes

    Further analysis needs to be done to see if dask is working correctly and also caching, otherwise, it seems to work to serve multiple datasets.

    opened by lsetiawan 5
  • release 0.0.3?

    release 0.0.3?

    We've just merged some significant features and refactors. Is now a good time to make the 0.0.3 release?

    cc @benbovy

    opened by jhamman 4
  • Doc fixes, tweaks and improvements

    Doc fixes, tweaks and improvements

    A couple of comments:

    • The rest accessor API is now documented using sphinx-autosummary-accessors.

    • I replaced the ipython directives by regular python code blocks. I don't think using ipython directives are worth relying on ipython + all xpublish's runtime dependencies for building the docs, given that we don't really leverage the interactive output here. I'm not against reverting this change in case anyone has objections.

    opened by benbovy 4
  • Move this project to a new GitHub organization?

    Move this project to a new GitHub organization?

    Recently, @lsetiawan and @benbovy have been making contributions to this repository. Would now be a good time to move the repository to a GitHub organization. I think xarray-contrib is a logical place but Pangeo would also be fine by me.

    opened by jhamman 4
  • Add init app method for custom app config

    Add init app method for custom app config

    Overview

    Adding init_app method to set additional configuration to FastAPI configuration to allow more control to app and expand.

    Need this for sub-application to build proxying for multiple datasets: https://fastapi.tiangolo.com/advanced/sub-applications-proxy/

    opened by lsetiawan 4
  • Publish a collection of datasets

    Publish a collection of datasets

    Closes #23.

    It took me a bit longer than I thought to implement this.

    TODO:

    • [x] add tests
    • [x] update docs

    This PR adds a top-level Rest class that has the same interface than the Dataset.rest accessor. Actually, the accessor now internally reuses this class and is just there for convenience:

    ds.rest.serve()
    

    is equivalent to

    # my_collection = {'': ds}
    
    # Rest(my_collection, unique=True).serve()
    
    # EDIT:
    Rest(ds).serve()
    

    When ~~unique=False~~ a mapping of datasets is given to Rest.__init__, all dataset-specific api endoints have the prefix /datasets/{dataset_id}. There is also another endpoint /datasets that returns a list of all dataset ids. When ~~unique=True~~ a single dataset is given the behavior stays unchanged compared to the current one.

    Some other remarks on the implementation:

    • All the datasets in the published collection have an extra, xpublish-reserved global attribute for storing their id. This doesn't affects the original datasets (attributes are added on shallow copies) and it is not returned in the API results. I haven't found any better solution to access the dataset id (useful for setting cache keys) from within a path operation function than via the get_dataset dependency.

    • One thing that is still a bit annoying is the dataset_id field still required in the generated API docs when serving a single dataset (see https://github.com/tiangolo/fastapi/issues/1594). This is only minor annoyance, though, the API works as expected and hopefully this issue will be fixed at some point.

    opened by benbovy 3
  • Support nested collections of datasets (datatree)

    Support nested collections of datasets (datatree)

    Hi there,

    We want to use Datatree (a new package for working with hierarchies of xarray Datasets) together with Xpublish. A single datatree.DataTree can be written to a zarr dataset where subgroups typically contain an xarray.Dataset and optional subgroups.

    Our specific application is looking to serve data from a multi-dimensional data pyramid (see ndpyramid for more details) that looks something like:

    /
     ├── .zmetadata
     └── .zgroup
     └── .zattrs
     ├── 0
     │   └── .zgroup
     │   ├── tavg
     │       └── .zarray
     │       └── 0.0
     ├── 1
     │   └── .zgroup
     │   ├── tavg
     │       └── .zarray
     │       └── 0.0
     │       └── 0.1
     │       └── 1.0
     │       └── 1.1
     ├── 2
    …
    

    We could serve each subgroup independently but that is less desirable since the top level group metadata (stored in .zarrs and in the consolidated .zmetadata) is needed to describe the relationship among groups.

    Proposed feature addition

    My assumption is that to serve a dataset like the one I described above, we need to build a custom router for DataTrees. This new router, we’ll call it the ZarrDataTreeRouter, would be able to reuse many of the existing zarr endpoints, but would support a more nested data model.

    In https://github.com/carbonplan/maps/issues/15, @benbovy suggested that this sort of support would make sense here so, perhaps we can simply ask for some pointers on how to architect the ZarrDataTreeRouter?

    One specific question we have is how an implementation of this should interface with #88 and #89. Both which seem to be reshaping how complex, custom routers are developed.

    cc @jhamman

    opened by norlandrhagen 2
  • Add router factory classes

    Add router factory classes

    This is very much inspired from titiler. It will provide more flexibility for e.g., choosing alternative default values for some parameters (e.g., projection or colormap for xyz/wms services #88).

    One question: should we force users to subclass XpublishFactory or should we still allow them providing "raw" FastAPI routers to xpublish applications?

    Using XpublishFactory exclusively would allow us to get rid of FastAPI's dependencies overriding (it always seemed hacky to me in this context) and some ugly workarounds like https://github.com/xarray-contrib/xpublish/commit/13bed4ba04b664d6b772222379c6889bcbe6278b.

    On the other hand, I like that users can simply create an APIRouter with a bunch of functions. Also no breaking change in this case.

    Thoughts @xarray-contrib/xpublish-devs?

    opened by benbovy 0
  • Feature/xyz routes

    Feature/xyz routes

    So, the service is quite simple, however there are few design choices that may not be ideal:

    • Data projection: The service's tiling system uses morecantile, therefore one of the available CRS projections can be passed to the tiling function. The input data projection should match that CRS, however, in case there is a mismatch, the service does not try to re-project the data as it expects the users to take care of that. If that sounds OK, then I think there should be a check that raises an error if the CRS don't match.
    • Data validation: For simplicity the service requires that data spatial dimensions are named 'x' and 'y'. The function that checks that is defined in the router function. Is there a better way (or place) to implement a data input validation functionality?
    opened by iacopoff 12
  • Can xpublish serve Datasets dynamically?

    Can xpublish serve Datasets dynamically?

    Hi @jhamman, xpublish looks really neat.

    Does it provide a way to serve data holdings dynamically so that you could potentially serve millions of files? This would allow users to navigate an end-point that would dynamically read and serve an xarray Dataset on request (rather than in advance).

    opened by agstephens 2
  • return /air/mean as zarr, what's the best strategy to implement the routes?

    return /air/mean as zarr, what's the best strategy to implement the routes?

    I really like the approach you implemented with:

    from fastapi import APIRouter, Depends, HTTPException
    from xpublish.dependencies import get_dataset
    
    myrouter = APIRouter()
    
    @myrouter.get("/{var_name}/mean")
    def get_mean(var_name: str, dataset: xr.Dataset = Depends(get_dataset)):
        if var_name not in dataset.variables:
            raise HTTPException(
                status_code=404, detail=f"Variable '{var_name}' not found in dataset"
            )
    
        return float(dataset[var_name].mean())
    
    ds.rest(routers=[myrouter])
    
    ds.rest.serve()
    

    The example above returns a float. What I'd like to do is to implement API endpoints for a derived dataset (e.g. spatial subset) served as zarr, let's say:

    /datasets/{dataset_id}/{variable}/processes/position:aggregate-time/.zmetadata /datasets/{dataset_id}/{variable}/processes/position:aggregate-time/zgroups /datasets/{dataset_id}/{variable}/processes/position:aggregate-time/zattrs /datasets/{dataset_id}/{variable}/processes/position:aggregate-time/{var}/{chunk}

    The client would then do something like

    curl -X 'GET' \
      'http://0.0.0.0:9001/datasets/no2/tropospheric_no2_column_number_density/processes/position:aggregate-time/.zmetadata?location=2.12%2C48.75%2C2.52%2C48.99&function=mean&datetime=2018-05-01T00%3A00%3A00%2F2018-06-01T00%3A00%3A00' \
      -H 'accept: application/json'
    

    or

    fs = HTTPFileSystem()
    
    http_map = fs.get_mapper('http://0.0.0.0:9001/datasets/no2/tropospheric_no2_column_number_density/processes/position:aggregate-time/.zmetadata?location=2.12%2C48.75%2C2.52%2C48.99&function=mean&datetime=2018-05-01T00%3A00%3A00%2F2018-06-01T00%3A00%3A00')
    

    What would be the best approach to implement this with xpublish? Any suggestion would be appreciated

    opened by fabricebrito 2
  • Is possible to modify a dataset once the serve is started?

    Is possible to modify a dataset once the serve is started?

    Hi, I'm interested in use xpublish to analyze some financial data that I have in a cluster. The thing is that the data is a time-series, so every day the data suffer a concatenation of data and when this happens the dataset that I made public does not show the new dates. I want to know how can I update that dataset without killing the serve and public again the datasets. I really can't find a method to do this in the documentation, sorry for annoying you with this but I think that this a great API for my use case.

    opened by josephnowak 4
  • DOC: add a dedicated section for Xpublish's built-in REST API endpoints

    DOC: add a dedicated section for Xpublish's built-in REST API endpoints

    Currently the built-in API routes are briefly documented in the Tutorial section, but if more routers are added (#50) it will be nice to have a separate, more detailed section.

    https://sphinxcontrib-openapi.readthedocs.io/ could be useful to automatically generate that page.

    documentation 
    opened by benbovy 0
  • Flexible Dask architecture

    Flexible Dask architecture

    Related to this blog post and this thread.

    A good way to support those architectures in xpublish would be to have a get_dask_client FastAPI dependency. Like the other resources (dataset, cache, etc.), this dependency would be overridden when initializing the FastAPI application instance in Rest with a dependency function implementing one of the architectures mentioned in the blog post. This would also be extensible to any user-defined architecture.

    Perhaps the dask client and cache resources can be independent or tightly coupled to each other, depending on the architecture? I guess accessing those resources as nested dependencies would probably help supporting both cases.

    Any thoughts @andersy005 @jhamman?

    enhancement 
    opened by benbovy 0
  • Would it be feasible to provide a WMS route?

    Would it be feasible to provide a WMS route?

    Hi

    Recently stumbled upon this neat little library. I currently have the problem that I'm building a dashboard using streamlit and folium. On the folium map, I'd like to overlay a raster which can be achieved rather easy if this is hosted via WMS somewhere. Now, I'd like to avoid hosting this single file via something expensive as geoserver or a threads instance just for this single map.

    Would it be feasible to implement a WMS access option in Xpublish? not sure about the boilerplate required for doing this... 🤷‍♂️

    opened by cwerner 12
  • Configure the application using Pydantic Settings

    Configure the application using Pydantic Settings

    See https://fastapi.tiangolo.com/advanced/settings/#pydantic-settings.

    This could be useful for development/staging/production deployments of an application created with xpublish.

    We could already expose some settings like the cache size. Later on, we could add some settings to connect to a Dask cluster (e.g., LocalCluster or via Dask-gateway).

    enhancement 
    opened by benbovy 0
Releases(0.1.0)
Owner
xarray-contrib
xarray compatible projects
xarray-contrib
signal-cli-rest-api is a wrapper around signal-cli and allows you to interact with it through http requests

signal-cli-rest-api signal-cli-rest-api is a wrapper around signal-cli and allows you to interact with it through http requests. Features register/ver

Sebastian Noel Lübke 26 Nov 15, 2021
REST API with FastAPI and SQLite3.

REST API with FastAPI and SQLite3

Luis Quiñones Requelme 1 Oct 29, 2021
ReST based network device broker

The Open API Platform for Network Devices netpalm makes it easy to push and pull state from your apps to your network by providing multiple southbound

null 324 Nov 24, 2021
[rewrite 중] 코로나바이러스감염증-19(COVID-19)의 국내/국외 발생 동향 조회 API | Coronavirus Infectious Disease-19 (COVID-19) outbreak trend inquiry API

COVID-19API 코로나 바이러스 감염증-19(COVID-19, SARS-CoV-2)의 국내/외 발생 동향 조회 API Corona Virus Infectious Disease-19 (COVID-19, SARS-CoV-2) outbreak trend inquiry

Euiseo Cha 26 Nov 17, 2021
Install multiple versions of r2 and its plugins via Pip on any system!

r2env This repository contains the tool available via pip to install and manage multiple versions of radare2 and its plugins. r2-tools doesn't conflic

radare org 10 Nov 15, 2021
API & Webapp to answer questions about COVID-19. Using NLP (Question Answering) and trusted data sources.

This open source project serves two purposes. Collection and evaluation of a Question Answering dataset to improve existing QA/search methods - COVID-

deepset 307 Nov 30, 2021
api versioning for fastapi web applications

fastapi-versioning api versioning for fastapi web applications Installation pip install fastapi-versioning Examples from fastapi import FastAPI from f

Dean Way 314 Dec 2, 2021
Deploy an inference API on AWS (EC2) using FastAPI Docker and Github Actions

Deploy an inference API on AWS (EC2) using FastAPI Docker and Github Actions To learn more about this project: medium blog post The goal of this proje

Ahmed BESBES 32 Nov 22, 2021
This is a FastAPI application that provides a RESTful API for the Podcasts from different podcast's RSS feeds

The Podcaster API This is a FastAPI application that provides a RESTful API for the Podcasts from different podcast's RSS feeds. The API response is i

Sagar Giri 2 Nov 7, 2021
xarray: N-D labeled arrays and datasets

xarray is an open source project and Python package that makes working with labelled multi-dimensional arrays simple, efficient, and fun!

Python for Data 2.3k Nov 28, 2021
Run your jupyter notebooks as a REST API endpoint. This isn't a jupyter server but rather just a way to run your notebooks as a REST API Endpoint.

Jupter Notebook REST API Run your jupyter notebooks as a REST API endpoint. This isn't a jupyter server but rather just a way to run your notebooks as

Invictify 35 Nov 12, 2021
A high-level plotting API for pandas, dask, xarray, and networkx built on HoloViews

hvPlot A high-level plotting API for the PyData ecosystem built on HoloViews. Build Status Coverage Latest dev release Latest release Docs What is it?

HoloViz 482 Dec 3, 2021
A high-level plotting API for pandas, dask, xarray, and networkx built on HoloViews

hvPlot A high-level plotting API for the PyData ecosystem built on HoloViews. Build Status Coverage Latest dev release Latest release Docs What is it?

HoloViz 349 Feb 15, 2021
RESTler is the first stateful REST API fuzzing tool for automatically testing cloud services through their REST APIs and finding security and reliability bugs in these services.

RESTler is the first stateful REST API fuzzing tool for automatically testing cloud services through their REST APIs and finding security and reliability bugs in these services.

Microsoft 1.2k Nov 23, 2021
User-related REST API based on the awesome Django REST Framework

Django REST Registration User registration REST API, based on Django REST Framework. Documentation Full documentation for the project is available at

Andrzej Pragacz 302 Dec 1, 2021
Django-rest-auth provides a set of REST API endpoints for Authentication and Registration

This app makes it extremely easy to build Django powered SPA's (Single Page App) or Mobile apps exposing all registration and authentication related functionality as CBV's (Class Base View) and REST (JSON)

Tivix 2.3k Nov 29, 2021
A music recommendation REST API which makes a machine learning algorithm work with the Django REST Framework

music-recommender-rest-api A music recommendation REST API which makes a machine learning algorithm work with the Django REST Framework How it works T

The Reaper 1 Sep 28, 2021
Client library to download and publish models and other files on the huggingface.co hub

huggingface_hub Client library to download and publish models and other files on the huggingface.co hub Do you have an open source ML library? We're l

Hugging Face 281 Nov 26, 2021
ckan 3.2k Nov 28, 2021
BMW TechOffice MUNICH 82 Nov 22, 2021
A PDM plugin to publish to PyPI

PDM Publish A PDM plugin to publish to PyPI NOTE: Consider if you need this over using twine directly Installation If you installed pdm via pipx: pipx

Branch Vincent 5 Nov 9, 2021
Store events and publish to Kafka

Create an event from Django ORM object model, store the event into the database and also publish it into Kafka cluster.

Diag 4 Nov 7, 2021
Use Raspberry Pi and CircuitSetup's power monitor hardware to publish electrical usage to MQTT

This repo has code and notes for whole home electrical power monitoring using a Raspberry Pi and CircuitSetup modules. Beyond just collecting data, it

Eric Tsai 9 Nov 17, 2021
Repo for FUZE project. I will also publish some Linux kernel LPE exploits for various real world kernel vulnerabilities here. the samples are uploaded for education purposes for red and blue teams.

Linux_kernel_exploits Some Linux kernel exploits for various real world kernel vulnerabilities here. More exploits are yet to come. This repo contains

Wei Wu 454 Nov 22, 2021
A tf publisher gui tool for ROS, which publish /tf_static message. The software is based on PyQt5.

tf_publisher_gui for ROS Introduction How to use cd catkin_ws/src git clone https://github.com/yinwu33/tf_publisher_gui.git cd catkin_ws catkin_make s

yinwu33 2 Nov 18, 2021
This a Django TODO app project and practiced how to deploy and publish the project to Heroku

ToDo App Demo | Project Table of Contents Overview Built With Features How to use Acknowledgements Contact Overview Built With HTML CSS JS Django How

Cetin OGUT 1 Nov 19, 2021
image stream publish server over websocket

Image Stream Push Server 简介 通过浏览器网页实时查看图像处理结果。 环境 运行程序需要安装一下python依赖: tornado: 用于创建http及websocket服务; opencv-contrib-python: 用于图像数据源获取及图像处理。 使用 进入到src目

MrError404 1 Nov 4, 2021
Aggregating gridded data (xarray) to polygons

A package to aggregate gridded data in xarray to polygons in geopandas using area-weighting from the relative area overlaps between pixels and polygons. Check out the binder link above for a sample code run!

Kevin Schwarzwald 19 Oct 28, 2021