A Pythonic introduction to methods for scaling your data science and machine learning work to larger datasets and larger models, using the tools and APIs you know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

Overview

Binder

Note: This repository is currently a work in progress. If you are joining for any given tutorial, please make sure to clone // pull the repository 2 hours before the tutorial begins.

Material for any given tutorial will be in the notebooks directory: for example, material for the Data Umbrella & PyLadies NYC tutorial on October 27, is in a subdirectort of /notebooks called /data-umbrella-2020-10-27.

Data Science At Scale

This tutorial's purpose is to introduce Pythonistas to methods for scaling their data science and machine learning work to larger datasets and larger models, using the tools and APIs they know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

Prerequisites

Not a lot. It would help if you knew

  • programming fundamentals and the basics of the Python programming language (e.g., variables, for loops);
  • a bit about pandas, numpy, and scikit-learn (although not strictly necessary);
  • a bit about Jupyter Notebooks;
  • your way around the terminal/shell.

However, I have always found that the most important and beneficial prerequisite is a will to learn new things so if you have this quality, you'll definitely get something out of this code-along session.

Also, if you'd like to watch and not code along, you'll also have a great time and these notebooks will be downloadable afterwards also.

If you are going to code along and use the Anaconda distribution of Python 3 (see below), I ask that you install it before the session.

Getting set up computationally

Binder

The first option is to click on the Binder badge above. This will spin up the necessary computational environment for you so you can write and execute Python code from the comfort of your browser. Binder is a free service. Due to this, the resources are not guaranteed, though they usually work well. If you want as close to a guarantee as possible, follow the instructions below to set up your computational environment locally (that is, on your own computer). Note that Binder will not work for all of the notebooks, particularly when we spin up Coiled Cloud. For these, you can follow along or set up your local environment as detailed below.

1. Clone the repository

To get set up for this live coding session, clone this repository. You can do so by executing the following in your terminal:

git clone https://github.com/coiled/data-science-at-scale

Alternatively, you can download the zip file of the repository at the top of the main page of the repository. If you prefer not to use git or don't have experience with it, this a good option.

2. Download Anaconda (if you haven't already)

If you do not already have the Anaconda distribution of Python 3, go get it (n.b., you can also do this w/out Anaconda using pip to install the required packages, however Anaconda is great for Data Science and I encourage you to use it).

3. Create your conda environment for this session

Navigate to the relevant directory data-science-at-scale and install required packages in a new conda environment:

conda env create -f binder/environment.yml

This will create a new environment called data-science-at-scale. To activate the environment on OSX/Linux, execute

source activate data-science-at-scale

On Windows, execute

activate data-science-at-scale

Then execute the following to get all the great Jupyter // Bokeh // Dask dashboarding tools.

jupyter labextension install @jupyter-widgets/jupyterlab-manager
jupyter labextension install @bokeh/jupyter_bokeh
jupyter labextension install dask-labextension

4. Open your Jupyter Lab

In the terminal, execute jupyter lab.

Then open the notebook 0-overview.ipynb in the relevant subdirectory of /notebooks and we're ready to get coding. Enjoy.

Comments
  • Make NBs production (training!) ready

    Make NBs production (training!) ready

    I've got the 1st NB for this tutorial at a place I'm happy with (final Coiled error aside): https://github.com/coiled/data-science-at-scale/blob/master/01-data-analysis-at-scale.ipynb

    As discussed, @davidventuri, if you could use this as inspiration for filling out text, code comments (and images, if you see fit), in the other NBs, that would be great.

    The remaining NBs I would like you to prioritize in the following order:

    • [x] 02a-scalable-dataframes-lab.ipynb
    • [x] 04b-scalable-machine-learning-advanced.ipynb (feel free to add stuff from the great posts you've written with us)
    • [x] 03-parallelization-basics.ipynb
    • [x] 02b-scalable-dataframes-lab.ipynb
    • [ ] 03a-parallelization-basics.ipynb
    • [ ] 04a-scalable-machine-learning.ipynb (not necessary at the moment; only do this after everything else, time permitting)

    A few things I've done here that will be needed in the other NBs:

    • Listing what we plan to do
    • Recap
    • Mentioning Coiled &/or Beta but not in a salesy way, merely to provide context
    • Enough code comments to give context but nothing over the top

    Don't edit any code but do raise issues if you think there's something funky.

    NBs 3a, 3b, and 4a are from here and need to be credited as such. I think you'll likely edit much of the text of them so feel free to add in them something like "This material riffs of ..."

    Two more things:

    • [x] Could you add Coiled and Dask logos the NBs, something like here? You can find Coiled logos here.

    • [x] In this NB, it would be great if you could add reminders about features and target variables in ML, training and test sets, train test split, and cross validation. We won't need to much about each, just a refresher.

    Feel free to use anything I've written here for this.

    It may be clear, but NB4b will likely be the only ML NB and & I may drop 4a.

    opened by hugobowne 5
  • Dask dashboards missing locally

    Dask dashboards missing locally

    I've followed instructions from here to try to get this repo up and running locally.

    This is what I did from the readme:

    conda env create -f binder/environment.yml 
    conda activate data-science-at-scale
    jupyter labextension install @jupyter-widgets/jupyterlab-manager
    jupyter labextension install @bokeh/jupyter_bokeh
    

    then jupyter lab

    It all went smoothly but there are no dashboards when I open the notebooks (see screenshot).

    Any ideas, @jrbourbeau?

    Screen Shot 2020-09-07 at 10 27 26 am
    opened by hugobowne 4
  • Dask dashboards not working for Coiled cluster

    Dask dashboards not working for Coiled cluster

    When I execute the Coiled part of our overview notebook, the Dask dashboarding doesn't seem to work.

    The relevant section of Jupyter Lab is greyed out like this.

    Screen Shot 2020-10-27 at 3 21 32 pm

    any thoughts, @jrbourbeau ?

    opened by hugobowne 2
  • CVE-2007-4559 Patch

    CVE-2007-4559 Patch

    Patching CVE-2007-4559

    Hi, we are security researchers from the Advanced Research Center at Trellix. We have began a campaign to patch a widespread bug named CVE-2007-4559. CVE-2007-4559 is a 15 year old bug in the Python tarfile package. By using extract() or extractall() on a tarfile object without sanitizing input, a maliciously crafted .tar file could perform a directory path traversal attack. We found at least one unsantized extractall() in your codebase and are providing a patch for you via pull request. The patch essentially checks to see if all tarfile members will be extracted safely and throws an exception otherwise. We encourage you to use this patch or your own solution to secure against CVE-2007-4559. Further technical information about the vulnerability can be found in this blog.

    If you have further questions you may contact us through this projects lead researcher Kasimir Schulz.

    opened by TrellixVulnTeam 1
  • Import of LocalCluster not needed in coiled sec

    Import of LocalCluster not needed in coiled sec

    In the section Multi-machine parallelism in the cloud with Coiled of 3-machine-learning.ipynb currently there is an import that includes LocalCluster which is not used/needed in that section and can cause confusion.

    cc: @pavithraes

    opened by ncclementi 1
  • Minor edits

    Minor edits

    This PR:

    • Updates the dask components image used in the overview notebook
    • Removes unnecessary code cells like creating a Coiled software environment / cluster configuration

    cc @hugobowne for thoughts

    opened by jrbourbeau 1
  • Fix data download prep

    Fix data download prep

    This PR ensures that the data download prep.py script can run successfully. It looks like only the flights dataset is being used in the tutorial, @hugobowne is that correct or am I missing something?

    opened by jrbourbeau 1
  • Dask notes to include

    Dask notes to include

    Good Dask notes from @adbreind that I would like to include in this tutorial:

    About Dask

    Dask was created in 2014 as part of the Blaze project, a DARPA funded project at Continuum/Anaconda. It has since grown into a multi-institution community project with developers from projects including NumPy, Pandas, Jupyter and Scikit-Learn. Many of the core Dask maintainers are employed to work on the project by companies including Continuum/Anaconda, Prefect, NVIDIA, Capital One, Saturn Cloud and Coiled.

    Fundamentally, Dask allows a variety of parallel workflows using existing Python constructs, patterns, or libraries, including dataframes, arrays (scaling out Numpy), bags (an unordered collection construct a bit like Counter), and concurrent.futures

    In addition to working in conjunction with Python ecosystem tools, Dask's extremely low scheduling overhead (nanoseconds in some cases) allows it work well even on single machines, and smoothly scale up.

    Dask supports a variety of use cases for industry and research: https://stories.dask.org/en/latest/

    With its recent 2.x releases, and integration to other projects (e.g., RAPIDS for GPU computation), many commercial enterprises are paying attention and jumping in to parallel Python with Dask.

    Dask Ecosystem

    In addition to the core Dask library and its Distributed scheduler, the Dask ecosystem connects several additional initiatives, including...

    • Dask ML - parallel machine learning, with a scikit-learn-style API
    • Dask-kubernetes
    • Dask-XGBoost
    • Dask-YARN
    • Dask-image
    • Dask-cuDF
    • ... and some others

    What's Not Part of Dask?

    There are lots of functions that integrate to Dask, but are not represented in the core Dask ecosystem, including...

    • a SQL engine
    • data storage
    • data catalog
    • visualization
    • coarse-grained scheduling / orchestration
    • streaming

    ... although there are typically other Python packages that fill these needs (e.g., Kartothek or Intake for a data catalog).

    How Do We Set Up and/or Deploy Dask?

    The easiest way to install Dask is with Anaconda: conda install dask

    Schedulers and Clustering

    Dask has a simple default scheduler called the "single machine scheduler" -- this is the scheduler that's used if your import dask and start running code without explicitly using a Client object. It can be handy for quick-and-dirty testing, but I would (warning! opinion!) suggest that a best practice is to use the newer "distributed scheduler" even for single-machine workloads

    The distributed scheduler can work with

    • threads (although that is often not a great idea due to the GIL) in one process
    • multiple processes on one machine
    • multiple processes on multiple machines

    The distributed scheduler has additional useful features including data locality awareness and realtime graphical dashboards.

    opened by hugobowne 1
  • Tutorial structure

    Tutorial structure

    I think I've got the structure down pretty well in this commit.

    Interested in @jrbourbeau's thoughts:

    • NB 1: motivating Dask with NYC-taxi example: dataset too big for memory so doing basic analytics on a Dask dataframe
    • NB 2a/2b: diving into Dask dataframes
    • NB3: diving into parallelization with Dask delayed
    • NB 4a/4b: Scalable machine learning -- ideally we could have a simple example of using Dask for CPU-bound ML & RAM-boun-ML (TBD with @jrbourbeau)
    opened by hugobowne 1
  • Dask DataFrame example

    Dask DataFrame example

    Consider starting session with this example:

    https://github.com/coiled/coiled-examples/blob/master/pandas-dask-coiled.ipynb

    good for motivation and it's dataframes!

    opened by hugobowne 1
  • Attribute sources

    Attribute sources

    If you end up using content from the following repos, attribute them:

    https://github.com/adbreind/dask-mini-2019 https://github.com/dask/dask-tutorial

    opened by hugobowne 1
  • Add env small for binder

    Add env small for binder

    If we check the prep.py it checks for the environment variable DASK_TUTORIAL_SMALL. The idea behind this is to use this when the tutorial is run on binder. This will actually set the env variable when launching binder instead of using the whole data.

    For comparison, this is how it's set in the main dask-tutorial. https://github.com/dask/dask-tutorial/blob/main/binder/start

    cc: @pavithraes

    opened by ncclementi 0
  • Old logo and broken URLs in notebook 01-data-analysis-at-scale.ipynb

    Old logo and broken URLs in notebook 01-data-analysis-at-scale.ipynb

    The "Dask in the Cloud" blog links to the notebook 01-data-analysis-at-scale.ipynb. I noticed a few issues with this notebook:

    1. It contains the old Coiled logo
    2. 3 of the URLs (to Coiled and Dask) don't function.

    Intro: "Coiled" and "free Beta here"

    Section 2: "Dask"

    Section 3: "Coiled"

    opened by rrpelgrim 0
  • Write to AWS

    Write to AWS

    I need to be able to write to an s3 bucket.

    @necaris is going to help. thanks, Rami!

    you can see current error I get here:

    
    
    ---------------------------------------------------------------------------
    NoCredentialsError                        Traceback (most recent call last)
    <timed eval> in <module>
    
    ~/opt/anaconda3/envs/data-science-at-scale/lib/python3.8/site-packages/dask/dataframe/core.py in to_parquet(self, path, *args, **kwargs)
       3947         from .io import to_parquet
       3948 
    -> 3949         return to_parquet(self, path, *args, **kwargs)
       3950 
       3951     @derived_from(pd.DataFrame)
    
    ~/opt/anaconda3/envs/data-science-at-scale/lib/python3.8/site-packages/dask/dataframe/io/parquet/core.py in to_parquet(df, path, engine, compression, write_index, append, ignore_divisions, partition_on, storage_options, write_metadata_file, compute, compute_kwargs, schema, **kwargs)
        461     # Engine-specific initialization steps to write the dataset.
        462     # Possibly create parquet metadata, and load existing stuff if appending
    --> 463     meta, schema, i_offset = engine.initialize_write(
        464         df,
        465         fs,
    
    ~/opt/anaconda3/envs/data-science-at-scale/lib/python3.8/site-packages/dask/dataframe/io/parquet/arrow.py in initialize_write(df, fs, path, append, partition_on, ignore_divisions, division_info, schema, index_cols, **kwargs)
        876         if append and division_info is None:
        877             ignore_divisions = True
    --> 878         fs.mkdirs(path, exist_ok=True)
        879 
        880         if append:
    
    ~/opt/anaconda3/envs/data-science-at-scale/lib/python3.8/site-packages/fsspec/spec.py in mkdirs(self, path, exist_ok)
       1016     def mkdirs(self, path, exist_ok=False):
       1017         """Alias of :ref:`FilesystemSpec.makedirs`."""
    -> 1018         return self.makedirs(path, exist_ok=exist_ok)
       1019 
       1020     def listdir(self, path, detail=True, **kwargs):
    
    ~/opt/anaconda3/envs/data-science-at-scale/lib/python3.8/site-packages/s3fs/core.py in makedirs(self, path, exist_ok)
        458 
        459     def makedirs(self, path, exist_ok=False):
    --> 460         self.mkdir(path, create_parents=True)
        461 
        462     async def _rmdir(self, path):
    
    ~/opt/anaconda3/envs/data-science-at-scale/lib/python3.8/site-packages/fsspec/asyn.py in wrapper(*args, **kwargs)
         98     def wrapper(*args, **kwargs):
         99         self = obj or args[0]
    --> 100         return maybe_sync(func, self, *args, **kwargs)
        101 
        102     return wrapper
    
    ~/opt/anaconda3/envs/data-science-at-scale/lib/python3.8/site-packages/fsspec/asyn.py in maybe_sync(func, self, *args, **kwargs)
         78         if inspect.iscoroutinefunction(func):
         79             # run the awaitable on the loop
    ---> 80             return sync(loop, func, *args, **kwargs)
         81         else:
         82             # just call the blocking function
    
    ~/opt/anaconda3/envs/data-science-at-scale/lib/python3.8/site-packages/fsspec/asyn.py in sync(loop, func, callback_timeout, *args, **kwargs)
         49     if error[0]:
         50         typ, exc, tb = error[0]
    ---> 51         raise exc.with_traceback(tb)
         52     else:
         53         return result[0]
    
    ~/opt/anaconda3/envs/data-science-at-scale/lib/python3.8/site-packages/fsspec/asyn.py in f()
         33             if callback_timeout is not None:
         34                 future = asyncio.wait_for(future, callback_timeout)
    ---> 35             result[0] = await future
         36         except Exception:
         37             error[0] = sys.exc_info()
    
    ~/opt/anaconda3/envs/data-science-at-scale/lib/python3.8/site-packages/s3fs/core.py in _mkdir(self, path, acl, create_parents, **kwargs)
        444                         'LocationConstraint': region_name
        445                     }
    --> 446                 await self.s3.create_bucket(**params)
        447                 self.invalidate_cache('')
        448                 self.invalidate_cache(bucket)
    
    ~/opt/anaconda3/envs/data-science-at-scale/lib/python3.8/site-packages/aiobotocore/client.py in _make_api_call(self, operation_name, api_params)
         89             http, parsed_response = event_response
         90         else:
    ---> 91             http, parsed_response = await self._make_request(
         92                 operation_model, request_dict, request_context)
         93 
    
    ~/opt/anaconda3/envs/data-science-at-scale/lib/python3.8/site-packages/aiobotocore/client.py in _make_request(self, operation_model, request_dict, request_context)
        110                             request_context):
        111         try:
    --> 112             return await self._endpoint.make_request(operation_model,
        113                                                      request_dict)
        114         except Exception as e:
    
    ~/opt/anaconda3/envs/data-science-at-scale/lib/python3.8/site-packages/aiobotocore/endpoint.py in _send_request(self, request_dict, operation_model)
        224     async def _send_request(self, request_dict, operation_model):
        225         attempts = 1
    --> 226         request = self.create_request(request_dict, operation_model)
        227         context = request_dict['context']
        228         success_response, exception = await self._get_response(
    
    ~/opt/anaconda3/envs/data-science-at-scale/lib/python3.8/site-packages/botocore/endpoint.py in create_request(self, params, operation_model)
        113                 service_id=service_id,
        114                 op_name=operation_model.name)
    --> 115             self._event_emitter.emit(event_name, request=request,
        116                                      operation_name=operation_model.name)
        117         prepared_request = self.prepare_request(request)
    
    ~/opt/anaconda3/envs/data-science-at-scale/lib/python3.8/site-packages/botocore/hooks.py in emit(self, event_name, **kwargs)
        354     def emit(self, event_name, **kwargs):
        355         aliased_event_name = self._alias_event_name(event_name)
    --> 356         return self._emitter.emit(aliased_event_name, **kwargs)
        357 
        358     def emit_until_response(self, event_name, **kwargs):
    
    ~/opt/anaconda3/envs/data-science-at-scale/lib/python3.8/site-packages/botocore/hooks.py in emit(self, event_name, **kwargs)
        226                  handlers.
        227         """
    --> 228         return self._emit(event_name, kwargs)
        229 
        230     def emit_until_response(self, event_name, **kwargs):
    
    ~/opt/anaconda3/envs/data-science-at-scale/lib/python3.8/site-packages/botocore/hooks.py in _emit(self, event_name, kwargs, stop_on_response)
        209         for handler in handlers_to_call:
        210             logger.debug('Event %s: calling handler %s', event_name, handler)
    --> 211             response = handler(**kwargs)
        212             responses.append((handler, response))
        213             if stop_on_response and response is not None:
    
    ~/opt/anaconda3/envs/data-science-at-scale/lib/python3.8/site-packages/botocore/signers.py in handler(self, operation_name, request, **kwargs)
         88         # this method is invoked to sign the request.
         89         # Don't call this method directly.
    ---> 90         return self.sign(operation_name, request)
         91 
         92     def sign(self, operation_name, request, region_name=None,
    
    ~/opt/anaconda3/envs/data-science-at-scale/lib/python3.8/site-packages/botocore/signers.py in sign(self, operation_name, request, region_name, signing_type, expires_in, signing_name)
        155                     raise e
        156 
    --> 157             auth.add_auth(request)
        158 
        159     def _choose_signer(self, operation_name, signing_type, context):
    
    ~/opt/anaconda3/envs/data-science-at-scale/lib/python3.8/site-packages/botocore/auth.py in add_auth(self, request)
        423         self._region_name = signing_context.get(
        424             'region', self._default_region_name)
    --> 425         super(S3SigV4Auth, self).add_auth(request)
        426 
        427     def _modify_request_before_signing(self, request):
    
    ~/opt/anaconda3/envs/data-science-at-scale/lib/python3.8/site-packages/botocore/auth.py in add_auth(self, request)
        355     def add_auth(self, request):
        356         if self.credentials is None:
    --> 357             raise NoCredentialsError
        358         datetime_now = datetime.datetime.utcnow()
        359         request.context['timestamp'] = datetime_now.strftime(SIGV4_TIMESTAMP)
    
    NoCredentialsError: Unable to locate credentials
    
    distributed.client - ERROR - Failed to reconnect to scheduler after 10.00 seconds, closing client
    _GatheringFuture exception was never retrieved
    future: <_GatheringFuture finished exception=CancelledError()>
    asyncio.exceptions.CancelledError
    
    
    
    opened by hugobowne 4
  • Small datasets on binder

    Small datasets on binder

    If learners use binder and thus need smaller versions of the data, use the same method as here: https://github.com/dask/dask-tutorial

    well, here: https://github.com/dask/dask-tutorial/blob/master/prep.py

    opened by hugobowne 0
Owner
Coiled
Scalable Python with Dask
Coiled
Stream-Kafka-ELK-Stack - Weather data streaming using Apache Kafka and Elastic Stack.

Streaming Data Pipeline - Kafka + ELK Stack Streaming weather data using Apache Kafka and Elastic Stack. Data source: https://openweathermap.org/api O

Felipe Demenech Vasconcelos 2 Jan 20, 2022
NumPy and Pandas interface to Big Data

Blaze translates a subset of modified NumPy and Pandas-like syntax to databases and other computing systems. Blaze allows Python users a familiar inte

Blaze 3.1k Jan 5, 2023
Hidden Markov Models in Python, with scikit-learn like API

hmmlearn hmmlearn is a set of algorithms for unsupervised learning and inference of Hidden Markov Models. For supervised learning learning of HMMs and

null 2.7k Jan 3, 2023
Sensitivity Analysis Library in Python (Numpy). Contains Sobol, Morris, Fractional Factorial and FAST methods.

Sensitivity Analysis Library (SALib) Python implementations of commonly used sensitivity analysis methods. Useful in systems modeling to calculate the

SALib 663 Jan 5, 2023
Using Data Science with Machine Learning techniques (ETL pipeline and ML pipeline) to classify received messages after disasters.

Using Data Science with Machine Learning techniques (ETL pipeline and ML pipeline) to classify received messages after disasters.

null 1 Feb 11, 2022
Pandas and Dask test helper methods with beautiful error messages.

beavis Pandas and Dask test helper methods with beautiful error messages. test helpers These test helper methods are meant to be used in test suites.

Matthew Powers 18 Nov 28, 2022
Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

Data Scientist Learning Plan Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

Trung-Duy Nguyen 27 Nov 1, 2022
Improving your data science workflows with

Make Better Defaults Author: Kjell Wooding [email protected] This is the git repo for Makefiles: One great trick for making your conda environments mo

Kjell Wooding 18 Dec 23, 2022
Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code. Tuplex has similar Python APIs to Apache Spark or Dask, but rather than invoking the Python interpreter, Tuplex generates optimized LLVM bytecode for the given pipeline and input data set.

Tuplex 791 Jan 4, 2023
Using Python to scrape some basic player information from www.premierleague.com and then use Pandas to analyse said data.

PremiershipPlayerAnalysis Using Python to scrape some basic player information from www.premierleague.com and then use Pandas to analyse said data. No

null 5 Sep 6, 2021
A data analysis using python and pandas to showcase trends in school performance.

A data analysis using python and pandas to showcase trends in school performance. A data analysis to showcase trends in school performance using Panda

Jimmy Faccioli 0 Sep 7, 2021
Driver Analysis with Factors and Forests: An Automated Data Science Tool using Python

Driver Analysis with Factors and Forests: An Automated Data Science Tool using Python ??

Thomas 2 May 26, 2022
Bearsql allows you to query pandas dataframe with sql syntax.

Bearsql adds sql syntax on pandas dataframe. It uses duckdb to speedup the pandas processing and as the sql engine

null 14 Jun 22, 2022
Python tools for querying and manipulating BIDS datasets.

PyBIDS is a Python library to centralize interactions with datasets conforming BIDS (Brain Imaging Data Structure) format.

Brain Imaging Data Structure 180 Dec 18, 2022
Zipline, a Pythonic Algorithmic Trading Library

Zipline is a Pythonic algorithmic trading library. It is an event-driven system for backtesting. Zipline is currently used in production as the backte

Quantopian, Inc. 15.7k Jan 7, 2023
🧪 Panel-Chemistry - exploratory data analysis and build powerful data and viz tools within the domain of Chemistry using Python and HoloViz Panel.

???? ??. The purpose of the panel-chemistry project is to make it really easy for you to do DATA ANALYSIS and build powerful DATA AND VIZ APPLICATIONS within the domain of Chemistry using using Python and HoloViz Panel.

Marc Skov Madsen 97 Dec 8, 2022
Finds, downloads, parses, and standardizes public bikeshare data into a standard pandas dataframe format

Finds, downloads, parses, and standardizes public bikeshare data into a standard pandas dataframe format.

Brady Law 2 Dec 1, 2021
Hatchet is a Python-based library that allows Pandas dataframes to be indexed by structured tree and graph data.

Hatchet Hatchet is a Python-based library that allows Pandas dataframes to be indexed by structured tree and graph data. It is intended for analyzing

Lawrence Livermore National Laboratory 14 Aug 19, 2022
Toolchest provides APIs for scientific and bioinformatic data analysis.

Toolchest Python Client Toolchest provides APIs for scientific and bioinformatic data analysis. It allows you to abstract away the costliness of runni

Toolchest 11 Jun 30, 2022