A Pythonic introduction to methods for scaling your data science and machine learning work to larger datasets and larger models, using the tools and APIs you know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

Coiled

Last update: Nov 10, 2022

Related tags

Data Analysis data-science-at-scale

Overview

Note: This repository is currently a work in progress. If you are joining for any given tutorial, please make sure to clone // pull the repository 2 hours before the tutorial begins.

Material for any given tutorial will be in the notebooks directory: for example, material for the Data Umbrella & PyLadies NYC tutorial on October 27, is in a subdirectort of /notebooks called /data-umbrella-2020-10-27.

Data Science At Scale

This tutorial's purpose is to introduce Pythonistas to methods for scaling their data science and machine learning work to larger datasets and larger models, using the tools and APIs they know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

Prerequisites

Not a lot. It would help if you knew

programming fundamentals and the basics of the Python programming language (e.g., variables, for loops);
a bit about pandas, numpy, and scikit-learn (although not strictly necessary);
a bit about Jupyter Notebooks;
your way around the terminal/shell.

However, I have always found that the most important and beneficial prerequisite is a will to learn new things so if you have this quality, you'll definitely get something out of this code-along session.

Also, if you'd like to watch and not code along, you'll also have a great time and these notebooks will be downloadable afterwards also.

If you are going to code along and use the Anaconda distribution of Python 3 (see below), I ask that you install it before the session.

Getting set up computationally

The first option is to click on the Binder badge above. This will spin up the necessary computational environment for you so you can write and execute Python code from the comfort of your browser. Binder is a free service. Due to this, the resources are not guaranteed, though they usually work well. If you want as close to a guarantee as possible, follow the instructions below to set up your computational environment locally (that is, on your own computer). Note that Binder will not work for all of the notebooks, particularly when we spin up Coiled Cloud. For these, you can follow along or set up your local environment as detailed below.

1. Clone the repository

To get set up for this live coding session, clone this repository. You can do so by executing the following in your terminal:

git clone https://github.com/coiled/data-science-at-scale

Alternatively, you can download the zip file of the repository at the top of the main page of the repository. If you prefer not to use git or don't have experience with it, this a good option.

2. Download Anaconda (if you haven't already)

If you do not already have the Anaconda distribution of Python 3, go get it (n.b., you can also do this w/out Anaconda using pip to install the required packages, however Anaconda is great for Data Science and I encourage you to use it).

3. Create your conda environment for this session

Navigate to the relevant directory data-science-at-scale and install required packages in a new conda environment:

conda env create -f binder/environment.yml

This will create a new environment called data-science-at-scale. To activate the environment on OSX/Linux, execute

source activate data-science-at-scale

On Windows, execute

activate data-science-at-scale

Then execute the following to get all the great Jupyter // Bokeh // Dask dashboarding tools.

jupyter labextension install @jupyter-widgets/jupyterlab-manager
jupyter labextension install @bokeh/jupyter_bokeh
jupyter labextension install dask-labextension

4. Open your Jupyter Lab

In the terminal, execute jupyter lab.

Then open the notebook 0-overview.ipynb in the relevant subdirectory of /notebooks and we're ready to get coding. Enjoy.

Comments

Make NBs production (training!) ready
I've got the 1st NB for this tutorial at a place I'm happy with (final Coiled error aside): https://github.com/coiled/data-science-at-scale/blob/master/01-data-analysis-at-scale.ipynb

As discussed, @davidventuri, if you could use this as inspiration for filling out text, code comments (and images, if you see fit), in the other NBs, that would be great.

The remaining NBs I would like you to prioritize in the following order:

[x] 02a-scalable-dataframes-lab.ipynb

[x] 04b-scalable-machine-learning-advanced.ipynb (feel free to add stuff from the great posts you've written with us)

[x] 03-parallelization-basics.ipynb

[x] 02b-scalable-dataframes-lab.ipynb

[ ] 03a-parallelization-basics.ipynb

[ ] 04a-scalable-machine-learning.ipynb (not necessary at the moment; only do this after everything else, time permitting)

A few things I've done here that will be needed in the other NBs:

Listing what we plan to do

Recap

Mentioning Coiled &/or Beta but not in a salesy way, merely to provide context

Enough code comments to give context but nothing over the top

Don't edit any code but do raise issues if you think there's something funky.

NBs 3a, 3b, and 4a are from here and need to be credited as such. I think you'll likely edit much of the text of them so feel free to add in them something like "This material riffs of ..."

Two more things:

[x] Could you add Coiled and Dask logos the NBs, something like here? You can find Coiled logos here.

[x] In this NB, it would be great if you could add reminders about features and target variables in ML, training and test sets, train test split, and cross validation. We won't need to much about each, just a refresher.

Feel free to use anything I've written here for this.

It may be clear, but NB4b will likely be the only ML NB and & I may drop 4a.
opened by hugobowne 5
Dask dashboards missing locally
I've followed instructions from here to try to get this repo up and running locally.

This is what I did from the readme:

conda env create -f binder/environment.yml conda activate data-science-at-scale jupyter labextension install @jupyter-widgets/jupyterlab-manager jupyter labextension install @bokeh/jupyter_bokeh

then jupyter lab

It all went smoothly but there are no dashboards when I open the notebooks (see screenshot).

Any ideas, @jrbourbeau?
opened by hugobowne 4
Dask dashboards not working for Coiled cluster

When I execute the Coiled part of our overview notebook, the Dask dashboarding doesn't seem to work.

The relevant section of Jupyter Lab is greyed out like this.

any thoughts, @jrbourbeau ?

opened by hugobowne 2
CVE-2007-4559 Patch

Patching CVE-2007-4559

Hi, we are security researchers from the Advanced Research Center at Trellix. We have began a campaign to patch a widespread bug named CVE-2007-4559. CVE-2007-4559 is a 15 year old bug in the Python tarfile package. By using extract() or extractall() on a tarfile object without sanitizing input, a maliciously crafted .tar file could perform a directory path traversal attack. We found at least one unsantized extractall() in your codebase and are providing a patch for you via pull request. The patch essentially checks to see if all tarfile members will be extracted safely and throws an exception otherwise. We encourage you to use this patch or your own solution to secure against CVE-2007-4559. Further technical information about the vulnerability can be found in this blog.

If you have further questions you may contact us through this projects lead researcher Kasimir Schulz.

opened by TrellixVulnTeam 1
Import of LocalCluster not needed in coiled sec

In the section Multi-machine parallelism in the cloud with Coiled of 3-machine-learning.ipynb currently there is an import that includes LocalCluster which is not used/needed in that section and can cause confusion.

cc: @pavithraes

opened by ncclementi 1
Minor edits
This PR:

Updates the dask components image used in the overview notebook

Removes unnecessary code cells like creating a Coiled software environment / cluster configuration

cc @hugobowne for thoughts
opened by jrbourbeau 1
Fix data download prep

This PR ensures that the data download prep.py script can run successfully. It looks like only the flights dataset is being used in the tutorial, @hugobowne is that correct or am I missing something?

opened by jrbourbeau 1
Dask notes to include
Good Dask notes from @adbreind that I would like to include in this tutorial:

About Dask

Dask was created in 2014 as part of the Blaze project, a DARPA funded project at Continuum/Anaconda. It has since grown into a multi-institution community project with developers from projects including NumPy, Pandas, Jupyter and Scikit-Learn. Many of the core Dask maintainers are employed to work on the project by companies including Continuum/Anaconda, Prefect, NVIDIA, Capital One, Saturn Cloud and Coiled.

Fundamentally, Dask allows a variety of parallel workflows using existing Python constructs, patterns, or libraries, including dataframes, arrays (scaling out Numpy), bags (an unordered collection construct a bit like Counter), and concurrent.futures

In addition to working in conjunction with Python ecosystem tools, Dask's extremely low scheduling overhead (nanoseconds in some cases) allows it work well even on single machines, and smoothly scale up.

Dask supports a variety of use cases for industry and research: https://stories.dask.org/en/latest/

With its recent 2.x releases, and integration to other projects (e.g., RAPIDS for GPU computation), many commercial enterprises are paying attention and jumping in to parallel Python with Dask.

Dask Ecosystem

In addition to the core Dask library and its Distributed scheduler, the Dask ecosystem connects several additional initiatives, including...

Dask ML - parallel machine learning, with a scikit-learn-style API

Dask-kubernetes

Dask-XGBoost

Dask-YARN

Dask-image

Dask-cuDF

... and some others

What's Not Part of Dask?

There are lots of functions that integrate to Dask, but are not represented in the core Dask ecosystem, including...

a SQL engine

data storage

data catalog

visualization

coarse-grained scheduling / orchestration

streaming

... although there are typically other Python packages that fill these needs (e.g., Kartothek or Intake for a data catalog).

How Do We Set Up and/or Deploy Dask?

The easiest way to install Dask is with Anaconda: conda install dask

Schedulers and Clustering

Dask has a simple default scheduler called the "single machine scheduler" -- this is the scheduler that's used if your import dask and start running code without explicitly using a Client object. It can be handy for quick-and-dirty testing, but I would (warning! opinion!) suggest that a best practice is to use the newer "distributed scheduler" even for single-machine workloads

The distributed scheduler can work with

threads (although that is often not a great idea due to the GIL) in one process

multiple processes on one machine

multiple processes on multiple machines

The distributed scheduler has additional useful features including data locality awareness and realtime graphical dashboards.
opened by hugobowne 1
Tutorial structure
I think I've got the structure down pretty well in this commit.

Interested in @jrbourbeau's thoughts:

NB 1: motivating Dask with NYC-taxi example: dataset too big for memory so doing basic analytics on a Dask dataframe

NB 2a/2b: diving into Dask dataframes

NB3: diving into parallelization with Dask delayed

NB 4a/4b: Scalable machine learning -- ideally we could have a simple example of using Dask for CPU-bound ML & RAM-boun-ML (TBD with @jrbourbeau)
opened by hugobowne 1
Dask DataFrame example

Consider starting session with this example:

https://github.com/coiled/coiled-examples/blob/master/pandas-dask-coiled.ipynb

good for motivation and it's dataframes!

opened by hugobowne 1
Attribute sources

If you end up using content from the following repos, attribute them:

https://github.com/adbreind/dask-mini-2019 https://github.com/dask/dask-tutorial

opened by hugobowne 1
Add env small for binder

If we check the prep.py it checks for the environment variable DASK_TUTORIAL_SMALL. The idea behind this is to use this when the tutorial is run on binder. This will actually set the env variable when launching binder instead of using the whole data.

For comparison, this is how it's set in the main dask-tutorial. https://github.com/dask/dask-tutorial/blob/main/binder/start

cc: @pavithraes

opened by ncclementi 0
Old logo and broken URLs in notebook 01-data-analysis-at-scale.ipynb
The "Dask in the Cloud" blog links to the notebook 01-data-analysis-at-scale.ipynb. I noticed a few issues with this notebook:

It contains the old Coiled logo

3 of the URLs (to Coiled and Dask) don't function.

Intro: "Coiled" and "free Beta here"

Section 2: "Dask"

Section 3: "Coiled"
opened by rrpelgrim 0

Write to AWS

I need to be able to write to an s3 bucket.

@necaris is going to help. thanks, Rami!

you can see current error I get here:



---------------------------------------------------------------------------
NoCredentialsError                        Traceback (most recent call last)
<timed eval> in <module>

~/opt/anaconda3/envs/data-science-at-scale/lib/python3.8/site-packages/dask/dataframe/core.py in to_parquet(self, path, *args, **kwargs)
   3947         from .io import to_parquet
   3948 
-> 3949         return to_parquet(self, path, *args, **kwargs)
   3950 
   3951     @derived_from(pd.DataFrame)

~/opt/anaconda3/envs/data-science-at-scale/lib/python3.8/site-packages/dask/dataframe/io/parquet/core.py in to_parquet(df, path, engine, compression, write_index, append, ignore_divisions, partition_on, storage_options, write_metadata_file, compute, compute_kwargs, schema, **kwargs)
    461     # Engine-specific initialization steps to write the dataset.
    462     # Possibly create parquet metadata, and load existing stuff if appending
--> 463     meta, schema, i_offset = engine.initialize_write(
    464         df,
    465         fs,

~/opt/anaconda3/envs/data-science-at-scale/lib/python3.8/site-packages/dask/dataframe/io/parquet/arrow.py in initialize_write(df, fs, path, append, partition_on, ignore_divisions, division_info, schema, index_cols, **kwargs)
    876         if append and division_info is None:
    877             ignore_divisions = True
--> 878         fs.mkdirs(path, exist_ok=True)
    879 
    880         if append:

~/opt/anaconda3/envs/data-science-at-scale/lib/python3.8/site-packages/fsspec/spec.py in mkdirs(self, path, exist_ok)
   1016     def mkdirs(self, path, exist_ok=False):
   1017         """Alias of :ref:`FilesystemSpec.makedirs`."""
-> 1018         return self.makedirs(path, exist_ok=exist_ok)
   1019 
   1020     def listdir(self, path, detail=True, **kwargs):

~/opt/anaconda3/envs/data-science-at-scale/lib/python3.8/site-packages/s3fs/core.py in makedirs(self, path, exist_ok)
    458 
    459     def makedirs(self, path, exist_ok=False):
--> 460         self.mkdir(path, create_parents=True)
    461 
    462     async def _rmdir(self, path):

~/opt/anaconda3/envs/data-science-at-scale/lib/python3.8/site-packages/fsspec/asyn.py in wrapper(*args, **kwargs)
     98     def wrapper(*args, **kwargs):
     99         self = obj or args[0]
--> 100         return maybe_sync(func, self, *args, **kwargs)
    101 
    102     return wrapper

~/opt/anaconda3/envs/data-science-at-scale/lib/python3.8/site-packages/fsspec/asyn.py in maybe_sync(func, self, *args, **kwargs)
     78         if inspect.iscoroutinefunction(func):
     79             # run the awaitable on the loop
---> 80             return sync(loop, func, *args, **kwargs)
     81         else:
     82             # just call the blocking function

~/opt/anaconda3/envs/data-science-at-scale/lib/python3.8/site-packages/fsspec/asyn.py in sync(loop, func, callback_timeout, *args, **kwargs)
     49     if error[0]:
     50         typ, exc, tb = error[0]
---> 51         raise exc.with_traceback(tb)
     52     else:
     53         return result[0]

~/opt/anaconda3/envs/data-science-at-scale/lib/python3.8/site-packages/fsspec/asyn.py in f()
     33             if callback_timeout is not None:
     34                 future = asyncio.wait_for(future, callback_timeout)
---> 35             result[0] = await future
     36         except Exception:
     37             error[0] = sys.exc_info()

~/opt/anaconda3/envs/data-science-at-scale/lib/python3.8/site-packages/s3fs/core.py in _mkdir(self, path, acl, create_parents, **kwargs)
    444                         'LocationConstraint': region_name
    445                     }
--> 446                 await self.s3.create_bucket(**params)
    447                 self.invalidate_cache('')
    448                 self.invalidate_cache(bucket)

~/opt/anaconda3/envs/data-science-at-scale/lib/python3.8/site-packages/aiobotocore/client.py in _make_api_call(self, operation_name, api_params)
     89             http, parsed_response = event_response
     90         else:
---> 91             http, parsed_response = await self._make_request(
     92                 operation_model, request_dict, request_context)
     93 

~/opt/anaconda3/envs/data-science-at-scale/lib/python3.8/site-packages/aiobotocore/client.py in _make_request(self, operation_model, request_dict, request_context)
    110                             request_context):
    111         try:
--> 112             return await self._endpoint.make_request(operation_model,
    113                                                      request_dict)
    114         except Exception as e:

~/opt/anaconda3/envs/data-science-at-scale/lib/python3.8/site-packages/aiobotocore/endpoint.py in _send_request(self, request_dict, operation_model)
    224     async def _send_request(self, request_dict, operation_model):
    225         attempts = 1
--> 226         request = self.create_request(request_dict, operation_model)
    227         context = request_dict['context']
    228         success_response, exception = await self._get_response(

~/opt/anaconda3/envs/data-science-at-scale/lib/python3.8/site-packages/botocore/endpoint.py in create_request(self, params, operation_model)
    113                 service_id=service_id,
    114                 op_name=operation_model.name)
--> 115             self._event_emitter.emit(event_name, request=request,
    116                                      operation_name=operation_model.name)
    117         prepared_request = self.prepare_request(request)

~/opt/anaconda3/envs/data-science-at-scale/lib/python3.8/site-packages/botocore/hooks.py in emit(self, event_name, **kwargs)
    354     def emit(self, event_name, **kwargs):
    355         aliased_event_name = self._alias_event_name(event_name)
--> 356         return self._emitter.emit(aliased_event_name, **kwargs)
    357 
    358     def emit_until_response(self, event_name, **kwargs):

~/opt/anaconda3/envs/data-science-at-scale/lib/python3.8/site-packages/botocore/hooks.py in emit(self, event_name, **kwargs)
    226                  handlers.
    227         """
--> 228         return self._emit(event_name, kwargs)
    229 
    230     def emit_until_response(self, event_name, **kwargs):

~/opt/anaconda3/envs/data-science-at-scale/lib/python3.8/site-packages/botocore/hooks.py in _emit(self, event_name, kwargs, stop_on_response)
    209         for handler in handlers_to_call:
    210             logger.debug('Event %s: calling handler %s', event_name, handler)
--> 211             response = handler(**kwargs)
    212             responses.append((handler, response))
    213             if stop_on_response and response is not None:

~/opt/anaconda3/envs/data-science-at-scale/lib/python3.8/site-packages/botocore/signers.py in handler(self, operation_name, request, **kwargs)
     88         # this method is invoked to sign the request.
     89         # Don't call this method directly.
---> 90         return self.sign(operation_name, request)
     91 
     92     def sign(self, operation_name, request, region_name=None,

~/opt/anaconda3/envs/data-science-at-scale/lib/python3.8/site-packages/botocore/signers.py in sign(self, operation_name, request, region_name, signing_type, expires_in, signing_name)
    155                     raise e
    156 
--> 157             auth.add_auth(request)
    158 
    159     def _choose_signer(self, operation_name, signing_type, context):

~/opt/anaconda3/envs/data-science-at-scale/lib/python3.8/site-packages/botocore/auth.py in add_auth(self, request)
    423         self._region_name = signing_context.get(
    424             'region', self._default_region_name)
--> 425         super(S3SigV4Auth, self).add_auth(request)
    426 
    427     def _modify_request_before_signing(self, request):

~/opt/anaconda3/envs/data-science-at-scale/lib/python3.8/site-packages/botocore/auth.py in add_auth(self, request)
    355     def add_auth(self, request):
    356         if self.credentials is None:
--> 357             raise NoCredentialsError
    358         datetime_now = datetime.datetime.utcnow()
    359         request.context['timestamp'] = datetime_now.strftime(SIGV4_TIMESTAMP)

NoCredentialsError: Unable to locate credentials

distributed.client - ERROR - Failed to reconnect to scheduler after 10.00 seconds, closing client
_GatheringFuture exception was never retrieved
future: <_GatheringFuture finished exception=CancelledError()>
asyncio.exceptions.CancelledError

opened by hugobowne 4

Small datasets on binder

If learners use binder and thus need smaller versions of the data, use the same method as here: https://github.com/dask/dask-tutorial

well, here: https://github.com/dask/dask-tutorial/blob/master/prep.py

opened by hugobowne 0

Owner

Coiled

Scalable Python with Dask

GitHub

Stream-Kafka-ELK-Stack - Weather data streaming using Apache Kafka and Elastic Stack.

Streaming Data Pipeline - Kafka + ELK Stack Streaming weather data using Apache Kafka and Elastic Stack. Data source: https://openweathermap.org/api O

2 Jan 20, 2022

NumPy and Pandas interface to Big Data

Blaze translates a subset of modified NumPy and Pandas-like syntax to databases and other computing systems. Blaze allows Python users a familiar inte

3.1k Jan 5, 2023

Hidden Markov Models in Python, with scikit-learn like API

hmmlearn hmmlearn is a set of algorithms for unsupervised learning and inference of Hidden Markov Models. For supervised learning learning of HMMs and

2.7k Jan 3, 2023

Sensitivity Analysis Library in Python (Numpy). Contains Sobol, Morris, Fractional Factorial and FAST methods.

Sensitivity Analysis Library (SALib) Python implementations of commonly used sensitivity analysis methods. Useful in systems modeling to calculate the

663 Jan 5, 2023

Using Data Science with Machine Learning techniques (ETL pipeline and ML pipeline) to classify received messages after disasters.

1 Feb 11, 2022

Pandas and Dask test helper methods with beautiful error messages.

beavis Pandas and Dask test helper methods with beautiful error messages. test helpers These test helper methods are meant to be used in test suites.

18 Nov 28, 2022

Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

Data Scientist Learning Plan Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

27 Nov 1, 2022

Improving your data science workflows with

Make Better Defaults Author: Kjell Wooding [email protected] This is the git repo for Makefiles: One great trick for making your conda environments mo

18 Dec 23, 2022

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code. Tuplex has similar Python APIs to Apache Spark or Dask, but rather than invoking the Python interpreter, Tuplex generates optimized LLVM bytecode for the given pipeline and input data set.

791 Jan 4, 2023

Using Python to scrape some basic player information from www.premierleague.com and then use Pandas to analyse said data.

PremiershipPlayerAnalysis Using Python to scrape some basic player information from www.premierleague.com and then use Pandas to analyse said data. No

5 Sep 6, 2021

A data analysis using python and pandas to showcase trends in school performance.

A data analysis using python and pandas to showcase trends in school performance. A data analysis to showcase trends in school performance using Panda

0 Sep 7, 2021

Driver Analysis with Factors and Forests: An Automated Data Science Tool using Python

Driver Analysis with Factors and Forests: An Automated Data Science Tool using Python ??

2 May 26, 2022

Bearsql allows you to query pandas dataframe with sql syntax.

Bearsql adds sql syntax on pandas dataframe. It uses duckdb to speedup the pandas processing and as the sql engine

14 Jun 22, 2022

Python tools for querying and manipulating BIDS datasets.

PyBIDS is a Python library to centralize interactions with datasets conforming BIDS (Brain Imaging Data Structure) format.

180 Dec 18, 2022

Zipline, a Pythonic Algorithmic Trading Library

Zipline is a Pythonic algorithmic trading library. It is an event-driven system for backtesting. Zipline is currently used in production as the backte

15.7k Jan 7, 2023

🧪 Panel-Chemistry - exploratory data analysis and build powerful data and viz tools within the domain of Chemistry using Python and HoloViz Panel.

???? ??. The purpose of the panel-chemistry project is to make it really easy for you to do DATA ANALYSIS and build powerful DATA AND VIZ APPLICATIONS within the domain of Chemistry using using Python and HoloViz Panel.

97 Dec 8, 2022

A Pythonic introduction to methods for scaling your data science and machine learning work to larger datasets and larger models, using the tools and APIs you know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

Related tags

Overview

Data Science At Scale

Prerequisites

Getting set up computationally

1. Clone the repository

2. Download Anaconda (if you haven't already)

3. Create your conda environment for this session

4. Open your Jupyter Lab

Comments

Patching CVE-2007-4559

About Dask

How Do We Set Up and/or Deploy Dask?

Owner

Coiled

Stream-Kafka-ELK-Stack - Weather data streaming using Apache Kafka and Elastic Stack.

NumPy and Pandas interface to Big Data

Hidden Markov Models in Python, with scikit-learn like API

Sensitivity Analysis Library in Python (Numpy). Contains Sobol, Morris, Fractional Factorial and FAST methods.

Using Data Science with Machine Learning techniques (ETL pipeline and ML pipeline) to classify received messages after disasters.

Pandas and Dask test helper methods with beautiful error messages.

Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

Improving your data science workflows with

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

Using Python to scrape some basic player information from www.premierleague.com and then use Pandas to analyse said data.

A data analysis using python and pandas to showcase trends in school performance.

Driver Analysis with Factors and Forests: An Automated Data Science Tool using Python

Bearsql allows you to query pandas dataframe with sql syntax.

Python tools for querying and manipulating BIDS datasets.

Zipline, a Pythonic Algorithmic Trading Library

🧪 Panel-Chemistry - exploratory data analysis and build powerful data and viz tools within the domain of Chemistry using Python and HoloViz Panel.

Finds, downloads, parses, and standardizes public bikeshare data into a standard pandas dataframe format

Hatchet is a Python-based library that allows Pandas dataframes to be indexed by structured tree and graph data.

Toolchest provides APIs for scientific and bioinformatic data analysis.