Convert monolithic Jupyter notebooks into Ploomber pipelines.

Ploomber

Last update: Dec 16, 2022

Related tags

Data Analysis workflow data-science machine-learning jupyter data-engineering jupyter-notebooks mlops

Overview

Soorgeon

Convert monolithic Jupyter notebooks into Ploomber pipelines.

soorgeon.mp4

3-minute video tutorial.

Try the interactive demo:

Note: Soorgeon is in alpha, help us make it better.

Install

pip install soorgeon

Usage

# refactor notebook
soorgeon refactor nb.ipynb

# all variables with the df prefix are stored in csv files
soorgeon refactor nb.ipynb --df-format csv
# all variables with the df prefix are stored in parquet files
soorgeon refactor nb.ipynb --df-format parquet

# store task output in 'some-directory' (if missing, this defaults to 'output')
soorgeon refactor nb.ipynb --product-prefix some-directory

# generate tasks in .py format
soorgeon refactor nb.ipynb --file-format py

To learn more, check out our guide.

Examples

git clone https://github.com/ploomber/soorgeon

Exploratory daya analysis notebook:

cd examples/exploratory
soorgeon refactor nb.ipynb

# to run the pipeline
pip install -r requirements.txt
ploomber build

Machine learning notebook:

cd examples/machine-learning
soorgeon refactor nb.ipynb

# to run the pipeline
pip install -r requirements.txt
ploomber build

To learn more, check out our guide.

Community

Comments

A utility to ensure the notebook runs
Before refactoring a notebook, the user must ensure that the original notebook runs. We should have a command to check whether it works and suggest actions if if there are some errors.

e.g.,

soorgeon run nb.ipynb

If the notebook fails because of a ModuleNotFound: suggest creating a virtualenv, and adding a requirements.txt with the package name

If it's an other error, show the guide for debugging notebooks

if it's a function signature, recommend downgrading some libraries

An alternative approach would be to let the user convert anyway and then help them fix the issues in the ploomber generated pipeline, this way they can leverage incremental builds for rapid iterations. we can also suggest them to add a sample parameter
opened by edublancas 23
Overriding variables
(i think this is already covered but I need to make sure and add a few tests)

Important: for these examples comments are H2 headings (e.g. # I'm an H2 heading)

We have to add two test cases:

1. Overriding variables

# first df = something() # second df = another() # third df_2 = df + 1

When running soorgeon refactor, the output should contain three notebooks (first, second, third). third.ipynb must contain the following:

# third df_2 = df + 1

Now, third uses df as input. So we must ensure that soorgeon picks up the df from the second notebook and not the one from the first one. Graphically it should be:

graph LR; first; second --> third;

Overriding variables (same cell)

# first df = something() # second df = another() df_2 = df + 1

We need to ensure that in this case, the second task generated does not create a dependency with the first task, since it has its own df (i.e., it does not depend on the df = something() line).

In other words, the output pipeline (produced by ploomber plot), should look like this (two independent tasks):

graph LR; first; second;

Instead of this (two dependent tasks):

graph LR; first --> second;

testing

the simplest way to test it is to call soorgeon refactor and then load the pipeline to ensure we got the right structure:

from ploomber.spec import DAGSpec nb = """ # notebook content goes here """ # refactor notebook export.from_nb(_read(nb)) dag = DAGSpec('pipeline.yaml').to_dag().render() # write assert statements for each task in the pipeline # to get the task names, you can do: list(dag) # example, check that task_name only has one upstream dependency called 'get' assert set(dag['task_name'].upstream) == {'get'} # another example, check that another_task has no dependencies assert set(dag['another_task'].upstream) == set()
good first issue
opened by edublancas 19
`soorgeon clean` command deletes existing output

when running soorgeon clean we convert the notebook to .py and then back to .ipynb the ploomber is that cell outputs are lost. What we want instead is to keep the outputs as they were. I think there might be some challenges since running isort maybe cause the clean notebook to have extra cells (and potentially running black, too). so we need to find a solution. I think the new jupyter notebook format adds unique cell ids, so we can use that (if the notebook does not have the cell ids, maybe loading it with nbformat.load will add them). worst case: we create our cell ids
bug high priority med effort

opened by edublancas 18
Dealing with non-pickable files
Some objects aren't pickable (e.g, jinja templates) so we should capture this errors at runtime and suggest fixes:

use another library like cloudpickle (this can be an option), like sorgeon notebook.ipynb --serializer cloudpickle

link to a guide to fix the problem (e.g., create a factory to instantiate the object)
opened by edublancas 11
Display a warning if notebooks saves files outside the output/product folder

It's possible that the original notebook already creates new files, we should show a warning if that's the case and tell the use to register them as products (include a short guide to explain how to do it)

opened by edublancas 10
making the CI more reliable

We have a bunch of integration tests that fetch data and notebooks using Kaggle's API. However, they fail sometimes (probably because the Kaggle API isn't very reliable). See https://github.com/ploomber/soorgeon/issues/58

One solution is to use the Kaggle API once and then upload the files to an S3 bucket, then have the CI download them from S3 instead of Kaggle.

opened by edublancas 9
apply black before refactoring
I ran soorgeon refactor on this notebook: https://www.kaggle.com/code/cdeotte/xgboost-starter-0-793

after fixing the headings and the function that uses global variables. I got this error:

File "/Users/Edu/dev/soorgeon/src/soorgeon/io.py", line 515, in find_inputs_and_outputs_from_leaf (_, candidates_in, candidates_out) = find_for_loop_def_and_io( File "/Users/Edu/dev/soorgeon/src/soorgeon/io.py", line 104, in find_for_loop_def_and_io raise ValueError(f'Expected a node with type "for_stmt", ' ValueError: Expected a node with type "for_stmt", got: <ExprStmt: df = df.merge(importances[k], on='feature', how='left')@4,25> with type expr_stmt Error: An error occurred when refactoring notebook.

and it failed. I realized it's because of this for loop:

for k in range(1,FOLDS): df = df.merge(importances[k], on='feature', how='left')

I realized that applying black and rerunning it fixed the issue. that's an easy fix for many of this edge cases; so I think we should apply black on the notebook before running soorgeon refactor

here's the notebook. note that I changed the extension from ipynb to txt since github wont let me upload it with ipynb extension

xgboost-starter-0-793.txt
low priority med effort
opened by edublancas 8
display warning when there are output files in task itself
Describe your changes

Parse notebook code, if it contains statement such as open, output warning.

Issue ticket number and link

Closes #16

Checklist before requesting a review

[x] I have performed a self-review of my code

[x] I have added thorough tests (when necessary).

[x] I have added the right documentation (when needed). Product update? If yes, write one line about this update.
opened by Wxl19980214 8
refactored code, down load from github instead of kaggle
Describe your changes

Instead of fetching from kaggle, we fetch from github by pygithub api now

Add _pygithub.py for downloading a dir from a repo.

Add pygithub package to setup.py (I don't know if this one is necessary)

Removed stored notebooks under _kaggle since we can store them in the storage repo now

Removed unused functions under src/_kaggle.py

Removed kaggle username and key in CI, I don't think we need them. Maybe we can also remove them from repo secrets

Storage repo here and README.md has instructions on how to upload

Contribute.md remains unchanged, we can let user register new notebook in _kaggle/index.yaml. But the rest of downloading and uploading, updating CI and test cases will need to be done manually by one of the team members.

To do

1.We need to add a personal access token to the repo's secrets. 2. Right now CI will fail because this one fails test_notebooks[look-at-this-note-feature-engineering-is-easy] for this reason: E soorgeon.exceptions.InputError: Only H1 headings are found. At this time, only H2 headings are supported. Check out our guide:

Issue ticket number and link

Closes #59 #39

Checklist before requesting a review

[x] I have performed a self-review of my code

[x] I have added thorough tests (when necessary).

[x] I have added the right documentation (when needed). Product update? If yes, write one line about this update.
opened by Wxl19980214 7
Dealing with non-picklable objects
The PR tries to solve https://github.com/ploomber/soorgeon/issues/9

Added a CLI option called serializer which can be one of cloudpickle or dill

Based on the serializer selected soorgeon will try to pickle the objects with either cloudpickle or dill

Both the packages are optional dependencies
opened by neelasha23 6
Refactor test command to use papermill to run notebooks
Describe your changes

Refactor test command to use papermill to run notebooks

Add file existence check for output of executed notebook

Issue ticket number and link

Closes #6

Checklist before requesting a review

[x] I have performed a self-review of my code

[x] I have added thorough tests (when necessary).

[x] I have added the right documentation (when needed). Product update? If yes, write one line about this update.
opened by 94rain 5
Support using global variables inside a function's body
Previously, we did not support using global variables inside a function's body:

x = 1 def sum(y): return x + y sum(y)

The above would break since sum is using a variable that's defined outside the function's body. If a user tried to refactor a notebook with code like this using Soorgeon refactor, they would get an error message asking them to change the code to:

x = 1 def sum(y, x): return x + y sum(y, x)

This PR automate this process and modify the user's source code on their behalf so they don't have to do it manually.

https://github.com/ploomber/soorgeon/issues/65

Closes #65

Checklist before requesting a review

[ x] I have performed a self-review of my code

[ x] I have added thorough tests (when necessary).

[ ] I have added the right documentation (when needed). Product update? If yes, write one line about this update.
opened by rrhg 1
idea: have `soorgeon clean` auto-format SQL snippets
saw this on stack overflow: https://stackoverflow.com/questions/41046955/formatting-sql-query-inside-an-ipython-jupyter-notebook

it could work both on cells that have:

%%sql

and markdown snippets
high priority low effort
opened by edublancas 0
dealing with global variables
Currently, we do not support using global variables inside a function's body:

x = 1 def sum(y): return x + y sum(y)

The above will break since sum is using a variable that's defined outside the function's body. If a user tries to refactor a notebook with code like this using soorgeon refactor, they'll get an error message asking them to change the code to:

x = 1 def sum(y, x): return x + y sum(y, x)

We throw an error message a link to this document.

However, we should automate this process and modify the user's source code on their behalf so they don't have to do it manually.

considerations

There are a few edge cases to take into account, for example, what if the function already has an argument with the name? Or what if the function's signature is using *args, or **kwargs. Since there are many edge cases, we should focus on covering the simple ones to ensure it works, detect a few of the edge ones, and throw an error so the user fixes it manually.

modifying users' code

soorgeon refactor parses the code into the AST to detect dependencies among notebook's. Python's standard library has an ast module; however, it's very limited so we use parso instead.

Parso offers roundtrip conversion, meaning we can go from source code to AST and to source code again. You can see an example of that here:

https://github.com/ploomber/soorgeon/blob/03229806b905e7715d1c3ee0413ff5a3bc30b71c/src/soorgeon/io.py#L851

The above is a function that removes import statements from a string with source code.

determining if parso is the best option

so far, parso has worked well for us; however, we are interested in exploring other options. We're interested in improving soorgeon's capabilities to automatically refactor code so we should ensure we're using the right tool for the job. so part of fixing this issue is to see if there are better alternatives.

the bottom of Python's AST module links to some alternatives:

See also

Green Tree Snakes, an external documentation resource, has good details on working with Python ASTs.

ASTTokens annotates Python ASTs with the positions of tokens and text in the source code that generated them. This is helpful for tools that make source code transformations.

leoAst.py unifies the token-based and parse-tree-based views of python programs by inserting two-way links between tokens and ast nodes.

LibCST parses code as a Concrete Syntax Tree that looks like an ast tree and keeps all formatting details. It’s useful for building automated refactoring (codemod) applications and linters.

Parso is a Python parser that supports error recovery and round-trip parsing for different Python versions (in multiple Python versions). Parso is also able to list multiple syntax errors in your python file.

And I also found this other project:

https://github.com/PyCQA/redbaron https://github.com/PyCQA/baron
opened by edublancas 2
extending soorgeon clean

we started working on a soorgeon clean command to apply flake8 and isort to notebooks (see https://github.com/ploomber/soorgeon/issues/50)

We can extend this functionality to do other types of clean-ups. PyCQA hosts a bunch of interesting projects. a few ones that caught my eye:

autoflake: https://github.com/PyCQA/autoflake bandit: https://github.com/PyCQA/bandit

a similar package for inspiration: https://github.com/nbQA-dev/nbQA

opened by edublancas 0
automated notebook cleaning
Notebooks get messy. Suppose a data scientist is working on a notebook that downloads some data, cleans it, generates features, and trains a model. Let's now say we want to deploy this notebook so we run a scheduled job every month to re-train the model with new data. In production, we need to keep some of the notebook logic (but not all).

During development, data scientists usually add extra cells for exploration and debugging purposes. For example, I might add a histogram to see the distribution of the input features to see if the data exhibits certain properties. While useful during development, some of the notebook's cells (e.g. plotting a histogram) aren't needed for production, so cleaning the notebook will be useful for a smoother transition to production.

Note that this is related to soorgeon clean (https://github.com/ploomber/soorgeon/issues/50) but not the same. soorgeon clean only reformats the existing code. In this case, we're talking about making deeper changes to the user's code.

low-hanging fruit: using autoflake

The low-hanging fruit here is to remove unused imports and variables via autoflake. For example, say a notebook looks like this:

# cell 1 import math import pandas as pd df = pd.read_csv('data.csv') # cell 2 df2 = pd.read_csv('another.csv') # cell 3 df.plot()

if we run autoflake on that notebook, it'll be able to delete import math since the module is never used, and also df2 = pd.read_csv('another.csv') since df2 isn't used, and we'll end up with:

# cell 1 import pandas as pd df = pd.read_csv('data.csv') # cell 2 df.plot()

more advanced approach: automated variable pruning

A more advanced approach would be to delete everything that is not required to produce some final result. For example, say we have a notebook whose final result is to produce df_final:

import pandas as pd df = pd.read_csv('data.csv') df2 = pd.read_csv('another.csv') df3 = pd.read_csv('final.csv') df4 = do_something(df2, df3) df_final = do_stuff(df1)

We can work backward and eliminate everything that does not affect df_final:

import pandas as pd df = pd.read_csv('data.csv') df_final = do_stuff(df1)

considerations

Pruning code is useful for cleaning up that's not needed for deployment but in some cases; the code might be needed again. For example, if I add some data exploration code (i.e. code to generate a plot), I may want to delete it as part of this automated cleaning process, but once the model is deployed, I might need that code again if the model fails and I need to debug things; I'm unsure on how to deal with this scenario
opened by edublancas 0

Owner

Ploomber

We develop tools to streamline the development to production Data Science workflow.

GitHub https://ploomber.io

Streamz helps you build pipelines to manage continuous streams of data

Streamz helps you build pipelines to manage continuous streams of data. It is simple to use in simple cases, but also supports complex pipelines that involve branching, joining, flow control, feedback, back pressure, and so on.

1.1k Dec 28, 2022

This tool parses log data and allows to define analysis pipelines for anomaly detection.

logdata-anomaly-miner This tool parses log data and allows to define analysis pipelines for anomaly detection. It was designed to run the analysis wit

32 Nov 27, 2022

Building house price data pipelines with Apache Beam and Spark on GCP

This project contains the process from building a web crawler to extract the raw data of house price to create ETL pipelines using Google Could Platform services.

1 Nov 22, 2021

Data pipelines built with polars

valves Warning: the project is very much work in progress. Valves is a collection of functions for your data .pipe()-lines. This project aimes to host

14 Jan 3, 2023

Python library for creating data pipelines with chain functional programming

PyFunctional Features PyFunctional makes creating data pipelines easy by using chained functional operators. Here are a few examples of what it can do

2.1k Jan 5, 2023

simple way to build the declarative and destributed data pipelines with python

unipipeline simple way to build the declarative and distributed data pipelines. Why you should use it Declarative strict config Scaffolding Fully type

0 Jan 26, 2022

Shot notebooks resuming the main functions of GeoPandas

Shot notebooks resuming the main functions of GeoPandas, 2 notebooks written as Exercises to apply these functions.

1 Jan 12, 2022

Convert tables stored as images to an usable .csv file

Convert an image of numbers to a .csv file This Python program aims to convert images of array numbers to corresponding .csv files. It uses OpenCV for

711 Dec 26, 2022

pipeline for migrating lichess data into postgresql

How Long Does It Take Ordinary People To "Get Good" At Chess? TL;DR: According to 5.5 years of data from 2.3 million players and 450 million games, mo

182 Nov 11, 2022

Package for decomposing EMG signals into motor unit firings, as used in Formento et al 2021.

EMGDecomp Package for decomposing EMG signals into motor unit firings, created for Formento et al 2021. Based heavily on Negro et al, 2016. Supports G

13 Nov 1, 2022

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

2 Nov 20, 2021

For making Tagtog annotation into csv dataset

tagtog_relation_extraction for making Tagtog annotation into csv dataset How to Use On Tagtog 1. Go to Project > Downloads 2. Download all documents,

4 Dec 28, 2021

A Python module for clustering creators of social media content into networks

sm_content_clustering A Python module for clustering creators of social media content into networks. Currently supports identifying potential networks

72 Dec 30, 2022

Finds, downloads, parses, and standardizes public bikeshare data into a standard pandas dataframe format

Finds, downloads, parses, and standardizes public bikeshare data into a standard pandas dataframe format.

2 Dec 1, 2021

Demonstrate a Dataflow pipeline that saves data from an API into BigQuery table

Overview dataflow-mvp provides a basic example pipeline that pulls data from an API and writes it to a BigQuery table using GCP's Dataflow (i.e., Apac

1 Dec 3, 2021

Useful tool for inserting DataFrames into the Excel sheet.

PyCellFrame Insert Pandas DataFrames into the Excel sheet with a bunch of conditions Install pip install pycellframe Usage Examples Let's suppose that

1 Feb 16, 2022

Import, connect and transform data into Excel

xlwings_query Import, connect and transform data into Excel. Description The concept is to apply data transformations to a main query object. When the

1 Jan 19, 2022

Text to speech is a process to convert any text into voice. Text to speech project takes words on digital devices and convert them into audio. Here I have used Google-text-to-speech library popularly known as gTTS library to convert text file to .mp3 file. Hope you like my project!

Text to speech (using Python) Text to speech is a process to convert any text into voice. Text to speech project takes words on digital devices and co

19 Jun 30, 2022

monolish: MONOlithic Liner equation Solvers for Highly-parallel architecture

monolish is a linear equation solver library that monolithically fuses variable data type, matrix structures, matrix data format, vendor specific data transfer APIs, and vendor specific numerical algebra libraries.

179 Dec 21, 2022

Write maintainable, production-ready pipelines using Jupyter or your favorite text editor. Develop locally, deploy to the cloud. ☁️

2.9k Jan 6, 2023

Convert monolithic Jupyter notebooks into Ploomber pipelines.

Related tags

Overview

Soorgeon

Install

Usage

Examples

Community

Comments

1. Overriding variables

Overriding variables (same cell)

testing

Describe your changes

Issue ticket number and link

Checklist before requesting a review

Describe your changes

To do

Issue ticket number and link

Checklist before requesting a review

Describe your changes

Issue ticket number and link

Checklist before requesting a review

This PR automate this process and modify the user's source code on their behalf so they don't have to do it manually.

https://github.com/ploomber/soorgeon/issues/65

Checklist before requesting a review

considerations

modifying users' code

determining if parso is the best option

low-hanging fruit: using autoflake

more advanced approach: automated variable pruning

considerations

Owner

Ploomber

Streamz helps you build pipelines to manage continuous streams of data

This tool parses log data and allows to define analysis pipelines for anomaly detection.

Building house price data pipelines with Apache Beam and Spark on GCP

Data pipelines built with polars

Python library for creating data pipelines with chain functional programming

simple way to build the declarative and destributed data pipelines with python

Shot notebooks resuming the main functions of GeoPandas

Convert tables stored as images to an usable .csv file

pipeline for migrating lichess data into postgresql

Package for decomposing EMG signals into motor unit firings, as used in Formento et al 2021.

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

For making Tagtog annotation into csv dataset

A Python module for clustering creators of social media content into networks

Finds, downloads, parses, and standardizes public bikeshare data into a standard pandas dataframe format

Demonstrate a Dataflow pipeline that saves data from an API into BigQuery table

Useful tool for inserting DataFrames into the Excel sheet.

Import, connect and transform data into Excel

Text to speech is a process to convert any text into voice. Text to speech project takes words on digital devices and convert them into audio. Here I have used Google-text-to-speech library popularly known as gTTS library to convert text file to .mp3 file. Hope you like my project!

monolish: MONOlithic Liner equation Solvers for Highly-parallel architecture

Write maintainable, production-ready pipelines using Jupyter or your favorite text editor. Develop locally, deploy to the cloud. ☁️