Convert monolithic Jupyter notebooks into Ploomber pipelines.

Overview

Soorgeon

Join our community | Newsletter | Contact us | Blog | Website | YouTube

header

Convert monolithic Jupyter notebooks into Ploomber pipelines.

soorgeon.mp4

3-minute video tutorial.

Try the interactive demo:

Open JupyerLab

Note: Soorgeon is in alpha, help us make it better.

Install

pip install soorgeon

Usage

# refactor notebook
soorgeon refactor nb.ipynb

# all variables with the df prefix are stored in csv files
soorgeon refactor nb.ipynb --df-format csv
# all variables with the df prefix are stored in parquet files
soorgeon refactor nb.ipynb --df-format parquet

# store task output in 'some-directory' (if missing, this defaults to 'output')
soorgeon refactor nb.ipynb --product-prefix some-directory

# generate tasks in .py format
soorgeon refactor nb.ipynb --file-format py

To learn more, check out our guide.

Examples

git clone https://github.com/ploomber/soorgeon

Exploratory daya analysis notebook:

cd examples/exploratory
soorgeon refactor nb.ipynb

# to run the pipeline
pip install -r requirements.txt
ploomber build

Machine learning notebook:

cd examples/machine-learning
soorgeon refactor nb.ipynb

# to run the pipeline
pip install -r requirements.txt
ploomber build

To learn more, check out our guide.

Community

Comments
  • A utility to ensure the notebook runs

    A utility to ensure the notebook runs

    Before refactoring a notebook, the user must ensure that the original notebook runs. We should have a command to check whether it works and suggest actions if if there are some errors.

    e.g.,

    soorgeon run nb.ipynb
    

    If the notebook fails because of a ModuleNotFound: suggest creating a virtualenv, and adding a requirements.txt with the package name

    If it's an other error, show the guide for debugging notebooks

    if it's a function signature, recommend downgrading some libraries

    An alternative approach would be to let the user convert anyway and then help them fix the issues in the ploomber generated pipeline, this way they can leverage incremental builds for rapid iterations. we can also suggest them to add a sample parameter

    opened by edublancas 23
  • Overriding variables

    Overriding variables

    (i think this is already covered but I need to make sure and add a few tests)

    Important: for these examples comments are H2 headings (e.g. # I'm an H2 heading)

    We have to add two test cases:

    1. Overriding variables

    
    # first
    
    df = something()
    
    # second
    
    df = another()
    
    # third
    
    df_2 = df + 1
    

    When running soorgeon refactor, the output should contain three notebooks (first, second, third). third.ipynb must contain the following:

    # third
    
    df_2 = df + 1
    

    Now, third uses df as input. So we must ensure that soorgeon picks up the df from the second notebook and not the one from the first one. Graphically it should be:

      graph LR;
          first;
          second --> third;
    

    Overriding variables (same cell)

    # first
    df = something()
    
    # second
    df = another()
    df_2 = df + 1
    

    We need to ensure that in this case, the second task generated does not create a dependency with the first task, since it has its own df (i.e., it does not depend on the df = something() line).

    In other words, the output pipeline (produced by ploomber plot), should look like this (two independent tasks):

      graph LR;
          first;
          second;
    

    Instead of this (two dependent tasks):

      graph LR;
          first --> second;
    

    testing

    the simplest way to test it is to call soorgeon refactor and then load the pipeline to ensure we got the right structure:

    from ploomber.spec import DAGSpec
    
    nb = """
    # notebook content goes here
    """
    
    # refactor notebook
    export.from_nb(_read(nb))
    
    dag = DAGSpec('pipeline.yaml').to_dag().render()
    
    # write assert statements for each task in the pipeline
    # to get the task names, you can do:
    list(dag)
    
    # example, check that task_name only has one upstream dependency called 'get'
    assert  set(dag['task_name'].upstream) == {'get'}
    
    # another example, check that another_task has no dependencies
    assert  set(dag['another_task'].upstream) == set()
    
    good first issue 
    opened by edublancas 19
  • `soorgeon clean` command deletes existing output

    `soorgeon clean` command deletes existing output

    when running soorgeon clean we convert the notebook to .py and then back to .ipynb the ploomber is that cell outputs are lost. What we want instead is to keep the outputs as they were. I think there might be some challenges since running isort maybe cause the clean notebook to have extra cells (and potentially running black, too). so we need to find a solution. I think the new jupyter notebook format adds unique cell ids, so we can use that (if the notebook does not have the cell ids, maybe loading it with nbformat.load will add them). worst case: we create our cell ids

    bug high priority med effort 
    opened by edublancas 18
  • Dealing with non-pickable files

    Dealing with non-pickable files

    Some objects aren't pickable (e.g, jinja templates) so we should capture this errors at runtime and suggest fixes:

    1. use another library like cloudpickle (this can be an option), like sorgeon notebook.ipynb --serializer cloudpickle
    2. link to a guide to fix the problem (e.g., create a factory to instantiate the object)
    opened by edublancas 11
  • Display a warning if notebooks saves files outside the output/product folder

    Display a warning if notebooks saves files outside the output/product folder

    It's possible that the original notebook already creates new files, we should show a warning if that's the case and tell the use to register them as products (include a short guide to explain how to do it)

    opened by edublancas 10
  • making the CI more reliable

    making the CI more reliable

    We have a bunch of integration tests that fetch data and notebooks using Kaggle's API. However, they fail sometimes (probably because the Kaggle API isn't very reliable). See https://github.com/ploomber/soorgeon/issues/58

    One solution is to use the Kaggle API once and then upload the files to an S3 bucket, then have the CI download them from S3 instead of Kaggle.

    opened by edublancas 9
  • apply black before refactoring

    apply black before refactoring

    I ran soorgeon refactor on this notebook: https://www.kaggle.com/code/cdeotte/xgboost-starter-0-793

    after fixing the headings and the function that uses global variables. I got this error:

      File "/Users/Edu/dev/soorgeon/src/soorgeon/io.py", line 515, in find_inputs_and_outputs_from_leaf
        (_, candidates_in, candidates_out) = find_for_loop_def_and_io(
      File "/Users/Edu/dev/soorgeon/src/soorgeon/io.py", line 104, in find_for_loop_def_and_io
        raise ValueError(f'Expected a node with type "for_stmt", '
    ValueError: Expected a node with type "for_stmt", got: <ExprStmt: df = df.merge(importances[k], on='feature', how='left')@4,25> with type expr_stmt
    Error: An error occurred when refactoring notebook.
    

    and it failed. I realized it's because of this for loop:

    for k in range(1,FOLDS): df = df.merge(importances[k], on='feature', how='left')
    

    I realized that applying black and rerunning it fixed the issue. that's an easy fix for many of this edge cases; so I think we should apply black on the notebook before running soorgeon refactor

    here's the notebook. note that I changed the extension from ipynb to txt since github wont let me upload it with ipynb extension

    xgboost-starter-0-793.txt

    low priority med effort 
    opened by edublancas 8
  • display warning when there are output files in task itself

    display warning when there are output files in task itself

    Describe your changes

    Parse notebook code, if it contains statement such as open, output warning.

    Issue ticket number and link

    Closes #16

    Checklist before requesting a review

    • [x] I have performed a self-review of my code
    • [x] I have added thorough tests (when necessary).
    • [x] I have added the right documentation (when needed). Product update? If yes, write one line about this update.
    opened by Wxl19980214 8
  • refactored code, down load from github instead of kaggle

    refactored code, down load from github instead of kaggle

    Describe your changes

    1. Instead of fetching from kaggle, we fetch from github by pygithub api now
    2. Add _pygithub.py for downloading a dir from a repo.
    3. Add pygithub package to setup.py (I don't know if this one is necessary)
    4. Removed stored notebooks under _kaggle since we can store them in the storage repo now
    5. Removed unused functions under src/_kaggle.py
    6. Removed kaggle username and key in CI, I don't think we need them. Maybe we can also remove them from repo secrets
    7. Storage repo here and README.md has instructions on how to upload
    8. Contribute.md remains unchanged, we can let user register new notebook in _kaggle/index.yaml. But the rest of downloading and uploading, updating CI and test cases will need to be done manually by one of the team members.

    To do

    1.We need to add a personal access token to the repo's secrets. 2. Right now CI will fail because this one fails test_notebooks[look-at-this-note-feature-engineering-is-easy] for this reason: E soorgeon.exceptions.InputError: Only H1 headings are found. At this time, only H2 headings are supported. Check out our guide:

    Issue ticket number and link

    Closes #59 #39

    Checklist before requesting a review

    • [x] I have performed a self-review of my code
    • [x] I have added thorough tests (when necessary).
    • [x] I have added the right documentation (when needed). Product update? If yes, write one line about this update.
    opened by Wxl19980214 7
  • Dealing with non-picklable objects

    Dealing with non-picklable objects

    The PR tries to solve https://github.com/ploomber/soorgeon/issues/9

    1. Added a CLI option called serializer which can be one of cloudpickle or dill
    2. Based on the serializer selected soorgeon will try to pickle the objects with either cloudpickle or dill
    3. Both the packages are optional dependencies
    opened by neelasha23 6
  • Refactor test command to use papermill to run notebooks

    Refactor test command to use papermill to run notebooks

    Describe your changes

    • Refactor test command to use papermill to run notebooks
    • Add file existence check for output of executed notebook

    Issue ticket number and link

    Closes #6

    Checklist before requesting a review

    • [x] I have performed a self-review of my code
    • [x] I have added thorough tests (when necessary).
    • [x] I have added the right documentation (when needed). Product update? If yes, write one line about this update.
    opened by 94rain 5
  • Support using global variables inside a function's body

    Support using global variables inside a function's body

    Previously, we did not support using global variables inside a function's body:

            x = 1
            def sum(y):
                return x + y
            sum(y)
    

    The above would break since sum is using a variable that's defined outside the function's body. If a user tried to refactor a notebook with code like this using Soorgeon refactor, they would get an error message asking them to change the code to:

            x = 1
            def sum(y, x):
                return x + y
            sum(y, x)
    

    This PR automate this process and modify the user's source code on their behalf so they don't have to do it manually.

    https://github.com/ploomber/soorgeon/issues/65

    Closes #65

    Checklist before requesting a review

    • [ x] I have performed a self-review of my code
    • [ x] I have added thorough tests (when necessary).
    • [ ] I have added the right documentation (when needed). Product update? If yes, write one line about this update.
    opened by rrhg 1
  • idea: have `soorgeon clean` auto-format SQL snippets

    idea: have `soorgeon clean` auto-format SQL snippets

    saw this on stack overflow: https://stackoverflow.com/questions/41046955/formatting-sql-query-inside-an-ipython-jupyter-notebook

    it could work both on cells that have:

    %%sql
    

    and markdown snippets

    high priority low effort 
    opened by edublancas 0
  • dealing with global variables

    dealing with global variables

    Currently, we do not support using global variables inside a function's body:

    x = 1
    
    def sum(y):
      return x + y
    
    sum(y)
    

    The above will break since sum is using a variable that's defined outside the function's body. If a user tries to refactor a notebook with code like this using soorgeon refactor, they'll get an error message asking them to change the code to:

    x = 1
    
    def sum(y, x):
      return x + y
    
    sum(y, x)
    

    We throw an error message a link to this document.

    However, we should automate this process and modify the user's source code on their behalf so they don't have to do it manually.

    considerations

    There are a few edge cases to take into account, for example, what if the function already has an argument with the name? Or what if the function's signature is using *args, or **kwargs. Since there are many edge cases, we should focus on covering the simple ones to ensure it works, detect a few of the edge ones, and throw an error so the user fixes it manually.

    modifying users' code

    soorgeon refactor parses the code into the AST to detect dependencies among notebook's. Python's standard library has an ast module; however, it's very limited so we use parso instead.

    Parso offers roundtrip conversion, meaning we can go from source code to AST and to source code again. You can see an example of that here:

    https://github.com/ploomber/soorgeon/blob/03229806b905e7715d1c3ee0413ff5a3bc30b71c/src/soorgeon/io.py#L851

    The above is a function that removes import statements from a string with source code.

    determining if parso is the best option

    so far, parso has worked well for us; however, we are interested in exploring other options. We're interested in improving soorgeon's capabilities to automatically refactor code so we should ensure we're using the right tool for the job. so part of fixing this issue is to see if there are better alternatives.

    the bottom of Python's AST module links to some alternatives:


    See also

    Green Tree Snakes, an external documentation resource, has good details on working with Python ASTs.

    ASTTokens annotates Python ASTs with the positions of tokens and text in the source code that generated them. This is helpful for tools that make source code transformations.

    leoAst.py unifies the token-based and parse-tree-based views of python programs by inserting two-way links between tokens and ast nodes.

    LibCST parses code as a Concrete Syntax Tree that looks like an ast tree and keeps all formatting details. It’s useful for building automated refactoring (codemod) applications and linters.

    Parso is a Python parser that supports error recovery and round-trip parsing for different Python versions (in multiple Python versions). Parso is also able to list multiple syntax errors in your python file.


    And I also found this other project:

    https://github.com/PyCQA/redbaron https://github.com/PyCQA/baron

    opened by edublancas 2
  • extending soorgeon clean

    extending soorgeon clean

    we started working on a soorgeon clean command to apply flake8 and isort to notebooks (see https://github.com/ploomber/soorgeon/issues/50)

    We can extend this functionality to do other types of clean-ups. PyCQA hosts a bunch of interesting projects. a few ones that caught my eye:

    autoflake: https://github.com/PyCQA/autoflake bandit: https://github.com/PyCQA/bandit

    a similar package for inspiration: https://github.com/nbQA-dev/nbQA

    opened by edublancas 0
  • automated notebook cleaning

    automated notebook cleaning

    Notebooks get messy. Suppose a data scientist is working on a notebook that downloads some data, cleans it, generates features, and trains a model. Let's now say we want to deploy this notebook so we run a scheduled job every month to re-train the model with new data. In production, we need to keep some of the notebook logic (but not all).

    During development, data scientists usually add extra cells for exploration and debugging purposes. For example, I might add a histogram to see the distribution of the input features to see if the data exhibits certain properties. While useful during development, some of the notebook's cells (e.g. plotting a histogram) aren't needed for production, so cleaning the notebook will be useful for a smoother transition to production.

    Note that this is related to soorgeon clean (https://github.com/ploomber/soorgeon/issues/50) but not the same. soorgeon clean only reformats the existing code. In this case, we're talking about making deeper changes to the user's code.

    low-hanging fruit: using autoflake

    The low-hanging fruit here is to remove unused imports and variables via autoflake. For example, say a notebook looks like this:

    # cell 1
    import math
    import pandas as pd
    df = pd.read_csv('data.csv')
    
    # cell 2
    df2 = pd.read_csv('another.csv')
    
    # cell 3
    df.plot()
    

    if we run autoflake on that notebook, it'll be able to delete import math since the module is never used, and also df2 = pd.read_csv('another.csv') since df2 isn't used, and we'll end up with:

    # cell 1
    import pandas as pd
    df = pd.read_csv('data.csv')
    
    # cell 2
    df.plot()
    

    more advanced approach: automated variable pruning

    A more advanced approach would be to delete everything that is not required to produce some final result. For example, say we have a notebook whose final result is to produce df_final:

    import pandas as pd
    df = pd.read_csv('data.csv')
    df2 = pd.read_csv('another.csv')
    df3 = pd.read_csv('final.csv')
    
    df4 = do_something(df2, df3)
    
    df_final = do_stuff(df1)
    

    We can work backward and eliminate everything that does not affect df_final:

    import pandas as pd
    df = pd.read_csv('data.csv')
    
    df_final = do_stuff(df1)
    

    considerations

    Pruning code is useful for cleaning up that's not needed for deployment but in some cases; the code might be needed again. For example, if I add some data exploration code (i.e. code to generate a plot), I may want to delete it as part of this automated cleaning process, but once the model is deployed, I might need that code again if the model fails and I need to debug things; I'm unsure on how to deal with this scenario

    opened by edublancas 0
Owner
Ploomber
We develop tools to streamline the development to production Data Science workflow.
Ploomber
Streamz helps you build pipelines to manage continuous streams of data

Streamz helps you build pipelines to manage continuous streams of data. It is simple to use in simple cases, but also supports complex pipelines that involve branching, joining, flow control, feedback, back pressure, and so on.

Python Streamz 1.1k Dec 28, 2022
This tool parses log data and allows to define analysis pipelines for anomaly detection.

logdata-anomaly-miner This tool parses log data and allows to define analysis pipelines for anomaly detection. It was designed to run the analysis wit

AECID 32 Nov 27, 2022
Building house price data pipelines with Apache Beam and Spark on GCP

This project contains the process from building a web crawler to extract the raw data of house price to create ETL pipelines using Google Could Platform services.

null 1 Nov 22, 2021
Data pipelines built with polars

valves Warning: the project is very much work in progress. Valves is a collection of functions for your data .pipe()-lines. This project aimes to host

null 14 Jan 3, 2023
Python library for creating data pipelines with chain functional programming

PyFunctional Features PyFunctional makes creating data pipelines easy by using chained functional operators. Here are a few examples of what it can do

Pedro Rodriguez 2.1k Jan 5, 2023
simple way to build the declarative and destributed data pipelines with python

unipipeline simple way to build the declarative and distributed data pipelines. Why you should use it Declarative strict config Scaffolding Fully type

aliaksandr-master 0 Jan 26, 2022
Shot notebooks resuming the main functions of GeoPandas

Shot notebooks resuming the main functions of GeoPandas, 2 notebooks written as Exercises to apply these functions.

null 1 Jan 12, 2022
Convert tables stored as images to an usable .csv file

Convert an image of numbers to a .csv file This Python program aims to convert images of array numbers to corresponding .csv files. It uses OpenCV for

null 711 Dec 26, 2022
pipeline for migrating lichess data into postgresql

How Long Does It Take Ordinary People To "Get Good" At Chess? TL;DR: According to 5.5 years of data from 2.3 million players and 450 million games, mo

Joseph Wong 182 Nov 11, 2022
Package for decomposing EMG signals into motor unit firings, as used in Formento et al 2021.

EMGDecomp Package for decomposing EMG signals into motor unit firings, created for Formento et al 2021. Based heavily on Negro et al, 2016. Supports G

null 13 Nov 1, 2022
Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

null 2 Nov 20, 2021
For making Tagtog annotation into csv dataset

tagtog_relation_extraction for making Tagtog annotation into csv dataset How to Use On Tagtog 1. Go to Project > Downloads 2. Download all documents,

hyeong 4 Dec 28, 2021
A Python module for clustering creators of social media content into networks

sm_content_clustering A Python module for clustering creators of social media content into networks. Currently supports identifying potential networks

null 72 Dec 30, 2022
Finds, downloads, parses, and standardizes public bikeshare data into a standard pandas dataframe format

Finds, downloads, parses, and standardizes public bikeshare data into a standard pandas dataframe format.

Brady Law 2 Dec 1, 2021
Demonstrate a Dataflow pipeline that saves data from an API into BigQuery table

Overview dataflow-mvp provides a basic example pipeline that pulls data from an API and writes it to a BigQuery table using GCP's Dataflow (i.e., Apac

Chris Carbonell 1 Dec 3, 2021
Useful tool for inserting DataFrames into the Excel sheet.

PyCellFrame Insert Pandas DataFrames into the Excel sheet with a bunch of conditions Install pip install pycellframe Usage Examples Let's suppose that

Luka Sosiashvili 1 Feb 16, 2022
Import, connect and transform data into Excel

xlwings_query Import, connect and transform data into Excel. Description The concept is to apply data transformations to a main query object. When the

George Karakostas 1 Jan 19, 2022
monolish: MONOlithic Liner equation Solvers for Highly-parallel architecture

monolish is a linear equation solver library that monolithically fuses variable data type, matrix structures, matrix data format, vendor specific data transfer APIs, and vendor specific numerical algebra libraries.

RICOS Co. Ltd. 179 Dec 21, 2022
Write maintainable, production-ready pipelines using Jupyter or your favorite text editor. Develop locally, deploy to the cloud. ☁️

Write maintainable, production-ready pipelines using Jupyter or your favorite text editor. Develop locally, deploy to the cloud. ☁️

Ploomber 2.9k Jan 6, 2023