Currently all notebooks and all images are committed every deployment (every 6 hours).
This has some big advantages:
- Reproducibility.
- Simplicity.
- NB viewer for PRs.
This his also many disadvantages:
.git
folder data is getting very large (900GB).
- Sync times are getting slow.
- Notebook files changes are hard to review (not diff-friendly).
- NB-viewer is not really a reliable way to render the notebooks (with all the various JS and HTML used in them),
- Bot commits are adding lots of noise to the commit history.
- Changes need to always be rebased on top of bot commits even if no actual changes besides deployment happened.
Fortunately, there is a tool that makes life with Jupyter notebooks and git much easier (at least for me): https://github.com/mwouts/jupytext. It is an incredibly comfortable way of working in both an IDE and a notebook simultaneously, but most importantly it makes git life much easier:
- Notebook files are "paired" to
.py
files (with some comments to store structure and markdown) and are synced together locally.
.py
are committed to git, whereas .ipynb
files are not (e.g. ignored in .gitignore
).
- During deployments a command to generate
.ipynb
is used (they will contain only inputs, and no outputs), and than papermill can run them to fill them with up to date outputs. Nothing needs to be committed, because .py
files haven't changed.
- For local development Jupyter will automatically open the
.py
files with the metadata as notebooks when jupytext is installed. Otherwise a short Makefile target can create / update all notebooks using the same script that's used during deployment.
In my own repo (where I work on the various visualisations / calculations) I've switched to this system and am very happy about it:
- Only meaningful commits are in the history and there is no tradeoff between deployment schedule and git noise (although I keep it to once a day, since there is no hurry).
- I can clearly see changes I'm committing using the diff on the
.py
files.
- The changes are all very light weight since they don't contain data and random JSON noise.
With COVID-19 not likely to be over quickly any time soon it might be a good idea to give this a chance. It's pretty simple, and can be gradual (notebooks can be moved to this system and git-removed gradually).