We'd like to make it super easy to go from writing code in a notebook to training that model distributed.
Experience might be something like
- User writes code in notebook and executes in Jupyter lab
- User clicks a button which allows user to fill in various settings e.g. number of GPUs
- User clicks train
Under the hood this would cause
- A docker image to be built
- A TFJob/PyTorch/K8s Job to be created and fired off.
I think the biggest challenge is that we probably don't want to execute all code in the notebook. Typically, there's some amount of refactoring that needs to be done to convert a notebook into a python module suitable for execution in a bash job.
As a concrete example
Here's the notebook for our GitHub Issue summarization example
Here's the corresponding python module used when training in a K8s job.
The python module only executes a subset of cells in particular those to
- Define model architecture
- train the model
Rather than try to auto-convert a notebook like the github issue example, I think we should require users structure their code to facilitate the conversion.
My suggestion would be to allow any functions defined in the notebook to be used as entry points. So for the GitHub issues summarization the user would have a cell like the following
from keras.callbacks import CSVLogger, ModelCheckpoint
def train_model(output)
script_name_base = 'tutorial_seq2seq'
csv_logger = CSVLogger('{:}.log'.format(script_name_base))
model_checkpoint = ModelCheckpoint('{:}.epoch{{epoch:02d}}-
val{{val_loss:.5f}}.hdf5'.format(script_name_base),
save_best_only=True)
batch_size = 1200
epochs = 7
history = seq2seq_Model.fit([encoder_input_data, decoder_input_data],
np.expand_dims(decoder_target_data, -1),
batch_size=batch_size,
epochs=epochs,
validation_split=0.12, callbacks=[csv_logger, model_checkpoint])
seq2seq_Model.save(output)
train('seq2seq_model_tutorial.h5')
If user structures their code this way, we should be able to manually create and invoke a suitable container entry point. Something like the following
- Use nbconvert to convert from ipynb to python code
- Post process the python code
- Strip out any statements not inside a function (except imports)
- Create a CLI for the functions using a library like PyFire
- Build a Docker image that is Notebook image + code
A variant of this idea would be to use metaml (by @wbuchwalter ). metaml uses metaparticle to allow people to annotate their python code with information needed to then run it on K8s (e.g. distributed using TFJob). If we went this approach I think the flow would be
- Run nbconvert to go from ipynb -> py
- Use metaparticle/metaml tool chain to build the docker image and submit the job.
@willingc @yuvipanda Is there existing tooling in the Jupyter community other than nbconvert to convert notebooks to code suitable for asynchronous batch execution?
/cc @wbuchwalter @gaocegege @yuvipanda @willingc
priority/p1 area/jupyter area/0.4.0