BatchFlow helps you conveniently work with random or sequential batches of your data and define data processing and machine learning workflows even for datasets that do not fit into memory.

Data Analysis Center

Last update: Dec 20, 2022

Related tags

Pipelines python workflow data-science machine-learning pipeline workflow-engine pipeline-framework python3

Overview

BatchFlow

BatchFlow helps you conveniently work with random or sequential batches of your data and define data processing and machine learning workflows even for datasets that do not fit into memory.

For more details see the documentation and tutorials.

Main features:

flexible batch generaton
deterministic and stochastic pipelines
datasets and pipelines joins and merges
data processing actions
flexible model configuration
within batch parallelism
batch prefetching
ready to use ML models and proven NN architectures
convenient layers and helper functions to build custom models
a powerful research engine with parallel model training and extended experiment logging.

Basic usage

my_workflow = my_dataset.pipeline()
              .load('/some/path')
              .do_something()
              .do_something_else()
              .some_additional_action()
              .save('/to/other/path')

The trick here is that all the processing actions are lazy. They are not executed until their results are needed, e.g. when you request a preprocessed batch:

my_workflow.run(BATCH_SIZE, shuffle=True, n_epochs=5)

for batch in my_workflow.gen_batch(BATCH_SIZE, shuffle=True, n_epochs=5):
    # only now the actions are fired and data is being changed with the workflow defined earlier
    # actions are executed one by one and here you get a fully processed batch

NUM_ITERS = 1000
for i in range(NUM_ITERS):
    processed_batch = my_workflow.next_batch(BATCH_SIZE, shuffle=True, n_epochs=None)
    # only now the actions are fired and data is changed with the workflow defined earlier
    # actions are executed one by one and here you get a fully processed batch

Train a neural network

BatchFlow includes ready-to-use proven architectures like VGG, Inception, ResNet and many others. To apply them to your data just choose a model, specify the inputs (like the number of classes or images shape) and call train_model. Of course, you can also choose a loss function, an optimizer and many other parameters, if you want.

from batchflow.models.tf import ResNet34

my_workflow = my_dataset.pipeline()
              .init_model('dynamic', ResNet34, config={
                          'inputs/images/shape': B('image_shape'),
                          'labels/classes': 10,
                          'initial_block/inputs': 'images'})
              .load('/some/path')
              .some_transform()
              .another_transform()
              .train_model('ResNet34', images=B('images'), labels=B('labels'))
              .run(BATCH_SIZE, shuffle=True)

For more advanced cases and detailed API see the documentation.

Installation

BatchFlow module is in the beta stage. Your suggestions and improvements are very welcome.

BatchFlow supports python 3.5 or higher.

Stable python package

With modern pipenv

pipenv install batchflow

With old-fashioned pip

pip3 install batchflow

Development version

With modern pipenv

pipenv install git+https://github.com/analysiscenter/batchflow.git#egg=batchflow

With old-fashioned pip

pip3 install git+https://github.com/analysiscenter/batchflow.git

After that just import batchflow:

import batchflow as bf

Git submodule

In many cases it might be more convenient to install batchflow as a submodule in your project repository than as a python package.

git submodule add https://github.com/analysiscenter/batchflow.git
git submodule init
git submodule update

If your python file is located in another directory, you might need to add a path to batchflow:

import sys
sys.path.insert(0, "/path/to/batchflow")
import batchflow as bf

What is great about using a submodule that every commit in your project can be linked to its own commit of a submodule. This is extremely convenient in a fast paced research environment.

Relative import is also possible:

from .batchflow import Dataset

Projects based on BatchFlow

SeismiQB - ML for seismic interpretation
SeismicPro - ML for seismic processing
PetroFlow - ML for well interpretation
PyDEns - DL Solver for ODE and PDE
RadIO - ML for CT imaging
CardIO - ML for heart signals

Citing BatchFlow

Please cite BatchFlow in your publications if it helps your research.

Roman Khudorozhkov et al. BatchFlow library for fast ML workflows. 2017. doi:10.5281/zenodo.1041203

@misc{roman_kh_2017_1041203,
  author       = {Khudorozhkov, Roman and others},
  title        = {BatchFlow library for fast ML workflows},
  year         = 2017,
  doi          = {10.5281/zenodo.1041203},
  url          = {https://doi.org/10.5281/zenodo.1041203}
}

Comments

0 full project tutorial

This PR contains a new tutorial with the project from scratch using Batchflow. This tutorial was intended to be a Batchflow intro containing links to helpful pages in documentation and almost every other tutorial.

Plus, I made some fixes (ResNet configs), upgrades (SEBlock, fontsize in plot_images) and additions (show_confusion_matrix, model weights initialization with kaiming normal, branch arguments parsing, new datasets and their parsing) for it.

Almost all improvements are in https://github.com/analysiscenter/batchflow/pull/624 (except batchflow/models/utils.py and the tutorial).

opened by HollowPrincess 36
Image examples from dataset/examples/simple_but_ugly/ don't work out of the box

Hi! I've tried to run several examples from the directory. For example, trying to run https://github.com/analysiscenter/dataset/tree/master/examples/simple_but_ugly fails with bunch of errors. Looks like there are no several files with definitions (random_scale, random_rotate, convert_to_pil, etc)

opened by mikhailkin 7
Learning rate features
Changelist:

[x] New decay interface;

[x] Ability to fetch learning rate: fetches= 'lr';

[x] Saving learning rate into iter_info;

[x] Notebook with decay experiments.

Fixes:

Docstrings for optimizer, decay, loss;

The bug of building several models occurring when prefetch is enabled, and model_config does not have an input shape.
opened by Dimonovez 6
Incremental Torch improvements
refactor BasePool: it can be split into multiple classes with clear functionality (done in #469)

improve Encoder/Decoder modules

refactor n_iters and decay configurations: no need to pass n_iters in the root configuration (done)

make so every block is sent to device: can be helpful with pre-trained models (done in #461)

refactor pyramid layers so they use common base

make Xception out of XceptionBlocks
opened by SergeyTsimfer 6
Eager Torch
This PR proposes to add EagerTorch model that:

can build off of batch_data during first call of the train method

allows for better usage of native torch modules

does not use redundant tf-like methods (make_inputs, has_classes, etc)
opened by SergeyTsimfer 6
Sampler classes
Main changes

Remove inner functions from Sampler.sample for multiprocessing (pickle, really) to work with Sampler-objects. Now all operations on samplers (&, |,..., +, -, ..., %) are implemented in corresponding subclasses of Sampler.

Add multiprocessing-example into the Sampler-tutorial.
opened by akoryagin 5
Regression metrics

This PR proposes to add regression metrics and tests. No tests provided yet.

Metrics are compatible with multi output task i.e. when targets have shape (n_samples, n_outputs). In this case aggregation among outputs is available.

opened by nikita-klsh 5
Fix pylint
This PR fixes pylint output in the latest docker image analysiscenter1/ds-py3.

A lot of missing-function-docstring and abstract-method warnings happened in batchflow/models/torch folder we cant fix, because they occur on the torch side. They are fixing this and this.

So we can:

Close eyes on this warnings.

Temporary set additional restrictions for pylint for torch folder only like its currently done.

@roman-kh
opened by nikita-klsh 4
Create release gh action
Make release actions:

build docs

calc test coverage

create python package and upload it to pypi.

Actions should fire at various statuses: created/edited, published, etc.
opened by roman-kh 4
Trackers
This PR proposes to add:

monitoring tools (namely, new ResourceMonitor class, as well as context managers for conveniency), for example

with monitor_resource('gpu', frequency=0.5) as monitor: # train model

to better understand resource (cpu, gpu, memory) utilization

tracking tools (namely, new Notifier class) to provide better utilization of tqdm progress-bars in Pipeline, as well as to plot graphs of, for example, loss values, dynamically during model training

Notifier class is also capable of using any of the resource monitoring utilities provided by ResourceMonitors. Old functionality of just passing n/True is working too
opened by SergeyTsimfer 3
TFModel improvements
This PR proposes to:

Make ConvBlock able to chain multiple layers, just like Torch version can

Simplify logic of letter parsing, as well as adding capability of using R letter as separate Branch with complex parameters

Swap all arguments inside calls to conv_block to keyword ones

Add squeeze-and-excitation versions of ResNet

Re-check all the tests of model compilation

Add various attention modules: some of them are available through S (stands for self-attention) letter, some of them are mods of Combine
opened by SergeyTsimfer 3
Refactor `inbatch_parallel`
As the inbatch_parallel now not supposed to be used on its own, we can refactor it with following goals in mind:

[ ] remove _use_self args

[ ] remove init/post functions: the container with init should be passed directly from Batch.apply_parallel, and the results should be post-processed in the Batch.apply_parallel as well

[ ] make inbatch_parallel a class: that would allow for easier introspection and parameter changes on the fly, for example, target to any other.
opened by SergeyTsimfer 2
Initialize random seed for processes

Each python process starts with the same random seed initialization which results in no randomness across processes. Batch or pipeline action is required to provide random seed.

opened by roman-kh 0
Minimize requirements

BatchFlow should require numpy only as it is used everywhere. All other packages should be optional and modules / functions should provide a clear message if needed reqs are not installed.

opened by roman-kh 0

Releases(0.8.0)

0.8.0(Dec 30, 2022)
This release fixes crop behavior of TorchModel, as well as adds new blocks and methods:

InternBlock with deformable convolutions

separate BottleneckBlock that extends the functionality of ResBlock

method for getting a reference to the current TorchModel instance inside train/predict contexts

mode parameter for train and predict methods to control nn.Module behavior.

Also, this is the first version after numpy deprecation of autocast to dtype=object of mishaped arrays, so this is fixed in some places.
Source code(tar.gz)
Source code(zip)
0.7.7(Nov 7, 2022)

This release fixes one small TorchModel bug.
Source code(tar.gz)
Source code(zip)
0.7.6(Oct 21, 2022)

This release changes the way Batch.apply_parallel works: now it accepts both init and post functions, and should be the preferrable way to decorate batch methods (by marking them with decorators.apply_parallel).

Other than that, there are a few new building blocks for TorchModel, parameter to pad the last microbatches to full microbatch_size, and small bug fixes.
Source code(tar.gz)
Source code(zip)
0.7.5(Jul 7, 2022)
Models

added gradient clipping and new layers

Plot

refactored existing plots across the library to rely on plot, introduced in the previous release

Research improvements

modified stored configs to use aliases instead of actual values: that fixes some pickling problems

Source code(tar.gz)
Source code(zip)
0.7.0(Jun 20, 2022)
Models

refactored model building procedure: split modules into separate entities like EncoderModule and DecoderModule

introduced new modules that import ready-to-use networks from other libraries: currently, we support TIMM and HuggingFace libraries

better module repr

check #645 for other changes

Plot

introduced plot module with utilities for displaying images and curves

plot has a few tutorials with lots of examples: refer to them to get a more in-depth understanding of plot usages

Research improvements

added separate Storage class, that manages output streams of research results.

various fixes and QoL changes

Source code(tar.gz)
Source code(zip)
0.6.0(Feb 17, 2022)

Named expressions

Added BA

Models

Removed tensorflow

Research

Added research module for massive parallel model training / evaluation.

Tutorials and examples

New tutorials.

Also added some tests.
Source code(tar.gz)
Source code(zip)
v0.5.0(Jun 10, 2021)

Lots of improvements and bug fixes
Source code(tar.gz)
Source code(zip)
v0.5.0beta3(Mar 5, 2021)

Source code(tar.gz)
Source code(zip)
v0.5.0beta1(Mar 5, 2021)

Source code(tar.gz)
Source code(zip)
v0.5.0beta2(Mar 5, 2021)

Source code(tar.gz)
Source code(zip)
0.3.0(Jan 20, 2018)
Bug fixes and a lot of refactoring.

Batch

Components can be added dynamically during execution. Parameters order is changed in apply_transform and apply_transform_all.

Named expressions:

B() returns the batch itself.

F takes args and kwargs.

added R (random) and L (lambda).

Pipeline

Refactored models directory and variables directory. Added print. Removed print_variable.

Tensorflow

Layers

Added:

1d and 3d bilinear resize

3d depth to space

separable transposed convolutions

subpixel convolutions

bilinear additive resize

upsample

alpha dropout

universal pooling and global_pooling

Changed:

conv_block support residuals (with sum and concat) and upsample layers.

TFModel:

new methods: upsample, Pyramid Pooling module, Atrous Spatial Pyramid Pooling module

model predictions can be an output of predefined operations (sigmoid, softmax, argmax, etc)

Model zoo

Added DenseNetFC, ResNetAttention, VNet, RefineNet, Faster-RCNN, Global Convolution Network, Encoder-decoder, Inception-ResNet v2, MobileNet v2.
Source code(tar.gz)
Source code(zip)
0.2.2(Nov 23, 2017)
Changed model structure and configuration (with default_config() and build_config())

Added ready to use TensorFlow models: VGG, Inception v1, v3, v4, ResNet, MobileNet, SqueezeNet, DenseNet, FCN32, FCN16, FCN8, UNet, LinkNet.

Added new layers: fractional_max_pooling.

Dimensionality for all layers is now inferred from the input tensor shape.

Added fake njit decorator for environments without numba installed.

Source code(tar.gz)
Source code(zip)
0.2.0(Nov 3, 2017)

Class-based models
Source code(tar.gz)
Source code(zip)