BatchFlow helps you conveniently work with random or sequential batches of your data and define data processing and machine learning workflows even for datasets that do not fit into memory.

Overview

License Python TensorFlow PyTorch codecov PyPI Status

BatchFlow

BatchFlow helps you conveniently work with random or sequential batches of your data and define data processing and machine learning workflows even for datasets that do not fit into memory.

For more details see the documentation and tutorials.

Main features:

  • flexible batch generaton
  • deterministic and stochastic pipelines
  • datasets and pipelines joins and merges
  • data processing actions
  • flexible model configuration
  • within batch parallelism
  • batch prefetching
  • ready to use ML models and proven NN architectures
  • convenient layers and helper functions to build custom models
  • a powerful research engine with parallel model training and extended experiment logging.

Basic usage

my_workflow = my_dataset.pipeline()
              .load('/some/path')
              .do_something()
              .do_something_else()
              .some_additional_action()
              .save('/to/other/path')

The trick here is that all the processing actions are lazy. They are not executed until their results are needed, e.g. when you request a preprocessed batch:

my_workflow.run(BATCH_SIZE, shuffle=True, n_epochs=5)

or

for batch in my_workflow.gen_batch(BATCH_SIZE, shuffle=True, n_epochs=5):
    # only now the actions are fired and data is being changed with the workflow defined earlier
    # actions are executed one by one and here you get a fully processed batch

or

NUM_ITERS = 1000
for i in range(NUM_ITERS):
    processed_batch = my_workflow.next_batch(BATCH_SIZE, shuffle=True, n_epochs=None)
    # only now the actions are fired and data is changed with the workflow defined earlier
    # actions are executed one by one and here you get a fully processed batch

Train a neural network

BatchFlow includes ready-to-use proven architectures like VGG, Inception, ResNet and many others. To apply them to your data just choose a model, specify the inputs (like the number of classes or images shape) and call train_model. Of course, you can also choose a loss function, an optimizer and many other parameters, if you want.

from batchflow.models.tf import ResNet34

my_workflow = my_dataset.pipeline()
              .init_model('dynamic', ResNet34, config={
                          'inputs/images/shape': B('image_shape'),
                          'labels/classes': 10,
                          'initial_block/inputs': 'images'})
              .load('/some/path')
              .some_transform()
              .another_transform()
              .train_model('ResNet34', images=B('images'), labels=B('labels'))
              .run(BATCH_SIZE, shuffle=True)

For more advanced cases and detailed API see the documentation.

Installation

BatchFlow module is in the beta stage. Your suggestions and improvements are very welcome.

BatchFlow supports python 3.5 or higher.

Stable python package

With modern pipenv

pipenv install batchflow

With old-fashioned pip

pip3 install batchflow

Development version

With modern pipenv

pipenv install git+https://github.com/analysiscenter/batchflow.git#egg=batchflow

With old-fashioned pip

pip3 install git+https://github.com/analysiscenter/batchflow.git

After that just import batchflow:

import batchflow as bf

Git submodule

In many cases it might be more convenient to install batchflow as a submodule in your project repository than as a python package.

git submodule add https://github.com/analysiscenter/batchflow.git
git submodule init
git submodule update

If your python file is located in another directory, you might need to add a path to batchflow:

import sys
sys.path.insert(0, "/path/to/batchflow")
import batchflow as bf

What is great about using a submodule that every commit in your project can be linked to its own commit of a submodule. This is extremely convenient in a fast paced research environment.

Relative import is also possible:

from .batchflow import Dataset

Projects based on BatchFlow

Citing BatchFlow

Please cite BatchFlow in your publications if it helps your research.

DOI

Roman Khudorozhkov et al. BatchFlow library for fast ML workflows. 2017. doi:10.5281/zenodo.1041203
@misc{roman_kh_2017_1041203,
  author       = {Khudorozhkov, Roman and others},
  title        = {BatchFlow library for fast ML workflows},
  year         = 2017,
  doi          = {10.5281/zenodo.1041203},
  url          = {https://doi.org/10.5281/zenodo.1041203}
}
Comments
  • 0 full project tutorial

    0 full project tutorial

    This PR contains a new tutorial with the project from scratch using Batchflow. This tutorial was intended to be a Batchflow intro containing links to helpful pages in documentation and almost every other tutorial.

    Plus, I made some fixes (ResNet configs), upgrades (SEBlock, fontsize in plot_images) and additions (show_confusion_matrix, model weights initialization with kaiming normal, branch arguments parsing, new datasets and their parsing) for it.

    Almost all improvements are in https://github.com/analysiscenter/batchflow/pull/624 (except batchflow/models/utils.py and the tutorial).

    opened by HollowPrincess 36
  • Image examples from dataset/examples/simple_but_ugly/ don't work out of the box

    Image examples from dataset/examples/simple_but_ugly/ don't work out of the box

    Hi! I've tried to run several examples from the directory. For example, trying to run https://github.com/analysiscenter/dataset/tree/master/examples/simple_but_ugly fails with bunch of errors. Looks like there are no several files with definitions (random_scale, random_rotate, convert_to_pil, etc)

    opened by mikhailkin 7
  • Learning rate features

    Learning rate features

    Changelist:

    • [x] New decay interface;
    • [x] Ability to fetch learning rate: fetches= 'lr';
    • [x] Saving learning rate into iter_info;
    • [x] Notebook with decay experiments.

    Fixes:

    • Docstrings for optimizer, decay, loss;
    • The bug of building several models occurring when prefetch is enabled, and model_config does not have an input shape.
    opened by Dimonovez 6
  • Incremental Torch improvements

    Incremental Torch improvements

    • refactor BasePool: it can be split into multiple classes with clear functionality (done in #469)

    • improve Encoder/Decoder modules

    • refactor n_iters and decay configurations: no need to pass n_iters in the root configuration (done)

    • make so every block is sent to device: can be helpful with pre-trained models (done in #461)

    • refactor pyramid layers so they use common base

    • make Xception out of XceptionBlocks

    opened by SergeyTsimfer 6
  • Eager Torch

    Eager Torch

    This PR proposes to add EagerTorch model that:

    • can build off of batch_data during first call of the train method

    • allows for better usage of native torch modules

    • does not use redundant tf-like methods (make_inputs, has_classes, etc)

    opened by SergeyTsimfer 6
  • Sampler classes

    Sampler classes

    Main changes

    • Remove inner functions from Sampler.sample for multiprocessing (pickle, really) to work with Sampler-objects. Now all operations on samplers (&, |,..., +, -, ..., %) are implemented in corresponding subclasses of Sampler.
    • Add multiprocessing-example into the Sampler-tutorial.
    opened by akoryagin 5
  • Regression metrics

    Regression metrics

    This PR proposes to add regression metrics and tests. No tests provided yet.

    Metrics are compatible with multi output task i.e. when targets have shape (n_samples, n_outputs). In this case aggregation among outputs is available.

    opened by nikita-klsh 5
  • Fix pylint

    Fix pylint

    This PR fixes pylint output in the latest docker image analysiscenter1/ds-py3.

    A lot of missing-function-docstring and abstract-method warnings happened in batchflow/models/torch folder we cant fix, because they occur on the torch side. They are fixing this and this.

    So we can:

    1. Close eyes on this warnings.
    2. Temporary set additional restrictions for pylint for torch folder only like its currently done.

    @roman-kh

    opened by nikita-klsh 4
  • Create release gh action

    Create release gh action

    Make release actions:

    • build docs
    • calc test coverage
    • create python package and upload it to pypi.

    Actions should fire at various statuses: created/edited, published, etc.

    opened by roman-kh 4
  • Trackers

    Trackers

    This PR proposes to add:

    • monitoring tools (namely, new ResourceMonitor class, as well as context managers for conveniency), for example
    with monitor_resource('gpu', frequency=0.5) as monitor:
        # train model
    

    to better understand resource (cpu, gpu, memory) utilization

    • tracking tools (namely, new Notifier class) to provide better utilization of tqdm progress-bars in Pipeline, as well as to plot graphs of, for example, loss values, dynamically during model training

    • Notifier class is also capable of using any of the resource monitoring utilities provided by ResourceMonitors. Old functionality of just passing n/True is working too

    image

    opened by SergeyTsimfer 3
  • TFModel improvements

    TFModel improvements

    This PR proposes to:

    • Make ConvBlock able to chain multiple layers, just like Torch version can

    • Simplify logic of letter parsing, as well as adding capability of using R letter as separate Branch with complex parameters

    • Swap all arguments inside calls to conv_block to keyword ones

    • Add squeeze-and-excitation versions of ResNet

    • Re-check all the tests of model compilation

    • Add various attention modules: some of them are available through S (stands for self-attention) letter, some of them are mods of Combine

    opened by SergeyTsimfer 3
  • Refactor `inbatch_parallel`

    Refactor `inbatch_parallel`

    As the inbatch_parallel now not supposed to be used on its own, we can refactor it with following goals in mind:

    • [ ] remove _use_self args
    • [ ] remove init/post functions: the container with init should be passed directly from Batch.apply_parallel, and the results should be post-processed in the Batch.apply_parallel as well
    • [ ] make inbatch_parallel a class: that would allow for easier introspection and parameter changes on the fly, for example, target to any other.
    opened by SergeyTsimfer 2
  • Initialize random seed for processes

    Initialize random seed for processes

    Each python process starts with the same random seed initialization which results in no randomness across processes. Batch or pipeline action is required to provide random seed.

    opened by roman-kh 0
  • Minimize requirements

    Minimize requirements

    BatchFlow should require numpy only as it is used everywhere. All other packages should be optional and modules / functions should provide a clear message if needed reqs are not installed.

    opened by roman-kh 0
Releases(0.8.0)
  • 0.8.0(Dec 30, 2022)

    This release fixes crop behavior of TorchModel, as well as adds new blocks and methods:

    • InternBlock with deformable convolutions
    • separate BottleneckBlock that extends the functionality of ResBlock
    • method for getting a reference to the current TorchModel instance inside train/predict contexts
    • mode parameter for train and predict methods to control nn.Module behavior.

    Also, this is the first version after numpy deprecation of autocast to dtype=object of mishaped arrays, so this is fixed in some places.

    Source code(tar.gz)
    Source code(zip)
  • 0.7.7(Nov 7, 2022)

  • 0.7.6(Oct 21, 2022)

    This release changes the way Batch.apply_parallel works: now it accepts both init and post functions, and should be the preferrable way to decorate batch methods (by marking them with decorators.apply_parallel).

    Other than that, there are a few new building blocks for TorchModel, parameter to pad the last microbatches to full microbatch_size, and small bug fixes.

    Source code(tar.gz)
    Source code(zip)
  • 0.7.5(Jul 7, 2022)

    Models

    • added gradient clipping and new layers

    Plot

    • refactored existing plots across the library to rely on plot, introduced in the previous release

    Research improvements

    • modified stored configs to use aliases instead of actual values: that fixes some pickling problems
    Source code(tar.gz)
    Source code(zip)
  • 0.7.0(Jun 20, 2022)

    Models

    • refactored model building procedure: split modules into separate entities like EncoderModule and DecoderModule
    • introduced new modules that import ready-to-use networks from other libraries: currently, we support TIMM and HuggingFace libraries
    • better module repr
    • check #645 for other changes

    Plot

    • introduced plot module with utilities for displaying images and curves
    • plot has a few tutorials with lots of examples: refer to them to get a more in-depth understanding of plot usages

    Research improvements

    • added separate Storage class, that manages output streams of research results.
    • various fixes and QoL changes
    Source code(tar.gz)
    Source code(zip)
  • 0.6.0(Feb 17, 2022)

    Named expressions

    Added BA

    Models

    Removed tensorflow

    Research

    Added research module for massive parallel model training / evaluation.

    Tutorials and examples

    New tutorials.

    Also added some tests.

    Source code(tar.gz)
    Source code(zip)
  • v0.5.0(Jun 10, 2021)

  • v0.5.0beta3(Mar 5, 2021)

  • v0.5.0beta1(Mar 5, 2021)

  • v0.5.0beta2(Mar 5, 2021)

  • 0.3.0(Jan 20, 2018)

    Bug fixes and a lot of refactoring.

    Batch

    Components can be added dynamically during execution. Parameters order is changed in apply_transform and apply_transform_all.

    Named expressions:

    • B() returns the batch itself.
    • F takes args and kwargs.
    • added R (random) and L (lambda).

    Pipeline

    Refactored models directory and variables directory. Added print. Removed print_variable.

    Tensorflow

    Layers

    Added:

    • 1d and 3d bilinear resize
    • 3d depth to space
    • separable transposed convolutions
    • subpixel convolutions
    • bilinear additive resize
    • upsample
    • alpha dropout
    • universal pooling and global_pooling

    Changed:

    • conv_block support residuals (with sum and concat) and upsample layers.

    TFModel:

    • new methods: upsample, Pyramid Pooling module, Atrous Spatial Pyramid Pooling module
    • model predictions can be an output of predefined operations (sigmoid, softmax, argmax, etc)

    Model zoo

    Added DenseNetFC, ResNetAttention, VNet, RefineNet, Faster-RCNN, Global Convolution Network, Encoder-decoder, Inception-ResNet v2, MobileNet v2.

    Source code(tar.gz)
    Source code(zip)
  • 0.2.2(Nov 23, 2017)

    • Changed model structure and configuration (with default_config() and build_config())

    • Added ready to use TensorFlow models: VGG, Inception v1, v3, v4, ResNet, MobileNet, SqueezeNet, DenseNet, FCN32, FCN16, FCN8, UNet, LinkNet.

    • Added new layers: fractional_max_pooling.

    • Dimensionality for all layers is now inferred from the input tensor shape.

    • Added fake njit decorator for environments without numba installed.

    Source code(tar.gz)
    Source code(zip)
  • 0.2.0(Nov 3, 2017)

Build, test, deploy, iterate - Dev and prod tool for data science pipelines

Prodmodel is a build system for data science pipelines. Users, testers, contributors are welcome! Motivation · Concepts · Installation · Usage · Contr

Prodmodel 53 Nov 29, 2022
Microsoft Azure provides a wide number of services for managing and storing data

Microsoft Azure provides a wide number of services for managing and storing data. One product is Microsoft Azure SQL. Which gives us the capability to create and manage instances of SQL Servers hosted in the cloud. This project, demonstrates how to use these services to manage data we collect from different sources.

Riya Vijay Vishwakarma 1 Dec 12, 2021
functional data manipulation for pandas

pandas-ply: functional data manipulation for pandas pandas-ply is a thin layer which makes it easier to manipulate data with pandas. In particular, it

Coursera 188 Nov 24, 2022
Clean APIs for data cleaning. Python implementation of R package Janitor

pyjanitor pyjanitor is a Python implementation of the R package janitor, and provides a clean API for cleaning data. Why janitor? Originally a port of

Eric Ma 1.1k Jan 1, 2023
MultiPy lets you conveniently keep track of your python scripts for personal use or showcase by loading and grouping them into categories. It allows you to either run each script individually or together with just one click.

MultiPy About MultiPy is a graphical user interface built using Dear PyGui Python GUI Framework that lets you conveniently keep track of your python s

null 56 Oct 29, 2022
Python script to combine the statistical results of a TOPAS simulation that was split up into multiple batches.

topas-merge-simulations Python script to combine the statistical results of a TOPAS simulation that was split up into multiple batches At the top of t

Sebastian Schäfer 1 Aug 16, 2022
Mysterium the first tool which permits you to retrieve the most part of a Python code even the .py or .pyc was extracted from an executable file, even it is encrypted with every existing encryptage. Mysterium don't make any difference between encrypted and non encrypted files, it can retrieve code from Pyarmor or .pyc files.

Mysterium the first tool which permits you to retrieve the most part of a Python code even the .py or .pyc was extracted from an executable file, even it is encrypted with every existing encryptage. Mysterium don't make any difference between encrypted and non encrypted files, it can retrieve code from Pyarmor or .pyc files.

Venax 116 Dec 21, 2022
Fully Automated YouTube Channel ▶️with Added Extra Features.

Fully Automated Youtube Channel ▒█▀▀█ █▀▀█ ▀▀█▀▀ ▀▀█▀▀ █░░█ █▀▀▄ █▀▀ █▀▀█ ▒█▀▀▄ █░░█ ░░█░░ ░▒█░░ █░░█ █▀▀▄ █▀▀ █▄▄▀ ▒█▄▄█ ▀▀▀▀ ░░▀░░ ░▒█░░ ░▀▀▀ ▀▀▀░

sam-sepiol 249 Jan 2, 2023
Your own movie streaming service. Easy to install, easy to use. Download, manage and watch your favorite movies conveniently from your browser or phone. Install it on your server, access it anywhere and enjoy.

Vigilio Your own movie streaming service. Easy to install, easy to use. Download, manage and watch your favorite movies conveniently from your browser

Tugcan Olgun 141 Jan 6, 2023
MLReef is an open source ML-Ops platform that helps you collaborate, reproduce and share your Machine Learning work with thousands of other users.

The collaboration platform for Machine Learning MLReef is an open source ML-Ops platform that helps you collaborate, reproduce and share your Machine

MLReef 1.4k Dec 27, 2022
PyLaboratory 0 Feb 7, 2022
Unified Interface for Constructing and Managing Workflows on different workflow engines, such as Argo Workflows, Tekton Pipelines, and Apache Airflow.

Couler What is Couler? Couler aims to provide a unified interface for constructing and managing workflows on different workflow engines, such as Argo

Couler Project 781 Jan 3, 2023
A Pythonic introduction to methods for scaling your data science and machine learning work to larger datasets and larger models, using the tools and APIs you know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

This tutorial's purpose is to introduce Pythonistas to methods for scaling their data science and machine learning work to larger datasets and larger models, using the tools and APIs they know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

Coiled 102 Nov 10, 2022
It is a temporary project to study discord interactions. You can set permissions conveniently when you invite a particular disk code bot.

Permission Bot 디스코드 내에 있는 message-components 를 연구하기 위하여 제작된 봇입니다. Setup /config/config_example.ini 파일을 /config/config.ini으로 변환합니다. config 파일의 기본 양식은 아

gunyu1019 4 Mar 7, 2022
EasyRequests is a minimalistic HTTP-Request Library that wraps aiohttp and asyncio in a small package that allows for sequential, parallel or even single requests

EasyRequests EasyRequests is a minimalistic HTTP-Request Library that wraps aiohttp and asyncio in a small package that allows for sequential, paralle

Avi 1 Jan 27, 2022
The Dual Memory is build from a simple CNN for the deep memory and Linear Regression fro the fast Memory

Simple-DMA a simple Dual Memory Architecture for classifications. based on the paper Dual-Memory Deep Learning Architectures for Lifelong Learning of

null 1 Jan 27, 2022
Python utility function to communicate with a subprocess using iterables: for when data is too big to fit in memory and has to be streamed

iterable-subprocess Python utility function to communicate with a subprocess using iterables: for when data is too big to fit in memory and has to be

Department for International Trade 5 Jul 10, 2022
Implementation of a memory efficient multi-head attention as proposed in the paper, "Self-attention Does Not Need O(n²) Memory"

Memory Efficient Attention Pytorch Implementation of a memory efficient multi-head attention as proposed in the paper, Self-attention Does Not Need O(

Phil Wang 180 Jan 5, 2023