D2Go is a toolkit for efficient deep learning

Last update: Jan 4, 2023

Related tags

Overview

D2Go

D2Go is a production ready software system from FacebookResearch, which supports end-to-end model training and deployment for mobile platforms.

What's D2Go

It is a deep learning toolkit powered by PyTorch and Detectron2.
State-of-the-art efficient backbone networks for mobile devices.
End-to-end model training, quantization and deployment pipeline.
Easy export to TorchScript format for deployment.

Installation

Install PyTorch Nightly (use CUDA 10.2 as example, see details at PyTorch Website):

conda install pytorch torchvision cudatoolkit=10.2 -c pytorch-nightly

Install Detectron2 (other installation options at Detectron2):

python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'

Install mobile_cv:

python -m pip install 'git+https://github.com/facebookresearch/mobile-vision.git'

Install d2go:

git clone https://github.com/facebookresearch/d2go
cd d2go & python -m pip install .

Get Started

Getting Started with D2Go.
See our model zoo for example configs and pretrained models.

License

D2Go is released under the Apache 2.0 license.

Comments

Demo script fails with ImportError: cannot import name 'metanet_pb2' from 'caffe2.proto'

Instructions To Reproduce the 🐛 Bug:

What exact command you run: python demo.py --config-file faster_rcnn_fbnetv3a_C4.yaml --input input1.jpg --output output1.jpg

Full logs or other relevant observations:

(d2go) mat@ada:~/repos/d2go/demo$ python demo.py --config-file faster_rcnn_fbnetv3a_C4.yaml --input input1.jpg --output output1.jpg
Traceback (most recent call last):
  File "demo.py", line 11, in <module>
    from d2go.model_zoo import model_zoo
  File "/home/mat/anaconda3/envs/d2go/lib/python3.7/site-packages/d2go/model_zoo/model_zoo.py", line 7, in <module>
    from d2go.runner import create_runner
  File "/home/mat/anaconda3/envs/d2go/lib/python3.7/site-packages/d2go/runner/__init__.py", line 10, in <module>
    from .default_runner import BaseRunner, Detectron2GoRunner, GeneralizedRCNNRunner
  File "/home/mat/anaconda3/envs/d2go/lib/python3.7/site-packages/d2go/runner/default_runner.py", line 28, in <module>
    from d2go.export.d2_meta_arch import patch_d2_meta_arch
  File "/home/mat/anaconda3/envs/d2go/lib/python3.7/site-packages/d2go/export/__init__.py", line 5, in <module>
    from . import torchscript  # noqa
  File "/home/mat/anaconda3/envs/d2go/lib/python3.7/site-packages/d2go/export/torchscript.py", line 13, in <module>
    from detectron2.export.flatten import TracingAdapter, flatten_to_tuple
  File "/home/mat/anaconda3/envs/d2go/lib/python3.7/site-packages/detectron2/export/__init__.py", line 3, in <module>
    from .api import *
  File "/home/mat/anaconda3/envs/d2go/lib/python3.7/site-packages/detectron2/export/api.py", line 6, in <module>
    from caffe2.proto import caffe2_pb2
  File "/home/mat/anaconda3/envs/d2go/lib/python3.7/site-packages/caffe2/proto/__init__.py", line 11, in <module>
    from caffe2.proto import caffe2_pb2, metanet_pb2, torch_pb2
ImportError: cannot import name 'metanet_pb2' from 'caffe2.proto' (/home/mat/anaconda3/envs/d2go/lib/python3.7/site-packages/caffe2/proto/__init__.py)

pytorch                   1.11.0.dev20211109 py3.7_cuda10.2_cudnn7.6.5_0    pytorch-nightly
pytorch-lightning         1.5.0                    pypi_0    pypi

Was unable to track down what the issue was. Tried installing stable PyTorch build and got segfault

bug

opened by msalvaris 15

Unable to replicate balloon training result

Code to reproduce result:

Tried to test the training of a model via the balloon example (https://github.com/facebookresearch/d2go/blob/master/demo/d2go_beginner.ipynb)

import os
import json
import numpy as np
from detectron2.structures import BoxMode
from detectron2.data import MetadataCatalog, DatasetCatalog
import cv2

def get_balloon_dicts(img_dir):
    json_file = os.path.join(img_dir, "via_region_data.json")
    with open(json_file) as f:
        imgs_anns = json.load(f)

    dataset_dicts = []
    for idx, v in enumerate(imgs_anns.values()):
        record = {}
        
        filename = os.path.join(img_dir, v["filename"])
        height, width = cv2.imread(filename).shape[:2]
        
        record["file_name"] = filename
        record["image_id"] = idx
        record["height"] = height
        record["width"] = width
      
        annos = v["regions"]
        objs = []
        for _, anno in annos.items():
            assert not anno["region_attributes"]
            anno = anno["shape_attributes"]
            px = anno["all_points_x"]
            py = anno["all_points_y"]
            poly = [(x + 0.5, y + 0.5) for x, y in zip(px, py)]
            poly = [p for x in poly for p in x]

            obj = {
                "bbox": [np.min(px), np.min(py), np.max(px), np.max(py)],
                "bbox_mode": BoxMode.XYXY_ABS,
                "segmentation": [poly],
                "category_id": 0,
            }
            objs.append(obj)
        record["annotations"] = objs
        dataset_dicts.append(record)
    return dataset_dicts

for d in ["train", "val"]:
    DatasetCatalog.register("balloon_" + d, lambda d=d: get_balloon_dicts("balloon/" + d))
    MetadataCatalog.get("balloon_" + d).set(thing_classes=["balloon"], evaluator_type="coco")

balloon_metadata = MetadataCatalog.get("balloon_train")

from d2go.runner import Detectron2GoRunner
from d2go.model_zoo import model_zoo

def prepare_for_launch():
    runner = Detectron2GoRunner()
    cfg = runner.get_default_cfg()
    cfg.merge_from_file(model_zoo.get_config_file("faster_rcnn_fbnetv3a_C4.yaml"))
    cfg.MODEL_EMA.ENABLED = False
    cfg.DATASETS.TRAIN = ("balloon_train",)
    cfg.DATASETS.TEST = ("balloon_val",)
    cfg.DATALOADER.NUM_WORKERS = 2
    cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url("faster_rcnn_fbnetv3a_C4.yaml")  # Let training initialize from model zoo
    cfg.SOLVER.IMS_PER_BATCH = 2
    cfg.SOLVER.BASE_LR = 0.00025  # pick a good LR
    cfg.SOLVER.MAX_ITER = 600    # 600 iterations seems good enough for this toy dataset; you will need to train longer for a practical dataset
    cfg.SOLVER.STEPS = []        # do not decay learning rate
    cfg.MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE = 128   # faster, and good enough for this toy dataset (default: 512)
    cfg.MODEL.ROI_HEADS.NUM_CLASSES = 1  # only has one class (ballon). (see https://detectron2.readthedocs.io/tutorials/datasets.html#update-the-config-for-new-datasets)
    cfg.OUTPUT_DIR = 'balloon_model'
    # NOTE: this config means the number of classes, but a few popular unofficial tutorials incorrect uses num_classes+1 here.
    os.makedirs(cfg.OUTPUT_DIR, exist_ok=True)
    return cfg, runner

cfg, runner = prepare_for_launch()
model = runner.build_model(cfg)
runner.do_train(cfg, model, resume=False)

cfg.MODEL.WEIGHTS = 'balloon_model/model_final.pth'
metrics = runner.do_test(cfg, model)
print(metrics)

Result

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.021
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.061
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.012
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.001
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.041
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.004
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.118
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.204
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.012
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.333

And the inference results on test images are terrible.

Expected Result

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.494
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.651
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.543
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.104
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.757
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.204
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.526
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.526
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.118
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.810

opened by AaronReidCI 14

Hardcode top-level directory, remove import for resource access

Summary: This diff is part of a stack which has the goal of "buckifying" D2 (https://github.com/facebookresearch/d2go/commit/87374efb134e539090e0b5c476809dc35bf6aedb)Go core and enabling autodeps and other tooling. The last diff in the stack introduces the TARGETS. The diffs earlier in the stack are resolving circular dependencies and other issues which prevent the buckification from occurring.

This diff changes the import paths being rerouted in the d2go package. Instead of local-importing the package, which creates a buck circular dependency, we can hardcode the top-level path. We know it is fixed because we are defining it with base_module.

For the other packages it can remain as-is.

Also add some dependencies annotations in preparation for the buckification, with manual.

Differential Revision: D35928513
CLA Signed fb-exported

opened by miqueljubert 13

assertion error in data loader while converting to int8

I trained "faster_rcnn_fbnetv3g_fpn" on custom dataset. And model trained successfully. but receiving this error while converting to int8.

code I am using to convert:

import copy
from detectron2.data import build_detection_test_loader
from d2go.export.api import convert_and_export_predictor
from d2go.tests.data_loader_helper import create_fake_detection_data_loader
from d2go.export.d2_meta_arch import patch_d2_meta_arch

import logging

# disable all the warnings
previous_level = logging.root.manager.disable
logging.disable(logging.INFO)

patch_d2_meta_arch()

cfg_name = 'faster_rcnn_fbnetv3g_fpn.yaml'
pytorch_model = model_zoo.get(cfg_name, trained=True)
pytorch_model.cpu()

with create_fake_detection_data_loader(224, 320, is_train=False) as data_loader:
    predictor_path = convert_and_export_predictor(
            model_zoo.get_config(cfg_name),
            copy.deepcopy(pytorch_model),
            "torchscript_int8@tracing",
            './',
            data_loader,
        )

# recover the logging level
logging.disable(previous_level)

The error I am receiving:

WARNING [03/16 06:34:40 mobile_cv.arch.utils.helper]: Arguments ['width_divisor', 'dw_skip_bnrelu', 'zero_last_bn_gamma'] skipped for op Conv2d
loading annotations into memory...
Done (t=0.00s)
creating index...
index created!
/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py:477: UserWarning: This DataLoader will create 4 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  cpuset_checked))
/usr/local/lib/python3.7/dist-packages/torch/quantization/observer.py:123: UserWarning: Please use quant_min and quant_max to specify the range for observers.                     reduce_range will be deprecated in a future release of PyTorch.
  reduce_range will be deprecated in a future release of PyTorch."
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-19-46e25055929a> in <module>()
     23             "torchscript_int8@tracing",
     24             './',
---> 25             data_loader,
     26         )
     27 

/usr/local/lib/python3.7/dist-packages/d2go/export/api.py in convert_and_export_predictor(cfg, pytorch_model, predictor_type, output_dir, data_loader)
     98             pytorch_model = post_training_quantize(cfg, pytorch_model, data_loader)
     99             # only check bn exists in ptq as qat still has bn inside fused ops
--> 100             assert not fuse_utils.check_bn_exist(pytorch_model)
    101         logger.info(f"Converting quantized model {cfg.QUANTIZATION.BACKEND}...")
    102         if cfg.QUANTIZATION.EAGER_MODE:

AssertionError:

I have changed the config file and included the newly registered custom dataset. what could be else wrong here?

bug

opened by DhruvMakwana 13

Add required example_args argument to prepare_fx and prepare_qat_fx

Summary: FX Graph Mode Quantization needs to know whether an fx node is a floating point Tensor before it can decide whether to insert observer/fake_quantize module or not, since we only insert observer/fake_quantize module for floating point Tensors. Currently we have some hacks to support this by defining some rules like NON_OBSERVABLE_ARG_DICT (https://github.com/pytorch/pytorch/blob/master/torch/ao/quantization/fx/utils.py#L496), but this approach is fragile and we do not plan to maintain it long term in the pytorch code base.

As we discussed in the design review, we'd need to ask users to provide sample args and sample keyword args so that we can infer the type in a more robust way. This PR starts with changing the prepare_fx and prepare_qat_fx api to require user to either provide example arguments thrugh example_inputs, Note this api doesn't support kwargs, kwargs can make https://github.com/pytorch/pytorch/pull/76496#discussion_r861230047 (comment) simpler, but it will be rare, and even then we can still workaround with positional arguments, also torch.jit.trace(https://pytorch.org/docs/stable/generated/torch.jit.trace.html) and ShapeProp: https://github.com/pytorch/pytorch/blob/master/torch/fx/passes/shape_prop.py#L140 just have single positional args, we'll just use a single example_inputs argument for now.

If needed, we can extend the api with an optional example_kwargs. e.g. in case when there are a lot of arguments for forward and it makes more sense to pass the arguments by keyword

BC-breaking Note: Before: m = resnet18(...) m = prepare_fx(m, qconfig_dict)

After: m = resnet18(...) m = prepare_fx(m, qconfig_dict, example_inputs=(torch.randn(1, 3, 224, 224),))

Reviewed By: vkuzo, andrewor14

Differential Revision: D35984526
CLA Signed fb-exported

opened by jerryzh168 10
Enable torch tracing by changing assertions in d2go forwards to allow for torch.fx.proxy.Proxy type.

Summary: Torch FX tracing propagates a type of torch.fx.proxy.Proxy through the graph.

Existing type assertions in the d2go code base trigger during torch FX tracing, causing tracing to fail.

This diff adds a helper function is_fx_proxy(), for checking for torch.fx.proxy.Proxy instances, then uses this to guard the existing assertions, thus enabling the tracing, as well as maintaining the originally intended functionality.

Differential Revision: D35518556
CLA Signed fb-exported

opened by simonhollis 10

Quantization-aware training with the API

Instructions To Reproduce the 🐛 Bug:

I tried to add quantization-aware training to the d2go_beginner.ipynb notebook, but I couldn't get it to work.

Code:

from d2go.runner import Detectron2GoRunner


def prepare_for_launch():
    runner = Detectron2GoRunner()
    cfg = runner.get_default_cfg()
    cfg.merge_from_file(model_zoo.get_config_file("faster_rcnn_fbnetv3a_C4.yaml"))
    cfg.MODEL_EMA.ENABLED = False
    cfg.DATASETS.TRAIN = ("balloon_train",)
    cfg.DATASETS.TEST = ("balloon_val",)
    cfg.DATALOADER.NUM_WORKERS = 2
    cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url("faster_rcnn_fbnetv3a_C4.yaml")  # Let training initialize from model zoo
    cfg.SOLVER.IMS_PER_BATCH = 2
    cfg.SOLVER.BASE_LR = 0.00025  # pick a good LR
    cfg.SOLVER.MAX_ITER = 600    # 300 iterations seems good enough for this toy dataset; you will need to train longer for a practical dataset
    cfg.SOLVER.STEPS = []        # do not decay learning rate
    cfg.MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE = 128   # faster, and good enough for this toy dataset (default: 512)
    cfg.MODEL.ROI_HEADS.NUM_CLASSES = 1  # only has one class (ballon). (see https://detectron2.readthedocs.io/tutorials/datasets.html#update-the-config-for-new-datasets)
    # NOTE: this config means the number of classes, but a few popular unofficial tutorials incorrect uses num_classes+1 here.

    # quantization-aware training
    cfg.QUANTIZATION.BACKEND = "qnnpack"
    cfg.QUANTIZATION.QAT.ENABLED = True
    cfg.QUANTIZATION.QAT.START_ITER = 0
    cfg.QUANTIZATION.QAT.ENABLE_OBSERVER_ITER = 0
    cfg.QUANTIZATION.QAT.DISABLE_OBSERVER_ITER = 5
    cfg.QUANTIZATION.QAT.FREEZE_BN_ITER = 7

    os.makedirs(cfg.OUTPUT_DIR, exist_ok=True)
    return cfg, runner

cfg, runner = prepare_for_launch()
print(cfg)
model = runner.build_model(cfg)
runner.do_train(cfg, model, resume=False)

Error message:

AssertionError                            Traceback (most recent call last)
<ipython-input-11-327fe2b2a9ce> in <module>()
     32 cfg, runner = prepare_for_launch()
     33 print(cfg)
---> 34 model = runner.build_model(cfg)
     35 runner.do_train(cfg, model, resume=False)

15 frames
/usr/local/lib/python3.7/dist-packages/torch/quantization/fuser_method_mappings.py in get_fuser_method(op_list, additional_fuser_method_mapping)
    129                                      additional_fuser_method_mapping)
    130     fuser_method = all_mappings.get(op_list, None)
--> 131     assert fuser_method is not None, "did not find fuser method for: {} ".format(op_list)
    132     return fuser_method

AssertionError: did not find fuser method for: (<class 'torch.nn.modules.conv.Conv2d'>, <class 'mobile_cv.arch.layers.batch_norm.NaiveSyncBatchNorm'>, <class 'torch.nn.modules.activation.ReLU'>)

Expected behavior:

Quantization-aware training should work with the API.

bug

opened by TannerGilbert 10

Training time is huge

the model training time is very very huge, the training iteration itself is fast, however the training load the whole data while training unlike detectron2, the time i needed to train model in detectron2 was 3 days for 300 000 iterations, now i need 3 days for 50 000 iterations. while training i see the script load the whole training data all again and again which consume time. can i speed up the training process? thanks in advance.

opened by AhmedHessuin 9
One EMAState in D2go 1/N - model_ema.py --> ema.py

Summary: Renaming model_ema.py to ema.py (as modeling is already in the folder name. Fixing dependencies after rename

Differential Revision: D41685115
CLA Signed Merged fb-exported

opened by mcimpoi 8
Integrate PyTorch Fully Sharded Data Parallel (FSDP)
Summary: Integrate PyTorch FSDP, which supports two sharding modes: 1. gradient + optimizer sharding; 2. full model sharding (params + gradient + optimizer). This feature is enabled in the train_net.py code path.

Sources

Integration follows this tutorial: https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html

API changes

Add new config keys to support the new feature. Refer to mobile-vision/d2go/d2go/trainer/fairscale.py for the full list of config options

Add FSDPCheckpointer as an inheritance of QATCheckpointer to support special loading/saving logic for FSDP models

Differential Revision: D39228316
CLA Signed fb-exported
opened by YanjunChen329 8
use runner class instead of instance outside of main
Summary: As discussed, we decided to not use runner instance outside of main, previous diffs already solved the prerequisites, this diff mainly does the renaming.

Use runner name (str) in the fblearner, ML pipeline.

Use runner name (str) in FBL operator, MAST and binary operator.

Use runner class as the interface of main, it can be either the name of class (str) or actual class. The main usage should be using str, so that the importing of class happens inside main, But since this is BC breaking, i.e. some scripts or tests will fail without updating, supporting actual class makes it easier modify code for those cases (eg. some local test class doesn't have a name).

Differential Revision: D37060338
CLA Signed fb-exported
opened by wat3rBro 8
Clean Up MobileOptimizerType Rewrite Flags Public API and Documentation
Summary: X-link: https://github.com/pytorch/pytorch/pull/91600

Remove MobileOptimizerType and all rewrite flags from torch.X and torch._C.X to clean up torch.X and torch._C.X namespaces

The affected rewrite flags are

CONV_BN_FUSION

FUSE_ADD_RELU

HOIST_CONV_PACKED_PARAMS

INSERT_FOLD_PREPACK_OPS

REMOVE_DROPOUT

VULKAN_AUTOMATIC_GPU_TRANSFER

Bc-Breaking Change:

Before this change, the rewrite flags were accessible through all of

torch.utils.mobile_optimizer.MobileOptimizerType.X

torch._C.MobileOptimizerType.X

torch.X

torch.MobileOptimizerType.X

torch._C.X

But after this change, only torch.utils.mobile_optimizer.MobileOptimizerType.X (option 1 above) and the newly added torch._C._MobileOptimizerType.X remain

Corresponding updates to PyTorch Tutorial Docs are in https://github.com/pytorch/tutorials/pull/2163

Differential Revision: D41690203
CLA Signed fb-exported
opened by salilsdesai 2
Parallelize EMA optimizer

Summary: Tracing d2go runners using adamw optimizer yielded small operators being executed in the EMA code. They can be fused together by using multi-tensor API.

Differential Revision: D42098310
CLA Signed fb-exported

opened by frabu6 1
Convert local checkpoint to global one automatically in d2go FSDP checkpointer
Summary:

Design

Following D41861308, local checkpoints need to be converted to global ones before being loaded and used in non-FSDP wrapped models. This diff implements such conversion in d2go checkpointer level to allow automatic conversion with minimal user interference and no new config key.

In previous diff, FSDPWrapper has 2 loading modes and 2 saving modes: it uses load_local_state_dict to determine whether the ckpt we want to load is local or global, and uses use_local_state_dict to decide whether to save new ckpts as local or global. Thus, there are 4 combinations of loading/saving modes:

load local + save local

load local + save global

load global + save local

load global + save global

And the local-to-global checkpoint conversion maps to mode 2: load local + save global. Thus, when the checkpointer is in mode 2, it automatically saves the model to a global ckpt right after it loads the local ckpt. Because this happens in checkpointer level, normal training/eval can resume after ckpt conversion. This gives users a consistent and seamless experience with normal training/eval, while also providing a separate ckpt conversion feature via eval-only.

Usage

Suppose we want to convert local checkpoint /tmp/model_final, user can run the same training command with extra args: MODEL.WEIGHTS=/tmp/model_final and FSDP.USE_LOCAL_STATE_DICT=False

Wiki: https://www.internalfb.com/intern/wiki/Mobile_Vision/Detectron2Go/D2 (https://github.com/facebookresearch/d2go/commit/87374efb134e539090e0b5c476809dc35bf6aedb)Go_Tutorials/Diffusion_Pipeline/Diffusion_Model_Inference/#using-checkpoints-traine

Differential Revision: D41926662
CLA Signed fb-exported
opened by YanjunChen329 1
Move FSDP wrapping to runner.build_model
Summary: Move FSDP wrapping to runner.build_model by rewriting it as a modeling hook

Motivation When a model is too large to run inference on a single GPU, it requires using FSDP with local checkpointing mode to save peak GPU memory. However, in eval_pytorch workflow (train_net with eval-only), models are evaluated without being wrapped by FSDP. This may cause OOM errors for the reasons above. Thus, it may be a better practice to wrap model with FSDP during runner.build_model(cfg), so evaluation can also be run in the same FSDP setting as in training.

This diff moves FSDP wrapping to runner.build_model(cfg) by rewriting it as a modeling hook.

API changes

Users need to append "FSDPModelingHook" to MODEL.MODELING_HOOKS to enable FSDP.

FSDP.ALGORITHM can only be full or grad_optim

Note It's not possible to unwrap an FSDP model back to the normal model, so FSDPModelingHook.unapply() can't be implemented

Differential Revision: D41416917
CLA Signed fb-exported
opened by YanjunChen329 1

Owner

Facebook Research

GitHub

Microsoft Cognitive Toolkit (CNTK), an open source deep-learning toolkit

CNTK Chat Windows build status Linux build status The Microsoft Cognitive Toolkit (https://cntk.ai) is a unified deep learning toolkit that describes

17k Feb 11, 2021

FAMIE is a comprehensive and efficient active learning (AL) toolkit for multilingual information extraction (IE)

FAMIE: A Fast Active Learning Framework for Multilingual Information Extraction

18 Sep 1, 2022

AdaShare: Learning What To Share For Efficient Deep Multi-Task Learning

AdaShare: Learning What To Share For Efficient Deep Multi-Task Learning (NeurIPS 2020) Introduction AdaShare is a novel and differentiable approach fo

94 Dec 22, 2022

Efficient-GlobalPointer - Pytorch Efficient GlobalPointer

引言感谢苏神带来的模型，原文地址：https://spaces.ac.cn/archives/8877 如何运行对应模型EfficientGlobalPoi

40 Dec 14, 2022

TorchIO is a Medical image preprocessing and augmentation toolkit for deep learning. Part of the PyTorch Ecosystem.

Medical image preprocessing and augmentation toolkit for deep learning. Part of the PyTorch Ecosystem.

1.6k Jan 6, 2023

TorchOk - The toolkit for fast Deep Learning experiments in Computer Vision

52 Dec 23, 2022

A clear, concise, simple yet powerful and efficient API for deep learning.

The Gluon API Specification The Gluon API specification is an effort to improve speed, flexibility, and accessibility of deep learning technology for

2.3k Dec 17, 2022

DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective.

8.4k Jan 1, 2023

PPLNN is a Primitive Library for Neural Network is a high-performance deep-learning inference engine for efficient AI inferencing

943 Jan 7, 2023

Code release for "Self-Tuning for Data-Efficient Deep Learning" (ICML 2021)

Self-Tuning for Data-Efficient Deep Learning This repository contains the implementation code for paper: Self-Tuning for Data-Efficient Deep Learning

101 Dec 11, 2022

Implementation of "Selection via Proxy: Efficient Data Selection for Deep Learning" from ICLR 2020.

Selection via Proxy: Efficient Data Selection for Deep Learning This repository contains a refactored implementation of "Selection via Proxy: Efficien

70 Nov 16, 2022

Lorien: A Unified Infrastructure for Efficient Deep Learning Workloads Delivery

Lorien: A Unified Infrastructure for Efficient Deep Learning Workloads Delivery Lorien is an infrastructure to massively explore/benchmark the best sc

45 Dec 12, 2022

TorchX: A PyTorch Extension Library for More Efficient Deep Learning

TorchX TorchX: A PyTorch Extension Library for More Efficient Deep Learning. @misc{torchx, author = {Ansheng You and Changxu Wang}, title = {T

8 May 28, 2022

A deep learning library that makes face recognition efficient and effective

Distributed Arcface Training in Pytorch This is a deep learning library that makes face recognition efficient, and effective, which can train tens of

10 Nov 23, 2021

An efficient and easy-to-use deep learning model compression framework

TinyNeuralNetwork 简体中文 TinyNeuralNetwork is an efficient and easy-to-use deep learning model compression framework, which contains features like neura

441 Dec 25, 2022

Easy Parallel Library (EPL) is a general and efficient deep learning framework for distributed model training.

English | 简体中文 Easy Parallel Library Overview Easy Parallel Library (EPL) is a general and efficient library for distributed model training. Usability

185 Dec 21, 2022

This is an open-source toolkit for Heterogeneous Graph Neural Network(OpenHGNN) based on DGL [Deep Graph Library] and PyTorch.

519 Jan 2, 2023

Tutorial on active learning with the Nvidia Transfer Learning Toolkit (TLT).

Active Learning with the Nvidia TLT Tutorial on active learning with the Nvidia Transfer Learning Toolkit (TLT). In this tutorial, we will show you ho

25 Dec 3, 2022

CRLT: A Unified Contrastive Learning Toolkit for Unsupervised Text Representation Learning

CRLT: A Unified Contrastive Learning Toolkit for Unsupervised Text Representation Learning This repository contains the code and relevant instructions

5 Aug 19, 2022

D2Go is a toolkit for efficient deep learning

Related tags

Overview

D2Go

What's D2Go

Installation

Get Started

License

Comments

Instructions To Reproduce the 🐛 Bug:

Code to reproduce result:

Result

Expected Result

Instructions To Reproduce the 🐛 Bug:

Expected behavior:

Design

Usage

Owner

Facebook Research

Microsoft Cognitive Toolkit (CNTK), an open source deep-learning toolkit

FAMIE is a comprehensive and efficient active learning (AL) toolkit for multilingual information extraction (IE)

AdaShare: Learning What To Share For Efficient Deep Multi-Task Learning

Efficient-GlobalPointer - Pytorch Efficient GlobalPointer

TorchIO is a Medical image preprocessing and augmentation toolkit for deep learning. Part of the PyTorch Ecosystem.

TorchOk - The toolkit for fast Deep Learning experiments in Computer Vision

A clear, concise, simple yet powerful and efficient API for deep learning.

DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective.

PPLNN is a Primitive Library for Neural Network is a high-performance deep-learning inference engine for efficient AI inferencing

Code release for "Self-Tuning for Data-Efficient Deep Learning" (ICML 2021)

Implementation of "Selection via Proxy: Efficient Data Selection for Deep Learning" from ICLR 2020.

Lorien: A Unified Infrastructure for Efficient Deep Learning Workloads Delivery

TorchX: A PyTorch Extension Library for More Efficient Deep Learning

A deep learning library that makes face recognition efficient and effective

An efficient and easy-to-use deep learning model compression framework

Easy Parallel Library (EPL) is a general and efficient deep learning framework for distributed model training.

This is an open-source toolkit for Heterogeneous Graph Neural Network(OpenHGNN) based on DGL [Deep Graph Library] and PyTorch.

Tutorial on active learning with the Nvidia Transfer Learning Toolkit (TLT).

CRLT: A Unified Contrastive Learning Toolkit for Unsupervised Text Representation Learning