Semi-Supervised Learning, Object Detection, ICCV2021

Overview

End-to-End Semi-Supervised Object Detection with Soft Teacher

PWC PWC PWC PWC PWC PWC PWC

By Mengde Xu*, Zheng Zhang*, Han Hu, Jianfeng Wang, Lijuan Wang, Fangyun Wei, Xiang Bai, Zicheng Liu.

This repo is the official implementation of ICCV2021 paper "End-to-End Semi-Supervised Object Detection with Soft Teacher".

Citation

@article{xu2021end,
  title={End-to-End Semi-Supervised Object Detection with Soft Teacher},
  author={Xu, Mengde and Zhang, Zheng and Hu, Han and Wang, Jianfeng and Wang, Lijuan and Wei, Fangyun and Bai, Xiang and Liu, Zicheng},
  journal={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
  year={2021}
}

Main Results

Partial Labeled Data

We followed STAC[1] to evaluate on 5 different data splits for each setting, and report the average performance of 5 splits. The results are shown in the following:

1% labeled data

Method mAP Model Weights Config Files
Baseline 10.0 - Config
Ours (thr=5e-2) 21.62 Drive Config
Ours (thr=1e-3) 22.64 Drive Config

5% labeled data

Method mAP Model Weights Config Files
Baseline 20.92 - Config
Ours (thr=5e-2) 30.42 Drive Config
Ours (thr=1e-3) 31.7 Drive Config

10% labeled data

Method mAP Model Weights Config Files
Baseline 26.94 - Config
Ours (thr=5e-2) 33.78 Drive Config
Ours (thr=1e-3) 34.7 Drive Config

Full Labeled Data

Faster R-CNN (ResNet-50)

Model mAP Model Weights Config Files
Baseline 40.9 - Config
Ours (thr=5e-2) 44.05 Drive Config
Ours (thr=1e-3) 44.6 Drive Config
Ours* (thr=5e-2) 44.5 - Config
Ours* (thr=1e-3) 44.9 - Config

Faster R-CNN (ResNet-101)

Model mAP Model Weights Config Files
Baseline 43.8 - Config
Ours* (thr=5e-2) 46.8 - Config
Ours* (thr=1e-3) 47.3 - Config

Notes

  • Ours* means we use longer training schedule.
  • thr indicates model.test_cfg.rcnn.score_thr in config files. This inference trick was first introduced by Instant-Teaching[2].
  • All models are trained on 8*V100 GPUs

Usage

Requirements

  • Ubuntu 16.04
  • Anaconda3 with python=3.6
  • Pytorch=1.9.0
  • mmdetection=2.16.0+fe46ffe
  • mmcv=1.3.9
  • wandb=0.10.31

Notes

  • We use wandb for visualization, if you don't want to use it, just comment line 273-284 in configs/soft_teacher/base.py.

Installation

make install

Data Preparation

  • Download the COCO dataset
  • Execute the following command to generate data set splits:
# YOUR_DATA should be a directory contains coco dataset.
# For eg.:
# YOUR_DATA/
#  coco/
#     train2017/
#     val2017/
#     unlabeled2017/
#     annotations/
ln -s ${YOUR_DATA} data
bash tools/dataset/prepare_coco_data.sh conduct

Training

  • To train model on the partial labeled data setting:
# JOB_TYPE: 'baseline' or 'semi', decide which kind of job to run
# PERCENT_LABELED_DATA: 1, 5, 10. The ratio of labeled coco data in whole training dataset.
# GPU_NUM: number of gpus to run the job
for FOLD in 1 2 3 4 5;
do
  bash tools/dist_train_partially.sh <JOB_TYPE> ${FOLD} <PERCENT_LABELED_DATA> <GPU_NUM>
done

For example, we could run the following scripts to train our model on 10% labeled data with 8 GPUs:

for FOLD in 1 2 3 4 5;
do
  bash tools/dist_train_partially.sh semi ${FOLD} 10 8
done
  • To train model on the full labeled data setting:
bash tools/dist_train.sh <CONFIG_FILE_PATH> <NUM_GPUS>

For example, to train ours R50 model with 8 GPUs:

bash tools/dist_train.sh configs/soft_teacher/soft_teacher_faster_rcnn_r50_caffe_fpn_coco_full_720k.py 8

Evaluation

bash tools/dist_test.sh <CONFIG_FILE_PATH> <CHECKPOINT_PATH> <NUM_GPUS> --eval bbox --cfg-options model.test_cfg.rcnn.score_thr=<THR>

Inference

To inference with trained model and visualize the detection results:

# [IMAGE_FILE_PATH]: the path of your image file in local file system
# [CONFIG_FILE]: the path of a confile file
# [CHECKPOINT_PATH]: the path of a trained model related to provided confilg file.
# [OUTPUT_PATH]: the directory to save detection result
python demo/image_demo.py [IMAGE_FILE_PATH] [CONFIG_FILE] [CHECKPOINT_PATH] --output [OUTPUT_PATH]

For example:

  • Inference on single image with provided R50 model:
python demo/image_demo.py /tmp/tmp.png configs/soft_teacher/soft_teacher_faster_rcnn_r50_caffe_fpn_coco_full_720k.py work_dirs/downloaded.model --output work_dirs/

After the program completes, a image with the same name as input will be saved to work_dirs

  • Inference on many images with provided R50 model:
python demo/image_demo.py '/tmp/*.jpg' configs/soft_teacher/soft_teacher_faster_rcnn_r50_caffe_fpn_coco_full_720k.py work_dirs/downloaded.model --output work_dirs/

[1] A Simple Semi-Supervised Learning Framework for Object Detection

[2] Instant-Teaching: An End-to-End Semi-SupervisedObject Detection Framework

Comments
  • Model training stops after validation after 4000 iterations

    Model training stops after validation after 4000 iterations

    After training for 4000 iterations the validation happens and after that the training stops throwing the following error:

    raise ChildFailedError(
    torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
    ***************************************
             tools/train.py FAILED         
    =======================================
    Root Cause:
    [0]:
      time: 2021-09-22_05:54:53
      rank: 1 (local_rank: 1)
      exitcode: 1 (pid: 2210236)
      error_file: <N/A>
      msg: "Process failed with exitcode 1"
    =======================================
    Other Failures:
      <NO_OTHER_FAILURES>
    ***************************************
    

    I am training with 2 gpus. Do you have any insight why this error is being thrown?

    opened by purbayankar 19
  • Getting Error when start training with single GPU.  [Error:                CHILD PROCESS FAILED WITH NO ERROR_FILE                ]

    Getting Error when start training with single GPU. [Error: CHILD PROCESS FAILED WITH NO ERROR_FILE ]

    The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run WARNING:torch.distributed.run:--use_env is deprecated and will be removed in future releases. Please read local_rank fromos.environ('LOCAL_RANK')` instead. INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs: entrypoint : tools/train.py min_nodes : 1 max_nodes : 1 nproc_per_node : 1 run_id : none rdzv_backend : static rdzv_endpoint : 127.0.0.1:29500 rdzv_configs : {'rank': 0, 'timeout': 900} max_restarts : 3 monitor_interval : 5 log_dir : None metrics_cfg : {}

    INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_mnn5x9jo/none_j57hpvun INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python3 INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group /home/vefak/Documents/anaconda3/envs/torch/lib/python3.6/site-packages/torch/distributed/elastic/utils/store.py:53: FutureWarning: This is an experimental API and will be changed in future. "This is an experimental API and will be changed in future.", FutureWarning INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result: restart_count=0 master_addr=127.0.0.1 master_port=29500 group_rank=0 group_world_size=1 local_ranks=[0] role_ranks=[0] global_ranks=[0] role_world_sizes=[1] global_world_sizes=[1]

    INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_mnn5x9jo/none_j57hpvun/attempt_0/0/error.json ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 0 (pid: 14927) of binary: /home/vefak/Documents/anaconda3/envs/torch/bin/python3 ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 3/3 attempts left; will restart worker group INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result: restart_count=1 master_addr=127.0.0.1 master_port=29500 group_rank=0 group_world_size=1 local_ranks=[0] role_ranks=[0] global_ranks=[0] role_world_sizes=[1] global_world_sizes=[1]

    INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_mnn5x9jo/none_j57hpvun/attempt_1/0/error.json ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 0 (pid: 14955) of binary: /home/vefak/Documents/anaconda3/envs/torch/bin/python3 ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 2/3 attempts left; will restart worker group INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result: restart_count=2 master_addr=127.0.0.1 master_port=29500 group_rank=0 group_world_size=1 local_ranks=[0] role_ranks=[0] global_ranks=[0] role_world_sizes=[1] global_world_sizes=[1]

    INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_mnn5x9jo/none_j57hpvun/attempt_2/0/error.json ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 0 (pid: 15007) of binary: /home/vefak/Documents/anaconda3/envs/torch/bin/python3 ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 1/3 attempts left; will restart worker group INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result: restart_count=3 master_addr=127.0.0.1 master_port=29500 group_rank=0 group_world_size=1 local_ranks=[0] role_ranks=[0] global_ranks=[0] role_world_sizes=[1] global_world_sizes=[1]

    INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_mnn5x9jo/none_j57hpvun/attempt_3/0/error.json ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 0 (pid: 15048) of binary: /home/vefak/Documents/anaconda3/envs/torch/bin/python3 ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (FAILED). Waiting 300 seconds for other agents to finish /home/vefak/Documents/anaconda3/envs/torch/lib/python3.6/site-packages/torch/distributed/elastic/utils/store.py:71: FutureWarning: This is an experimental API and will be changed in future. "This is an experimental API and will be changed in future.", FutureWarning INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.0004889965057373047 seconds {"name": "torchelastic.worker.status.FAILED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 0, "group_rank": 0, "worker_id": "15048", "role": "default", "hostname": "vefak", "state": "FAILED", "total_run_time": 25, "rdzv_backend": "static", "raw_error": "{"message": ""}", "metadata": "{"group_world_size": 1, "entry_point": "python3", "local_rank": [0], "role_rank": [0], "role_world_size": [1]}", "agent_restarts": 3}} {"name": "torchelastic.worker.status.SUCCEEDED", "source": "AGENT", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": null, "group_rank": 0, "worker_id": null, "role": "default", "hostname": "vefak", "state": "SUCCEEDED", "total_run_time": 25, "rdzv_backend": "static", "raw_error": null, "metadata": "{"group_world_size": 1, "entry_point": "python3"}", "agent_restarts": 3}} /home/vefak/Documents/anaconda3/envs/torch/lib/python3.6/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py:354: UserWarning:


               CHILD PROCESS FAILED WITH NO ERROR_FILE                
    

    CHILD PROCESS FAILED WITH NO ERROR_FILE Child process 15048 (local_rank 0) FAILED (exitcode -11) Error msg: Signal 11 (SIGSEGV) received by PID 15048 Without writing an error file to <N/A>. While this DOES NOT affect the correctness of your application, no trace information about the error will be available for inspection. Consider decorating your top level entrypoint function with torch.distributed.elastic.multiprocessing.errors.record. Example:

    from torch.distributed.elastic.multiprocessing.errors import record

    @record def trainer_main(args): # do train


    warnings.warn(_no_error_file_warning_msg(rank, failure)) Traceback (most recent call last): File "/home/vefak/Documents/anaconda3/envs/torch/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/vefak/Documents/anaconda3/envs/torch/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/vefak/Documents/anaconda3/envs/torch/lib/python3.6/site-packages/torch/distributed/launch.py", line 173, in main() File "/home/vefak/Documents/anaconda3/envs/torch/lib/python3.6/site-packages/torch/distributed/launch.py", line 169, in main run(args) File "/home/vefak/Documents/anaconda3/envs/torch/lib/python3.6/site-packages/torch/distributed/run.py", line 624, in run )(*cmd_args) File "/home/vefak/Documents/anaconda3/envs/torch/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 116, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/vefak/Documents/anaconda3/envs/torch/lib/python3.6/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper return f(*args, **kwargs) File "/home/vefak/Documents/anaconda3/envs/torch/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent failures=result.failures, torch.distributed.elastic.multiprocessing.errors.ChildFailedError:


              tools/train.py FAILED               
    

    ================================================== Root Cause: [0]: time: 2022-02-04_00:19:07 rank: 0 (local_rank: 0) exitcode: -11 (pid: 15048) error_file: <N/A> msg: "Signal 11 (SIGSEGV) received by PID 15048"

    Other Failures: <NO_OTHER_FAILURES>


    `

    opened by vefak 16
  • assert len(indices) == len(self)

    assert len(indices) == len(self)

    hello, When I use it, raise error: "assert len(indices) == len(self), f"{indices} not equal {len(self)} while offset is: {offset}"" then I print the length info, =====len of indices is 26865 - offset: 0 - len self 36650 below is the detail error info, Please help me. Traceback (most recent call last): File "tools/train.py", line 198, in <module> main() File "tools/train.py", line 193, in main meta=meta, File "/data6/ziqiwen/code/softteacher/ssod/apis/train.py", line 206, in train_detector runner.run(data_loaders, cfg.workflow) File "/home/ziqiwen/anaconda3/envs/mm/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 117, in run iter_loaders = [IterLoader(x) for x in data_loaders] File "/home/ziqiwen/anaconda3/envs/mm/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 117, in <listcomp> iter_loaders = [IterLoader(x) for x in data_loaders] File "/home/ziqiwen/anaconda3/envs/mm/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 23, in __init__ self.iter_loader = iter(self._dataloader) File "/home/ziqiwen/anaconda3/envs/mm/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 291, in __iter__ return _MultiProcessingDataLoaderIter(self) File "/home/ziqiwen/anaconda3/envs/mm/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 764, in __init__ self._try_put_index() File "/home/ziqiwen/anaconda3/envs/mm/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 994, in _try_put_index index = self._next_index() File "/home/ziqiwen/anaconda3/envs/mm/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 357, in _next_index return next(self._sampler_iter) # may raise StopIteration File "/home/ziqiwen/anaconda3/envs/mm/lib/python3.7/site-packages/torch/utils/data/sampler.py", line 208, in __iter__ for idx in self.sampler: File "/data6/ziqiwen/code/softteacher/ssod/datasets/samplers/semi_sampler.py", line 189, in __iter__ assert len(indices) == len(self) AssertionError Traceback (most recent call last): File "/home/ziqiwen/anaconda3/envs/mm/lib/python3.7/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/home/ziqiwen/anaconda3/envs/mm/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/ziqiwen/anaconda3/envs/mm/lib/python3.7/site-packages/torch/distributed/launch.py", line 261, in <module> main() File "/home/ziqiwen/anaconda3/envs/mm/lib/python3.7/site-packages/torch/distributed/launch.py", line 257, in main cmd=cmd)

    opened by winnerziqi 15
  • Training on a custom dataset

    Training on a custom dataset

    Thanks for sharing your great code!. I was trying to train your semi-supervised model on a custom data. yet I always get unsup_loss_rpn_bbox: 0.0000, unsup_loss_bbox: 0.0000 even after a long training time. My data has only one object class. Any suggestions, please? Thanks

    This is what I got on test set. It looks that the network was never trained

    [>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 3276/3276, 45.6 task/s, elapsed: 72s, ETA: 0s Evaluating bbox... Loading and preparing results... DONE (t=0.00s) creating index... index created! Running per image evaluation... Evaluate annotation type *bbox* DONE (t=0.36s). Accumulating evaluation results... DONE (t=0.07s). Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.007 Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=1000 ] = 0.010 Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=1000 ] = 0.010 Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.000 Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.007 Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.000 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.001 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=300 ] = 0.001 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=1000 ] = 0.001 Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.000 Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.001 Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.000 OrderedDict([('bbox_mAP', 0.007), ('bbox_mAP_50', 0.01), ('bbox_mAP_75', 0.01), ('bbox_mAP_s', 0.0), ('bbox_mAP_m', 0.007), ('bbox_mAP_l', 0.0), ('bbox_mAP_copypaste', '0.007 0.010 0.010 0.000 0.007 0.000')])

    opened by Hananali1 14
  • the r_square between iou and bbox variance in the refine(jitter(bbox)) method

    the r_square between iou and bbox variance in the refine(jitter(bbox)) method

    The scatter plot in your paper about the relationship between iou and the bbox variance(after jittered) is really interesting and showed a strong correlation. Since that, I wonna try another method on single stage detector about estimating the bbox quality under your soft teacher architecture. I simply want to know what's the r_square you've achieved with soft teacher and faster-rcnn+FPN on COCO 1% labeled dataset. Maybe I wonna have a comparision in my projects in the future. Of course if I could come out with some methods under your architecture, I'll show my greatest gratitude and acknowledgements in my paper or project report! Sincerely thanks for your help!!!

    opened by Jack-Hu-2001 14
  • KeyError: 'loss_cls'

    KeyError: 'loss_cls'

    After 300 Iter training, raise "KeyError: 'loss_cls'".

    Below is my training information:

    wandb: (1) Create a W&B account wandb: (2) Use an existing W&B account wandb: (3) Don't visualize my results wandb: Enter your choice: wandb: WARNING Invalid choice wandb: Enter your choice: wandb: WARNING Invalid choice wandb: Enter your choice: 3 wandb: You chose 'Don't visualize my results'

    CondaEnvException: Unable to determine environment

    Please re-run this command with one of the following options:

    • Provide an environment name via --name or -n
    • Re-run this command inside an activated conda environment.

    wandb: W&B syncing is set to offline in this directory. Run wandb online or set WANDB_MODE=online to enable cloud syncing. =====group sizes is [1755 7042] =====len of indices is 14660 - offset: 0 - len self 14660 /opt/conda/lib/python3.7/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at /pytorch/c10/core/TensorImpl.h:1156.) return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode) /home/swap/project/SoftTeacher/thirdparty/mmdetection/mmdet/core/anchor/anchor_generator.py:324: UserWarning: grid_anchors would be deprecated soon. Please use grid_priors warnings.warn('grid_anchors would be deprecated soon. ' /home/swap/project/SoftTeacher/thirdparty/mmdetection/mmdet/core/anchor/anchor_generator.py:361: UserWarning: single_level_grid_anchors would be deprecated soon. Please use single_level_grid_priors 'single_level_grid_anchors would be deprecated soon. ' 2021-09-13 07:35:50,265 - mmcv - INFO - Reducer buckets have been rebuilt in this iteration. 2021-09-13 07:36:05,024 - mmdet.ssod - INFO - Iter [50/14400] lr: 9.890e-04, eta: 1:18:29, time: 0.328, data_time: 0.021, memory: 3390, ema_momentum: 0.9800, unsup_weight: 4, sup_loss_rpn_cls: 0.4607, sup_loss_rpn_bbox: 0.2644, sup_loss_cls: 1.8000, sup_acc: 74.8359, sup_loss_bbox: 0.3120, unsup_loss_rpn_cls: 1.2687, unsup_loss_rpn_bbox: 0.4132, unsup_loss_cls: 3.9225, unsup_acc: 78.9180, unsup_loss_bbox: 2.3666, loss: 10.8082 2021-09-13 07:36:20,035 - mmdet.ssod - INFO - Iter [100/14400] lr: 1.988e-03, eta: 1:14:52, time: 0.300, data_time: 0.012, memory: 3390, ema_momentum: 0.9900, unsup_weight: 4, sup_loss_rpn_cls: 0.2892, sup_loss_rpn_bbox: 0.2540, sup_loss_cls: 0.4055, sup_acc: 92.8125, sup_loss_bbox: 0.2864, unsup_loss_rpn_cls: 0.1523, unsup_loss_rpn_bbox: 0.0000, unsup_loss_cls: 0.0626, unsup_acc: 99.9648, unsup_loss_bbox: 0.0114, loss: 1.4615 2021-09-13 07:36:35,185 - mmdet.ssod - INFO - Iter [150/14400] lr: 2.987e-03, eta: 1:13:43, time: 0.303, data_time: 0.012, memory: 3390, ema_momentum: 0.9933, unsup_weight: 4, sup_loss_rpn_cls: 0.2405, sup_loss_rpn_bbox: 0.2624, sup_loss_cls: 0.3531, sup_acc: 92.7383, sup_loss_bbox: 0.2667, unsup_loss_rpn_cls: 0.1803, unsup_loss_rpn_bbox: 0.0000, unsup_loss_cls: 0.0657, unsup_acc: 100.0000, unsup_loss_bbox: 0.0000, loss: 1.3686 2021-09-13 07:36:51,586 - mmdet.ssod - INFO - Iter [200/14400] lr: 3.986e-03, eta: 1:14:30, time: 0.328, data_time: 0.012, memory: 3390, ema_momentum: 0.9950, unsup_weight: 4, sup_loss_rpn_cls: 0.3032, sup_loss_rpn_bbox: 0.3015, sup_loss_cls: 0.4180, sup_acc: 92.7266, sup_loss_bbox: 0.2982, unsup_loss_rpn_cls: 0.1778, unsup_loss_rpn_bbox: 0.0000, unsup_loss_cls: 0.0629, unsup_acc: 100.0000, unsup_loss_bbox: 0.0000, loss: 1.5616 2021-09-13 07:37:07,086 - mmdet.ssod - INFO - Iter [250/14400] lr: 4.985e-03, eta: 1:14:01, time: 0.310, data_time: 0.012, memory: 3390, ema_momentum: 0.9960, unsup_weight: 4, sup_loss_rpn_cls: 0.3210, sup_loss_rpn_bbox: 0.4059, sup_loss_cls: 0.4333, sup_acc: 92.9453, sup_loss_bbox: 0.3436, unsup_loss_rpn_cls: 0.1687, unsup_loss_rpn_bbox: 0.0000, unsup_loss_cls: 0.0740, unsup_acc: 100.0000, unsup_loss_bbox: 0.0000, loss: 1.7465 2021-09-13 07:37:22,497 - mmdet.ssod - INFO - Iter [300/14400] lr: 5.984e-03, eta: 1:13:32, time: 0.308, data_time: 0.011, memory: 3390, ema_momentum: 0.9967, unsup_weight: 4, sup_loss_rpn_cls: 0.3386, sup_loss_rpn_bbox: 0.4950, sup_loss_cls: 0.3686, sup_acc: 93.9102, sup_loss_bbox: 0.2682, unsup_loss_rpn_cls: 0.1938, unsup_loss_rpn_bbox: 0.0000, unsup_loss_cls: 0.0760, unsup_acc: 100.0000, unsup_loss_bbox: 0.0000, loss: 1.7401 Traceback (most recent call last): File "tools/train.py", line 198, in main() File "tools/train.py", line 193, in main meta=meta, File "/home/swap/project/SoftTeacher/ssod/apis/train.py", line 205, in train_detector runner.run(data_loaders, cfg.workflow) File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 133, in run iter_runner(iter_loaders[i], **kwargs) File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 60, in train outputs = self.model.train_step(data_batch, self.optimizer, **kwargs) File "/opt/conda/lib/python3.7/site-packages/mmcv/parallel/distributed.py", line 53, in train_step output = self.module.train_step(*inputs[0], **kwargs[0]) File "/home/swap/project/SoftTeacher/thirdparty/mmdetection/mmdet/models/detectors/base.py", line 238, in train_step losses = self(**data) File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 128, in new_func output = old_func(*new_args, **new_kwargs) File "/home/swap/project/SoftTeacher/thirdparty/mmdetection/mmdet/models/detectors/base.py", line 172, in forward return self.forward_train(img, img_metas, **kwargs) File "/home/swap/project/SoftTeacher/ssod/models/soft_teacher.py", line 50, in forward_train data_groups["unsup_teacher"], data_groups["unsup_student"] File "/home/swap/project/SoftTeacher/ssod/models/soft_teacher.py", line 77, in foward_unsup_train return self.compute_pseudo_label_loss(student_info, teacher_info) File "/home/swap/project/SoftTeacher/ssod/models/soft_teacher.py", line 120, in compute_pseudo_label_loss student_info=student_info, File "/home/swap/project/SoftTeacher/ssod/models/soft_teacher.py", line 244, in unsup_rcnn_cls_loss loss["loss_cls"] = loss["loss_cls"].sum() / max(bbox_targets[1].sum(), 1.0) KeyError: 'loss_cls'

    opened by duany049 14
  • Formal defintion of G_cls, G_reg, l_cls, l_reg

    Formal defintion of G_cls, G_reg, l_cls, l_reg

    Hello @MendelXu!

    The paper mentions that G_cls is produced from teacher model by foreground filtering, G_reg is produced from teacher model by box variance filtering; unfortunately the paper doesn't mention the definition of l_cls and l_reg. l_cls(student_candidate_box, teacher_pseudo_boxes) and l_reg(student_candidate_box, teacher_pseudo_boxes) still needs to do label assignment. How is this assignment performed? What losses are used?

    The paper mentions: Another important benefit of this end-to-end framework is that it allows for greater leverage of the teacher model to guide the training of the student model, rather than just providing “some generated pseudo boxes with hard category labels” as in previous approaches [27, 36]. A soft teacher approach is proposed to implement this insight. In this approach, the teacher model is used to directly assess all the box candidates that are generated by the student model,rather than providing “pseudo boxes” to assign category labels and regression vectors to these student-generated box candidates.

    It seems that in SoftTeacher, pseudo boxes with hard labels (G_cls, G_reg) are still generated and that some standard IoU-based target-box matching / assignment (l_cls, l_reg) is used. If it's not the case, could you please bring some clarifications?

    It's possible to recover this information from the code, but some formal definitions could help reading the code as well.

    Thank you!

    opened by vadimkantorov 13
  • Error while trying to train with 4 gpus

    Error while trying to train with 4 gpus

    Congratulations for the great work. I am getting this error while trying to train with 4 gpus. Can you please help me out?

    File "/data/SoftTeacher/tools/train.py", line 198, in <module>
        main()
      File "/data/SoftTeacher/tools/train.py", line 186, in main
        train_detector(
      File "/data/SoftTeacher/ssod/apis/train.py", line 206, in train_detector
        runner.run(data_loaders, cfg.workflow)
      File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/runner/iter_based_runner.py", line 133, in run
        iter_runner(iter_loaders[i], **kwargs)
      File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/runner/iter_based_runner.py", line 60, in train
        outputs = self.model.train_step(data_batch, self.optimizer, **kwargs)
      File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/parallel/distributed.py", line 52, in train_step
        output = self.module.train_step(*inputs[0], **kwargs[0])
      File "/data/SoftTeacher/thirdparty/mmdetection/mmdet/models/detectors/base.py", line 238, in train_step
        losses = self(**data)
      File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
        return forward_call(*input, **kwargs)
      File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/runner/fp16_utils.py", line 128, in new_func
        output = old_func(*new_args, **new_kwargs)
      File "/data/SoftTeacher/thirdparty/mmdetection/mmdet/models/detectors/base.py", line 172, in forward
        return self.forward_train(img, img_metas, **kwargs)
      File "/data/SoftTeacher/ssod/models/soft_teacher.py", line 44, in forward_train
        sup_loss = self.student.forward_train(**data_groups["sup"])
      File "/data/SoftTeacher/thirdparty/mmdetection/mmdet/models/detectors/two_stage.py", line 135, in forward_train
        rpn_losses, proposal_list = self.rpn_head.forward_train(
      File "/data/SoftTeacher/thirdparty/mmdetection/mmdet/models/dense_heads/base_dense_head.py", line 59, in forward_train
        proposal_list = self.get_bboxes(*outs, img_metas, cfg=proposal_cfg)
      File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/runner/fp16_utils.py", line 214, in new_func
        output = old_func(*new_args, **new_kwargs)
      File "/data/SoftTeacher/thirdparty/mmdetection/mmdet/models/dense_heads/rpn_head.py", line 152, in get_bboxes
        proposals = self._get_bboxes_single(cls_score_list, bbox_pred_list,
      File "/data/SoftTeacher/thirdparty/mmdetection/mmdet/models/dense_heads/rpn_head.py", line 244, in _get_bboxes_single
        dets, keep = batched_nms(proposals, scores, ids, cfg.nms)
      File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/ops/nms.py", line 307, in batched_nms
        dets, keep = nms_op(boxes_for_nms, scores, **nms_cfg_)
      File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/utils/misc.py", line 330, in new_func
        output = old_func(*args, **kwargs)
      File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/ops/nms.py", line 171, in nms
        inds = NMSop.apply(boxes, scores, iou_threshold, offset,
      File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/ops/nms.py", line 26, in forward
        inds = ext_module.nms(
    RuntimeError: CUDA error: invalid device function
    CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
    For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
    /data/SoftTeacher/thirdparty/mmdetection/mmdet/core/anchor/anchor_generator.py:324: UserWarning: ``grid_anchors`` would be deprecated soon. Please use ``grid_priors`` 
      warnings.warn('``grid_anchors`` would be deprecated soon. '
    /data/SoftTeacher/thirdparty/mmdetection/mmdet/core/anchor/anchor_generator.py:360: UserWarning: ``single_level_grid_anchors`` would be deprecated soon. Please use ``single_level_grid_priors`` 
      warnings.warn(
    Traceback (most recent call last):
      File "/data/SoftTeacher/tools/train.py", line 198, in <module>
        main()
      File "/data/SoftTeacher/tools/train.py", line 186, in main
        train_detector(
      File "/data/SoftTeacher/ssod/apis/train.py", line 206, in train_detector
        runner.run(data_loaders, cfg.workflow)
      File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/runner/iter_based_runner.py", line 133, in run
        iter_runner(iter_loaders[i], **kwargs)
      File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/runner/iter_based_runner.py", line 60, in train
        outputs = self.model.train_step(data_batch, self.optimizer, **kwargs)
      File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/parallel/distributed.py", line 52, in train_step
        output = self.module.train_step(*inputs[0], **kwargs[0])
      File "/data/SoftTeacher/thirdparty/mmdetection/mmdet/models/detectors/base.py", line 238, in train_step
        losses = self(**data)
      File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
        return forward_call(*input, **kwargs)
      File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/runner/fp16_utils.py", line 128, in new_func
        output = old_func(*new_args, **new_kwargs)
      File "/data/SoftTeacher/thirdparty/mmdetection/mmdet/models/detectors/base.py", line 172, in forward
        return self.forward_train(img, img_metas, **kwargs)
      File "/data/SoftTeacher/ssod/models/soft_teacher.py", line 44, in forward_train
        sup_loss = self.student.forward_train(**data_groups["sup"])
      File "/data/SoftTeacher/thirdparty/mmdetection/mmdet/models/detectors/two_stage.py", line 135, in forward_train
        rpn_losses, proposal_list = self.rpn_head.forward_train(
      File "/data/SoftTeacher/thirdparty/mmdetection/mmdet/models/dense_heads/base_dense_head.py", line 59, in forward_train
        proposal_list = self.get_bboxes(*outs, img_metas, cfg=proposal_cfg)
      File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/runner/fp16_utils.py", line 214, in new_func
        output = old_func(*new_args, **new_kwargs)
      File "/data/SoftTeacher/thirdparty/mmdetection/mmdet/models/dense_heads/rpn_head.py", line 152, in get_bboxes
        proposals = self._get_bboxes_single(cls_score_list, bbox_pred_list,
      File "/data/SoftTeacher/thirdparty/mmdetection/mmdet/models/dense_heads/rpn_head.py", line 244, in _get_bboxes_single
        dets, keep = batched_nms(proposals, scores, ids, cfg.nms)
      File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/ops/nms.py", line 307, in batched_nms
        dets, keep = nms_op(boxes_for_nms, scores, **nms_cfg_)
      File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/utils/misc.py", line 330, in new_func
        output = old_func(*args, **kwargs)
      File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/ops/nms.py", line 171, in nms
        inds = NMSop.apply(boxes, scores, iou_threshold, offset,
      File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/ops/nms.py", line 26, in forward
        inds = ext_module.nms(
    RuntimeError: CUDA error: invalid device function
    CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
    For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
    /data/SoftTeacher/thirdparty/mmdetection/mmdet/core/anchor/anchor_generator.py:324: UserWarning: ``grid_anchors`` would be deprecated soon. Please use ``grid_priors`` 
      warnings.warn('``grid_anchors`` would be deprecated soon. '
    /data/SoftTeacher/thirdparty/mmdetection/mmdet/core/anchor/anchor_generator.py:360: UserWarning: ``single_level_grid_anchors`` would be deprecated soon. Please use ``single_level_grid_priors`` 
      warnings.warn(
    Traceback (most recent call last):
      File "/data/SoftTeacher/tools/train.py", line 198, in <module>
        main()
      File "/data/SoftTeacher/tools/train.py", line 186, in main
        train_detector(
      File "/data/SoftTeacher/ssod/apis/train.py", line 206, in train_detector
        runner.run(data_loaders, cfg.workflow)
      File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/runner/iter_based_runner.py", line 133, in run
        iter_runner(iter_loaders[i], **kwargs)
      File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/runner/iter_based_runner.py", line 60, in train
        outputs = self.model.train_step(data_batch, self.optimizer, **kwargs)
      File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/parallel/distributed.py", line 52, in train_step
        output = self.module.train_step(*inputs[0], **kwargs[0])
      File "/data/SoftTeacher/thirdparty/mmdetection/mmdet/models/detectors/base.py", line 238, in train_step
        losses = self(**data)
      File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
        return forward_call(*input, **kwargs)
      File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/runner/fp16_utils.py", line 128, in new_func
        output = old_func(*new_args, **new_kwargs)
      File "/data/SoftTeacher/thirdparty/mmdetection/mmdet/models/detectors/base.py", line 172, in forward
        return self.forward_train(img, img_metas, **kwargs)
      File "/data/SoftTeacher/ssod/models/soft_teacher.py", line 44, in forward_train
        sup_loss = self.student.forward_train(**data_groups["sup"])
      File "/data/SoftTeacher/thirdparty/mmdetection/mmdet/models/detectors/two_stage.py", line 135, in forward_train
        rpn_losses, proposal_list = self.rpn_head.forward_train(
      File "/data/SoftTeacher/thirdparty/mmdetection/mmdet/models/dense_heads/base_dense_head.py", line 59, in forward_train
        proposal_list = self.get_bboxes(*outs, img_metas, cfg=proposal_cfg)
      File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/runner/fp16_utils.py", line 214, in new_func
        output = old_func(*new_args, **new_kwargs)
      File "/data/SoftTeacher/thirdparty/mmdetection/mmdet/models/dense_heads/rpn_head.py", line 152, in get_bboxes
        proposals = self._get_bboxes_single(cls_score_list, bbox_pred_list,
      File "/data/SoftTeacher/thirdparty/mmdetection/mmdet/models/dense_heads/rpn_head.py", line 244, in _get_bboxes_single
        dets, keep = batched_nms(proposals, scores, ids, cfg.nms)
      File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/ops/nms.py", line 307, in batched_nms
        dets, keep = nms_op(boxes_for_nms, scores, **nms_cfg_)
      File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/utils/misc.py", line 330, in new_func
        output = old_func(*args, **kwargs)
      File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/ops/nms.py", line 171, in nms
        inds = NMSop.apply(boxes, scores, iou_threshold, offset,
      File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/ops/nms.py", line 26, in forward
        inds = ext_module.nms(
    RuntimeError: CUDA error: invalid device function
    CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
    For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
    /data/SoftTeacher/thirdparty/mmdetection/mmdet/core/anchor/anchor_generator.py:324: UserWarning: ``grid_anchors`` would be deprecated soon. Please use ``grid_priors`` 
      warnings.warn('``grid_anchors`` would be deprecated soon. '
    /data/SoftTeacher/thirdparty/mmdetection/mmdet/core/anchor/anchor_generator.py:360: UserWarning: ``single_level_grid_anchors`` would be deprecated soon. Please use ``single_level_grid_priors`` 
      warnings.warn(
    Traceback (most recent call last):
      File "/data/SoftTeacher/tools/train.py", line 198, in <module>
        main()
      File "/data/SoftTeacher/tools/train.py", line 186, in main
        train_detector(
      File "/data/SoftTeacher/ssod/apis/train.py", line 206, in train_detector
        runner.run(data_loaders, cfg.workflow)
      File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/runner/iter_based_runner.py", line 133, in run
        iter_runner(iter_loaders[i], **kwargs)
      File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/runner/iter_based_runner.py", line 60, in train
        outputs = self.model.train_step(data_batch, self.optimizer, **kwargs)
      File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/parallel/distributed.py", line 52, in train_step
        output = self.module.train_step(*inputs[0], **kwargs[0])
      File "/data/SoftTeacher/thirdparty/mmdetection/mmdet/models/detectors/base.py", line 238, in train_step
        losses = self(**data)
      File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
        return forward_call(*input, **kwargs)
      File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/runner/fp16_utils.py", line 128, in new_func
        output = old_func(*new_args, **new_kwargs)
      File "/data/SoftTeacher/thirdparty/mmdetection/mmdet/models/detectors/base.py", line 172, in forward
        return self.forward_train(img, img_metas, **kwargs)
      File "/data/SoftTeacher/ssod/models/soft_teacher.py", line 44, in forward_train
        sup_loss = self.student.forward_train(**data_groups["sup"])
      File "/data/SoftTeacher/thirdparty/mmdetection/mmdet/models/detectors/two_stage.py", line 135, in forward_train
        rpn_losses, proposal_list = self.rpn_head.forward_train(
      File "/data/SoftTeacher/thirdparty/mmdetection/mmdet/models/dense_heads/base_dense_head.py", line 59, in forward_train
        proposal_list = self.get_bboxes(*outs, img_metas, cfg=proposal_cfg)
      File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/runner/fp16_utils.py", line 214, in new_func
        output = old_func(*new_args, **new_kwargs)
      File "/data/SoftTeacher/thirdparty/mmdetection/mmdet/models/dense_heads/rpn_head.py", line 152, in get_bboxes
        proposals = self._get_bboxes_single(cls_score_list, bbox_pred_list,
      File "/data/SoftTeacher/thirdparty/mmdetection/mmdet/models/dense_heads/rpn_head.py", line 244, in _get_bboxes_single
        dets, keep = batched_nms(proposals, scores, ids, cfg.nms)
      File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/ops/nms.py", line 307, in batched_nms
        dets, keep = nms_op(boxes_for_nms, scores, **nms_cfg_)
      File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/utils/misc.py", line 330, in new_func
        output = old_func(*args, **kwargs)
      File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/ops/nms.py", line 171, in nms
        inds = NMSop.apply(boxes, scores, iou_threshold, offset,
      File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/ops/nms.py", line 26, in forward
        inds = ext_module.nms(
    RuntimeError: CUDA error: invalid device function
    CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
    For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
    
    wandb: Waiting for W&B process to finish, PID 38162
    wandb: Program failed with code 1. 
    ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 1 (pid: 37922) of binary: /home/ubuntu/anaconda3/envs/py39/bin/python
    Traceback (most recent call last):
      File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/runpy.py", line 197, in _run_module_as_main
        return _run_code(code, main_globals, None,
      File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/runpy.py", line 87, in _run_code
        exec(code, run_globals)
      File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/launch.py", line 193, in <module>
        main()
      File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/launch.py", line 189, in main
        launch(args)
      File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/launch.py", line 174, in launch
        run(args)
      File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/run.py", line 689, in run
        elastic_launch(
      File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 116, in __call__
        return launch_agent(self._config, self._entrypoint, list(args))
      File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 244, in launch_agent
        raise ChildFailedError(
    torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
    **************************************************
                  tools/train.py FAILED               
    ==================================================
    Root Cause:
    [0]:
      time: 2021-09-29_15:26:15
      rank: 1 (local_rank: 1)
      exitcode: -11 (pid: 37922)
      error_file: <N/A>
      msg: "Signal 11 (SIGSEGV) received by PID 37922"
    ==================================================
    Other Failures:
    [1]:
      time: 2021-09-29_15:26:15
      rank: 3 (local_rank: 3)
      exitcode: -11 (pid: 37924)
      error_file: <N/A>
      msg: "Signal 11 (SIGSEGV) received by PID 37924"
    **************************************************
    
    opened by sobujmaroon 12
  • mmcv incompatibility with mmdetection

    mmcv incompatibility with mmdetection

    I could not find a compatible mmcv-full version that can import MultiScaleDeformableAttention (even if I tried your indicated version)

    /cta/users/mehmet/.conda/envs/softteacher/lib/python3.6/site-packages/mmcv/cnn/bricks/transformer.py:27: UserWarning: Fail to import ``MultiScaleDeformableAttention`` from ``mmcv.ops.multi_scale_deform_attn``, You should install ``mmcv-full`` if you need this module.
      warnings.warn('Fail to import ``MultiScaleDeformableAttention`` from '
    /cta/users/mehmet/SoftTeacher/thirdparty/mmdetection/mmdet/models/utils/transformer.py:27: UserWarning: `MultiScaleDeformableAttention` in MMCV has been moved to `mmcv.ops.multi_scale_deform_attn`, please update your MMCV
      '`MultiScaleDeformableAttention` in MMCV has been moved to '
    Traceback (most recent call last):
      File "/cta/users/mehmet/SoftTeacher/thirdparty/mmdetection/mmdet/models/utils/transformer.py", line 23, in <module>
        from mmcv.ops.multi_scale_deform_attn import MultiScaleDeformableAttention
      File "/cta/users/mehmet/.conda/envs/softteacher/lib/python3.6/site-packages/mmcv/ops/__init__.py", line 1, in <module>
        from .bbox import bbox_overlaps
      File "/cta/users/mehmet/.conda/envs/softteacher/lib/python3.6/site-packages/mmcv/ops/bbox.py", line 3, in <module>
        ext_module = ext_loader.load_ext('_ext', ['bbox_overlaps'])
      File "/cta/users/mehmet/.conda/envs/softteacher/lib/python3.6/site-packages/mmcv/utils/ext_loader.py", line 12, in load_ext
        ext = importlib.import_module('mmcv.' + name)
      File "/cta/users/mehmet/.conda/envs/softteacher/lib/python3.6/importlib/__init__.py", line 126, in import_module
        return _bootstrap._gcd_import(name[level:], package, level)
    ImportError: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "tools/train.py", line 15, in <module>
        from mmdet.models import build_detector
      File "/cta/users/mehmet/SoftTeacher/thirdparty/mmdetection/mmdet/models/__init__.py", line 2, in <module>
        from .backbones import *  # noqa: F401,F403
      File "/cta/users/mehmet/SoftTeacher/thirdparty/mmdetection/mmdet/models/backbones/__init__.py", line 2, in <module>
        from .csp_darknet import CSPDarknet
      File "/cta/users/mehmet/SoftTeacher/thirdparty/mmdetection/mmdet/models/backbones/csp_darknet.py", line 11, in <module>
        from ..utils import CSPLayer
      File "/cta/users/mehmet/SoftTeacher/thirdparty/mmdetection/mmdet/models/utils/__init__.py", line 14, in <module>
        from .transformer import (DetrTransformerDecoder, DetrTransformerDecoderLayer,
      File "/cta/users/mehmet/SoftTeacher/thirdparty/mmdetection/mmdet/models/utils/transformer.py", line 29, in <module>
        from mmcv.cnn.bricks.transformer import MultiScaleDeformableAttention
    ImportError: cannot import name 'MultiScaleDeformableAttention'
    
    opened by makifozkanoglu 10
  • 将Detector更换为cascade,会出现TypeError: _bbox_forward() missing 1 required positional argument: 'rois'

    将Detector更换为cascade,会出现TypeError: _bbox_forward() missing 1 required positional argument: 'rois'

    image 我这边初步排查了问题,在使用cascade时self.teacher.roi_head.simple_test_bboxes()会调用cascade_roi_head.py中的def _bbox_forward(self, stage, x, rois),但是test_mixins.py中为bbox_results = self._bbox_forward(x, rois),缺少stage参数,这个如何处理呢? cascade_roi_head.py中定义如下: image test_mixins.py中调用如下: image

    opened by zhanghang-cv 9
  • Error in training

    Error in training

    Error in full training:

    tools/train.py FAILED

    Root Cause: [0]: time: 2021-10-04_17:02:18 rank: 0 (local_rank: 0) exitcode: 1 (pid: 921) error_file: <N/A> msg: "Process failed with exitcode 1"

    Other Failures: <NO_OTHER_FAILURES>

    I am using only one GPU, get an error in full training with my own data converted to COCO.

    Firstly, I segmented the data with "bash tools/dataset/prepare_coco_data.sh conduct", then trained with "bash tools/dist_train.sh configs/soft_teacher/soft_teacher_faster_rcnn_r50_caffe_fpn_coco_full_720k.py 1 "

    I also trained as the readme file with the COCO data, and still obtain errors, in full or semi training. It gets stuck in: INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_mhoa0vu3/none_y1enyj09/attempt_0/0/error.json

    opened by luisfra19 9
  • Cascade RCNN with Soft Teacher does not run: _bbox_forward() missing 1 required positional argument: 'rois'

    Cascade RCNN with Soft Teacher does not run: _bbox_forward() missing 1 required positional argument: 'rois'

    When trying to run Cascade RCNN with Soft Teacher, I get the following exception:

    File "/...../python3.8/site-packages/mmdet/models/roi_heads/test_mixins.py", line 89, in simple_test_bboxes bbox_results = self._bbox_forward(x, rois) TypeError: _bbox_forward() missing 1 required positional argument: 'rois'

    This problem was raised previously in #123 and #106, but neither was fully answered. Apparently Cascade RCNN's ROI heads will somehow use the incorrect test function.

    There was never a satisfactory answer to how to fix this problem other than a cryptic response: "The dirty way I used before is that the feature and teacher on the teacher side are passed into the head as parameters, and then the teacher is used to make judgments in the head.."

    1. Has anyone successfully run Cascade RCNN with Soft Teacher?
    2. Does anyone know the fix to make the above problem work with Cascade RCNN?

    Thank you in advance! Cheers, Mark

    opened by planaria158 0
  • Config files for evaluating the provided models

    Config files for evaluating the provided models

    Hi. Is it possible to share the config files used for evaluating the weights available in through the Google Drive links?

    I was trying to reproduce the 44.05% mAP of the Faster R-CNN (ResNet-50) -- Ours (thr=5e-2) experiment. However, I only get Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.324.

    The command that I ran was the following:

    bash tools/dist_test.sh
    /home/ubuntu/project/Detection/SoftTeacher/configs/soft_teacher/soft_teacher_faster_rcnn_r50_caffe_fpn_coco_full_720k_eval.py
    /home/ubuntu/project/Detection/SoftTeacher/work_dirs/soft_teacher_faster_rcnn_r50_caffe_fpn_coco_full_720k/coco_iter_720000.pth
    1 --eval bbox --cfg-options model.test_cfg.rcnn.score_thr=0.90
    

    The config file is the following:

    _base_="base.py"
    
    data = dict(
        samples_per_gpu=8,
        workers_per_gpu=5,
        train=dict(
            sup=dict(
                ann_file="/home/ubuntu/project/data/COCO/annotations/instances_train2017.json",
                img_prefix="/home/ubuntu/project/data/COCO/train2017/",
            ),
        ),
        val=dict(
            ann_file="/home/ubuntu/project/data/COCO/annotations/instances_val2017.json",
            img_prefix="/home/ubuntu/project/data/COCO/val2017/",
        ),
        test=dict(
            ann_file="/home/ubuntu/project/data/COCO/annotations/instances_val2017.json",
            img_prefix="/home/ubuntu/project/data/COCO/val2017/",
        ),
    
        sampler=dict(
            train=dict(
                sample_ratio=[1, 1],
            )
        )
    )
    
    semi_wrapper = dict(
        train_cfg=dict(
            unsup_weight=2.0,
        )
    )
    
    optimizer = dict(lr=0.01, weight_decay=1e-4, momentum=0.9)
    lr_config = dict(step=[300000, 425000])
    runner = dict(_delete_=True, type="IterBasedRunner", max_iters=450000)
    

    Could someone help me out? Thank you. If there is an existing issue about this that I missed, I apologize in advance.

    opened by Bai-YT 0
  • lr_config set up according to max iter?

    lr_config set up according to max iter?

    How should I set lr_config parameters according to max_iter? Now I want to perform only 20k steps as max iteration what should lr_config will be?

    ` lr_config = dict(step=[120000 * 4, 160000 * 4])

    runner = dict(delete=True, type="IterBasedRunner", max_iters=12000)`

    opened by vefak 0
  • 训练了4000iters,验证完之后卡着了,请问这是怎么回事呢?

    训练了4000iters,验证完之后卡着了,请问这是怎么回事呢?

    2022-07-07 14:05:26,893 - mmdet.ssod - INFO - Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.135 Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=1000 ] = 0.269 Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=1000 ] = 0.120 Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.071 Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.155 Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.164 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.262 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=300 ] = 0.262 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=1000 ] = 0.262 Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.119 Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.281 Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.331

    [>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 5000/5000, 155.5 task/s, elapsed: 32s, ETA: 0s2022-07-07 14:06:02,265 - mmdet.ssod - INFO - Evaluating bbox... Loading and preparing results... DONE (t=0.11s) creating index... index created! Running per image evaluation... Evaluate annotation type bbox DONE (t=14.59s). Accumulating evaluation results... DONE (t=3.20s). 2022-07-07 14:06:20,482 - mmdet.ssod - INFO - Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.088 Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=1000 ] = 0.193 Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=1000 ] = 0.068 Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.045 Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.105 Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.112 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.169 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=300 ] = 0.169 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=1000 ] = 0.169 Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.063 Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.186 Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.216

    2022-07-07 14:06:20,850 - mmdet.ssod - INFO - Exp name: soft_teacher_faster_rcnn_r50_caffe_fpn_coco_180k.py 2022-07-07 14:06:20,855 - mmdet.ssod - INFO - Iter(val) [4000] teacher.bbox_mAP: 0.1350, teacher.bbox_mAP_50: 0.2690, teacher.bbox_mAP_75: 0.1200, teacher.bbox_mAP_s: 0.0710, teacher.bbox_mAP_m: 0.1550, teacher.bbox_mAP_l: 0.1640, teacher.bbox_mAP_copypaste: 0.135 0.269 0.120 0.071 0.155 0.164, student.bbox_mAP: 0.0880, student.bbox_mAP_50: 0.1930, student.bbox_mAP_75: 0.0680, student.bbox_mAP_s: 0.0450, student.bbox_mAP_m: 0.1050, student.bbox_mAP_l: 0.1120, student.bbox_mAP_copypaste: 0.088 0.193 0.068 0.045 0.105 0.112

    请问跑完了4000iters以后,程序就卡着不动了,这是咋回事呢?

    opened by mary-0830 2
Owner
Microsoft
Open source projects and samples from Microsoft
Microsoft
Data-Uncertainty Guided Multi-Phase Learning for Semi-supervised Object Detection

An official implementation of paper Data-Uncertainty Guided Multi-Phase Learning for Semi-supervised Object Detection

null 11 Nov 23, 2022
CVPR2022 paper "Dense Learning based Semi-Supervised Object Detection"

[CVPR2022] DSL: Dense Learning based Semi-Supervised Object Detection DSL is the first work on Anchor-Free detector for Semi-Supervised Object Detecti

Bhchen 69 Dec 8, 2022
PyTorch code for ICLR 2021 paper Unbiased Teacher for Semi-Supervised Object Detection

Unbiased Teacher for Semi-Supervised Object Detection This is the PyTorch implementation of our paper: Unbiased Teacher for Semi-Supervised Object Detection

Facebook Research 366 Dec 28, 2022
Instant-Teaching: An End-to-End Semi-Supervised Object Detection Framework

This repo is the official implementation of "Instant-Teaching: An End-to-End Semi-Supervised Object Detection Framework". @inproceedings{zhou2021insta

null 34 Dec 31, 2022
Group R-CNN for Point-based Weakly Semi-supervised Object Detection (CVPR2022)

Group R-CNN for Point-based Weakly Semi-supervised Object Detection (CVPR2022) By Shilong Zhang*, Zhuoran Yu*, Liyang Liu*, Xinjiang Wang, Aojun Zhou,

Shilong Zhang 129 Dec 24, 2022
Project looking into use of autoencoder for semi-supervised learning and comparing data requirements compared to supervised learning.

Project looking into use of autoencoder for semi-supervised learning and comparing data requirements compared to supervised learning.

Tom-R.T.Kvalvaag 2 Dec 17, 2021
UniMoCo: Unsupervised, Semi-Supervised and Full-Supervised Visual Representation Learning

UniMoCo: Unsupervised, Semi-Supervised and Full-Supervised Visual Representation Learning This is the official PyTorch implementation for UniMoCo pape

dddzg 49 Jan 2, 2023
Code and models for ICCV2021 paper "Robust Object Detection via Instance-Level Temporal Cycle Confusion".

Robust Object Detection via Instance-Level Temporal Cycle Confusion This repo contains the implementation of the ICCV 2021 paper, Robust Object Detect

Xin Wang 69 Oct 13, 2022
TOOD: Task-aligned One-stage Object Detection, ICCV2021 Oral

One-stage object detection is commonly implemented by optimizing two sub-tasks: object classification and localization, using heads with two parallel branches, which might lead to a certain level of spatial misalignment in predictions between the two tasks.

null 264 Jan 9, 2023
Exploring Classification Equilibrium in Long-Tailed Object Detection, ICCV2021

Exploring Classification Equilibrium in Long-Tailed Object Detection (LOCE, ICCV 2021) Paper Introduction The conventional detectors tend to make imba

null 52 Nov 21, 2022
ICCV2021 Paper: AutoShape: Real-Time Shape-Aware Monocular 3D Object Detection

ICCV2021 Paper: AutoShape: Real-Time Shape-Aware Monocular 3D Object Detection

Zongdai 107 Dec 20, 2022
[CVPR 2021] MiVOS - Mask Propagation module. Reproduced STM (and better) with training code :star2:. Semi-supervised video object segmentation evaluation.

MiVOS (CVPR 2021) - Mask Propagation Ho Kei Cheng, Yu-Wing Tai, Chi-Keung Tang [arXiv] [Paper PDF] [Project Page] [Papers with Code] This repo impleme

Rex Cheng 106 Jan 3, 2023
Semi-Supervised 3D Hand-Object Poses Estimation with Interactions in Time

Semi Hand-Object Semi-Supervised 3D Hand-Object Poses Estimation with Interactions in Time (CVPR 2021).

null 96 Dec 27, 2022
CoSMA: Convolutional Semi-Regular Mesh Autoencoder. From Paper "Mesh Convolutional Autoencoder for Semi-Regular Meshes of Different Sizes"

Mesh Convolutional Autoencoder for Semi-Regular Meshes of Different Sizes Implementation of CoSMA: Convolutional Semi-Regular Mesh Autoencoder arXiv p

Fraunhofer SCAI 10 Oct 11, 2022
Yolo object detection - Yolo object detection with python

How to run download required files make build_image make download Docker versio

null 3 Jan 26, 2022
code for paper "Not All Unlabeled Data are Equal: Learning to Weight Data in Semi-supervised Learning" by Zhongzheng Ren*, Raymond A. Yeh*, Alexander G. Schwing.

Not All Unlabeled Data are Equal: Learning to Weight Data in Semi-supervised Learning Overview This code is for paper: Not All Unlabeled Data are Equa

Jason Ren 22 Nov 23, 2022
PyTorch implementation of our ICCV2021 paper: StructDepth: Leveraging the structural regularities for self-supervised indoor depth estimation

StructDepth PyTorch implementation of our ICCV2021 paper: StructDepth: Leveraging the structural regularities for self-supervised indoor depth estimat

SJTU-ViSYS 112 Nov 28, 2022
Perturbed Self-Distillation: Weakly Supervised Large-Scale Point Cloud Semantic Segmentation (ICCV2021)

Perturbed Self-Distillation: Weakly Supervised Large-Scale Point Cloud Semantic Segmentation (ICCV2021) This is the implementation of PSD (ICCV 2021),

null 12 Dec 12, 2022