Semi-Supervised Learning, Object Detection, ICCV2021

Microsoft

Last update: Dec 27, 2022

Related tags

Overview

End-to-End Semi-Supervised Object Detection with Soft Teacher

By Mengde Xu*, Zheng Zhang*, Han Hu, Jianfeng Wang, Lijuan Wang, Fangyun Wei, Xiang Bai, Zicheng Liu.

This repo is the official implementation of ICCV2021 paper "End-to-End Semi-Supervised Object Detection with Soft Teacher".

Citation

@article{xu2021end,
  title={End-to-End Semi-Supervised Object Detection with Soft Teacher},
  author={Xu, Mengde and Zhang, Zheng and Hu, Han and Wang, Jianfeng and Wang, Lijuan and Wei, Fangyun and Bai, Xiang and Liu, Zicheng},
  journal={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
  year={2021}
}

Main Results

Partial Labeled Data

We followed STAC[1] to evaluate on 5 different data splits for each setting, and report the average performance of 5 splits. The results are shown in the following:

1% labeled data

Method	mAP	Model Weights	Config Files
Baseline	10.0	-	Config
Ours (thr=5e-2)	21.62	Drive	Config
Ours (thr=1e-3)	22.64	Drive	Config

5% labeled data

Method	mAP	Model Weights	Config Files
Baseline	20.92	-	Config
Ours (thr=5e-2)	30.42	Drive	Config
Ours (thr=1e-3)	31.7	Drive	Config

10% labeled data

Method	mAP	Model Weights	Config Files
Baseline	26.94	-	Config
Ours (thr=5e-2)	33.78	Drive	Config
Ours (thr=1e-3)	34.7	Drive	Config

Full Labeled Data

Faster R-CNN (ResNet-50)

Model	mAP	Model Weights	Config Files
Baseline	40.9	-	Config
Ours (thr=5e-2)	44.05	Drive	Config
Ours (thr=1e-3)	44.6	Drive	Config
Ours* (thr=5e-2)	44.5	-	Config
Ours* (thr=1e-3)	44.9	-	Config

Faster R-CNN (ResNet-101)

Model	mAP	Model Weights	Config Files
Baseline	43.8	-	Config
Ours* (thr=5e-2)	46.8	-	Config
Ours* (thr=1e-3)	47.3	-	Config

Notes

Ours* means we use longer training schedule.
thr indicates model.test_cfg.rcnn.score_thr in config files. This inference trick was first introduced by Instant-Teaching[2].
All models are trained on 8*V100 GPUs

Usage

Requirements

Ubuntu 16.04
Anaconda3 with python=3.6
Pytorch=1.9.0
mmdetection=2.16.0+fe46ffe
mmcv=1.3.9
wandb=0.10.31

Notes

We use wandb for visualization, if you don't want to use it, just comment line 273-284 in configs/soft_teacher/base.py.

Installation

make install

Data Preparation

Download the COCO dataset
Execute the following command to generate data set splits:

# YOUR_DATA should be a directory contains coco dataset.
# For eg.:
# YOUR_DATA/
#  coco/
#     train2017/
#     val2017/
#     unlabeled2017/
#     annotations/
ln -s ${YOUR_DATA} data
bash tools/dataset/prepare_coco_data.sh conduct

Training

To train model on the partial labeled data setting:

# JOB_TYPE: 'baseline' or 'semi', decide which kind of job to run
# PERCENT_LABELED_DATA: 1, 5, 10. The ratio of labeled coco data in whole training dataset.
# GPU_NUM: number of gpus to run the job
for FOLD in 1 2 3 4 5;
do
  bash tools/dist_train_partially.sh <JOB_TYPE> ${FOLD} <PERCENT_LABELED_DATA> <GPU_NUM>
done

For example, we could run the following scripts to train our model on 10% labeled data with 8 GPUs:

for FOLD in 1 2 3 4 5;
do
  bash tools/dist_train_partially.sh semi ${FOLD} 10 8
done

To train model on the full labeled data setting:

bash tools/dist_train.sh <CONFIG_FILE_PATH> <NUM_GPUS>

For example, to train ours R50 model with 8 GPUs:

bash tools/dist_train.sh configs/soft_teacher/soft_teacher_faster_rcnn_r50_caffe_fpn_coco_full_720k.py 8

Evaluation

bash tools/dist_test.sh <CONFIG_FILE_PATH> <CHECKPOINT_PATH> <NUM_GPUS> --eval bbox --cfg-options model.test_cfg.rcnn.score_thr=<THR>

Inference

To inference with trained model and visualize the detection results:

# [IMAGE_FILE_PATH]: the path of your image file in local file system
# [CONFIG_FILE]: the path of a confile file
# [CHECKPOINT_PATH]: the path of a trained model related to provided confilg file.
# [OUTPUT_PATH]: the directory to save detection result
python demo/image_demo.py [IMAGE_FILE_PATH] [CONFIG_FILE] [CHECKPOINT_PATH] --output [OUTPUT_PATH]

For example:

Inference on single image with provided R50 model:

python demo/image_demo.py /tmp/tmp.png configs/soft_teacher/soft_teacher_faster_rcnn_r50_caffe_fpn_coco_full_720k.py work_dirs/downloaded.model --output work_dirs/

After the program completes, a image with the same name as input will be saved to work_dirs

Inference on many images with provided R50 model:

python demo/image_demo.py '/tmp/*.jpg' configs/soft_teacher/soft_teacher_faster_rcnn_r50_caffe_fpn_coco_full_720k.py work_dirs/downloaded.model --output work_dirs/

[1] A Simple Semi-Supervised Learning Framework for Object Detection

[2] Instant-Teaching: An End-to-End Semi-SupervisedObject Detection Framework

Comments

Model training stops after validation after 4000 iterations

After training for 4000 iterations the validation happens and after that the training stops throwing the following error:

raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
***************************************
         tools/train.py FAILED         
=======================================
Root Cause:
[0]:
  time: 2021-09-22_05:54:53
  rank: 1 (local_rank: 1)
  exitcode: 1 (pid: 2210236)
  error_file: <N/A>
  msg: "Process failed with exitcode 1"
=======================================
Other Failures:
  <NO_OTHER_FAILURES>
***************************************

I am training with 2 gpus. Do you have any insight why this error is being thrown?

opened by purbayankar 19

Getting Error when start training with single GPU. [Error: CHILD PROCESS FAILED WITH NO ERROR_FILE ]
The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run WARNING:torch.distributed.run:--use_env is deprecated and will be removed in future releases. Please read local_rank fromos.environ('LOCAL_RANK')` instead. INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs: entrypoint : tools/train.py min_nodes : 1 max_nodes : 1 nproc_per_node : 1 run_id : none rdzv_backend : static rdzv_endpoint : 127.0.0.1:29500 rdzv_configs : {'rank': 0, 'timeout': 900} max_restarts : 3 monitor_interval : 5 log_dir : None metrics_cfg : {}

INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_mnn5x9jo/none_j57hpvun INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python3 INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group /home/vefak/Documents/anaconda3/envs/torch/lib/python3.6/site-packages/torch/distributed/elastic/utils/store.py:53: FutureWarning: This is an experimental API and will be changed in future. "This is an experimental API and will be changed in future.", FutureWarning INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result: restart_count=0 master_addr=127.0.0.1 master_port=29500 group_rank=0 group_world_size=1 local_ranks=[0] role_ranks=[0] global_ranks=[0] role_world_sizes=[1] global_world_sizes=[1]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_mnn5x9jo/none_j57hpvun/attempt_0/0/error.json ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 0 (pid: 14927) of binary: /home/vefak/Documents/anaconda3/envs/torch/bin/python3 ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 3/3 attempts left; will restart worker group INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result: restart_count=1 master_addr=127.0.0.1 master_port=29500 group_rank=0 group_world_size=1 local_ranks=[0] role_ranks=[0] global_ranks=[0] role_world_sizes=[1] global_world_sizes=[1]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_mnn5x9jo/none_j57hpvun/attempt_1/0/error.json ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 0 (pid: 14955) of binary: /home/vefak/Documents/anaconda3/envs/torch/bin/python3 ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 2/3 attempts left; will restart worker group INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result: restart_count=2 master_addr=127.0.0.1 master_port=29500 group_rank=0 group_world_size=1 local_ranks=[0] role_ranks=[0] global_ranks=[0] role_world_sizes=[1] global_world_sizes=[1]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_mnn5x9jo/none_j57hpvun/attempt_2/0/error.json ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 0 (pid: 15007) of binary: /home/vefak/Documents/anaconda3/envs/torch/bin/python3 ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 1/3 attempts left; will restart worker group INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result: restart_count=3 master_addr=127.0.0.1 master_port=29500 group_rank=0 group_world_size=1 local_ranks=[0] role_ranks=[0] global_ranks=[0] role_world_sizes=[1] global_world_sizes=[1]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_mnn5x9jo/none_j57hpvun/attempt_3/0/error.json ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 0 (pid: 15048) of binary: /home/vefak/Documents/anaconda3/envs/torch/bin/python3 ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (FAILED). Waiting 300 seconds for other agents to finish /home/vefak/Documents/anaconda3/envs/torch/lib/python3.6/site-packages/torch/distributed/elastic/utils/store.py:71: FutureWarning: This is an experimental API and will be changed in future. "This is an experimental API and will be changed in future.", FutureWarning INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.0004889965057373047 seconds {"name": "torchelastic.worker.status.FAILED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 0, "group_rank": 0, "worker_id": "15048", "role": "default", "hostname": "vefak", "state": "FAILED", "total_run_time": 25, "rdzv_backend": "static", "raw_error": "{"message": ""}", "metadata": "{"group_world_size": 1, "entry_point": "python3", "local_rank": [0], "role_rank": [0], "role_world_size": [1]}", "agent_restarts": 3}} {"name": "torchelastic.worker.status.SUCCEEDED", "source": "AGENT", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": null, "group_rank": 0, "worker_id": null, "role": "default", "hostname": "vefak", "state": "SUCCEEDED", "total_run_time": 25, "rdzv_backend": "static", "raw_error": null, "metadata": "{"group_world_size": 1, "entry_point": "python3"}", "agent_restarts": 3}} /home/vefak/Documents/anaconda3/envs/torch/lib/python3.6/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py:354: UserWarning:

CHILD PROCESS FAILED WITH NO ERROR_FILE

CHILD PROCESS FAILED WITH NO ERROR_FILE Child process 15048 (local_rank 0) FAILED (exitcode -11) Error msg: Signal 11 (SIGSEGV) received by PID 15048 Without writing an error file to <N/A>. While this DOES NOT affect the correctness of your application, no trace information about the error will be available for inspection. Consider decorating your top level entrypoint function with torch.distributed.elastic.multiprocessing.errors.record. Example:

from torch.distributed.elastic.multiprocessing.errors import record

@record def trainer_main(args): # do train

warnings.warn(_no_error_file_warning_msg(rank, failure)) Traceback (most recent call last): File "/home/vefak/Documents/anaconda3/envs/torch/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/vefak/Documents/anaconda3/envs/torch/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/vefak/Documents/anaconda3/envs/torch/lib/python3.6/site-packages/torch/distributed/launch.py", line 173, in main() File "/home/vefak/Documents/anaconda3/envs/torch/lib/python3.6/site-packages/torch/distributed/launch.py", line 169, in main run(args) File "/home/vefak/Documents/anaconda3/envs/torch/lib/python3.6/site-packages/torch/distributed/run.py", line 624, in run )(*cmd_args) File "/home/vefak/Documents/anaconda3/envs/torch/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 116, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/vefak/Documents/anaconda3/envs/torch/lib/python3.6/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper return f(*args, **kwargs) File "/home/vefak/Documents/anaconda3/envs/torch/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent failures=result.failures, torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

tools/train.py FAILED

================================================== Root Cause: [0]: time: 2022-02-04_00:19:07 rank: 0 (local_rank: 0) exitcode: -11 (pid: 15048) error_file: <N/A> msg: "Signal 11 (SIGSEGV) received by PID 15048"

Other Failures: <NO_OTHER_FAILURES>

`
opened by vefak 16
assert len(indices) == len(self)

hello, When I use it, raise error: "assert len(indices) == len(self), f"{indices} not equal {len(self)} while offset is: {offset}"" then I print the length info, =====len of indices is 26865 - offset: 0 - len self 36650 below is the detail error info, Please help me. Traceback (most recent call last): File "tools/train.py", line 198, in <module> main() File "tools/train.py", line 193, in main meta=meta, File "/data6/ziqiwen/code/softteacher/ssod/apis/train.py", line 206, in train_detector runner.run(data_loaders, cfg.workflow) File "/home/ziqiwen/anaconda3/envs/mm/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 117, in run iter_loaders = [IterLoader(x) for x in data_loaders] File "/home/ziqiwen/anaconda3/envs/mm/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 117, in <listcomp> iter_loaders = [IterLoader(x) for x in data_loaders] File "/home/ziqiwen/anaconda3/envs/mm/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 23, in __init__ self.iter_loader = iter(self._dataloader) File "/home/ziqiwen/anaconda3/envs/mm/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 291, in __iter__ return _MultiProcessingDataLoaderIter(self) File "/home/ziqiwen/anaconda3/envs/mm/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 764, in __init__ self._try_put_index() File "/home/ziqiwen/anaconda3/envs/mm/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 994, in _try_put_index index = self._next_index() File "/home/ziqiwen/anaconda3/envs/mm/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 357, in _next_index return next(self._sampler_iter) # may raise StopIteration File "/home/ziqiwen/anaconda3/envs/mm/lib/python3.7/site-packages/torch/utils/data/sampler.py", line 208, in __iter__ for idx in self.sampler: File "/data6/ziqiwen/code/softteacher/ssod/datasets/samplers/semi_sampler.py", line 189, in __iter__ assert len(indices) == len(self) AssertionError Traceback (most recent call last): File "/home/ziqiwen/anaconda3/envs/mm/lib/python3.7/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/home/ziqiwen/anaconda3/envs/mm/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/ziqiwen/anaconda3/envs/mm/lib/python3.7/site-packages/torch/distributed/launch.py", line 261, in <module> main() File "/home/ziqiwen/anaconda3/envs/mm/lib/python3.7/site-packages/torch/distributed/launch.py", line 257, in main cmd=cmd)

opened by winnerziqi 15
Training on a custom dataset

Thanks for sharing your great code!. I was trying to train your semi-supervised model on a custom data. yet I always get unsup_loss_rpn_bbox: 0.0000, unsup_loss_bbox: 0.0000 even after a long training time. My data has only one object class. Any suggestions, please? Thanks

This is what I got on test set. It looks that the network was never trained

[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 3276/3276, 45.6 task/s, elapsed: 72s, ETA: 0s Evaluating bbox... Loading and preparing results... DONE (t=0.00s) creating index... index created! Running per image evaluation... Evaluate annotation type *bbox* DONE (t=0.36s). Accumulating evaluation results... DONE (t=0.07s). Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.007 Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=1000 ] = 0.010 Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=1000 ] = 0.010 Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.000 Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.007 Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.000 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.001 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=300 ] = 0.001 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=1000 ] = 0.001 Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.000 Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.001 Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.000 OrderedDict([('bbox_mAP', 0.007), ('bbox_mAP_50', 0.01), ('bbox_mAP_75', 0.01), ('bbox_mAP_s', 0.0), ('bbox_mAP_m', 0.007), ('bbox_mAP_l', 0.0), ('bbox_mAP_copypaste', '0.007 0.010 0.010 0.000 0.007 0.000')])

opened by Hananali1 14
the r_square between iou and bbox variance in the refine(jitter(bbox)) method

The scatter plot in your paper about the relationship between iou and the bbox variance(after jittered) is really interesting and showed a strong correlation. Since that, I wonna try another method on single stage detector about estimating the bbox quality under your soft teacher architecture. I simply want to know what's the r_square you've achieved with soft teacher and faster-rcnn+FPN on COCO 1% labeled dataset. Maybe I wonna have a comparision in my projects in the future. Of course if I could come out with some methods under your architecture, I'll show my greatest gratitude and acknowledgements in my paper or project report! Sincerely thanks for your help!!!

opened by Jack-Hu-2001 14
KeyError: 'loss_cls'
After 300 Iter training, raise "KeyError: 'loss_cls'".

Below is my training information:

wandb: (1) Create a W&B account wandb: (2) Use an existing W&B account wandb: (3) Don't visualize my results wandb: Enter your choice: wandb: WARNING Invalid choice wandb: Enter your choice: wandb: WARNING Invalid choice wandb: Enter your choice: 3 wandb: You chose 'Don't visualize my results'

CondaEnvException: Unable to determine environment

Please re-run this command with one of the following options:

Provide an environment name via --name or -n

Re-run this command inside an activated conda environment.

wandb: W&B syncing is set to offline in this directory. Run wandb online or set WANDB_MODE=online to enable cloud syncing. =====group sizes is [1755 7042] =====len of indices is 14660 - offset: 0 - len self 14660 /opt/conda/lib/python3.7/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at /pytorch/c10/core/TensorImpl.h:1156.) return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode) /home/swap/project/SoftTeacher/thirdparty/mmdetection/mmdet/core/anchor/anchor_generator.py:324: UserWarning: grid_anchors would be deprecated soon. Please use grid_priors warnings.warn('grid_anchors would be deprecated soon. ' /home/swap/project/SoftTeacher/thirdparty/mmdetection/mmdet/core/anchor/anchor_generator.py:361: UserWarning: single_level_grid_anchors would be deprecated soon. Please use single_level_grid_priors 'single_level_grid_anchors would be deprecated soon. ' 2021-09-13 07:35:50,265 - mmcv - INFO - Reducer buckets have been rebuilt in this iteration. 2021-09-13 07:36:05,024 - mmdet.ssod - INFO - Iter [50/14400] lr: 9.890e-04, eta: 1:18:29, time: 0.328, data_time: 0.021, memory: 3390, ema_momentum: 0.9800, unsup_weight: 4, sup_loss_rpn_cls: 0.4607, sup_loss_rpn_bbox: 0.2644, sup_loss_cls: 1.8000, sup_acc: 74.8359, sup_loss_bbox: 0.3120, unsup_loss_rpn_cls: 1.2687, unsup_loss_rpn_bbox: 0.4132, unsup_loss_cls: 3.9225, unsup_acc: 78.9180, unsup_loss_bbox: 2.3666, loss: 10.8082 2021-09-13 07:36:20,035 - mmdet.ssod - INFO - Iter [100/14400] lr: 1.988e-03, eta: 1:14:52, time: 0.300, data_time: 0.012, memory: 3390, ema_momentum: 0.9900, unsup_weight: 4, sup_loss_rpn_cls: 0.2892, sup_loss_rpn_bbox: 0.2540, sup_loss_cls: 0.4055, sup_acc: 92.8125, sup_loss_bbox: 0.2864, unsup_loss_rpn_cls: 0.1523, unsup_loss_rpn_bbox: 0.0000, unsup_loss_cls: 0.0626, unsup_acc: 99.9648, unsup_loss_bbox: 0.0114, loss: 1.4615 2021-09-13 07:36:35,185 - mmdet.ssod - INFO - Iter [150/14400] lr: 2.987e-03, eta: 1:13:43, time: 0.303, data_time: 0.012, memory: 3390, ema_momentum: 0.9933, unsup_weight: 4, sup_loss_rpn_cls: 0.2405, sup_loss_rpn_bbox: 0.2624, sup_loss_cls: 0.3531, sup_acc: 92.7383, sup_loss_bbox: 0.2667, unsup_loss_rpn_cls: 0.1803, unsup_loss_rpn_bbox: 0.0000, unsup_loss_cls: 0.0657, unsup_acc: 100.0000, unsup_loss_bbox: 0.0000, loss: 1.3686 2021-09-13 07:36:51,586 - mmdet.ssod - INFO - Iter [200/14400] lr: 3.986e-03, eta: 1:14:30, time: 0.328, data_time: 0.012, memory: 3390, ema_momentum: 0.9950, unsup_weight: 4, sup_loss_rpn_cls: 0.3032, sup_loss_rpn_bbox: 0.3015, sup_loss_cls: 0.4180, sup_acc: 92.7266, sup_loss_bbox: 0.2982, unsup_loss_rpn_cls: 0.1778, unsup_loss_rpn_bbox: 0.0000, unsup_loss_cls: 0.0629, unsup_acc: 100.0000, unsup_loss_bbox: 0.0000, loss: 1.5616 2021-09-13 07:37:07,086 - mmdet.ssod - INFO - Iter [250/14400] lr: 4.985e-03, eta: 1:14:01, time: 0.310, data_time: 0.012, memory: 3390, ema_momentum: 0.9960, unsup_weight: 4, sup_loss_rpn_cls: 0.3210, sup_loss_rpn_bbox: 0.4059, sup_loss_cls: 0.4333, sup_acc: 92.9453, sup_loss_bbox: 0.3436, unsup_loss_rpn_cls: 0.1687, unsup_loss_rpn_bbox: 0.0000, unsup_loss_cls: 0.0740, unsup_acc: 100.0000, unsup_loss_bbox: 0.0000, loss: 1.7465 2021-09-13 07:37:22,497 - mmdet.ssod - INFO - Iter [300/14400] lr: 5.984e-03, eta: 1:13:32, time: 0.308, data_time: 0.011, memory: 3390, ema_momentum: 0.9967, unsup_weight: 4, sup_loss_rpn_cls: 0.3386, sup_loss_rpn_bbox: 0.4950, sup_loss_cls: 0.3686, sup_acc: 93.9102, sup_loss_bbox: 0.2682, unsup_loss_rpn_cls: 0.1938, unsup_loss_rpn_bbox: 0.0000, unsup_loss_cls: 0.0760, unsup_acc: 100.0000, unsup_loss_bbox: 0.0000, loss: 1.7401 Traceback (most recent call last): File "tools/train.py", line 198, in main() File "tools/train.py", line 193, in main meta=meta, File "/home/swap/project/SoftTeacher/ssod/apis/train.py", line 205, in train_detector runner.run(data_loaders, cfg.workflow) File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 133, in run iter_runner(iter_loaders[i], **kwargs) File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 60, in train outputs = self.model.train_step(data_batch, self.optimizer, **kwargs) File "/opt/conda/lib/python3.7/site-packages/mmcv/parallel/distributed.py", line 53, in train_step output = self.module.train_step(*inputs[0], **kwargs[0]) File "/home/swap/project/SoftTeacher/thirdparty/mmdetection/mmdet/models/detectors/base.py", line 238, in train_step losses = self(**data) File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 128, in new_func output = old_func(*new_args, **new_kwargs) File "/home/swap/project/SoftTeacher/thirdparty/mmdetection/mmdet/models/detectors/base.py", line 172, in forward return self.forward_train(img, img_metas, **kwargs) File "/home/swap/project/SoftTeacher/ssod/models/soft_teacher.py", line 50, in forward_train data_groups["unsup_teacher"], data_groups["unsup_student"] File "/home/swap/project/SoftTeacher/ssod/models/soft_teacher.py", line 77, in foward_unsup_train return self.compute_pseudo_label_loss(student_info, teacher_info) File "/home/swap/project/SoftTeacher/ssod/models/soft_teacher.py", line 120, in compute_pseudo_label_loss student_info=student_info, File "/home/swap/project/SoftTeacher/ssod/models/soft_teacher.py", line 244, in unsup_rcnn_cls_loss loss["loss_cls"] = loss["loss_cls"].sum() / max(bbox_targets[1].sum(), 1.0) KeyError: 'loss_cls'
opened by duany049 14
Formal defintion of G_cls, G_reg, l_cls, l_reg

Hello @MendelXu!

The paper mentions that G_cls is produced from teacher model by foreground filtering, G_reg is produced from teacher model by box variance filtering; unfortunately the paper doesn't mention the definition of l_cls and l_reg. l_cls(student_candidate_box, teacher_pseudo_boxes) and l_reg(student_candidate_box, teacher_pseudo_boxes) still needs to do label assignment. How is this assignment performed? What losses are used?

The paper mentions: Another important benefit of this end-to-end framework is that it allows for greater leverage of the teacher model to guide the training of the student model, rather than just providing “some generated pseudo boxes with hard category labels” as in previous approaches [27, 36]. A soft teacher approach is proposed to implement this insight. In this approach, the teacher model is used to directly assess all the box candidates that are generated by the student model,rather than providing “pseudo boxes” to assign category labels and regression vectors to these student-generated box candidates.

It seems that in SoftTeacher, pseudo boxes with hard labels (G_cls, G_reg) are still generated and that some standard IoU-based target-box matching / assignment (l_cls, l_reg) is used. If it's not the case, could you please bring some clarifications?

It's possible to recover this information from the code, but some formal definitions could help reading the code as well.

Thank you!

opened by vadimkantorov 13

Error while trying to train with 4 gpus

Congratulations for the great work. I am getting this error while trying to train with 4 gpus. Can you please help me out?

File "/data/SoftTeacher/tools/train.py", line 198, in <module>
    main()
  File "/data/SoftTeacher/tools/train.py", line 186, in main
    train_detector(
  File "/data/SoftTeacher/ssod/apis/train.py", line 206, in train_detector
    runner.run(data_loaders, cfg.workflow)
  File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/runner/iter_based_runner.py", line 133, in run
    iter_runner(iter_loaders[i], **kwargs)
  File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/runner/iter_based_runner.py", line 60, in train
    outputs = self.model.train_step(data_batch, self.optimizer, **kwargs)
  File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/parallel/distributed.py", line 52, in train_step
    output = self.module.train_step(*inputs[0], **kwargs[0])
  File "/data/SoftTeacher/thirdparty/mmdetection/mmdet/models/detectors/base.py", line 238, in train_step
    losses = self(**data)
  File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/runner/fp16_utils.py", line 128, in new_func
    output = old_func(*new_args, **new_kwargs)
  File "/data/SoftTeacher/thirdparty/mmdetection/mmdet/models/detectors/base.py", line 172, in forward
    return self.forward_train(img, img_metas, **kwargs)
  File "/data/SoftTeacher/ssod/models/soft_teacher.py", line 44, in forward_train
    sup_loss = self.student.forward_train(**data_groups["sup"])
  File "/data/SoftTeacher/thirdparty/mmdetection/mmdet/models/detectors/two_stage.py", line 135, in forward_train
    rpn_losses, proposal_list = self.rpn_head.forward_train(
  File "/data/SoftTeacher/thirdparty/mmdetection/mmdet/models/dense_heads/base_dense_head.py", line 59, in forward_train
    proposal_list = self.get_bboxes(*outs, img_metas, cfg=proposal_cfg)
  File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/runner/fp16_utils.py", line 214, in new_func
    output = old_func(*new_args, **new_kwargs)
  File "/data/SoftTeacher/thirdparty/mmdetection/mmdet/models/dense_heads/rpn_head.py", line 152, in get_bboxes
    proposals = self._get_bboxes_single(cls_score_list, bbox_pred_list,
  File "/data/SoftTeacher/thirdparty/mmdetection/mmdet/models/dense_heads/rpn_head.py", line 244, in _get_bboxes_single
    dets, keep = batched_nms(proposals, scores, ids, cfg.nms)
  File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/ops/nms.py", line 307, in batched_nms
    dets, keep = nms_op(boxes_for_nms, scores, **nms_cfg_)
  File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/utils/misc.py", line 330, in new_func
    output = old_func(*args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/ops/nms.py", line 171, in nms
    inds = NMSop.apply(boxes, scores, iou_threshold, offset,
  File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/ops/nms.py", line 26, in forward
    inds = ext_module.nms(
RuntimeError: CUDA error: invalid device function
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
/data/SoftTeacher/thirdparty/mmdetection/mmdet/core/anchor/anchor_generator.py:324: UserWarning: ``grid_anchors`` would be deprecated soon. Please use ``grid_priors`` 
  warnings.warn('``grid_anchors`` would be deprecated soon. '
/data/SoftTeacher/thirdparty/mmdetection/mmdet/core/anchor/anchor_generator.py:360: UserWarning: ``single_level_grid_anchors`` would be deprecated soon. Please use ``single_level_grid_priors`` 
  warnings.warn(
Traceback (most recent call last):
  File "/data/SoftTeacher/tools/train.py", line 198, in <module>
    main()
  File "/data/SoftTeacher/tools/train.py", line 186, in main
    train_detector(
  File "/data/SoftTeacher/ssod/apis/train.py", line 206, in train_detector
    runner.run(data_loaders, cfg.workflow)
  File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/runner/iter_based_runner.py", line 133, in run
    iter_runner(iter_loaders[i], **kwargs)
  File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/runner/iter_based_runner.py", line 60, in train
    outputs = self.model.train_step(data_batch, self.optimizer, **kwargs)
  File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/parallel/distributed.py", line 52, in train_step
    output = self.module.train_step(*inputs[0], **kwargs[0])
  File "/data/SoftTeacher/thirdparty/mmdetection/mmdet/models/detectors/base.py", line 238, in train_step
    losses = self(**data)
  File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/runner/fp16_utils.py", line 128, in new_func
    output = old_func(*new_args, **new_kwargs)
  File "/data/SoftTeacher/thirdparty/mmdetection/mmdet/models/detectors/base.py", line 172, in forward
    return self.forward_train(img, img_metas, **kwargs)
  File "/data/SoftTeacher/ssod/models/soft_teacher.py", line 44, in forward_train
    sup_loss = self.student.forward_train(**data_groups["sup"])
  File "/data/SoftTeacher/thirdparty/mmdetection/mmdet/models/detectors/two_stage.py", line 135, in forward_train
    rpn_losses, proposal_list = self.rpn_head.forward_train(
  File "/data/SoftTeacher/thirdparty/mmdetection/mmdet/models/dense_heads/base_dense_head.py", line 59, in forward_train
    proposal_list = self.get_bboxes(*outs, img_metas, cfg=proposal_cfg)
  File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/runner/fp16_utils.py", line 214, in new_func
    output = old_func(*new_args, **new_kwargs)
  File "/data/SoftTeacher/thirdparty/mmdetection/mmdet/models/dense_heads/rpn_head.py", line 152, in get_bboxes
    proposals = self._get_bboxes_single(cls_score_list, bbox_pred_list,
  File "/data/SoftTeacher/thirdparty/mmdetection/mmdet/models/dense_heads/rpn_head.py", line 244, in _get_bboxes_single
    dets, keep = batched_nms(proposals, scores, ids, cfg.nms)
  File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/ops/nms.py", line 307, in batched_nms
    dets, keep = nms_op(boxes_for_nms, scores, **nms_cfg_)
  File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/utils/misc.py", line 330, in new_func
    output = old_func(*args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/ops/nms.py", line 171, in nms
    inds = NMSop.apply(boxes, scores, iou_threshold, offset,
  File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/ops/nms.py", line 26, in forward
    inds = ext_module.nms(
RuntimeError: CUDA error: invalid device function
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
/data/SoftTeacher/thirdparty/mmdetection/mmdet/core/anchor/anchor_generator.py:324: UserWarning: ``grid_anchors`` would be deprecated soon. Please use ``grid_priors`` 
  warnings.warn('``grid_anchors`` would be deprecated soon. '
/data/SoftTeacher/thirdparty/mmdetection/mmdet/core/anchor/anchor_generator.py:360: UserWarning: ``single_level_grid_anchors`` would be deprecated soon. Please use ``single_level_grid_priors`` 
  warnings.warn(
Traceback (most recent call last):
  File "/data/SoftTeacher/tools/train.py", line 198, in <module>
    main()
  File "/data/SoftTeacher/tools/train.py", line 186, in main
    train_detector(
  File "/data/SoftTeacher/ssod/apis/train.py", line 206, in train_detector
    runner.run(data_loaders, cfg.workflow)
  File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/runner/iter_based_runner.py", line 133, in run
    iter_runner(iter_loaders[i], **kwargs)
  File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/runner/iter_based_runner.py", line 60, in train
    outputs = self.model.train_step(data_batch, self.optimizer, **kwargs)
  File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/parallel/distributed.py", line 52, in train_step
    output = self.module.train_step(*inputs[0], **kwargs[0])
  File "/data/SoftTeacher/thirdparty/mmdetection/mmdet/models/detectors/base.py", line 238, in train_step
    losses = self(**data)
  File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/runner/fp16_utils.py", line 128, in new_func
    output = old_func(*new_args, **new_kwargs)
  File "/data/SoftTeacher/thirdparty/mmdetection/mmdet/models/detectors/base.py", line 172, in forward
    return self.forward_train(img, img_metas, **kwargs)
  File "/data/SoftTeacher/ssod/models/soft_teacher.py", line 44, in forward_train
    sup_loss = self.student.forward_train(**data_groups["sup"])
  File "/data/SoftTeacher/thirdparty/mmdetection/mmdet/models/detectors/two_stage.py", line 135, in forward_train
    rpn_losses, proposal_list = self.rpn_head.forward_train(
  File "/data/SoftTeacher/thirdparty/mmdetection/mmdet/models/dense_heads/base_dense_head.py", line 59, in forward_train
    proposal_list = self.get_bboxes(*outs, img_metas, cfg=proposal_cfg)
  File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/runner/fp16_utils.py", line 214, in new_func
    output = old_func(*new_args, **new_kwargs)
  File "/data/SoftTeacher/thirdparty/mmdetection/mmdet/models/dense_heads/rpn_head.py", line 152, in get_bboxes
    proposals = self._get_bboxes_single(cls_score_list, bbox_pred_list,
  File "/data/SoftTeacher/thirdparty/mmdetection/mmdet/models/dense_heads/rpn_head.py", line 244, in _get_bboxes_single
    dets, keep = batched_nms(proposals, scores, ids, cfg.nms)
  File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/ops/nms.py", line 307, in batched_nms
    dets, keep = nms_op(boxes_for_nms, scores, **nms_cfg_)
  File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/utils/misc.py", line 330, in new_func
    output = old_func(*args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/ops/nms.py", line 171, in nms
    inds = NMSop.apply(boxes, scores, iou_threshold, offset,
  File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/ops/nms.py", line 26, in forward
    inds = ext_module.nms(
RuntimeError: CUDA error: invalid device function
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
/data/SoftTeacher/thirdparty/mmdetection/mmdet/core/anchor/anchor_generator.py:324: UserWarning: ``grid_anchors`` would be deprecated soon. Please use ``grid_priors`` 
  warnings.warn('``grid_anchors`` would be deprecated soon. '
/data/SoftTeacher/thirdparty/mmdetection/mmdet/core/anchor/anchor_generator.py:360: UserWarning: ``single_level_grid_anchors`` would be deprecated soon. Please use ``single_level_grid_priors`` 
  warnings.warn(
Traceback (most recent call last):
  File "/data/SoftTeacher/tools/train.py", line 198, in <module>
    main()
  File "/data/SoftTeacher/tools/train.py", line 186, in main
    train_detector(
  File "/data/SoftTeacher/ssod/apis/train.py", line 206, in train_detector
    runner.run(data_loaders, cfg.workflow)
  File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/runner/iter_based_runner.py", line 133, in run
    iter_runner(iter_loaders[i], **kwargs)
  File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/runner/iter_based_runner.py", line 60, in train
    outputs = self.model.train_step(data_batch, self.optimizer, **kwargs)
  File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/parallel/distributed.py", line 52, in train_step
    output = self.module.train_step(*inputs[0], **kwargs[0])
  File "/data/SoftTeacher/thirdparty/mmdetection/mmdet/models/detectors/base.py", line 238, in train_step
    losses = self(**data)
  File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/runner/fp16_utils.py", line 128, in new_func
    output = old_func(*new_args, **new_kwargs)
  File "/data/SoftTeacher/thirdparty/mmdetection/mmdet/models/detectors/base.py", line 172, in forward
    return self.forward_train(img, img_metas, **kwargs)
  File "/data/SoftTeacher/ssod/models/soft_teacher.py", line 44, in forward_train
    sup_loss = self.student.forward_train(**data_groups["sup"])
  File "/data/SoftTeacher/thirdparty/mmdetection/mmdet/models/detectors/two_stage.py", line 135, in forward_train
    rpn_losses, proposal_list = self.rpn_head.forward_train(
  File "/data/SoftTeacher/thirdparty/mmdetection/mmdet/models/dense_heads/base_dense_head.py", line 59, in forward_train
    proposal_list = self.get_bboxes(*outs, img_metas, cfg=proposal_cfg)
  File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/runner/fp16_utils.py", line 214, in new_func
    output = old_func(*new_args, **new_kwargs)
  File "/data/SoftTeacher/thirdparty/mmdetection/mmdet/models/dense_heads/rpn_head.py", line 152, in get_bboxes
    proposals = self._get_bboxes_single(cls_score_list, bbox_pred_list,
  File "/data/SoftTeacher/thirdparty/mmdetection/mmdet/models/dense_heads/rpn_head.py", line 244, in _get_bboxes_single
    dets, keep = batched_nms(proposals, scores, ids, cfg.nms)
  File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/ops/nms.py", line 307, in batched_nms
    dets, keep = nms_op(boxes_for_nms, scores, **nms_cfg_)
  File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/utils/misc.py", line 330, in new_func
    output = old_func(*args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/ops/nms.py", line 171, in nms
    inds = NMSop.apply(boxes, scores, iou_threshold, offset,
  File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/mmcv/ops/nms.py", line 26, in forward
    inds = ext_module.nms(
RuntimeError: CUDA error: invalid device function
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

wandb: Waiting for W&B process to finish, PID 38162
wandb: Program failed with code 1. 
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 1 (pid: 37922) of binary: /home/ubuntu/anaconda3/envs/py39/bin/python
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/run.py", line 689, in run
    elastic_launch(
  File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 116, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/ubuntu/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 244, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
**************************************************
              tools/train.py FAILED               
==================================================
Root Cause:
[0]:
  time: 2021-09-29_15:26:15
  rank: 1 (local_rank: 1)
  exitcode: -11 (pid: 37922)
  error_file: <N/A>
  msg: "Signal 11 (SIGSEGV) received by PID 37922"
==================================================
Other Failures:
[1]:
  time: 2021-09-29_15:26:15
  rank: 3 (local_rank: 3)
  exitcode: -11 (pid: 37924)
  error_file: <N/A>
  msg: "Signal 11 (SIGSEGV) received by PID 37924"
**************************************************

opened by sobujmaroon 12

mmcv incompatibility with mmdetection

I could not find a compatible mmcv-full version that can import MultiScaleDeformableAttention (even if I tried your indicated version)

/cta/users/mehmet/.conda/envs/softteacher/lib/python3.6/site-packages/mmcv/cnn/bricks/transformer.py:27: UserWarning: Fail to import ``MultiScaleDeformableAttention`` from ``mmcv.ops.multi_scale_deform_attn``, You should install ``mmcv-full`` if you need this module.
  warnings.warn('Fail to import ``MultiScaleDeformableAttention`` from '
/cta/users/mehmet/SoftTeacher/thirdparty/mmdetection/mmdet/models/utils/transformer.py:27: UserWarning: `MultiScaleDeformableAttention` in MMCV has been moved to `mmcv.ops.multi_scale_deform_attn`, please update your MMCV
  '`MultiScaleDeformableAttention` in MMCV has been moved to '
Traceback (most recent call last):
  File "/cta/users/mehmet/SoftTeacher/thirdparty/mmdetection/mmdet/models/utils/transformer.py", line 23, in <module>
    from mmcv.ops.multi_scale_deform_attn import MultiScaleDeformableAttention
  File "/cta/users/mehmet/.conda/envs/softteacher/lib/python3.6/site-packages/mmcv/ops/__init__.py", line 1, in <module>
    from .bbox import bbox_overlaps
  File "/cta/users/mehmet/.conda/envs/softteacher/lib/python3.6/site-packages/mmcv/ops/bbox.py", line 3, in <module>
    ext_module = ext_loader.load_ext('_ext', ['bbox_overlaps'])
  File "/cta/users/mehmet/.conda/envs/softteacher/lib/python3.6/site-packages/mmcv/utils/ext_loader.py", line 12, in load_ext
    ext = importlib.import_module('mmcv.' + name)
  File "/cta/users/mehmet/.conda/envs/softteacher/lib/python3.6/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
ImportError: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "tools/train.py", line 15, in <module>
    from mmdet.models import build_detector
  File "/cta/users/mehmet/SoftTeacher/thirdparty/mmdetection/mmdet/models/__init__.py", line 2, in <module>
    from .backbones import *  # noqa: F401,F403
  File "/cta/users/mehmet/SoftTeacher/thirdparty/mmdetection/mmdet/models/backbones/__init__.py", line 2, in <module>
    from .csp_darknet import CSPDarknet
  File "/cta/users/mehmet/SoftTeacher/thirdparty/mmdetection/mmdet/models/backbones/csp_darknet.py", line 11, in <module>
    from ..utils import CSPLayer
  File "/cta/users/mehmet/SoftTeacher/thirdparty/mmdetection/mmdet/models/utils/__init__.py", line 14, in <module>
    from .transformer import (DetrTransformerDecoder, DetrTransformerDecoderLayer,
  File "/cta/users/mehmet/SoftTeacher/thirdparty/mmdetection/mmdet/models/utils/transformer.py", line 29, in <module>
    from mmcv.cnn.bricks.transformer import MultiScaleDeformableAttention
ImportError: cannot import name 'MultiScaleDeformableAttention'

opened by makifozkanoglu 10

将Detector更换为cascade，会出现TypeError: _bbox_forward() missing 1 required positional argument: 'rois'

我这边初步排查了问题，在使用cascade时self.teacher.roi_head.simple_test_bboxes()会调用cascade_roi_head.py中的def _bbox_forward(self, stage, x, rois)，但是test_mixins.py中为bbox_results = self._bbox_forward(x, rois)，缺少stage参数，这个如何处理呢？ cascade_roi_head.py中定义如下： test_mixins.py中调用如下：

opened by zhanghang-cv 9
Error in training

Error in full training:

tools/train.py FAILED

Root Cause: [0]: time: 2021-10-04_17:02:18 rank: 0 (local_rank: 0) exitcode: 1 (pid: 921) error_file: <N/A> msg: "Process failed with exitcode 1"

Other Failures: <NO_OTHER_FAILURES>

I am using only one GPU, get an error in full training with my own data converted to COCO.

Firstly, I segmented the data with "bash tools/dataset/prepare_coco_data.sh conduct", then trained with "bash tools/dist_train.sh configs/soft_teacher/soft_teacher_faster_rcnn_r50_caffe_fpn_coco_full_720k.py 1 "

I also trained as the readme file with the COCO data, and still obtain errors, in full or semi training. It gets stuck in: INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_mhoa0vu3/none_y1enyj09/attempt_0/0/error.json

opened by luisfra19 9
Cascade RCNN with Soft Teacher does not run: _bbox_forward() missing 1 required positional argument: 'rois'
When trying to run Cascade RCNN with Soft Teacher, I get the following exception:

File "/...../python3.8/site-packages/mmdet/models/roi_heads/test_mixins.py", line 89, in simple_test_bboxes bbox_results = self._bbox_forward(x, rois) TypeError: _bbox_forward() missing 1 required positional argument: 'rois'

This problem was raised previously in #123 and #106, but neither was fully answered. Apparently Cascade RCNN's ROI heads will somehow use the incorrect test function.

There was never a satisfactory answer to how to fix this problem other than a cryptic response: "The dirty way I used before is that the feature and teacher on the teacher side are passed into the head as parameters, and then the teacher is used to make judgments in the head.."

Has anyone successfully run Cascade RCNN with Soft Teacher?

Does anyone know the fix to make the above problem work with Cascade RCNN?

Thank you in advance! Cheers, Mark
opened by planaria158 0

Config files for evaluating the provided models

Hi. Is it possible to share the config files used for evaluating the weights available in through the Google Drive links?

I was trying to reproduce the 44.05% mAP of the Faster R-CNN (ResNet-50) -- Ours (thr=5e-2) experiment. However, I only get Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.324.

The command that I ran was the following:

bash tools/dist_test.sh
/home/ubuntu/project/Detection/SoftTeacher/configs/soft_teacher/soft_teacher_faster_rcnn_r50_caffe_fpn_coco_full_720k_eval.py
/home/ubuntu/project/Detection/SoftTeacher/work_dirs/soft_teacher_faster_rcnn_r50_caffe_fpn_coco_full_720k/coco_iter_720000.pth
1 --eval bbox --cfg-options model.test_cfg.rcnn.score_thr=0.90

The config file is the following:

_base_="base.py"

data = dict(
    samples_per_gpu=8,
    workers_per_gpu=5,
    train=dict(
        sup=dict(
            ann_file="/home/ubuntu/project/data/COCO/annotations/instances_train2017.json",
            img_prefix="/home/ubuntu/project/data/COCO/train2017/",
        ),
    ),
    val=dict(
        ann_file="/home/ubuntu/project/data/COCO/annotations/instances_val2017.json",
        img_prefix="/home/ubuntu/project/data/COCO/val2017/",
    ),
    test=dict(
        ann_file="/home/ubuntu/project/data/COCO/annotations/instances_val2017.json",
        img_prefix="/home/ubuntu/project/data/COCO/val2017/",
    ),

    sampler=dict(
        train=dict(
            sample_ratio=[1, 1],
        )
    )
)

semi_wrapper = dict(
    train_cfg=dict(
        unsup_weight=2.0,
    )
)

optimizer = dict(lr=0.01, weight_decay=1e-4, momentum=0.9)
lr_config = dict(step=[300000, 425000])
runner = dict(_delete_=True, type="IterBasedRunner", max_iters=450000)

Could someone help me out? Thank you. If there is an existing issue about this that I missed, I apologize in advance.

opened by Bai-YT 0

lr_config set up according to max iter?

How should I set lr_config parameters according to max_iter? Now I want to perform only 20k steps as max iteration what should lr_config will be?

` lr_config = dict(step=[120000 * 4, 160000 * 4])

runner = dict(delete=True, type="IterBasedRunner", max_iters=12000)`

opened by vefak 0
训练了4000iters，验证完之后卡着了，请问这是怎么回事呢？

2022-07-07 14:05:26,893 - mmdet.ssod - INFO - Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.135 Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=1000 ] = 0.269 Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=1000 ] = 0.120 Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.071 Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.155 Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.164 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.262 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=300 ] = 0.262 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=1000 ] = 0.262 Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.119 Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.281 Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.331

[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 5000/5000, 155.5 task/s, elapsed: 32s, ETA: 0s2022-07-07 14:06:02,265 - mmdet.ssod - INFO - Evaluating bbox... Loading and preparing results... DONE (t=0.11s) creating index... index created! Running per image evaluation... Evaluate annotation type bbox DONE (t=14.59s). Accumulating evaluation results... DONE (t=3.20s). 2022-07-07 14:06:20,482 - mmdet.ssod - INFO - Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.088 Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=1000 ] = 0.193 Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=1000 ] = 0.068 Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.045 Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.105 Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.112 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.169 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=300 ] = 0.169 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=1000 ] = 0.169 Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.063 Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.186 Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.216

2022-07-07 14:06:20,850 - mmdet.ssod - INFO - Exp name: soft_teacher_faster_rcnn_r50_caffe_fpn_coco_180k.py 2022-07-07 14:06:20,855 - mmdet.ssod - INFO - Iter(val) [4000] teacher.bbox_mAP: 0.1350, teacher.bbox_mAP_50: 0.2690, teacher.bbox_mAP_75: 0.1200, teacher.bbox_mAP_s: 0.0710, teacher.bbox_mAP_m: 0.1550, teacher.bbox_mAP_l: 0.1640, teacher.bbox_mAP_copypaste: 0.135 0.269 0.120 0.071 0.155 0.164, student.bbox_mAP: 0.0880, student.bbox_mAP_50: 0.1930, student.bbox_mAP_75: 0.0680, student.bbox_mAP_s: 0.0450, student.bbox_mAP_m: 0.1050, student.bbox_mAP_l: 0.1120, student.bbox_mAP_copypaste: 0.088 0.193 0.068 0.045 0.105 0.112

请问跑完了4000iters以后，程序就卡着不动了，这是咋回事呢？

opened by mary-0830 2

Semi-Supervised Learning, Object Detection, ICCV2021

Related tags

Overview

End-to-End Semi-Supervised Object Detection with Soft Teacher

Citation

Main Results

Partial Labeled Data

1% labeled data

5% labeled data

10% labeled data

Full Labeled Data

Faster R-CNN (ResNet-50)

Faster R-CNN (ResNet-101)

Notes

Usage

Requirements

Notes

Installation

Data Preparation

Training

Evaluation

Inference

Comments

================================================== Root Cause: [0]: time: 2022-02-04_00:19:07 rank: 0 (local_rank: 0) exitcode: -11 (pid: 15048) error_file: <N/A> msg: "Signal 11 (SIGSEGV) received by PID 15048"

tools/train.py FAILED

Root Cause: [0]: time: 2021-10-04_17:02:18 rank: 0 (local_rank: 0) exitcode: 1 (pid: 921) error_file: <N/A> msg: "Process failed with exitcode 1"

Owner

Microsoft

Data-Uncertainty Guided Multi-Phase Learning for Semi-supervised Object Detection

CVPR2022 paper "Dense Learning based Semi-Supervised Object Detection"

PyTorch code for ICLR 2021 paper Unbiased Teacher for Semi-Supervised Object Detection

Instant-Teaching: An End-to-End Semi-Supervised Object Detection Framework

Group R-CNN for Point-based Weakly Semi-supervised Object Detection (CVPR2022)

Project looking into use of autoencoder for semi-supervised learning and comparing data requirements compared to supervised learning.

UniMoCo: Unsupervised, Semi-Supervised and Full-Supervised Visual Representation Learning

Code and models for ICCV2021 paper "Robust Object Detection via Instance-Level Temporal Cycle Confusion".

TOOD: Task-aligned One-stage Object Detection, ICCV2021 Oral

Exploring Classification Equilibrium in Long-Tailed Object Detection, ICCV2021

ICCV2021 Paper: AutoShape: Real-Time Shape-Aware Monocular 3D Object Detection

[CVPR 2021] MiVOS - Mask Propagation module. Reproduced STM (and better) with training code :star2:. Semi-supervised video object segmentation evaluation.

Semi-Supervised 3D Hand-Object Poses Estimation with Interactions in Time

CoSMA: Convolutional Semi-Regular Mesh Autoencoder. From Paper "Mesh Convolutional Autoencoder for Semi-Regular Meshes of Different Sizes"

Yolo object detection - Yolo object detection with python

code for paper "Not All Unlabeled Data are Equal: Learning to Weight Data in Semi-supervised Learning" by Zhongzheng Ren*, Raymond A. Yeh*, Alexander G. Schwing.

PyTorch implementation of our ICCV2021 paper: StructDepth: Leveraging the structural regularities for self-supervised indoor depth estimation

Perturbed Self-Distillation: Weakly Supervised Large-Scale Point Cloud Semantic Segmentation (ICCV2021)

MOT-Tracking-by-Detection-Pipeline - For Tracking-by-Detection format MOT (Multi Object Tracking), is it a framework that separates Detection and Tracking processes?

code for paper "Not All Unlabeled Data are Equal: Learning to Weight Data in Semi-supervised Learning" by Zhongzheng Ren, Raymond A. Yeh, Alexander G. Schwing.