When I was training on a single machine with multiple GPUs, I encountered the following error. What is the reason?
(base) lixiang@vs008:~/CWT-for-FSS$ sh scripts/train.sh pascal 0 [0,1] 50 1
0%| | 0/5953 [00:00<?, ?it/s]==> Running process rank 0.
FB_param_noise: 0
adapt_iter: 200
arch: resnet
augmentations: ['hor_flip', 'vert_flip', 'resize']
backbone_dim: 2048
batch_size: 2
batch_size_val: 2
bins: [1, 2, 3, 6]
bottleneck_dim: 512
ckpt_path: checkpoints/
ckpt_used: best
cls_lr: 0.1
data_root: pascal/
debug: False
distributed: True
dropout: 0.1
episodic: True
epochs: 20
gamma: 0.1
gpus: [0, 1]
heads: 4
image_size: 473
iter_per_epoch: 6000
layers: 50
log_freq: 50
lr_stepsize: 30
m_scale: False
main_optim: SGD
manual_seed: 2021
mean: [0.485, 0.456, 0.406]
milestones: [40, 70]
mixup: False
model_dir: model_ckpt
momentum: 0.9
n_runs: 1
nesterov: True
norm_feat: True
num_classes_tr: 2
num_classes_val: 5
padding_label: 255
port: 53765
pretrained: False
random_shot: False
resume_weights: /pretrained_models/
rot_max: 10
rot_min: -10
save_models: True
save_oracle: False
scale_lr: 1.0
scale_max: 2.0
scale_min: 0.5
scheduler: cosine
shot: 1
smoothing: True
std: [0.229, 0.224, 0.225]
test_name: default
test_num: 1000
test_split: default
train_list: lists/pascal/train.txt
train_name: pascal
train_split: 0
trans_lr: 0.001
use_split_coco: False
val_list: lists/pascal/val.txt
weight_decay: 0.0001
workers: 2
=> no weight found at '/pretrained_models/'
Processing data for [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
0%| | 0/5953 [00:00<?, ?it/s]==> Running process rank 1.
FB_param_noise: 0
adapt_iter: 200
arch: resnet
augmentations: ['hor_flip', 'vert_flip', 'resize']
backbone_dim: 2048
batch_size: 2
batch_size_val: 2
bins: [1, 2, 3, 6]
bottleneck_dim: 512
ckpt_path: checkpoints/
ckpt_used: best
cls_lr: 0.1
data_root: pascal/
debug: False
distributed: True
dropout: 0.1
episodic: True
epochs: 20
gamma: 0.1
gpus: [0, 1]
heads: 4
image_size: 473
iter_per_epoch: 6000
layers: 50
log_freq: 50
lr_stepsize: 30
m_scale: False
main_optim: SGD
manual_seed: 2021
mean: [0.485, 0.456, 0.406]
milestones: [40, 70]
mixup: False
model_dir: model_ckpt
momentum: 0.9
n_runs: 1
nesterov: True
norm_feat: True
num_classes_tr: 2
num_classes_val: 5
padding_label: 255
port: 53765
pretrained: False
random_shot: False
resume_weights: /pretrained_models/
rot_max: 10
rot_min: -10
save_models: True
save_oracle: False
scale_lr: 1.0
scale_max: 2.0
scale_min: 0.5
scheduler: cosine
shot: 1
smoothing: True
std: [0.229, 0.224, 0.225]
test_name: default
test_num: 1000
test_split: default
train_list: lists/pascal/train.txt
train_name: pascal
train_split: 0
trans_lr: 0.001
use_split_coco: False
val_list: lists/pascal/val.txt
weight_decay: 0.0001
workers: 2
=> no weight found at '/pretrained_models/'
Processing data for [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
100%|███████████████████████████████████████████████████████████████████████████| 5953/5953 [00:08<00:00, 681.75it/s]
100%|███████████████████████████████████████████████████████████████████████████| 5953/5953 [00:09<00:00, 609.93it/s]
0%| | 0/1449 [00:00<?, ?it/s]INFO: pascal -> pascal
INFO: 0 -> 0
>> Start Filtering classes
>> Removed classes = []
>> Kept classes = ['airplane', 'bicycle', 'bird', 'boat', 'bottle']
Processing data for [1, 2, 3, 4, 5]
0%| | 0/1449 [00:00<?, ?it/s]INFO: pascal -> pascal
INFO: 0 -> 0
>> Start Filtering classes
>> Removed classes = []
>> Kept classes = ['airplane', 'bicycle', 'bird', 'boat', 'bottle']
Processing data for [1, 2, 3, 4, 5]
100%|███████████████████████████████████████████████████████████████████████████| 1449/1449 [00:06<00:00, 229.57it/s]
100%|███████████████████████████████████████████████████████████████████████████| 1449/1449 [00:05<00:00, 241.58it/s]
Traceback (most recent call last):
File "/home/lixiang/anaconda3/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/lixiang/anaconda3/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/lixiang/CWT-for-FSS/src/train.py", line 360, in <module>
mp.spawn(main_worker, args=(world_size, args), nprocs=world_size, join=True)
File "/home/lixiang/anaconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/lixiang/anaconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/home/lixiang/anaconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/home/lixiang/anaconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/home/lixiang/CWT-for-FSS/src/train.py", line 134, in main_worker
_, _ = do_epoch(
File "/home/lixiang/CWT-for-FSS/src/train.py", line 266, in do_epoch
output_support = binary_cls(f_s)
File "/home/lixiang/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/lixiang/anaconda3/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 446, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/home/lixiang/anaconda3/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 442, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument weight in method wrapper__cudnn_convolution)