deep learning for image processing including classification and object-detection etc.




  • 本教程是对本人研究生期间的研究内容进行整理总结,总结的同时也希望能够帮助更多的小伙伴。后期如果有学习到新的知识也会与大家一起分享。
  • 本教程会以视频的方式进行分享,教学流程如下:
  • 课程中所有PPT都放在course_ppt文件夹下,需要的自行下载。




  • Anaconda3(建议使用)
  • python3.6/3.7/3.8
  • pycharm (IDE)
  • pytorch 1.7.1 (pip package)
  • torchvision 0.8.1 (pip package)
  • tensorflow 2.4.1 (pip package)




  • 为了得到你的许可


    非常抱歉打扰您,由于不知道您的联系方式,只能以这样的方式来征得您的同意。我写的论文里用了您的SSD和Faster rcnn代码做实验,我将在我的代码里公开我的代码与我的实验数据。代码链接会放您的。非常感谢您的代码以及视频讲解,帮助我很多。希望你能同意。谢谢你(哔哩哔哩也有私信过您)。如果您同意的话,请记得回复我一下。再次感谢您。

  • FasterRCNN 训练错误

    FasterRCNN 训练错误

    System information

    • Have I written custom code: no
    • OS Platform(e.g., window10 or Linux Ubuntu 16.04): linux
    • Python version: 3.8
    • Deep learning framework and version(e.g., Tensorflow2.1 or Pytorch1.3): torch1.6
    • Use GPU or not: yes
    • CUDA/cuDNN version(if you use GPU):
    • The network you trained(e.g., Resnet34 network): resnet50fpn

    Describe the current behavior 您好,用faster_rcnn训练自己的数据集,一共六种物体,create model设置的num_classes=7,但是还是出现了这个错误。其他没有改过,求教该怎么解决呀?

    Error info / logs

    Namespace(batch_size=8, data_path='/research/dept8/qdou/zwang/data/robo/final', device='cuda:0', epochs=50, output_dir='./save_weights', resume='', start_epoch=0)
    Using cuda device training.
    Using 8 dataloader workers
    /pytorch/aten/src/ATen/native/cuda/ operator(): block: [3,0,0], thread: [82,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
    /pytorch/aten/src/ATen/native/cuda/ operator(): block: [3,0,0], thread: [83,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
    /pytorch/aten/src/ATen/native/cuda/ operator(): block: [3,0,0], thread: [84,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
    Traceback (most recent call last):
      File "", line 167, in <module>
      File "", line 99, in main
        utils.train_one_epoch(model, optimizer, train_data_loader,
      File "/research/dept8/qdou/zwang/faster_rcnn/train_utils/", line 34, in train_one_epoch
        loss_dict = model(images, targets)
      File "/research/dept8/qdou/zwang/anaconda3/lib/python3.8/site-packages/torch/nn/modules/", line 722, in _call_impl
        result = self.forward(*input, **kwargs)
      File "/research/dept8/qdou/zwang/faster_rcnn/network_files/", line 93, in forward
        detections, detector_losses = self.roi_heads(features, proposals, images.image_sizes, targets)
      File "/research/dept8/qdou/zwang/anaconda3/lib/python3.8/site-packages/torch/nn/modules/", line 722, in _call_impl
        result = self.forward(*input, **kwargs)
      File "/research/dept8/qdou/zwang/faster_rcnn/network_files/", line 367, in forward
        proposals, matched_idxs, labels, regression_targets = self.select_training_samples(proposals, targets)
      File "/research/dept8/qdou/zwang/faster_rcnn/network_files/", line 222, in select_training_samples
        matched_idxs, labels = self.assign_targets_to_proposals(proposals, gt_boxes, gt_labels)
      File "/research/dept8/qdou/zwang/faster_rcnn/network_files/", line 144, in assign_targets_to_proposals
        labels_in_image[bg_inds] = 0
    RuntimeError: copy_if failed to synchronize: cudaErrorAssert: device-side assert triggered
  • 关于FCN网络中miou为0的问题


    up主您好,想请教您一个问题,就是我在用FCN网络做医学肿瘤分割时,输出的结果文档第二类的miou始终为0,具体是下面这个样子: [epoch: 7] train_loss: 0.00193 lr: 0.00780 global correct: 99.8 average row correct: ['100.0', '0.0'] IoU: ['99.8', '0.0'] mean IoU: 49.9

    我已经做了以下修改: *未载入resnet50预训练权重 *将初始学习率修改为0.001或0.01


  • MobileNetV2 训练报错

    MobileNetV2 训练报错

    System information

    • Have I written custom code: NO
    • OS Platform(e.g., window10 or Linux Ubuntu 16.04): MacOS Big Sur
    • Python version: 3.9.5
    • Deep learning framework and version(e.g., Tensorflow2.1 or Pytorch1.3): Pytorch 1.9
    • Use GPU or not: Not
    • CUDA/cuDNN version(if you use GPU):
    • The network you trained(e.g., Resnet34 network): MobileNetV2

    Describe the current behavior

    Error info / logs 截屏2021-07-04 下午11 37 17

  • 在使用retinanet进行多GPU训练时报错


    导师好!(狗头) 我在retinanet的backbone上面进行了修改,添加了cbam模块,使用单GPU训练正常,不会报错。但是使用多GPU却不行,我翻译了下大概是参数回传的问题,网上查了下也没搞清楚,可以帮忙看下吗??其实这个情况我在跑ssd时候也是这个错误,就没管,没想到这里又出错了…… 报错信息如下: Start training /home/lb511/anaconda3/envs/lhaozz/lib/python3.9/site-packages/torch/ UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/conda/conda-bld/pytorch_1640811803361/work/aten/src/ATen/native/TensorShape.cpp:2157.) return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined] /home/lb511/anaconda3/envs/lhaozz/lib/python3.9/site-packages/torch/ UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/conda/conda-bld/pytorch_1640811803361/work/aten/src/ATen/native/TensorShape.cpp:2157.) return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]

    Epoch: [0] [ 0/132] eta: 0:03:01.012762 lr: 0.000173 loss: 1.8379 (1.8379) bbox_regression: 0.6653 (0.6653) classification: 1.1726 (1.1726) time: 1.3713 data: 0.3985 max mem: 8638 Traceback (most recent call last): File "/home/lhaozz/hand_retinanet/", line 260, in main(args) File "/home/lhaozz/hand_retinanet/", line 141, in main mean_loss, lr = utils.train_one_epoch(model, optimizer, data_loader, File "/home/lhaozz/hand_retinanet/train_utils/", line 33, in train_one_epoch loss_dict = model(images, targets) File "/home/lb511/anaconda3/envs/lhaozz/lib/python3.9/site-packages/torch/nn/modules/", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/lb511/anaconda3/envs/lhaozz/lib/python3.9/site-packages/torch/nn/parallel/", line 873, in forward if torch.is_grad_enabled() and self.reducer._rebuild_buckets():

    RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel, and by making sure all forward function outputs participate in calculating loss. If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable). Parameter indices which did not receive grad for rank 0: 12 13 14 15 25 26 27 28 38 39 40 41 51 52 53 54 67 68 69 70 80 81 82 83 93 94 95 96 106 107 108 109 119 120 121 122 132 133 134 135 148 149 150 151 161 162 163 164 174 175 176 177 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error

    顺便问个小问题,混合精度训练我使用的是 两张3080,cpu是r9 5950x 16核心,,平时是一般打开就行吗?

  • CUDA version

    CUDA version

    System information

    • Have I written custom code: No
    • OS Platform(e.g., window10 or Linux Ubuntu 16.04): Ubuntu 16.04.6 LTS
    • Python version: 3.7.10
    • Deep learning framework and version(e.g., Tensorflow2.1 or Pytorch1.3): Pytorch 1.6.0
    • Use GPU or not: No
    • CUDA/cuDNN version(if you use GPU): CUDA Version 10.1.243
    • The network you trained(e.g., Resnet34 network): pytorch_object_detection/faster_rcnn/

    Describe the current behavior May I ask what version of CUDA is needed for this project? Will CUDA 10.1 not work?

    Error info / logs AssertionError: The NVIDIA driver on your system is too old (found version 10010). Please update your GPU driver by downloading and installing a new version from the URL: Alternatively, go to: to install a PyTorch version that has been compiled with your version of the CUDA driver.

  • mismatch for inception3a.branch3.1.conv.weight: copying a param with shape torch.Size([32, 16, 3, 3]) from checkpoint, the shape in current model is torch.Size([32, 16, 5, 5]).

    mismatch for inception3a.branch3.1.conv.weight: copying a param with shape torch.Size([32, 16, 3, 3]) from checkpoint, the shape in current model is torch.Size([32, 16, 5, 5]). 您好,我加载了googlenet的预训练权重会出现标题所指问题,请问如何解决?如果我修改branch3的kernal size为3 则则会出现RuntimeError: Sizes of tensors must match except in dimension 2. Got 28 and 30 (The offending index is 2)

  • 运行报错 运行报错

    System information

    • Have I written custom code:
    • OS Platform(e.g., window10 or Linux Ubuntu 16.04):
    • Python version:
    • Deep learning framework and version(e.g., Tensorflow2.1 or Pytorch1.3):
    • Use GPU or not:
    • CUDA/cuDNN version(if you use GPU):
    • The network you trained(e.g., Resnet34 network):

    Describe the current behavior

    Error info / logs

  • RuntimeError: Trying to pass too many CPU scalars to CUDA kernel!

    RuntimeError: Trying to pass too many CPU scalars to CUDA kernel!

    Thanks for sharing you code . when I run 'python ',I meet the problem.How I can do to solve the error!

    Traceback (most recent call last): File "", line 157, in main() File "", line 91, in main train_loss=train_loss, train_lr=learning_rate) File "/home/dl/桌面/faster_rcnn/train_utils/", line 33, in train_one_epoch loss_dict = model(images, targets) File "/home/dl/anaconda3/lib/python3.7/site-packages/torch/nn/modules/", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/home/dl/桌面/faster_rcnn/network_files/", line 87, in forward proposals, proposal_losses = self.rpn(images, features, targets) File "/home/dl/anaconda3/lib/python3.7/site-packages/torch/nn/modules/", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/home/dl/桌面/faster_rcnn/network_files/", line 615, in forward labels, matched_gt_boxes = self.assign_targets_to_anchors(anchors, targets) File "/home/dl/桌面/faster_rcnn/network_files/", line 410, in assign_targets_to_anchors matched_idxs = self.proposal_matcher(match_quality_matrix) File "/home/dl/桌面/faster_rcnn/network_files/", line 347, in call matches[below_low_threshold] = torch.tensor(self.BELOW_LOW_THRESHOLD) # -1 RuntimeError: Trying to pass too many CPU scalars to CUDA kernel!

  • 采用pytorch1.4跑YOLOv3-spp版本,删除双精度部分代码,程序out of memory

    采用pytorch1.4跑YOLOv3-spp版本,删除双精度部分代码,程序out of memory

    由于pytorch1.4版本不支持双精度(无from torch.cuda import amp) 所以修改有关双精度的代码 1.在train_eval_utils.py中,注释29和30行的
    # enable_amp = True if "cuda" in device.type else False # scaler = amp.GradScaler(enabled=enable_amp) 2.注释61行 # with amp.autocast(enabled=enable_amp): 3. 并将85行的代码修改如下(删除scaler部分):
    # backward # scaler.scale(losses).backward() losses.backward() # optimize if ni % accumulate == 0: # scaler.step(optimizer) # scaler.update() # optimizer.zero_grad() optimizer.step() optimizer.zero_grad() 报错:RuntimeError: CUDA error: out of memory 解决办法:将中pin_memory改成False


  • 多GPU训练报错:subprocess.CalledProcessError: Command '['/opt/anaconda3/envs/py37/bin/python', '-u', '']' returned non-zero exit status 1.

    多GPU训练报错:subprocess.CalledProcessError: Command '['/opt/anaconda3/envs/py37/bin/python', '-u', '']' returned non-zero exit status 1.

    (py37) xiamingyang@AI-02:~/PyTorch/PyTorch_Object_detection/faster_rcnn$ CUDA_VISIBLE_DEVICES=4,6 python -m torch.distributed.launch --nproc_per_node=2 --use_env

    Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

    | distributed init (rank 1): env:// | distributed init (rank 0): env:// Traceback (most recent call last): File "", line 249, in main(args) File "", line 40, in main init_distributed_mode(args) File "/Ai-Data/home/users/xiamingyang/PyTorch/PyTorch_Object_detection/faster_rcnn/train_utils/", line 320, in init_distributed_mode world_size=args.world_size, rank=args.rank) File "/opt/anaconda3/envs/py37/lib/python3.7/site-packages/torch/distributed/", line 397, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "/opt/anaconda3/envs/py37/lib/python3.7/site-packages/torch/distributed/", line 168, in _env_rendezvous_handler store = TCPStore(master_addr, master_port, world_size, start_daemon) RuntimeError: Address already in use Traceback (most recent call last): File "/opt/anaconda3/envs/py37/lib/python3.7/", line 193, in _run_module_as_main "main", mod_spec) File "/opt/anaconda3/envs/py37/lib/python3.7/", line 85, in _run_code exec(code, run_globals) File "/opt/anaconda3/envs/py37/lib/python3.7/site-packages/torch/distributed/", line 263, in main() File "/opt/anaconda3/envs/py37/lib/python3.7/site-packages/torch/distributed/", line 259, in main cmd=cmd) subprocess.CalledProcessError: Command '['/opt/anaconda3/envs/py37/bin/python', '-u', '']' returned non-zero exit status 1.

    opened by Taylor-X76 6
  • efficientnet pytorch无法运行

    efficientnet pytorch无法运行

    报错为这个 Traceback (most recent call last): File "C:\Users\dell\Desktop\deep-learning-for-image-processing-master\pytorch_classification\Test9_efficientNet\", line 145, in main(opt) File "C:\Users\dell\Desktop\deep-learning-for-image-processing-master\pytorch_classification\Test9_efficientNet\", line 76, in main if args.weights != "": AttributeError: 'Namespace' object has no attribute 'weights'

  • FileNotFound even files does exit

    FileNotFound even files does exit

    System information

    • Have I written custom code: No
    • OS Platform: window10
    • Python version: 3.8
    • Deep learning framework and version: PyTorch 1.7.1
    • Use GPU or not: use GPU
    • The network you trained: Faster R-CNN

    Describe the current behavior

    ** I am using a custom Pascal VOC dataset. but my files are named in string form and not integers. So, when I am using str format files I'm getting FileNotFoundError but when I change str to int in JPEGImages and 'filename' in the annotations file I can run my code smoothly. What should i change in my program plzz? **

  • HRNet训练到最后报错问题


    HRNet从头开始训练,跑了209个epoch之后,突然报了这样的错: Epoch: [209] Total time: 1:06:31 (0.8526 s / it) Test: [ 0/199] eta: 0:18:38 model_time: 0.5503 (0.5503) time: 5.6187 data: 3.6315 max mem: 5210 Test: [100/199] eta: 0:00:53 model_time: 0.2205 (0.2248) time: 0.3994 data: 0.0001 max mem: 5210 Test: [198/199] eta: 0:00:00 model_time: 0.1523 (0.2229) time: 0.3832 data: 0.0001 max mem: 5210 Test: Total time: 0:01:33 (0.4706 s / it) Averaged stats: model_time: 0.1523 (0.2229) Loading and preparing results... DONE (t=0.28s) creating index... index created! Running per image evaluation... Evaluate annotation type keypoints DONE (t=2.34s). Accumulating evaluation results... DONE (t=0.07s). IoU metric: keypoints Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets= 20 ] = 0.758 Average Precision (AP) @[ IoU=0.50 | area= all | maxDets= 20 ] = 0.935 Average Precision (AP) @[ IoU=0.75 | area= all | maxDets= 20 ] = 0.835 Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets= 20 ] = 0.729 Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets= 20 ] = 0.804 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 20 ] = 0.786 Average Recall (AR) @[ IoU=0.50 | area= all | maxDets= 20 ] = 0.942 Average Recall (AR) @[ IoU=0.75 | area= all | maxDets= 20 ] = 0.851 Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets= 20 ] = 0.753 Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets= 20 ] = 0.836 QObject::moveToThread: Current thread (0x561c025907d0) is not the object's thread (0x561c144fefa0). Cannot move to target thread (0x561c025907d0)

    qt.qpa.plugin: Could not load the Qt platform plugin "xcb" in "/home/ycj/.local/lib/python3.8/site-packages/cv2/qt/plugins" even though it was found. This application failed to start because no Qt platform plugin could be initialized. Reinstalling the application may fix this problem.

    Available platform plugins are: xcb, eglfs, linuxfb, minimal, minimalegl, offscreen, vnc, wayland-egl, wayland, wayland-xcomposite-egl, wayland-xcomposite-glx, webgl.

    已放弃 (核心已转储

    截图的话就如图所示, cgi-bin_mmwebwx-bin_webwxgetmsgimg_ MsgID=541557106273625799 skey=@crypt_dafbdf00_04b49658ff0a37e33e88b140f2ee1253 mmweb_appid=wx_webfilehelper 想问下这是正常的吗,是已经训练完成还是出了bug呢?

  • 使用up提供的fasterRCNN代码结合CAM进行可视化,结果异常,求助



    实验结果是这样的很奇怪 image image

  • 多gpu运行时候出错


    System information

    • Have I written custom code: Yes
    • OS Platform(e.g., window10 or Linux Ubuntu 16.04): Linux
    • Python version: 3.8
    • Deep learning framework and version(e.g., Tensorflow2.1 or Pytorch1.3): pytorch1.7.1
    • Use GPU or not: Use
    • CUDA/cuDNN version(if you use GPU): CUDA11.7
    • The network you trained(e.g., Resnet34 network): faster_res50_rpn

    Describe the current behavior

    您好,我用train_multi_GPU.py跑VG的数据集,数据集是按照my_dataset.py中的输出进行设置的,也转成了tensor,但是在”global_features,loss_dict = model(images, targets)“这一步的时候总是报"RuntimeError: chunk expects at least a 1-dimensional tensor“错误,不知道是哪个输入没有满足要求,请问有没有什么解决的办法?


    Error info / logs Traceback (most recent call last): File "", line 273, in main(args) File "", line 151, in main mean_loss, lr = utils.train_one_epoch(model, optimizer, data_loader, File "/home/zzyyxx/Image_Catpion/faster_rcnn/train_utils/", line 46, in train_one_epoch global_features,loss_dict = model(images, targets) File "/home/zzyyxx/enter/envs/ZTorch/lib/python3.8/site-packages/torch/nn/modules/", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/home/zzyyxx/enter/envs/ZTorch/lib/python3.8/site-packages/torch/nn/parallel/", line 617, in forward inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids) File "/home/zzyyxx/enter/envs/ZTorch/lib/python3.8/site-packages/torch/nn/parallel/", line 643, in scatter return scatter_kwargs(inputs, kwargs, device_ids, dim=self.dim) File "/home/zzyyxx/enter/envs/ZTorch/lib/python3.8/site-packages/torch/nn/parallel/", line 36, in scatter_kwargs inputs = scatter(inputs, target_gpus, dim) if inputs else [] File "/home/zzyyxx/enter/envs/ZTorch/lib/python3.8/site-packages/torch/nn/parallel/", line 28, in scatter res = scatter_map(inputs) File "/home/zzyyxx/enter/envs/ZTorch/lib/python3.8/site-packages/torch/nn/parallel/", line 15, in scatter_map return list(zip(*map(scatter_map, obj))) File "/home/zzyyxx/enter/envs/ZTorch/lib/python3.8/site-packages/torch/nn/parallel/", line 17, in scatter_map return list(map(list, zip(*map(scatter_map, obj)))) File "/home/zzyyxx/enter/envs/ZTorch/lib/python3.8/site-packages/torch/nn/parallel/", line 19, in scatter_map return list(map(type(obj), zip(*map(scatter_map, obj.items())))) File "/home/zzyyxx/enter/envs/ZTorch/lib/python3.8/site-packages/torch/nn/parallel/", line 15, in scatter_map return list(zip(*map(scatter_map, obj))) File "/home/zzyyxx/enter/envs/ZTorch/lib/python3.8/site-packages/torch/nn/parallel/", line 13, in scatter_map return Scatter.apply(target_gpus, None, dim, obj) File "/home/zzyyxx/enter/envs/ZTorch/lib/python3.8/site-packages/torch/nn/parallel/", line 92, in forward outputs = comm.scatter(input, target_gpus, chunk_sizes, ctx.dim, streams) File "/home/zzyyxx/enter/envs/ZTorch/lib/python3.8/site-packages/torch/nn/parallel/", line 186, in scatter return tuple(torch._C._scatter(tensor, devices, chunk_sizes, dim, streams)) RuntimeError: chunk expects at least a 1-dimensional tensor Traceback (most recent call last): File "", line 273, in main(args) File "", line 151, in main mean_loss, lr = utils.train_one_epoch(model, optimizer, data_loader, File "/home/zzyyxx/Image_Catpion/faster_rcnn/train_utils/", line 46, in train_one_epoch global_features,loss_dict = model(images, targets) File "/home/zzyyxx/enter/envs/ZTorch/lib/python3.8/site-packages/torch/nn/modules/", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/home/zzyyxx/enter/envs/ZTorch/lib/python3.8/site-packages/torch/nn/parallel/", line 617, in forward inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids) File "/home/zzyyxx/enter/envs/ZTorch/lib/python3.8/site-packages/torch/nn/parallel/", line 643, in scatter return scatter_kwargs(inputs, kwargs, device_ids, dim=self.dim) File "/home/zzyyxx/enter/envs/ZTorch/lib/python3.8/site-packages/torch/nn/parallel/", line 36, in scatter_kwargs inputs = scatter(inputs, target_gpus, dim) if inputs else [] File "/home/zzyyxx/enter/envs/ZTorch/lib/python3.8/site-packages/torch/nn/parallel/", line 28, in scatter res = scatter_map(inputs) File "/home/zzyyxx/enter/envs/ZTorch/lib/python3.8/site-packages/torch/nn/parallel/", line 15, in scatter_map return list(zip(*map(scatter_map, obj))) File "/home/zzyyxx/enter/envs/ZTorch/lib/python3.8/site-packages/torch/nn/parallel/", line 17, in scatter_map return list(map(list, zip(*map(scatter_map, obj)))) File "/home/zzyyxx/enter/envs/ZTorch/lib/python3.8/site-packages/torch/nn/parallel/", line 19, in scatter_map return list(map(type(obj), zip(*map(scatter_map, obj.items())))) File "/home/zzyyxx/enter/envs/ZTorch/lib/python3.8/site-packages/torch/nn/parallel/", line 15, in scatter_map return list(zip(*map(scatter_map, obj))) File "/home/zzyyxx/enter/envs/ZTorch/lib/python3.8/site-packages/torch/nn/parallel/", line 13, in scatter_map return Scatter.apply(target_gpus, None, dim, obj) File "/home/zzyyxx/enter/envs/ZTorch/lib/python3.8/site-packages/torch/nn/parallel/", line 92, in forward outputs = comm.scatter(input, target_gpus, chunk_sizes, ctx.dim, streams) File "/home/zzyyxx/enter/envs/ZTorch/lib/python3.8/site-packages/torch/nn/parallel/", line 186, in scatter return tuple(torch._C._scatter(tensor, devices, chunk_sizes, dim, streams)) RuntimeError: chunk expects at least a 1-dimensional tensor Traceback (most recent call last): File "/home/zzyyxx/enter/envs/ZTorch/lib/python3.8/", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/zzyyxx/enter/envs/ZTorch/lib/python3.8/", line 87, in _run_code exec(code, run_globals) File "/home/zzyyxx/enter/envs/ZTorch/lib/python3.8/site-packages/torch/distributed/", line 260, in main() File "/home/zzyyxx/enter/envs/ZTorch/lib/python3.8/site-packages/torch/distributed/", line 255, in main raise subprocess.CalledProcessError(returncode=process.returncode, subprocess.CalledProcessError: Command '['/home/zzyyxx/enter/envs/ZTorch/bin/python', '-u', '']' returned non-zero exit status 1.

