Official implementation of UTNet: A Hybrid Transformer Architecture for Medical Image Segmentation

Last update: Jan 1, 2023

Related tags

Deep Learning UTNet

Overview

UTNet (Accepted at MICCAI 2021)

Official implementation of UTNet: A Hybrid Transformer Architecture for Medical Image Segmentation

Introduction

Transformer architecture has emerged to be successful in a number of natural language processing tasks. However, its applications to medical vision remain largely unexplored. In this study, we present UTNet, a simple yet powerful hybrid Transformer architecture that integrates self-attention into a convolutional neural network for enhancing medical image segmentation. UTNet applies self-attention modules in both encoder and decoder for capturing long-range dependency at dif- ferent scales with minimal overhead. To this end, we propose an efficient self-attention mechanism along with relative position encoding that reduces the complexity of self-attention operation significantly from O(n2) to approximate O(n). A new self-attention decoder is also proposed to recover fine-grained details from the skipped connections in the encoder. Our approach addresses the dilemma that Transformer requires huge amounts of data to learn vision inductive bias. Our hybrid layer design allows the initialization of Transformer into convolutional networks without a need of pre-training. We have evaluated UTNet on the multi- label, multi-vendor cardiac magnetic resonance imaging cohort. UTNet demonstrates superior segmentation performance and robustness against the state-of-the-art approaches, holding the promise to generalize well on other medical image segmentations.

Supportting models

UTNet

TransUNet

ResNet50-UTNet

ResNet50-UNet

SwinUNet

To be continue ...

Getting Started

Currently, we only support M&Ms dataset.

Prerequisites

Python >= 3.6
pytorch = 1.8.1
SimpleITK = 2.0.2
numpy = 1.19.5
einops = 0.3.2

Preprocess

Resample all data to spacing of 1.2x1.2 mm in x-y plane. We don't change the spacing of z-axis, as UTNet is a 2D network. Then put all data into 'dataset/'

Training

The M&M dataset provides data from 4 venders, where vendor AB are provided for training while ABCD for testing. The '--domain' is used to control using which vendor for training. '--domain A' for using vender A only. '--domain B' for using vender B only. '--domain AB' for using both vender A and B. For testing, all 4 venders will be used.

UTNet

For default UTNet setting, training with:

python train_deep.py -m UTNet -u EXP_NAME --data_path YOUR_OWN_PATH --reduce_size 8 --block_list 1234 --num_blocks 1,1,1,1 --domain AB --gpu 0 --aux_loss

Or you can use '-m UTNet_encoder' to use transformer blocks in the encoder only. This setting is more stable than the default setting in some cases.

To optimize UTNet in your own task, there are several hyperparameters to tune:

'--block_list': indicates apply transformer blocks in which resolution. The number means the number of downsamplings, e.g. 3,4 means apply transformer blocks in features after 3 and 4 times downsampling. Apply transformer blocks in higher resolution feature maps will introduce much more computation.

'--num_blocks': indicates the number of transformer blocks applied in each level. e.g. block_list='3,4', num_blocks=2,4 means apply 2 transformer blocks in 3-times downsampling level and apply 4 transformer blocks in 4-time downsampling level.

'--reduce_size': indicates the size of downsampling for efficient attention. In our experiments, reduce_size 8 and 16 don't have much difference, but 16 will introduce more computation, so we choost 8 as our default setting. 16 might have better performance in other applications.

'--aux_loss': applies deep supervision in training, will introduce some computation overhead but has slightly better performance.

Here are some recomended parameter setting:

--block_list 1234 --num_blocks 1,1,1,1

Our default setting, most efficient setting. Suitable for tasks with limited training data, and most errors occur in the boundary of ROI where high resolution information is important.

--block_list 1234 --num_blocks 1,1,4,8

Similar to the previous one. The model capacity is larger as more transformer blocks are including, but needs larger dataset for training.

--block_list 234 --num_blocks 2,4,8

Suitable for tasks that has complex contexts and errors occurs inside ROI. More transformer blocks can help learn higher-level relationship.

Feel free to try other combinations of the hyperparameter like base_chan, reduce_size and num_blocks in each level etc. to trade off between capacity and efficiency to fit your own tasks and datasets.

TransUNet

We borrow code from the original TransUNet repo and fit it into our training framework. If you want to use pre-trained weight, please download from the original repo. The configuration is not parsed by command line, so if you want change the configuration of TransUNet, you need change it inside the train_deep.py.

python train_deep.py -m TransUNet -u EXP_NAME --data_path YOUR_OWN_PATH --gpu 0

ResNet50-UTNet

For fair comparison with TransUNet, we implement the efficient attention proposed in UTNet into ResNet50 backbone, which is basically append transformer blocks into specified level after ResNet blocks. ResNet50-UTNet is slightly better in performance than the default UTNet in M&M dataset.

python train_deep.py -m ResNet_UTNet -u EXP_NAME --data_path YOUR_OWN_PATH --reduce_size 8 --block_list 123 --num_blocks 1,1,1 --gpu 0

Similar to UTNet, this is the most efficient setting, suitable for tasks with limited training data.

--block_list 23 --num_blocks 2,4

Suitable for tasks that has complex contexts and errors occurs inside ROI. More transformer blocks can help learn higher-level relationship.

ResNet50-UNet

If you don't use Transformer blocks in ResNet50-UTNet, it is actually ResNet50-UNet. So you can use this as the baseline to compare the performance improvement from Transformer for fair comparision with TransUNet and our UTNet.

python train_deep.py -m ResNet_UTNet -u EXP_NAME --data_path YOUR_OWN_PATH --block_list ''  --gpu 0

SwinUNet

Download pre-trained model from the origin repo. As Swin-Transformer's input size is related to window size and is hard to change after pretraining, so we adapt our input size to 224. Without pre-training, SwinUNet's performance is very low.

python train_deep.py -m SwinUNet -u EXP_NAME --data_path YOUR_OWN_PATH --crop_size 224

Citation

If you find this repo helps, please kindly cite our paper, thanks!

@inproceedings{gao2021utnet,
  title={UTNet: a hybrid transformer architecture for medical image segmentation},
  author={Gao, Yunhe and Zhou, Mu and Metaxas, Dimitris N},
  booktitle={International Conference on Medical Image Computing and Computer-Assisted Intervention},
  pages={61--71},
  year={2021},
  organization={Springer}
}

Comments

Running error

Hi , I just downloaded the data that needed for the training, and I met a problem when running "train_deep". It says: IndexError: list index out of range . I checked the dataset file and I think I have configured it correctly, so I can't find what reason causes this, can you help me? The dataset file and error are shown as follows: $}2TPR6)JL~3J8OWK_(Q{5S0$

opened by inhaowu 6
data processing

Hi,

I come here from your comments to the nnFormer https://github.com/282857341/nnFormer/issues/18 . I found that there is not a big difference compared your mehtod to nnFormer. Therefore, I'd like wo learn more from your project.

But you didnot provide too much about the data processing. Does it mean the preprocessing is very simple? Also you said that you test on ACDC dataset. Could I ask what kind of data processing you are using there. Because the provided method there is very complicated I think. Also they mentioned that their work can not be applied to 2D input.

Best, Wei

opened by xiaoiker 4
Test andValidation code

Hello, I'm reproducing your code. I want to see the evaluation indicators. I found that there are no test files and verification files. Would you please share them

opened by zhuangyi1 1
data preprocessing

Dear All, It would be great if you integrate your preprocessing pipeline into the code even if you rely on another implementation this would help me and others to run the code with clarity. else please provide a detailed explanation of how you do the preprocessing in order to have a complete contribution to the community. Thank you in advance and nice job!

opened by Yussef93 1
关于您代码复现中HD的一点疑问？

您好！最近我也在做心脏分割项目的研究，看了您的工作，对我的研究帮助非常大，先提前感谢您。在复现您的代码的时候，碰到了一些疑问，想来请教一下您。

1：您在MM数据集上，训练集150例病人，测试集170例病人（剩余三十例病人医院不公开）。看到您是直接拿测试集当作验证集，取得在Vendor A 上的最优的效果，是这样子吗？

2：我完全按照您的方法对数据集做预处理，按照您的默认设置训练，得到的结果如下： AVG_dice_list : [0.89811725 0.82059008 0.86868217] AVG_ASD_list : [0.93223253 0.79887591 1.10695938] AVG_HD_list : [0.93223253 0.79887591 1.10695938] 请问您的HD和ASD计算方法是按照您的代码里那样子运行吗？为什么值这么小，而且 ASD=HD。

opened by zeng-su123 0
Instantiation Error with Block list 23 and num_blocks 2,4

Hi All, Please be noted when I use the above mentioned arguments I get list index out of range error due to the following line num_blocks[-3] although list length is only 2. Please advice

opened by Yussef93 0
Dataset

In your paper, the number of the testing dataset is 200. However, the sum of patient ids in Testing_A.csv, Testing_B.csv, Testing_C.csv, and Testing_D.csv is 170. So where are the rest 30 testing patient ids?

opened by buaaduke 0

Official implementation of UTNet: A Hybrid Transformer Architecture for Medical Image Segmentation

Related tags

Overview

UTNet (Accepted at MICCAI 2021)

Introduction

Supportting models

Getting Started

Prerequisites

Preprocess

Training

UTNet

TransUNet

ResNet50-UTNet

ResNet50-UNet

SwinUNet

Citation

Comments

Running error

data processing

Test andValidation code

data preprocessing

关于您代码复现中HD的一点疑问？

Instantiation Error with Block list 23 and num_blocks 2,4

Dataset

Owner

CoTr: Efficiently Bridging CNN and Transformer for 3D Medical Image Segmentation

The codes for the work "Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation"

MISSFormer: An Effective Medical Image Segmentation Transformer

Multi-atlas segmentation (MAS) is a promising framework for medical image segmentation

BossNAS: Exploring Hybrid CNN-transformers with Block-wisely Self-supervised Neural Architecture Search

Copy Paste positive polyp using poisson image blending for medical image segmentation

This is an official implementation for "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" on Object Detection and Instance Segmentation.

Official and maintained implementation of the paper "OSS-Net: Memory Efficient High Resolution Semantic Segmentation of 3D Medical Data" [BMVC 2021].

The Medical Detection Toolkit contains 2D + 3D implementations of prevalent object detectors such as Mask R-CNN, Retina Net, Retina U-Net, as well as a training and inference framework focused on dealing with medical images.

Build a medical knowledge graph based on Unified Language Medical System (UMLS)

This is the official PyTorch implementation of the paper "TransFG: A Transformer Architecture for Fine-grained Recognition" (Ju He, Jie-Neng Chen, Shuai Liu, Adam Kortylewski, Cheng Yang, Yutong Bai, Changhu Wang, Alan Yuille).

nnFormer: Interleaved Transformer for Volumetric Segmentation Code for paper "nnFormer: Interleaved Transformer for Volumetric Segmentation "

PyTorch implementation of paper: HPNet: Deep Primitive Segmentation Using Hybrid Representations.

Episodic Transformer (E.T.) is a novel attention-based architecture for vision-and-language navigation. E.T. is based on a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.

A PyTorch implementation for V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation

The official implementation of our CVPR 2021 paper - Hybrid Rotation Averaging: A Fast and Robust Rotation Averaging Approach

The official implementation of the Hybrid Self-Attention NEAT algorithm

Implementation of UNET architecture for Image Segmentation.

Code for the SIGIR 2022 paper "Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge Graph Completion"