PaddleViT: State-of-the-art Visual Transformer and MLP Models for PaddlePaddle 2.0+

Overview

PaddlePaddle Vision Transformers

GitHub GitHub Repo stars

State-of-the-art Visual Transformer and MLP Models for PaddlePaddle

🤖 PaddlePaddle Visual Transformers (PaddleViT or PPViT) is a collection of vision models beyond convolution. Most of the models are based on Visual Transformers, Visual Attentions, and MLPs, etc. PaddleViT also integrates popular layers, utilities, optimizers, schedulers, data augmentations, training/validation scripts for PaddlePaddle 2.1+. The aim is to reproduce a wide variety of state-of-the-art ViT and MLP models with full training/validation procedures. We are passionate about making cuting-edge CV techniques easier to use for everyone.

🤖 PaddleViT provides models and tools for multiple vision tasks, such as classifications, object detection, semantic segmentation, GAN, and more. Each model architecture is defined in standalone python module and can be modified to enable quick research experiments. At the same time, pretrained weights can be downloaded and used to finetune on your own datasets. PaddleViT also integrates popular tools and modules for custimized dataset, data preprocessing, performance metrics, DDP and more.

🤖 PaddleViT is backed by popular deep learning framework PaddlePaddle, we also provide tutorials and projects on Paddle AI Studio. It's intuitive and straightforward to get started for new users.

Quick Links

PaddleViT implements model architectures and tools for multiple vision tasks, go to the following links for detailed information.

We also provide tutorials:

  • Notebooks (coming soon)
  • Online Course (coming soon)

Model architectures

Image Classification (Transformers)

  1. ViT (from Google), released with paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
  2. DeiT (from Facebook and Sorbonne), released with paper Training data-efficient image transformers & distillation through attention, by Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou.
  3. Swin Transformer (from Microsoft), released with paper Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo.
  4. VOLO (from Sea AI Lab and NUS), released with paper VOLO: Vision Outlooker for Visual Recognition, by Li Yuan, Qibin Hou, Zihang Jiang, Jiashi Feng, Shuicheng Yan.
  5. CSwin Transformer (from USTC and Microsoft), released with paper CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows , by Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Weiming Zhang, Nenghai Yu, Lu Yuan, Dong Chen, Baining Guo.
  6. CaiT (from Facebook and Sorbonne), released with paper Going deeper with Image Transformers, by Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, Hervé Jégou.
  7. PVTv2 (from NJU/HKU/NJUST/IIAI/SenseTime), released with paper PVTv2: Improved Baselines with Pyramid Vision Transformer, by Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao.
  8. Shuffle Transformer (from Tencent), released with paper Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer, by Zilong Huang, Youcheng Ben, Guozhong Luo, Pei Cheng, Gang Yu, Bin Fu.
  9. T2T-ViT (from NUS and YITU), released with paper Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet , by Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Zihang Jiang, Francis EH Tay, Jiashi Feng, Shuicheng Yan.

Coming Soon:

  1. CrossViT (from IBM), released with paper CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification, by Chun-Fu Chen, Quanfu Fan, Rameswar Panda.
  2. Focal Transformer (from Microsoft), released with paper Focal Self-attention for Local-Global Interactions in Vision Transformers, by Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan and Jianfeng Gao.
  3. HaloNet, (from Google), released with paper Scaling Local Self-Attention for Parameter Efficient Visual Backbones, by Ashish Vaswani, Prajit Ramachandran, Aravind Srinivas, Niki Parmar, Blake Hechtman, Jonathon Shlens.

Image Classification (MLPs)

  1. MLP-Mixer (from Google), released with paper MLP-Mixer: An all-MLP Architecture for Vision, by Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, Alexey Dosovitskiy
  2. ResMLP (from Facebook/Sorbonne/Inria/Valeo), released with paper ResMLP: Feedforward networks for image classification with data-efficient training, by Hugo Touvron, Piotr Bojanowski, Mathilde Caron, Matthieu Cord, Alaaeldin El-Nouby, Edouard Grave, Gautier Izacard, Armand Joulin, Gabriel Synnaeve, Jakob Verbeek, Hervé Jégou.
  3. gMLP (from Google), released with paper Pay Attention to MLPs, by Hanxiao Liu, Zihang Dai, David R. So, Quoc V. Le.

Detection

  1. DETR (from Facebook), released with paper End-to-End Object Detection with Transformers, by Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko.

Coming Soon:

  1. Swin Transformer (from Microsoft), released with paper Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo.
  2. PVTv2 (from NJU/HKU/NJUST/IIAI/SenseTime), released with paper PVTv2: Improved Baselines with Pyramid Vision Transformer, by Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao.
  3. Focal Transformer (from Microsoft), released with paper Focal Self-attention for Local-Global Interactions in Vision Transformers, by Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan and Jianfeng Gao.
  4. UP-DETR (from Tencent), released with paper UP-DETR: Unsupervised Pre-training for Object Detection with Transformers, by Zhigang Dai, Bolun Cai, Yugeng Lin, Junying Chen.

Semantic Segmentation

Now:

  1. SETR (from Fudan/Oxford/Surrey/Tencent/Facebook), released with paper Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers, by Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao Xiang, Philip H.S. Torr, Li Zhang.
  2. DPT (from Intel), released with paper Vision Transformers for Dense Prediction, by René Ranftl, Alexey Bochkovskiy, Vladlen Koltun.
  3. Swin Transformer (from Microsoft), released with paper Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo.
  4. Segmenter (from Inria), realeased with paper Segmenter: Transformer for Semantic Segmentation, by Robin Strudel, Ricardo Garcia, Ivan Laptev, Cordelia Schmid.
  5. Trans2seg (from HKU/Sensetime/NJU), released with paper Segmenting Transparent Object in the Wild with Transformer, by Enze Xie, Wenjia Wang, Wenhai Wang, Peize Sun, Hang Xu, Ding Liang, Ping Luo.
  6. SegFormer (from HKU/NJU/NVIDIA/Caltech), released with paper SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers, by Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo.

Coming Soon:

  1. FTN (from Baidu), released with paper Fully Transformer Networks for Semantic Image Segmentation, by Sitong Wu, Tianyi Wu, Fangjian Lin, Shengwei Tian, Guodong Guo.
  2. Shuffle Transformer (from Tencent), released with paper Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer, by Zilong Huang, Youcheng Ben, Guozhong Luo, Pei Cheng, Gang Yu, Bin Fu
  3. Focal Transformer (from Microsoft), released with paper Focal Self-attention for Local-Global Interactions in Vision Transformers, by Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan and Jianfeng Gao.
  4. CSwin Transformer (from USTC and Microsoft), released with paper CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows , by Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Weiming Zhang, Nenghai Yu, Lu Yuan, Dong Chen, Baining Guo.

GAN

  1. TransGAN (from Seoul National University and NUUA), released with paper TransGAN: Two Pure Transformers Can Make One Strong GAN, and That Can Scale Up, by Yifan Jiang, Shiyu Chang, Zhangyang Wang.
  2. Styleformer (from Facebook and Sorbonne), released with paper Styleformer: Transformer based Generative Adversarial Networks with Style Vector, by Jeeseung Park, Younggeun Kim.

Coming Soon:

  1. ViTGAN (from UCSD/Google), released with paper ViTGAN: Training GANs with Vision Transformers, by Kwonjoon Lee, Huiwen Chang, Lu Jiang, Han Zhang, Zhuowen Tu, Ce Liu.

Installation

Prerequistites

  • Linux/MacOS/Windows
  • Python 3.6/3.7
  • PaddlePaddle 2.1.0+
  • CUDA10.2+

Installation

  1. Create a conda virtual environment and activate it.

    conda create -n paddlevit python=3.7 -y
    conda activate paddlevit
  2. Install PaddlePaddle following the official instructions, e.g.,

    conda install paddlepaddle-gpu==2.1.2 cudatoolkit=10.2 --channel https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/

    Note: please change the paddlepaddle version and cuda version accordingly to your environment.

  3. Install dependency packages

    • General dependencies:
      pip install yacs, yaml
      
    • Packages for Segmentation:
      pip install cityscapesScripts, detail
      
    • Packages for GAN:
      pip install lmdb
      
  4. Clone project from GitHub

    git clone https://github.com/BR-IDL/PaddleViT.git 
    

Docker Install

(coming soon)

Results (Ported Weights)

Image Classification

Model Acc@1 Acc@5 Image Size Crop_pct Interpolation Link
vit_base_patch16_224 84.58 97.30 224 0.875 bicubic google/baidu(qv4n)
vit_base_patch16_384 85.99 98.00 384 1.0 bicubic google/baidu(wsum)
vit_large_patch16_224 85.81 97.82 224 0.875 bicubic google/baidu(1bgk)
swin_base_patch4_window7_224 85.27 97.56 224 0.9 bicubic google/baidu(wyck)
swin_base_patch4_window12_384 86.43 98.07 384 1.0 bicubic google/baidu(4a95)
swin_large_patch4_window12_384 87.14 98.23 384 1.0 bicubic google/baidu(j71u)
pvtv2_b0 70.47 90.16 224 0.875 bicubic google/baidu(dxgb)
pvtv2_b1 78.70 94.49 224 0.875 bicubic google/baidu(2e5m)
pvtv2_b2 82.02 95.99 224 0.875 bicubic google/baidu(are2)
pvtv2_b3 83.14 96.47 224 0.875 bicubic google/baidu(nc21)
pvtv2_b4 83.61 96.69 224 0.875 bicubic google/baidu(tthf)
pvtv2_b5 83.77 96.61 224 0.875 bicubic google/baidu(9v6n)
pvtv2_b2_linear 82.06 96.04 224 0.875 bicubic google/baidu(a4c8)
mlp_mixer_b16_224 76.60 92.23 224 0.875 bicubic google/baidu(xh8x)
mlp_mixer_l16_224 72.06 87.67 224 0.875 bicubic google/baidu(8q7r)
resmlp_24_224 79.38 94.55 224 0.875 bicubic google/baidu(jdcx)
resmlp_36_224 79.77 94.89 224 0.875 bicubic google/baidu(33w3)
resmlp_big_24_224 81.04 95.02 224 0.875 bicubic google/baidu(r9kb)
resmlp_big_24_distilled_224 83.59 96.65 224 0.875 bicubic google/baidu(4jk5)
gmlp_s16_224 79.64 94.63 224 0.875 bicubic google/baidu(bcth)
volo_d5_224_86.10 86.08 97.58 224 1.0 bicubic google/baidu(td49)
volo_d5_512_87.07 87.05 97.97 512 1.15 bicubic google/baidu(irik)
cait_xxs24_224 78.38 94.32 224 1.0 bicubic google/baidu(j9m8)
cait_s24_384 85.05 97.34 384 1.0 bicubic google/baidu(qb86)
cait_m48_448 86.49 97.75 448 1.0 bicubic google/baidu(imk5)
deit_base_distilled_patch16_224 83.32 96.49 224 0.875 bicubic google/baidu(5f2g)
deit_base_distilled_patch16_384 85.43 97.33 384 1.0 bicubic google/baidu(qgj2)
shuffle_vit_tiny_patch4_window7 82.39 96.05 224 0.875 bicubic google/baidu(8a1i)
shuffle_vit_small_patch4_window7 83.53 96.57 224 0.875 bicubic google/baidu(xwh3)
shuffle_vit_base_patch4_window7 83.95 96.91 224 0.875 bicubic google/baidu(1gsr)
cswin_tiny_224 82.81 96.30 224 0.9 bicubic google/baidu(4q3h)
cswin_small_224 83.60 96.58 224 0.9 bicubic google/baidu(gt1a)
cswin_base_224 84.23 96.91 224 0.9 bicubic google/baidu(wj8p)
cswin_large_224 86.52 97.99 224 0.9 bicubic google/baidu(b5fs)
cswin_base_384 85.51 97.48 384 1.0 bicubic google/baidu(rkf5)
cswin_large_384 87.49 98.35 384 1.0 bicubic google/baidu(6235)
t2t_vit_7 71.68 90.89 224 0.9 bicubic google/baidu(1hpa)
t2t_vit_10 75.15 92.80 224 0.9 bicubic google/baidu(ixug)
t2t_vit_12 76.48 93.49 224 0.9 bicubic google/baidu(qpbb)
t2t_vit_14 81.50 95.67 224 0.9 bicubic google/baidu(c2u8)
t2t_vit_19 81.93 95.74 224 0.9 bicubic google/baidu(4in3)
t2t_vit_24 82.28 95.89 224 0.9 bicubic google/baidu(4in3)
t2t_vit_t_14 81.69 95.85 224 0.9 bicubic google/baidu(4in3)
t2t_vit_t_19 82.44 96.08 224 0.9 bicubic google/baidu(mier)
t2t_vit_t_24 82.55 96.07 224 0.9 bicubic google/baidu(6vxc)
t2t_vit_14_384 83.34 96.50 384 1.0 bicubic google/baidu(r685)

Object Detection

Model backbone box_mAP Model
DETR ResNet50 42.0 google/baidu(n5gk)
DETR ResNet101 43.5 google/baidu(bxz2)

Semantic Segmentation

Pascal Context

Model Backbone Batch_size mIoU (ss) mIoU (ms+flip) Backbone_checkpoint Model_checkpoint ConfigFile
SETR_Naive ViT_large 16 52.06 52.57 google/baidu(owoj) google/baidu(xdb8) config
SETR_PUP ViT_large 16 53.90 54.53 google/baidu(owoj) google/baidu(6sji) config
SETR_MLA ViT_Large 8 54.39 55.16 google/baidu(owoj) google/baidu(wora) config
SETR_MLA ViT_large 16 55.01 55.87 google/baidu(owoj) google/baidu(76h2) config

Cityscapes

Model Backbone Batch_size Iteration mIoU (ss) mIoU (ms+flip) Backbone_checkpoint Model_checkpoint ConfigFile
SETR_Naive ViT_Large 8 40k 76.71 79.03 google/baidu(owoj) google/baidu(g7ro) config
SETR_Naive ViT_Large 8 80k 77.31 79.43 google/baidu(owoj) google/baidu(wn6q) config
SETR_PUP ViT_Large 8 40k 77.92 79.63 google/baidu(owoj) google/baidu(zmoi) config
SETR_PUP ViT_Large 8 80k 78.81 80.43 google/baidu(owoj) baidu(f793) config
SETR_MLA ViT_Large 8 40k 76.70 78.96 google/baidu(owoj) baidu(qaiw) config
SETR_MLA ViT_Large 8 80k 77.26 79.27 google/baidu(owoj) baidu(6bgj) config

ADE20K

Model Backbone Batch_size Iteration mIoU (ss) mIoU (ms+flip) Backbone_checkpoint Model_checkpoint ConfigFile
SETR_Naive ViT_Large 16 160k 47.57 48.12 google/baidu(owoj) baidu(lugq) config
SETR_PUP ViT_Large 16 160k 49.12 49.51 google/baidu(owoj) baidu(udgs) config
SETR_MLA ViT_Large 8 160k 47.80 49.34 google/baidu(owoj) baidu(mrrv) config
DPT ViT_Large 16 160k 47.21 - google/baidu(owoj) baidu(ts7h) config
Segmenter ViT_Tiny 16 160k 38.45 - TODO baidu(1k97) config
Segmenter ViT_Small 16 160k 46.07 - TODO baidu(i8nv) config
Segmenter ViT_Base 16 160k 49.08 - TODO baidu(hxrl) config
Segmenter ViT_Large 16 160k 51.82 - TODO baidu(wdz6) config
Segmenter_Linear DeiT_Base 16 160k 47.34 - TODO baidu(5dpv) config
Segmenter DeiT_Base 16 160k 49.27 - TODO baidu(3kim) config
Segformer MIT-B0 16 160k 38.37 - TODO baidu(ges9) config
Segformer MIT-B1 16 160k 42.20 - TODO baidu(t4n4) config
Segformer MIT-B2 16 160k 46.38 - TODO baidu(h5ar) config
Segformer MIT-B3 16 160k 48.35 - TODO baidu(g9n4) config
Segformer MIT-B4 16 160k 49.01 - TODO baidu(e4xw) config
Segformer MIT-B5 16 160k 49.73 - TODO baidu(uczo) config
UperNet Swin_Tiny 16 160k 44.90 45.37 - baidu(lkhg) config
UperNet Swin_Small 16 160k 47.88 48.90 - baidu(vvy1) config
UperNet Swin_Base 16 160k 48.59 49.04 - baidu(y040) config

Trans10kV2

Model Backbone Batch_size Iteration mIoU (ss) mIoU (ms+flip) Backbone_checkpoint Model_checkpoint ConfigFile
Trans2seg_Medium Resnet50c 16 80k 72.25 - google/baidu(4dd5) google/baidu(qcb0) config

GAN

Model FID Image Size Crop_pct Interpolation Model
styleformer_cifar10 2.73 32 1.0 lanczos google/baidu(ztky)
styleformer_stl10 15.65 48 1.0 lanczos google/baidu(i973)
styleformer_celeba 3.32 64 1.0 lanczos google/baidu(fh5s)
styleformer_lsun 9.68 128 1.0 lanczos google/baidu(158t)

*The results are evaluated on Cifar10, STL10, Celeba and LSUNchurch dataset, using fid50k_full metric.

Quick Demo for Image Classification

To use the model with pretrained weights, go to the specific subfolder e.g., /image_classification/ViT/, then download the .pdparam weight file and change related file paths in the following python scripts. The model config files are located in 。、configs/.

Assume the downloaded weight file is stored in ./vit_base_patch16_224.pdparams, to use the vit_base_patch16_224 model in python:

from config import get_config
from visual_transformer import build_vit as build_model
# config files in ./configs/
config = get_config('./configs/vit_base_patch16_224.yaml')
# build model
model = build_model(config)
# load pretrained weights, .pdparams is NOT needed
model_state_dict = paddle.load('./vit_base_patch16_224')
model.set_dict(model_state_dict)

🤖 See the README file in each model folder for detailed usages.

Evaluation

To evaluate ViT model performance on ImageNet2012 with a single GPU, run the following script using command line:

sh run_eval.sh

or

CUDA_VISIBLE_DEVICES=0 \
python main_single_gpu.py \
    -cfg='./configs/vit_base_patch16_224.yaml' \
    -dataset='imagenet2012' \
    -batch_size=16 \
    -data_path='/dataset/imagenet' \
    -eval \
    -pretrained='./vit_base_patch16_224'
Run evaluation using multi-GPUs:
sh run_eval_multi.sh

or

CUDA_VISIBLE_DEVICES=0,1,2,3 \
python main_multi_gpu.py \
    -cfg='./configs/vit_base_patch16_224.yaml' \
    -dataset='imagenet2012' \
    -batch_size=16 \
    -data_path='/dataset/imagenet' \
    -eval \
    -pretrained='./vit_base_patch16_224'

Training

To train the ViT model on ImageNet2012 with single GPU, run the following script using command line:

sh run_train.sh

or

CUDA_VISIBLE_DEVICES=0 \
python main_single_gpu.py \
  -cfg='./configs/vit_base_patch16_224.yaml' \
  -dataset='imagenet2012' \
  -batch_size=32 \
  -data_path='/dataset/imagenet' \
Run training using multi-GPUs:
sh run_train_multi.sh

or

CUDA_VISIBLE_DEVICES=0,1,2,3 \
python main_multi_gpu.py \
    -cfg='./configs/vit_base_patch16_224.yaml' \
    -dataset='imagenet2012' \
    -batch_size=16 \
    -data_path='/dataset/imagenet' \

Features

  1. State-of-the-art

    • State-of-the-art transformer models for multiple CV tasks
    • State-of-the-art data processings and training methods
    • We keep pushing it forward.
  2. Easy-to-use tools

    • Easy configs for model vairants
    • Modular design for utiliy functions and tools
    • Low barrier for educators and practitioners
    • Unified framework for all the models
  3. Easily customizable to your needs

    • Examples for each model to reproduce the results
    • Model implementations are exposed for you to customize
    • Model files can be used independently for quick experiments
  4. High Performance

    • DDP with a single GPU per process.
    • Mixed-precision support (coming soon)

Contributing

  • We encourage and appreciate your contribution to PaddleViT project, please refer to our workflow and work styles by CONTRIBUTING.md

Licenses

  • This repo is under the Apache-2.0 license.

Contact

  • Please raise an issue on GitHub.
Comments
  • 关于README中命令行参数和Usage模型加载的问题

    关于README中命令行参数和Usage模型加载的问题

    我发现 PaddleViT 所有模型中的 README.md 都存在两个问题(以下均以 PaddleViT/image_classification/BEiT/ BEiT 模型的 README.md 为例):

    • 其一,Usage 示例代码中,加载预训练权重时少了后缀 .pdparams,而且注释中提到 .pdparams is NOT needed 也是不对的,应该是在下面的命令行参数中 -pretrained 的值是不需要 .pdparams,二者搞混了。
    from config import get_config
    from beit import build_beit as build_model
    # config files in ./configs/
    config = get_config('./configs/beit_base_patch16_224.yaml')
    # build model
    model = build_model(config)
    # load pretrained weights, .pdparams is NOT needed
    model_state_dict = paddle.load('./beit_base_patch16_224_ft22kto1k')
    model.set_dict(model_state_dict)
    

    应该讲注释注释中的 , .pdparams is NOT needed 删去,并在模型加载时,加上后缀 .pdparams

    from config import get_config
    from beit import build_beit as build_model
    # config files in ./configs/
    config = get_config('./configs/beit_base_patch16_224.yaml')
    # build model
    model = build_model(config)
    # load pretrained weights
    model_state_dict = paddle.load('./beit_base_patch16_224_ft22kto1k.')
    model.set_dict(model_state_dict)
    
    • 其二,在 EvaluationTraining 的命令行参数值多加了一个单引号,如果在终端直接执行,会出现 FileNotFoundError 错误:
    FileNotFoundError: [Errno 2] No such file or directory: "'./configs/beit_base_patch16_224.yaml'"
    

    我之前在终端预训模型训练和验证的命令时,出现过这个错误,群里也有其他同学出现了这样的问题。出现这个错误的原因是因为 argparse 在解析命令行参数时,为字符串类型的参数值自动加上了一个双引号。所以,在为命令行参数赋值时,不需要加上引号。所以,应该去掉 EvaluationTraining 命令行参数值中的单引号。 单 GPU 验证:

    CUDA_VISIBLE_DEVICES=0 \
    python main_single_gpu.py \
        -cfg='./configs/beit_base_patch16_224.yaml' \
        -dataset='imagenet2012' \
        -batch_size=16 \
        -data_path='/dataset/imagenet' \
        -eval \
        -pretrained='./beit_base_patch16_224_ft22kto1k'
    

    我修改为:

    CUDA_VISIBLE_DEVICES=0 \
    python main_single_gpu.py \
        -cfg=./configs/beit_base_patch16_224.yaml \
        -dataset=imagenet2012 \
        -batch_size=16 \
        -data_path=/path/to/dataset/imagenet/val \
        -eval \
        -pretrained=/path/to/pretrained/model/beit_base_patch16_224_ft22kto1k  # .pdparams is NOT needed
    

    GPU 验证:

    CUDA_VISIBLE_DEVICES=0,1,2,3 \
    python main_multi_gpu.py \
        -cfg='./configs/beit_base_patch16_224.yaml' \
        -dataset='imagenet2012' \
        -batch_size=16 \
        -data_path='/dataset/imagenet' \
        -eval \
        -pretrained='./beit_base_patch16_224_ft22kto1k'
    
    

    我修改为:

    CUDA_VISIBLE_DEVICES=0,1,2,3 \
    python main_multi_gpu.py \
        -cfg=./configs/beit_base_patch16_224.yaml \
        -dataset=imagenet2012 \
        -batch_size=16 \
        -data_path=/path/to/dataset/imagenet/val \
        -eval \
        -pretrained=/path/to/pretrained/model/beit_base_patch16_224_ft22kto1k  # .pdparams is NOT needed
    
    

    GPU 训练:

    CUDA_VISIBLE_DEVICES=0 \
    python main_single_gpu.py \
      -cfg='./configs/beit_base_patch16_224.yaml' \
      -dataset='imagenet2012' \
      -batch_size=32 \
      -data_path='/dataset/imagenet' \
    

    我修改为:

    CUDA_VISIBLE_DEVICES=0 \
    python main_single_gpu.py \
      -cfg=./configs/beit_base_patch16_224.yaml \
      -dataset=imagenet2012 \
      -batch_size=32 \
      -data_path=/path/to/dataset/imagenet/train \
    

    GPU 训练:

    CUDA_VISIBLE_DEVICES=0,1,2,3 \
    python main_multi_gpu.py \
        -cfg='./configs/beit_base_patch16_224.yaml' \
        -dataset='imagenet2012' \
        -batch_size=16 \
        -data_path='/dataset/imagenet' \ 
    
    

    我修改为:

    CUDA_VISIBLE_DEVICES=0,1,2,3 \
    python main_multi_gpu.py \
        -cfg=./configs/beit_base_patch16_224.yaml \
        -dataset=imagenet2012 \
        -batch_size=16 \
        -data_path=/path/to/dataset/imagenet/train \ 
    

    一会儿,我再提交个 PR,请官方审查~

    opened by libertatis 8
  • 单机多卡并行部分代码不理解

    单机多卡并行部分代码不理解

    如果是多卡训练,则需要初始化多卡训练环境。

    if nranks > 1:
        # Initialize parallel environment if not done.
        if not paddle.distributed.parallel.parallel_helper._is_parallel_ctx_initialized():
            logger.info("using dist training")
            # 初始化动态图模式下的并行训练环境,目前同时初始化NCCL和GLOO上下文用于通信。
            paddle.distributed.init_parallel_env()
            ddp_model = paddle.DataParallel(model)
        else:
            ddp_model = paddle.DataParallel(model)
    

    不理解:if not paddle.distributed.parallel.parallel_helper._is_parallel_ctx_initialized(): 这个判断是什么意思,难道paddle.distributed.init_parallel_env()不应该是必须初始化的吗?

    另外,还有如果要运行多机多卡需要修改什么代码吗? 谢谢

    opened by wstchhwp 6
  • small issue

    small issue

    Describe the bug resume training error AttributeError: 'Momentum' object has no attribute 'set_dict'

    To Reproduce Steps to reproduce the behavior: 1.Go to 'PaddleViT/object_detection/Swin/' 2.Run 'python main_single_gpu.py -resume='./output/train-20211210-09-50-43/Swin-Epoch-45'

    The recovery of model can pass

    Screenshots Traceback (most recent call last): File "C:\Program Files\JetBrains\PyCharm Community Edition 2021.2.2\plugins\python-ce\helpers\pydev\pydevd.py", line 1483, in _exec pydev_imports.execfile(file, globals, locals) # execute the script File "C:\Program Files\JetBrains\PyCharm Community Edition 2021.2.2\plugins\python-ce\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile exec(compile(contents+"\n", file, 'exec'), glob, loc) File "F:/***/pp_swin/main_single_gpu.py", line 400, in <module> main() File "F:/***/pp_swin/main_single_gpu.py", line 313, in main optimizer.set_dict(opt_state) AttributeError: 'Momentum' object has no attribute 'set_dict'

    Version (please complete the following information):

    • Paddle Version: [ 2.2.0]
    • Python Version [3.6]
    • GPU/CPU mode [ Gpu]
    opened by ky0107 5
  • Will Mobile-Former of PaddleViT come soon ?

    Will Mobile-Former of PaddleViT come soon ?

    Describe your feature request Will PaddleViT recently add Mobile-Former and release pretrained weights on ImageNet?

    Describe the reference code or paper Paper -> Mobile-Former: Bridging MobileNet and Transformer

    Describe the possible solution

    Additional context Add any other context or screenshots about the feature request here.

    opened by August0424 5
  • PaddleViT-Seg安装requirements.txt依赖库时出现问题

    PaddleViT-Seg安装requirements.txt依赖库时出现问题

    Describe the bug PaddleViT-Seg的requirements.txt文档中需要安装的依赖库有如下:

    • cityscapesScripts==2.2.0
    • detail==4.0
    • numpy==1.20.3
    • opencv-python==4.5.2.52
    • scipy==1.6.3
    • yacs==0.1.8

    问题一: 没有detail 这个库 微信截图_20211202203427 自己解决方式:删除detail

    问题二: opencv-python库没有4.5.2.52这个版本 微信截图_20211202203715 自己解决方式:更换opencv-python版本

    opened by richarddddd198 5
  • cswin_large_224 pretrained 22k model

    cswin_large_224 pretrained 22k model

    where can I get the cswin_large_224 pretrained 22k model and the 22kto1k labels mapping file. I want to fine-tune the cswin_large_224 model with imagenet-1k dataset myself.

    enhancement 
    opened by kaierlong 4
  • Calculate the ACC error during train the model

    Calculate the ACC error during train the model

    When I am training the model, there is an error in calculating ACC. Could you provide a way for me to solve it?

    Error message:Tensor holds the wrong type, it holds int, but desires to be int64_t. 11

    pred and label type pred label_orig

    bug 
    opened by VivizSun 4
  • I got some Warning and Error when runing to evaluate DETR model performance on COCO2017 with a single GPU

    I got some Warning and Error when runing to evaluate DETR model performance on COCO2017 with a single GPU

    I used the AIstudio GPU version and tried the DETR project.But got some CUDA mistake. I set the config. !pip install yacs import paddle from config import get_config from detr import build_detr config = get_config('./configs/detr_resnet50.yaml') model, critertion, postprocessors = build_detr(config) model_state_dict = paddle.load('detr_resnet50.pdparams') model.set_dict(model_state_dict) and this is the return W1216 20:21:43.708824 167 device_context.cc:404] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 10.1, Runtime API Version: 10.1 W1216 20:21:43.712436 167 device_context.cc:422] device: 0, cuDNN Version: 7.6. 100%|██████████| 151272/151272 [00:02<00:00, 69020.62it/s] And when I command sh run_eval.sh I got some Warning and Error. W1216 20:22:31.344815 398 init.cc:141] Compiled with WITH_GPU, but no GPU found in runtime. /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/framework.py:301: UserWarning: You are using GPU version Paddle, but your CUDA device is not set properly. CPU device will be used by default. "You are using GPU version Paddle, but your CUDA device is not set properly. CPU device will be used by default."

    Traceback (most recent call last): File "main_single_gpu.py", line 321, in <module> main() File "main_single_gpu.py", line 174, in main paddle.seed(seed) File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/framework/random.py", line 46, in seed for i in range(core.get_cuda_device_count()): OSError: (External) Cuda error(100), no CUDA-capable device is detected. [Advise: Please search for the error code(100) on website( https://docs.nvidia.com/cuda/archive/10.0/cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1g3f51e3575c2178246db0a94a430e0038 ) to get Nvidia's official solution about CUDA Error.] (at /paddle/paddle/fluid/platform/gpu_info.cc:99)

    opened by Atlantisming 4
  • 使用自制的COCO数据集训练DETR模型时会遇到问题

    使用自制的COCO数据集训练DETR模型时会遇到问题

    训练DETR遇到的报错,出现的时间不固定,有时候是在训练刚开始几个batch时,有时候是训练了几个epoch后会出现。单卡3090,显存占用率基本维持在50%左右

    训练使用的命令: CUDA_VISIBLE_DEVICES=1
    python main_single_gpu.py
    -cfg='./configs/detr_resnet50.yaml'
    -dataset='coco'
    -batch_size=2
    -data_path='/dataset/coco' \

    报错: Traceback (most recent call last): File "main_single_gpu.py", line 321, in main() File "main_single_gpu.py", line 289, in main accum_iter=config.TRAIN.ACCUM_ITER) File "main_single_gpu.py", line 91, in train loss_dict = criterion(outputs, targets) File "/home/cuiyuan/anaconda3/envs/paddle2/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 902, in call outputs = self.forward(*inputs, **kwargs) File "/disk/disk1/quyi/PaddleViT/object_detection/DETR/detr.py", line 285, in forward indices = self.matcher(outputs_without_aux, targets) # list of index(tensor) pairs File "/home/cuiyuan/anaconda3/envs/paddle2/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 902, in call outputs = self.forward(*inputs, **kwargs) File "/disk/disk1/quyi/PaddleViT/object_detection/DETR/matcher.py", line 129, in forward idx = linear_sum_assignment(c[i]) File "/home/cuiyuan/anaconda3/envs/paddle2/lib/python3.7/site-packages/scipy/optimize/_lsap.py", line 100, in linear_sum_assignment return _lsap_module.calculate_assignment(cost_matrix) ValueError: matrix contains invalid numeric entries

    bug 
    opened by ezekielqu 4
  • 关于 ViT Transformer Attention 添加 attn_head_size 参数的建议

    关于 ViT Transformer Attention 添加 attn_head_size 参数的建议

    vit transformer 的实现中(ViT Transformer Attention),多头注意力的 attn_head_size 的计算是由传入的 embed_dimnum_heads 计算得到的:

    self.attn_head_size = int(embed_dim / self.num_heads)
    

    我认为这里的实现至少有两个问题:

    • 其一,没有对embed_dim是否能num_heads整除做检查。当embed_dim不能被num_heads整除,或者num_heads > embed_dim时,transpose_multihead 的操作会出现异常:
        def transpose_multihead(self, x):
            new_shape = x.shape[:-1] + [self.num_heads, self.attn_head_size]
            x = x.reshape(new_shape)
            x = x.transpose([0, 2, 1, 3])
            return x
    
    • 其二,attn_head_size 的大小受到 embed_dimnum_heads 的限制,当预训练模型时,不能随意设置 attn_head_size 的大小,代码不够灵活。

    解决上述问题的办法,就是为 Attention__init__ 方法添加一个 attn_head_size 的参数,这样即不影响现有预训练模型的加载,又可以在预训练时,灵活设置 attn_head_size 的大小。由于 attn_head_size 与输入维度 embed_dim 无关,也不需要验证 embed_dim 是否能被 num_heads 整除。 目前主流框架中,两种实现都有: 第一种,由 embed_dimnum_heads 参数计算 attn_head_size 的实现,包括: PaddlePaddle: https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/nn/layer/transformer.py#L109 PyTorch: https://github.com/pytorch/pytorch/blob/master/torch/nn/modules/transformer.py transformers: https://github.com/huggingface/transformers/blob/master/src/transformers/models/bert/modeling_bert.py#L226 第二种,将 attn_head_size 作为参数传入的实现,包括: TensorFlow: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/keras/layers/multi_head_attention.py#L126 TensorFlow Addons: https://github.com/tensorflow/addons/blob/master/tensorflow_addons/layers/multihead_attention.py 我个人非常推荐第二种实现方式,API 使用起来更加灵活,代码看起来也非常顺畅,更加合理。 比如,原实现中 all_head_size 的定义:

    self.all_head_size = self.attn_head_size * self.num_heads
    

    all_head_size == embed_dim,完全没有必要定义。这个变量,只在 __init__

            self.qkv = nn.Linear(embed_dim,
                                 self.all_head_size*3,  # weights for q, k, and v
                                 weight_attr=w_attr_1,
                                 bias_attr=b_attr_1 if qkv_bias else False)
    

    forward

    new_shape = z.shape[:-2] + [self.all_head_size]
    

    中用到。__init__ 中的 qkv 映射的输出维度 self.all_head_size*3 可改为 embed_dim*3forward中的 new_shape 用到的 self.all_head_size,可以在方法的开始,取出输入 x 的维度,修改如下:

    embed_dim = x.shape[-1]
    ……
    new_shape = z.shape[:-2] + [embed_dim]
    

    以上是我对源码中定义 self.all_head_size 的质疑。 还有最后输出加一层 Linear Layer 的必要性:

            self.out = nn.Linear(embed_dim,
                                 embed_dim,
                                 weight_attr=w_attr_2,
                                 bias_attr=b_attr_2)
    

    forward 中,最后输出执行线性映射操作的上面由一行注释 reshape

            z = z.reshape(new_shape)
            # reshape
            z = self.out(z)
    

    意思应该是将维度映射回输入维度 embed_dim,方面后面的残差连接。不过既然 all_head_size == embed_dim,那何来 reshape? 所以,我认为这里对输出的线性映射是不必要的。 不过,如果我们使用第二种方式实现,将 attn_head_size 作为参数传入,不依赖 embed_sizenum_heads 来计算,以上代码看起来就顺畅多了,合理多了。 第二种实现,将 attn_head_size 作为参数传入,只需在源代码基础上更改几行代码即可,实现如下:

    from typing import Tuple, Union
    
    import paddle
    import paddle.nn as nn
    from paddle import ParamAttr
    from paddle import Tensor
    
    
    class Attention(nn.Layer):
        """ Attention module
    
        Attention module for ViT, here q, k, v are assumed the same.
        The qkv mappings are stored as one single param.
    
        Attributes:
            num_heads: number of heads
            attn_head_size: feature dim of single head
            all_head_size: feature dim of all heads
            qkv: a nn.Linear for q, k, v mapping
            scales: 1 / sqrt(single_head_feature_dim)
            out: projection of multi-head attention
            attn_dropout: dropout for attention
            proj_dropout: final dropout before output
            softmax: softmax op for attention
        """
        def __init__(self,
                     embed_dim: int,
                     num_heads: int,
                     attn_head_size: int,
                     qkv_bias: Union[bool, ParamAttr],
                     dropout: float = 0.,
                     attention_dropout: float = 0.):
            super().__init__()
            """
            增加了一个attn_head_size的参数,attn_head_size和num_heads的大小不受embed_dim的限制,使API的使用更灵活。
            """
            self.num_heads = num_heads
            # self.attn_head_size = int(embed_dim / self.num_heads)
            self.attn_head_size = attn_head_size
            self.all_head_size = self.attn_head_size * self.num_heads  # Attention Layer's hidden_size
    
            w_attr_1, b_attr_1 = self._init_weights()
            self.qkv = nn.Linear(embed_dim,
                                 self.all_head_size*3,  # weights for q, k, and v
                                 weight_attr=w_attr_1,
                                 bias_attr=b_attr_1 if qkv_bias else False)
    
            self.scales = self.attn_head_size ** -0.5
    
            w_attr_2, b_attr_2 = self._init_weights()
            # self.out = nn.Linear(embed_dim,
            #                      embed_dim,
            #                      weight_attr=w_attr_2,
            #                      bias_attr=b_attr_2)
            # 汇总多头注意力信息,并将维度映射回输入维度embed_dim,方便残差连接
            self.out = nn.Linear(self.all_head_size,
                                 embed_dim,
                                 weight_attr=w_attr_2,
                                 bias_attr=b_attr_2)
    
            self.attn_dropout = nn.Dropout(attention_dropout)
            self.proj_dropout = nn.Dropout(dropout)
            self.softmax = nn.Softmax(axis=-1)
    
        def _init_weights(self) -> Tuple[ParamAttr, ParamAttr]:
            weight_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform())
            bias_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform())
            return weight_attr, bias_attr
    
        def transpose_multihead(self, x: Tensor) -> Tensor:
            new_shape = x.shape[:-1] + [self.num_heads, self.attn_head_size]
            x = x.reshape(new_shape)
            x = x.transpose([0, 2, 1, 3])
            return x
    
        def forward(self, x: Tensor) -> Tuple[Tensor, Tensor]:
            qkv = self.qkv(x).chunk(3, axis=-1)
            q, k, v = map(self.transpose_multihead, qkv)
    
            attn = paddle.matmul(q, k, transpose_y=True)
            attn = attn * self.scales
            attn = self.softmax(attn)
            attn_weights = attn
            attn = self.attn_dropout(attn)
    
            z = paddle.matmul(attn, v)
            z = z.transpose([0, 2, 1, 3])
            new_shape = z.shape[:-2] + [self.all_head_size]
            z = z.reshape(new_shape)
            # 汇总多头注意力信息,并将维度映射回输入维度embed_dim,方便残差连接
            z = self.out(z)
            z = self.proj_dropout(z)
            return z, attn_weights
    

    测试:

    def main():
        t = paddle.randn([4, 16, 96])     # [batch_size, num_patches, embed_dim]
        print('input shape = ', t.shape)
    
        model = Attention(embed_dim=96,
                          num_heads=8,
                          attn_head_size=128,
                          qkv_bias=False,
                          dropout=0.,
                          attention_dropout=0.)
    
        print(model)
    
        out, attn_weights = model(t)
        print(out.shape)
        print(attn_weights.shape)
    
        for name, param in model.named_parameters():
            print(f'param name: {name},\tparam shape: {param.shape} ')
    
    
    if __name__ == "__main__":
        main()
    

    输出:

    input shape =  [4, 16, 96]
    Attention(
      (qkv): Linear(in_features=96, out_features=3072, dtype=float32)
      (out): Linear(in_features=1024, out_features=96, dtype=float32)
      (attn_dropout): Dropout(p=0.0, axis=None, mode=upscale_in_train)
      (proj_dropout): Dropout(p=0.0, axis=None, mode=upscale_in_train)
      (softmax): Softmax(axis=-1)
    )
    [4, 16, 96]
    [4, 8, 16, 16]
    param name: qkv.weight,	param shape: [96, 3072] 
    param name: out.weight,	param shape: [1024, 96] 
    param name: out.bias,	param shape: [96] 
    

    以上是我个人的一点儿不成熟的小建议,望官方评估采纳~

    bug invalid 
    opened by libertatis 4
  • 用于PaddleViT模型的Resnet预训练权重哪里下载?

    用于PaddleViT模型的Resnet预训练权重哪里下载?

    在PaddleViT用于语义分割的Trans2Seg模型,配置文件里需要使用resnet50c.pdparams的预训练权重文件,想问下是否可以提供一下下载方式?如果有resnet50, resnet101,resnet50c, resnet101c等系列模型文件能够提供一份吗? image

    感谢!@xperzy

    Segmentation 
    opened by xiaoguoguo2018 3
  • ViT pretrained weight with MAE

    ViT pretrained weight with MAE

    Describe your feature request Any chance we could get the pretrained ViT model with MAE?

    Describe the reference code or paper Masked Autoencoders Are Scalable Vision Learners

    Describe the possible solution Please pretrain the ViT with MAE and upload the weight link to the readme file. Thanks in advance!

    Additional context None.

    opened by jb892 0
  • MobileFormer

    MobileFormer

    Describe your feature request 您好,请问MobileFormer关于目标检测方面的复现有吗?

    Describe the reference code or paper

    Describe the possible solution

    Additional context Add any other context or screenshots about the feature request here.

    opened by liuhuan1111 0
  • 如何计算VIT模型的Flops ?

    如何计算VIT模型的Flops ?

    我想计算基于Transformer的一些分割模型的FLOPs用于评估实验模型的性能, 使用的是Paddle.flops来计算,但是出现了一些无法统计的情况( Treat it as zero FLOPs),是不是这样计算出来的FLOPs的值是不准确的呢?可以如何解决?

    我看到的是需要自己定义用于实现对自定义网络层的统计,也就是参数custom_ops,目前只找到了官方提供的用于计算paddle.nn.SyncBatchNorm的方法, 但是我不知道怎么设计用于计算.MaxPool2D、.LayerNorm、.GroupNorm、 .GELU、 Embedding这些函数,如果不计算的话,最后得出的FLOPs值应该有较大的差异,可以帮我解决这个问题吗?

    image

    image

    下面是的的结果: image

    image ... image

    opened by xiaoguoguo2018 0
Releases(v0.8)
  • v0.8(Jan 11, 2022)

    This release add:

    1. Add more classification models, detection models and segmentation models.
    2. Add more tools and script for model training and validation.
    3. Refactor train/val schemes for single and multiple GPUs.
    4. Fix common bugs and issues.
    5. Add more docs and tutorials
    6. Refine readmes
    Source code(tar.gz)
    Source code(zip)
  • v0.1(Aug 30, 2021)

Unofficial implementation of MLP-Mixer: An all-MLP Architecture for Vision

MLP-Mixer: An all-MLP Architecture for Vision This repo contains PyTorch implementation of MLP-Mixer: An all-MLP Architecture for Vision. Usage : impo

Rishikesh (ऋषिकेश) 175 Dec 23, 2022
Implements MLP-Mixer: An all-MLP Architecture for Vision.

MLP-Mixer-CIFAR10 This repository implements MLP-Mixer as proposed in MLP-Mixer: An all-MLP Architecture for Vision. The paper introduces an all MLP (

Sayak Paul 51 Jan 4, 2023
Implementation for paper MLP-Mixer: An all-MLP Architecture for Vision

MLP Mixer Implementation for paper MLP-Mixer: An all-MLP Architecture for Vision. Give us a star if you like this repo. Author: Github: bangoc123 Emai

Ngoc Nguyen Ba 86 Dec 10, 2022
This is an official implementation for "AS-MLP: An Axial Shifted MLP Architecture for Vision".

AS-MLP architecture for Image Classification Model Zoo Image Classification on ImageNet-1K Network Resolution Top-1 (%) Params FLOPs Throughput (image

SVIP Lab 106 Dec 12, 2022
Quickly comparing your image classification models with the state-of-the-art models (such as DenseNet, ResNet, ...)

Image Classification Project Killer in PyTorch This repo is designed for those who want to start their experiments two days before the deadline and ki

null 349 Dec 8, 2022
Implementation of 🦩 Flamingo, state-of-the-art few-shot visual question answering attention net out of Deepmind, in Pytorch

?? Flamingo - Pytorch Implementation of Flamingo, state-of-the-art few-shot visual question answering attention net, in Pytorch. It will include the p

Phil Wang 630 Dec 28, 2022
Deep Text Search is an AI-powered multilingual text search and recommendation engine with state-of-the-art transformer-based multilingual text embedding (50+ languages).

Deep Text Search - AI Based Text Search & Recommendation System Deep Text Search is an AI-powered multilingual text search and recommendation engine w

null 19 Sep 29, 2022
Implementation of ETSformer, state of the art time-series Transformer, in Pytorch

ETSformer - Pytorch Implementation of ETSformer, state of the art time-series Transformer, in Pytorch Install $ pip install etsformer-pytorch Usage im

Phil Wang 121 Dec 30, 2022
MLP-Like Vision Permutator for Visual Recognition (PyTorch)

Vision Permutator: A Permutable MLP-Like Architecture for Visual Recognition (arxiv) This is a Pytorch implementation of our paper. We present Vision

Qibin (Andrew) Hou 162 Nov 28, 2022
QuickAI is a Python library that makes it extremely easy to experiment with state-of-the-art Machine Learning models.

QuickAI is a Python library that makes it extremely easy to experiment with state-of-the-art Machine Learning models.

null 152 Jan 2, 2023
LaneDet is an open source lane detection toolbox based on PyTorch that aims to pull together a wide variety of state-of-the-art lane detection models

LaneDet is an open source lane detection toolbox based on PyTorch that aims to pull together a wide variety of state-of-the-art lane detection models. Developers can reproduce these SOTA methods and build their own methods.

TuZheng 405 Jan 4, 2023
LWCC: A LightWeight Crowd Counting library for Python that includes several pretrained state-of-the-art models.

LWCC: A LightWeight Crowd Counting library for Python LWCC is a lightweight crowd counting framework for Python. It wraps four state-of-the-art models

Matija Teršek 39 Dec 28, 2022
PySlowFast: video understanding codebase from FAIR for reproducing state-of-the-art video models.

PySlowFast PySlowFast is an open source video understanding codebase from FAIR that provides state-of-the-art video classification models with efficie

Meta Research 5.3k Jan 3, 2023
TorchMultimodal is a PyTorch library for training state-of-the-art multimodal multi-task models at scale.

TorchMultimodal (Alpha Release) Introduction TorchMultimodal is a PyTorch library for training state-of-the-art multimodal multi-task models at scale.

Meta Research 663 Jan 6, 2023
Vision Transformer and MLP-Mixer Architectures

Vision Transformer and MLP-Mixer Architectures Update (2.7.2021): Added the "When Vision Transformers Outperform ResNets..." paper, and SAM (Sharpness

Google Research 6.4k Jan 4, 2023
Official codebase used to develop Vision Transformer, MLP-Mixer, LiT and more.

Big Vision This codebase is designed for training large-scale vision models on Cloud TPU VMs. It is based on Jax/Flax libraries, and uses tf.data and

Google Research 701 Jan 3, 2023
Official PaddlePaddle implementation of Paint Transformer

Paint Transformer: Feed Forward Neural Painting with Stroke Prediction [Paper] [Paddle Implementation] Update We have optimized the serial inference p

TianweiLin 284 Dec 31, 2022