A PyTorch implementation of "From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network" (ICCV2021)

Overview

From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network

The official code of VisionLAN (ICCV2021). VisionLAN successfully achieves the transformation from two-step to one-step recognition (from Two to One), which adaptively considers both visual and linguistic information in a unified structure without the need of extra language model.

ToDo List

  • Release code
  • Document for Installation
  • Trained models
  • Document for testing and training
  • Evaluation
  • re-organize and clean the parameters

Updates

2021/10/9 We upload the code, datasets, and trained models.
2021/10/9 Fix a bug in cfs_LF_1.py.

Requirements

Python2.7
Colour
LMDB
Pillow
opencv-python
torch==1.3.0
torchvision==0.4.1
editdistance
matplotlib==2.2.5

Step-by-step install

pip install -r requirements.txt

Data preparing

Training sets

SynthText We use the tool to crop images from original SynthText dataset, and convert images into LMDB dataset.

MJSynth We use tool to convert images into LMDB dataset. (We only use training set in this implementation)

We have upload these LMDB datasets in RuiKe (password:x6si).

Testing sets

Evaluation datasets, LMDB datasets can be downloaded from BaiduYun (password:fjyy) or RuiKe

IIIT5K Words (IIIT5K)
ICDAR 2013 (IC13)
Street View Text (SVT)
ICDAR 2015 (IC15)
Street View Text-Perspective (SVTP)
CUTE80 (CUTE)

The structure of data directory is

datasets
├── evaluation
│   ├── Sumof6benchmarks
│   ├── CUTE
│   ├── IC13
│   ├── IC15
│   ├── IIIT5K
│   ├── SVT
│   └── SVTP
└── train
    ├── MJSynth
    └── SynthText

Evaluation

Results on 6 benchmarks

Methods IIIT5K IC13 SVT IC15 SVTP CUTE
Paper 95.8 95.7 91.7 83.7 86.0 88.5
This implementation 95.9 96.3 90.7 84.1 85.3 88.9

Download our trained model in BaiduYun (password: e3kj) or RuiKe (password: cxqi), and put it in output/LA/final.pth.

CUDA_VISIBLE_DEVICES=0 python eval.py

Visualize character-wise mask map

Examples of the visualization of mask_c: image

   CUDA_VISIBLE_DEVICES=0 python visualize.py

You can modify the 'mask_id' in cfgs/cfgs_visualize to change the mask position for visualization.

Results on OST datasets

Occlusion Scene Text (OST) dataset is proposed to reflect the ability for recognizing cases with missing visual cues. This dataset is collected from 6 benchmarks (IC13, IC15, IIIT5K, SVT, SVTP and CT) containing 4832 images. Images in this dataset are manually occluded in weak or heavy degree. Weak and heavy degrees mean that we occlude the character using one or two lines. For each image, we randomly choose one degree to only cover one character.

Examples of images in OST dataset: image image

Methods Average Weak Heavy
Paper 60.3 70.3 50.3
This implementation 60.3 70.8 49.8

The LMDB dataset is available in BaiduYun (password:yrrj) or RuiKe (password: vmzr)

Training

4 2080Ti GPUs are used in this implementation.

Language-free (LF) process

Step 1: We first train the vision model without MLM. (Our trained LF_1 model(BaiduYun) (password:avs5) or RuiKe (password:qwzn))

   CUDA_VISIBLE_DEVICES=0,1,2,3 python train_LF_1.py

Step 2: We finetune the MLM with vision model (Our trained LF_2 model(BaiduYun) (password:04jg) or RuiKe (password:v67q))

   CUDA_VISIBLE_DEVICES=0,1,2,3 python train_LF_2.py

Language-aware (LA) process

Use the mask map to guide the linguistic learning in the vision model.

   CUDA_VISIBLE_DEVICES=0,1,2,3 python train_LA.py

Tip: In LA process, model with loss (Loss VisionLAN) higher than 0.3 and the training accuracy (Accuracy) lower than 91.0 after the first 200 training iters obains better performance.

Improvement

  1. Mask id randomly generated according to the max length can not well adapt to the occlusion of long text. Thus, evenly sampled mask id can further improve the performance of MLM.
  2. Heavier vision model is able to capture more robust linguistic information in our later experiments.

Citation

If you find our method useful for your reserach, please cite

 @article{wang2021two,
  title={From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network},
  author={Wang, Yuxin and Xie, Hongtao and Fang, Shancheng and Wang, Jing and Zhu, Shenggao and Zhang, Yongdong},
  journal={ICCV},
  year={2021}
}

Feedback

Suggestions and discussions are greatly welcome. Please contact the authors by sending email to [email protected]

Comments
  • Questions about `Training Pipeline` and `Parallel Attention`

    Questions about `Training Pipeline` and `Parallel Attention`

    作者你好,论文提供的思路对我启发很大!

    我有两个问题想请教一下:

    1. 关于训练的pipeline

    我看到论文中描述的训练流程只包含language-freelanguage-aware两个环节,类似于代码中的LF_2LA,但代码中还额外增加了LF_1来专门预训练backbone + VRM部分,并且在LF_2过程的optimizer中还针对LF_1训练过的params采用了不同的lr。请问LF_1 --> LF_2 --> LALF_2 --> LA两种训练方式差别大吗?

    2. 关于并行解码环节的attention计算方式

    问题是我对实现逻辑不太理解

    这里我和SRN中视觉部分(PVAM)中的attention过程作对比:

    (a) SRN-PVAM中的attention过程(伪代码,假设qkv的维度都是d_model):

    # e.g. d_model = 512, max_seq_len = seq_len_q = 25, vocab_size = 37
    key2att = nn.Linear(d_model, d_model)
    query2att = nn.Linear(d_model, d_model)
    embedding = nn.Embedding(max_seq_len, d_model)
    score = nn.Linear(d_model, 1)
    classifier = nn.Linear(d_model, vocab_size)
    
    # input is encoder_out
    reading_order = torch.arange(max_seq_len, dtype=torch.long)
    Q = embedding(reading_order)  # (max_seq_len, d_model)
    K, V = encoder_out  # (batch_size, seq_len_k, d_model)
    
    # 这里计算att_weight的过程很容易理解,和经典的attention模型比如ASTER的attention过程相同
    ######
    att_q = key2att(Q).unsqueeze(0).unsqueeze(2)  # (1, seq_len_q, 1, d_model)
    att_k = query2att(K).unsqueeze(1)  # (batch_size, 1, seq_len_k, d_model)
    att_weight = score(torch.tanh(att_q + att_k)).squeeze(3)  # (batch_size, seq_len_q, seq_len_k)
    ######
    
    att_weight = F.softmax(att_weight, dim=-1)
    decoder_out = torch.bmm(att_weight, K)  # (batch_size, seq_len_q, d_model)
    logits = classifier(decoder_out)  # (batch_size, seq_len_q, vicab_size)
    

    (b) VisionLAN中的attention过程:

    # e.g. d_model = 512, max_seq_len = seq_len_q = 25, vocab_size = 37
    embedding = nn.Embedding(max_seq_len, d_model)
    w0 = nn.Linear(max_seq_len, seq_len_k)
    wv = nn.Linear(d_model, d_model)
    we = nn.Linear(d_model, max_seq_len)
    classifier = nn.Linear(d_model, vocab_size)
    
    # input is encoder_out
    K, V = encoder_out  # (batch_size, seq_len_k, d_model)
    reading_order = torch.arange(max_seq_len, dtype=torch.long)
    
    # 如何理解下面这段计算att_weight的代码?
    #####
    reading_order = embedding(reading_order)  # (seq_len_q, d_model)
    reading_order = reading_order.unsqueeze(0).expand(K.size(0), -1)  # (batch_size, seq_len_q, d_model)
    t = w0(reading_order.permute(0, 2, 1))  # (batch_size, d_model, seq_len_q) ==> (batch_size, d_model, seq_len_k)
    t = torch.tanh(t.permute(0, 2, 1) + wv(K))  # (batch_size, seq_len_k, d_model)
    att_weight = we(t)  # (batch_size, seq_len_k, d_model) ==> (batch_size, seq_len_k, seq_len_q)
    att_weight = att_weight.permute(0, 2, 1)
    ######
    
    att_weight = F.softmax(att_weight, dim=-1)
    decoder_out = torch.bmm(att_weight, K)  # (batch_size, seq_len_q, d_model)
    logits = classifier(decoder_out)  # (batch_size, seq_len_q, vicab_size)
    

    期待你的回复,谢谢!

    opened by YanShuang17 3
  • Questions about Benchmark Test Datasets

    Questions about Benchmark Test Datasets

    How many images do you use in Benchmark Datasets IIIT5K, SVT, IC13, IC15, SVTP, and CUTE80 respectively? I am confused with the 4832 images you mentioned in your paper. Thank you!

    opened by GaoXinJian-USTC 2
  • Inconsistent results on all test datasets.

    Inconsistent results on all test datasets.

    Thanks for your code. I found my evaluation results are different from the results post in README.md. I run the provided models and eval.py, and I get zero accuracy on all datasets.

    I used the test datasets and models that you provided on Ruike.

    The command I executed:

    python eval.py
    

    The result I get:

    ------Average on 6 benchmarks--------                                                                                                          
                                                                                                                                                   
    test accuracy:                                                                                                                                 
    Accuracy: 0.000000, AR: 0.815073, CER: 0.184927, WER: 1.000000, best_acc: 0.000000                                                             
    ------IIIT--------                                                                                                                             
                                                                                                                                                   
    test accuracy:                                                                                                                                 
    Accuracy: 0.000000, AR: 0.821556, CER: 0.178444, WER: 1.000000, best_acc: 0.000000                                                             
    ------IC13--------                                                                                                                             
                                                                                                                                                   
    test accuracy:                                                                                                                                 
    Accuracy: 0.000000, AR: 0.846731, CER: 0.153269, WER: 1.000000, best_acc: 0.000000                                                             
    ------IC15--------                                                                                                                             
                                                                                                                                                   
    test accuracy:                                                                                                                                 
    Accuracy: 0.000000, AR: 0.790097, CER: 0.209903, WER: 1.000000, best_acc: 0.000000
    ------SVT--------
    
    test accuracy: 
    Accuracy: 0.000000, AR: 0.831195, CER: 0.168805, WER: 1.000000, best_acc: 0.000000
    ------SVTP--------
    
    test accuracy: 
    Accuracy: 0.000000, AR: 0.797383, CER: 0.202617, WER: 1.000000, best_acc: 0.000000
    ------CUTE--------
    

    Hope for your response~

    opened by PkuDavidGuan 2
  • Is that a typo?

    Is that a typo?

    Hello! It is an excellent work that inspires me a lot ;D As I read your code, I found something strange (maybe a typo):

    train_LA.py

    line145: text_pre, test_rem, text_mas, att_mask_sub = model(data, label_id, cfgs.global_cfgs['step'])

    Should "test_rem" be modified to "text_rem"?

    opened by JingyeChen 2
  • I can not download the OST dataset?

    I can not download the OST dataset?

    Thanks for sharing the code and data of your amazing work; I tried to download the OST datasets from both Baidu and Ruike, but I could not. It needed login. Can you please upload on google drive or share with me the link that I can download directly?

    opened by zobeirraisi 1
  • eval.py问题

    eval.py问题

    感谢您的代码,运行“CUDA_VISIBLE_DEVICES=0 python eval.py”命令后,报出如下错误: Traceback (most recent call last): File "eval.py", line 11, in import cfgs.cfgs_eval as cfgs File "/workspace/VisionLAN-main/cfgs/cfgs_eval.py", line 5, in from data.dataset_scene import * File "/workspace/VisionLAN-main/data/dataset_scene.py", line 16, in from transforms import CVColorJitter, CVDeterioration, CVGeometry ModuleNotFoundError: No module named 'transforms' 看起来似乎是环境的问题,但我的环境是严格按照requirements.txt配置的,这是怎么回事呢?热切盼望您的解答

    opened by swqsyy 1
  • Question about Visualization character-wise mask map

    Question about Visualization character-wise mask map

    Thanks for your excellent contributions! I try to use your pre-trained LF_2 model to visualize the mask map, I pick the same image that was shown in the Visualization character-wise mask map (P=0) of ReadMe. But I got a different result from you. I resize the mask map to the original height and width and add it with the original image and got the result like this: image

    Then I just resize the mask map to the original height and width and directly visualize the mask map and got a result like this: image

    It seems that the above two visualizations are different from yours, was there a problem?

    opened by GaoXinJian-USTC 1
  • Training problems about CTCloss and Chinese training.

    Training problems about CTCloss and Chinese training.

    Dear yuxin, sorry to bother you again. When I use your code, I found two new questions: 1. When I executed python train_LF_1.py, I got a CUDA error in ClassNLLCriterion.cu. 2. When I modify the code into Chinese training, the model could not converge.

    Question 1: CUDA error in ClassNLLCriterion.cu.

    The error info:

    THCudaCheck FAIL file=/pytorch/aten/src/THC/generic/THCTensorMath.cu line=29 error=710 : device-side assert triggered                          
    /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [12,0,0] Assertion `t >= 0 && t < n_classes` failed.
    

    solution:

    The problem seems like a bug (My PyTorch version is 1.71). When I modified the nclass from 37 to 38, the problem is gone. I think 38 is reasonable: 36 normal chars, 1 , and 1 . I modified these two lines:

    VisionLAN.py
    71:        self.Prediction = Prediction(n_position=256, N_max_character=26, n_class=37) # N_max_character = 1 eos + 25 characters
    72:        self.nclass = 37
    

    Question 2: Chinese training is failed.

    I modified the codes for Chinese training, but the model could not be coverged. The loss drops very slowly. Do you modify the training config when training TRW15? image

    opened by PkuDavidGuan 1
Owner
null
This is the pytorch implementation for the paper: Generalizable Mixed-Precision Quantization via Attribution Rank Preservation, which is accepted to ICCV2021.

GMPQ: Generalizable Mixed-Precision Quantization via Attribution Rank Preservation This is the pytorch implementation for the paper: Generalizable Mix

null 18 Sep 2, 2022
PyTorch implementation of our ICCV2021 paper: StructDepth: Leveraging the structural regularities for self-supervised indoor depth estimation

StructDepth PyTorch implementation of our ICCV2021 paper: StructDepth: Leveraging the structural regularities for self-supervised indoor depth estimat

SJTU-ViSYS 112 Nov 28, 2022
Official PyTorch code for Hierarchical Conditional Flow: A Unified Framework for Image Super-Resolution and Image Rescaling (HCFlow, ICCV2021)

Hierarchical Conditional Flow: A Unified Framework for Image Super-Resolution and Image Rescaling (HCFlow, ICCV2021) This repository is the official P

Jingyun Liang 159 Dec 30, 2022
Official PyTorch code for Hierarchical Conditional Flow: A Unified Framework for Image Super-Resolution and Image Rescaling (HCFlow, ICCV2021)

Hierarchical Conditional Flow: A Unified Framework for Image Super-Resolution and Image Rescaling (HCFlow, ICCV2021) This repository is the official P

Jingyun Liang 159 Dec 30, 2022
Official PyTorch code for Mutual Affine Network for Spatially Variant Kernel Estimation in Blind Image Super-Resolution (MANet, ICCV2021)

Mutual Affine Network for Spatially Variant Kernel Estimation in Blind Image Super-Resolution (MANet, ICCV2021) This repository is the official PyTorc

Jingyun Liang 139 Dec 29, 2022
Dynamic Attentive Graph Learning for Image Restoration, ICCV2021 [PyTorch Code]

Dynamic Attentive Graph Learning for Image Restoration This repository is for GATIR introduced in the following paper: Chong Mou, Jian Zhang, Zhuoyuan

Jian Zhang 84 Dec 9, 2022
Dense Deep Unfolding Network with 3D-CNN Prior for Snapshot Compressive Imaging, ICCV2021 [PyTorch Code]

Dense Deep Unfolding Network with 3D-CNN Prior for Snapshot Compressive Imaging, ICCV2021 [PyTorch Code]

Jian Zhang 20 Oct 24, 2022
Implementation of ICCV2021(Oral) paper - VMNet: Voxel-Mesh Network for Geodesic-aware 3D Semantic Segmentation

VMNet: Voxel-Mesh Network for Geodesic-Aware 3D Semantic Segmentation Created by Zeyu HU Introduction This work is based on our paper VMNet: Voxel-Mes

HU Zeyu 82 Dec 27, 2022
Implementation for our ICCV2021 paper: Internal Video Inpainting by Implicit Long-range Propagation

Implicit Internal Video Inpainting Implementation for our ICCV2021 paper: Internal Video Inpainting by Implicit Long-range Propagation paper | project

null 202 Dec 30, 2022
Official implementation of "A Unified Objective for Novel Class Discovery", ICCV2021 (Oral)

A Unified Objective for Novel Class Discovery This is the official repository for the paper: A Unified Objective for Novel Class Discovery Enrico Fini

Enrico Fini 118 Dec 26, 2022
This is an official implementation of the paper "Distance-aware Quantization", accepted to ICCV2021.

PyTorch implementation of DAQ This is an official implementation of the paper "Distance-aware Quantization", accepted to ICCV2021. For more informatio

CV Lab @ Yonsei University 36 Nov 4, 2022
source code of “Visual Saliency Transformer” (ICCV2021)

Visual Saliency Transformer (VST) source code for our ICCV 2021 paper “Visual Saliency Transformer” by Nian Liu, Ni Zhang, Kaiyuan Wan, Junwei Han, an

null 89 Dec 21, 2022
HiFT: Hierarchical Feature Transformer for Aerial Tracking (ICCV2021)

HiFT: Hierarchical Feature Transformer for Aerial Tracking Ziang Cao, Changhong Fu, Junjie Ye, Bowen Li, and Yiming Li Our paper is Accepted by ICCV 2

Intelligent Vision for Robotics in Complex Environment 55 Nov 23, 2022
Official code for "Simpler is Better: Few-shot Semantic Segmentation with Classifier Weight Transformer. ICCV2021".

Simpler is Better: Few-shot Semantic Segmentation with Classifier Weight Transformer. ICCV2021. Introduction We proposed a novel model training paradi

Lucas 103 Dec 14, 2022
Seeing Dynamic Scene in the Dark: High-Quality Video Dataset with Mechatronic Alignment (ICCV2021)

Seeing Dynamic Scene in the Dark: High-Quality Video Dataset with Mechatronic Alignment This is a pytorch project for the paper Seeing Dynamic Scene i

DV Lab 21 Nov 28, 2022
Dynamic Divide-and-Conquer Adversarial Training for Robust Semantic Segmentation (ICCV2021)

Dynamic Divide-and-Conquer Adversarial Training for Robust Semantic Segmentation This is a pytorch project for the paper Dynamic Divide-and-Conquer Ad

DV Lab 29 Nov 21, 2022
CM-NAS: Cross-Modality Neural Architecture Search for Visible-Infrared Person Re-Identification (ICCV2021)

CM-NAS Official Pytorch code of paper CM-NAS: Cross-Modality Neural Architecture Search for Visible-Infrared Person Re-Identification in ICCV2021. Vis

JDAI-CV 40 Nov 25, 2022
Parametric Contrastive Learning (ICCV2021)

Parametric-Contrastive-Learning This repository contains the implementation code for ICCV2021 paper: Parametric Contrastive Learning (https://arxiv.or

DV Lab 156 Dec 21, 2022
Code and models for ICCV2021 paper "Robust Object Detection via Instance-Level Temporal Cycle Confusion".

Robust Object Detection via Instance-Level Temporal Cycle Confusion This repo contains the implementation of the ICCV 2021 paper, Robust Object Detect

Xin Wang 69 Oct 13, 2022