A PyTorch implementation of "From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network" (ICCV2021)

Last update: Dec 12, 2022

Related tags

Deep Learning VisionLAN

Overview

From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network

The official code of VisionLAN (ICCV2021). VisionLAN successfully achieves the transformation from two-step to one-step recognition (from Two to One), which adaptively considers both visual and linguistic information in a unified structure without the need of extra language model.

ToDo List

Updates

2021/10/9 We upload the code, datasets, and trained models.
2021/10/9 Fix a bug in cfs_LF_1.py.

Requirements

Python2.7
Colour
LMDB
Pillow
opencv-python
torch==1.3.0
torchvision==0.4.1
editdistance
matplotlib==2.2.5

Step-by-step install

pip install -r requirements.txt

Data preparing

Training sets

SynthText We use the tool to crop images from original SynthText dataset, and convert images into LMDB dataset.

MJSynth We use tool to convert images into LMDB dataset. (We only use training set in this implementation)

We have upload these LMDB datasets in RuiKe (password:x6si).

Testing sets

Evaluation datasets, LMDB datasets can be downloaded from BaiduYun (password:fjyy) or RuiKe

IIIT5K Words (IIIT5K)
ICDAR 2013 (IC13)
Street View Text (SVT)
ICDAR 2015 (IC15)
Street View Text-Perspective (SVTP)
CUTE80 (CUTE)

The structure of data directory is

datasets
├── evaluation
│   ├── Sumof6benchmarks
│   ├── CUTE
│   ├── IC13
│   ├── IC15
│   ├── IIIT5K
│   ├── SVT
│   └── SVTP
└── train
    ├── MJSynth
    └── SynthText

Evaluation

Results on 6 benchmarks

Methods	IIIT5K	IC13	SVT	IC15	SVTP	CUTE
Paper	95.8	95.7	91.7	83.7	86.0	88.5
This implementation	95.9	96.3	90.7	84.1	85.3	88.9

Download our trained model in BaiduYun (password: e3kj) or RuiKe (password: cxqi), and put it in output/LA/final.pth.

CUDA_VISIBLE_DEVICES=0 python eval.py

Visualize character-wise mask map

Examples of the visualization of mask_c:

   CUDA_VISIBLE_DEVICES=0 python visualize.py

You can modify the 'mask_id' in cfgs/cfgs_visualize to change the mask position for visualization.

Results on OST datasets

Occlusion Scene Text (OST) dataset is proposed to reflect the ability for recognizing cases with missing visual cues. This dataset is collected from 6 benchmarks (IC13, IC15, IIIT5K, SVT, SVTP and CT) containing 4832 images. Images in this dataset are manually occluded in weak or heavy degree. Weak and heavy degrees mean that we occlude the character using one or two lines. For each image, we randomly choose one degree to only cover one character.

Examples of images in OST dataset:

Methods	Average	Weak	Heavy
Paper	60.3	70.3	50.3
This implementation	60.3	70.8	49.8

The LMDB dataset is available in BaiduYun (password:yrrj) or RuiKe (password: vmzr)

Training

4 2080Ti GPUs are used in this implementation.

Language-free (LF) process

Step 1: We first train the vision model without MLM. (Our trained LF_1 model(BaiduYun) (password:avs5) or RuiKe (password:qwzn))

   CUDA_VISIBLE_DEVICES=0,1,2,3 python train_LF_1.py

Step 2: We finetune the MLM with vision model (Our trained LF_2 model(BaiduYun) (password:04jg) or RuiKe (password:v67q))

   CUDA_VISIBLE_DEVICES=0,1,2,3 python train_LF_2.py

Language-aware (LA) process

Use the mask map to guide the linguistic learning in the vision model.

   CUDA_VISIBLE_DEVICES=0,1,2,3 python train_LA.py

Tip: In LA process, model with loss (Loss VisionLAN) higher than 0.3 and the training accuracy (Accuracy) lower than 91.0 after the first 200 training iters obains better performance.

Improvement

Mask id randomly generated according to the max length can not well adapt to the occlusion of long text. Thus, evenly sampled mask id can further improve the performance of MLM.
Heavier vision model is able to capture more robust linguistic information in our later experiments.

Citation

If you find our method useful for your reserach, please cite

 @article{wang2021two,
  title={From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network},
  author={Wang, Yuxin and Xie, Hongtao and Fang, Shancheng and Wang, Jing and Zhu, Shenggao and Zhang, Yongdong},
  journal={ICCV},
  year={2021}
}

Feedback

Suggestions and discussions are greatly welcome. Please contact the authors by sending email to [email protected]

Comments

Questions about `Training Pipeline` and `Parallel Attention`

作者你好，论文提供的思路对我启发很大!

我有两个问题想请教一下:

1. 关于训练的pipeline

我看到论文中描述的训练流程只包含language-free和language-aware两个环节，类似于代码中的LF_2和LA，但代码中还额外增加了LF_1来专门预训练backbone + VRM部分，并且在LF_2过程的optimizer中还针对LF_1训练过的params采用了不同的lr。请问LF_1 --> LF_2 --> LA和LF_2 --> LA两种训练方式差别大吗？

2. 关于并行解码环节的attention计算方式

问题是我对实现逻辑不太理解

这里我和SRN中视觉部分(PVAM)中的attention过程作对比:

(a) SRN-PVAM中的attention过程(伪代码，假设qkv的维度都是d_model):

# e.g. d_model = 512, max_seq_len = seq_len_q = 25, vocab_size = 37
key2att = nn.Linear(d_model, d_model)
query2att = nn.Linear(d_model, d_model)
embedding = nn.Embedding(max_seq_len, d_model)
score = nn.Linear(d_model, 1)
classifier = nn.Linear(d_model, vocab_size)

# input is encoder_out
reading_order = torch.arange(max_seq_len, dtype=torch.long)
Q = embedding(reading_order)  # (max_seq_len, d_model)
K, V = encoder_out  # (batch_size, seq_len_k, d_model)

# 这里计算att_weight的过程很容易理解，和经典的attention模型比如ASTER的attention过程相同
######
att_q = key2att(Q).unsqueeze(0).unsqueeze(2)  # (1, seq_len_q, 1, d_model)
att_k = query2att(K).unsqueeze(1)  # (batch_size, 1, seq_len_k, d_model)
att_weight = score(torch.tanh(att_q + att_k)).squeeze(3)  # (batch_size, seq_len_q, seq_len_k)
######

att_weight = F.softmax(att_weight, dim=-1)
decoder_out = torch.bmm(att_weight, K)  # (batch_size, seq_len_q, d_model)
logits = classifier(decoder_out)  # (batch_size, seq_len_q, vicab_size)

(b) VisionLAN中的attention过程:

# e.g. d_model = 512, max_seq_len = seq_len_q = 25, vocab_size = 37
embedding = nn.Embedding(max_seq_len, d_model)
w0 = nn.Linear(max_seq_len, seq_len_k)
wv = nn.Linear(d_model, d_model)
we = nn.Linear(d_model, max_seq_len)
classifier = nn.Linear(d_model, vocab_size)

# input is encoder_out
K, V = encoder_out  # (batch_size, seq_len_k, d_model)
reading_order = torch.arange(max_seq_len, dtype=torch.long)

# 如何理解下面这段计算att_weight的代码?
#####
reading_order = embedding(reading_order)  # (seq_len_q, d_model)
reading_order = reading_order.unsqueeze(0).expand(K.size(0), -1)  # (batch_size, seq_len_q, d_model)
t = w0(reading_order.permute(0, 2, 1))  # (batch_size, d_model, seq_len_q) ==> (batch_size, d_model, seq_len_k)
t = torch.tanh(t.permute(0, 2, 1) + wv(K))  # (batch_size, seq_len_k, d_model)
att_weight = we(t)  # (batch_size, seq_len_k, d_model) ==> (batch_size, seq_len_k, seq_len_q)
att_weight = att_weight.permute(0, 2, 1)
######

att_weight = F.softmax(att_weight, dim=-1)
decoder_out = torch.bmm(att_weight, K)  # (batch_size, seq_len_q, d_model)
logits = classifier(decoder_out)  # (batch_size, seq_len_q, vicab_size)

期待你的回复，谢谢！

opened by YanShuang17 3

Questions about Benchmark Test Datasets

How many images do you use in Benchmark Datasets IIIT5K, SVT, IC13, IC15, SVTP, and CUTE80 respectively? I am confused with the 4832 images you mentioned in your paper. Thank you!

opened by GaoXinJian-USTC 2

Inconsistent results on all test datasets.

Thanks for your code. I found my evaluation results are different from the results post in README.md. I run the provided models and eval.py, and I get zero accuracy on all datasets.

I used the test datasets and models that you provided on Ruike.

The command I executed:

python eval.py

The result I get:

------Average on 6 benchmarks--------                                                                                                          
                                                                                                                                               
test accuracy:                                                                                                                                 
Accuracy: 0.000000, AR: 0.815073, CER: 0.184927, WER: 1.000000, best_acc: 0.000000                                                             
------IIIT--------                                                                                                                             
                                                                                                                                               
test accuracy:                                                                                                                                 
Accuracy: 0.000000, AR: 0.821556, CER: 0.178444, WER: 1.000000, best_acc: 0.000000                                                             
------IC13--------                                                                                                                             
                                                                                                                                               
test accuracy:                                                                                                                                 
Accuracy: 0.000000, AR: 0.846731, CER: 0.153269, WER: 1.000000, best_acc: 0.000000                                                             
------IC15--------                                                                                                                             
                                                                                                                                               
test accuracy:                                                                                                                                 
Accuracy: 0.000000, AR: 0.790097, CER: 0.209903, WER: 1.000000, best_acc: 0.000000
------SVT--------

test accuracy: 
Accuracy: 0.000000, AR: 0.831195, CER: 0.168805, WER: 1.000000, best_acc: 0.000000
------SVTP--------

test accuracy: 
Accuracy: 0.000000, AR: 0.797383, CER: 0.202617, WER: 1.000000, best_acc: 0.000000
------CUTE--------

Hope for your response~

opened by PkuDavidGuan 2

Is that a typo?

Hello! It is an excellent work that inspires me a lot ;D As I read your code, I found something strange (maybe a typo):

train_LA.py

line145: text_pre, test_rem, text_mas, att_mask_sub = model(data, label_id, cfgs.global_cfgs['step'])

Should "test_rem" be modified to "text_rem"?

opened by JingyeChen 2
I can not download the OST dataset?

Thanks for sharing the code and data of your amazing work; I tried to download the OST datasets from both Baidu and Ruike, but I could not. It needed login. Can you please upload on google drive or share with me the link that I can download directly?

opened by zobeirraisi 1
eval.py问题

感谢您的代码，运行“CUDA_VISIBLE_DEVICES=0 python eval.py”命令后，报出如下错误： Traceback (most recent call last): File "eval.py", line 11, in import cfgs.cfgs_eval as cfgs File "/workspace/VisionLAN-main/cfgs/cfgs_eval.py", line 5, in from data.dataset_scene import * File "/workspace/VisionLAN-main/data/dataset_scene.py", line 16, in from transforms import CVColorJitter, CVDeterioration, CVGeometry ModuleNotFoundError: No module named 'transforms' 看起来似乎是环境的问题，但我的环境是严格按照requirements.txt配置的，这是怎么回事呢？热切盼望您的解答

opened by swqsyy 1
Question about Visualization character-wise mask map

Thanks for your excellent contributions! I try to use your pre-trained LF_2 model to visualize the mask map, I pick the same image that was shown in the Visualization character-wise mask map (P=0) of ReadMe. But I got a different result from you. I resize the mask map to the original height and width and add it with the original image and got the result like this:

Then I just resize the mask map to the original height and width and directly visualize the mask map and got a result like this:

It seems that the above two visualizations are different from yours, was there a problem?

opened by GaoXinJian-USTC 1
Training problems about CTCloss and Chinese training.
Dear yuxin, sorry to bother you again. When I use your code, I found two new questions: 1. When I executed python train_LF_1.py, I got a CUDA error in ClassNLLCriterion.cu. 2. When I modify the code into Chinese training, the model could not converge.

Question 1: CUDA error in ClassNLLCriterion.cu.

The error info:

THCudaCheck FAIL file=/pytorch/aten/src/THC/generic/THCTensorMath.cu line=29 error=710 : device-side assert triggered /pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [12,0,0] Assertion `t >= 0 && t < n_classes` failed.

solution:

The problem seems like a bug (My PyTorch version is 1.71). When I modified the nclass from 37 to 38, the problem is gone. I think 38 is reasonable: 36 normal chars, 1 , and 1 . I modified these two lines:

VisionLAN.py 71: self.Prediction = Prediction(n_position=256, N_max_character=26, n_class=37) # N_max_character = 1 eos + 25 characters 72: self.nclass = 37

Question 2: Chinese training is failed.

I modified the codes for Chinese training, but the model could not be coverged. The loss drops very slowly. Do you modify the training config when training TRW15?
opened by PkuDavidGuan 1

Owner

GitHub

This is the pytorch implementation for the paper: Generalizable Mixed-Precision Quantization via Attribution Rank Preservation, which is accepted to ICCV2021.

GMPQ: Generalizable Mixed-Precision Quantization via Attribution Rank Preservation This is the pytorch implementation for the paper: Generalizable Mix

18 Sep 2, 2022

PyTorch implementation of our ICCV2021 paper: StructDepth: Leveraging the structural regularities for self-supervised indoor depth estimation

StructDepth PyTorch implementation of our ICCV2021 paper: StructDepth: Leveraging the structural regularities for self-supervised indoor depth estimat

112 Nov 28, 2022

Official PyTorch code for Hierarchical Conditional Flow: A Unified Framework for Image Super-Resolution and Image Rescaling (HCFlow, ICCV2021)

Hierarchical Conditional Flow: A Unified Framework for Image Super-Resolution and Image Rescaling (HCFlow, ICCV2021) This repository is the official P

159 Dec 30, 2022

Official PyTorch code for Hierarchical Conditional Flow: A Unified Framework for Image Super-Resolution and Image Rescaling (HCFlow, ICCV2021)

Hierarchical Conditional Flow: A Unified Framework for Image Super-Resolution and Image Rescaling (HCFlow, ICCV2021) This repository is the official P

159 Dec 30, 2022

Official PyTorch code for Mutual Affine Network for Spatially Variant Kernel Estimation in Blind Image Super-Resolution (MANet, ICCV2021)

Mutual Affine Network for Spatially Variant Kernel Estimation in Blind Image Super-Resolution (MANet, ICCV2021) This repository is the official PyTorc

139 Dec 29, 2022

A PyTorch implementation of "From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network" (ICCV2021)

Related tags

Overview

From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network

ToDo List

Updates

Requirements

Step-by-step install

Data preparing

Training sets

Testing sets

Evaluation

Results on 6 benchmarks

Visualize character-wise mask map

Results on OST datasets

Training

Language-free (LF) process

Language-aware (LA) process

Improvement

Citation

Feedback

Comments

1. 关于训练的pipeline

2. 关于并行解码环节的attention计算方式

train_LA.py

Question 1: CUDA error in ClassNLLCriterion.cu.

The error info:

solution:

Question 2: Chinese training is failed.

Owner

This is the pytorch implementation for the paper: Generalizable Mixed-Precision Quantization via Attribution Rank Preservation, which is accepted to ICCV2021.

PyTorch implementation of our ICCV2021 paper: StructDepth: Leveraging the structural regularities for self-supervised indoor depth estimation

Official PyTorch code for Hierarchical Conditional Flow: A Unified Framework for Image Super-Resolution and Image Rescaling (HCFlow, ICCV2021)

Official PyTorch code for Hierarchical Conditional Flow: A Unified Framework for Image Super-Resolution and Image Rescaling (HCFlow, ICCV2021)

Official PyTorch code for Mutual Affine Network for Spatially Variant Kernel Estimation in Blind Image Super-Resolution (MANet, ICCV2021)

Dynamic Attentive Graph Learning for Image Restoration, ICCV2021 [PyTorch Code]

Dense Deep Unfolding Network with 3D-CNN Prior for Snapshot Compressive Imaging, ICCV2021 [PyTorch Code]

Implementation of ICCV2021(Oral) paper - VMNet: Voxel-Mesh Network for Geodesic-aware 3D Semantic Segmentation

Implementation for our ICCV2021 paper: Internal Video Inpainting by Implicit Long-range Propagation

Official implementation of "A Unified Objective for Novel Class Discovery", ICCV2021 (Oral)

This is an official implementation of the paper "Distance-aware Quantization", accepted to ICCV2021.

source code of “Visual Saliency Transformer” (ICCV2021)

HiFT: Hierarchical Feature Transformer for Aerial Tracking (ICCV2021)

Official code for "Simpler is Better: Few-shot Semantic Segmentation with Classifier Weight Transformer. ICCV2021".

Seeing Dynamic Scene in the Dark: High-Quality Video Dataset with Mechatronic Alignment (ICCV2021)

Dynamic Divide-and-Conquer Adversarial Training for Robust Semantic Segmentation （ICCV2021）

CM-NAS: Cross-Modality Neural Architecture Search for Visible-Infrared Person Re-Identification (ICCV2021)

Parametric Contrastive Learning (ICCV2021)

Code and models for ICCV2021 paper "Robust Object Detection via Instance-Level Temporal Cycle Confusion".

Question 1: CUDA error in `ClassNLLCriterion.cu`.