Scene-Text-Detection-and-Recognition (Pytorch)

Competition URL: https://tbrain.trendmicro.com.tw/Competitions/Details/19 (Private 6th place)

1. Proposed Method

The models

Our model comprises two parts: scene text detection and scene text recognition. the descriptions of these two models are as follow:

Scene Text Detection
We employ YoloV5 [1] to detect the ROI (Region Of Interest) from an image and Resnet50 [2] to implement the ROI transformation algorithm. This algorithm transforms the coordinates detected by YoloV5 to the proper location, which fits the text well. YoloV5 can detect all ROIs that might be strings while ROI transformation can make the bbox more fit the region of the string. The visualization result is illustrated below, where the bbox of the dark green is ROI detected by YoloV5 and the bbox of the red is ROI after ROI transformation.

Scene Text Recognition
We employ ViT [3] to recognize the string of bbox detected by YoloV5 since our task is not a single text recognition. The transformer-based model achieves the state-of-the-art performance in Natural Language Processing (NLP). The attention mechanism can make the model pay attention to the words that need to be output at the moment. The model architecture is demonstrated below.

The whole training process is shown in the figure below.

Data augmentation

Random Scale Resize
We found that the sizes of the images in the public dataset are different. Therefore, if we resize the small image to the large, most of the image features will be lost. To solve this problem, we apply the random scale resize algorithm to obtain the low-resolution image from the high-resolution image in the training phase. The visualization results are demonstrated as follows.

Original image	72x72 --> 224x224	96x96 --> 224x224	121x121 --> 224x224	146x146 --> 224x224	196x196 --> 224x224

ColorJitter
In the training phase, the model's input is RGB channel. To enhance the reliability of the model, we appply the collorjitter algorithm to make the model see the images with different contrast, brightness, saturation and hue value. And this kind of method is also widely used in image classification. The visualization results are demonstrated as follows.

Input image	brightness=0.5	contrast=0.5	saturation=0.5	hue=0.5	brightness=0.5 contrast=0.5 saturation=0.5 hue=0.5

Random Rotaion
After we observe the training data, we found that most of the images in training data are square-shaped (original image), while some of the testing data is a little skewed. Therefore, we apply the random rotation algorithm to make the model more generalization. The visualization results are demonstrated as follows.

Original image	Random Rotation	Random Horizontal Flip	Both

2. Demo

Predicted results
Before we recognize the string bbox detected by YoloV5, we filter out the bbox with a size less than 45*45. Because the image resolution of a bbox with a size less than 45*45 is too low to recognize the correct string.

Input image	Scene Text detection	Scene Text recognition
		驗車委託汽車代檢元力汽車公司新竹區監理所
		3c配件玻璃貼專業包膜
		台灣大哥大 myfone 新店中正加盟門市
		西門町楊排骨酥麵非常感謝 tvbs食尚玩家蘋果日報壹週刊財訊錢櫃雜誌聯合報飛碟電台等報導排骨酥專賣店西門町楊排骨酥麵排骨酥麵嘉義店
		永晟電動工具行 492913338

Attention maps in ViT
We also visualize the attention maps in ViT, to check whether the model focus on the correct location of the image. The visualization results are demonstrated as follows.

Original image	Attention map

3. Competition Results

Public Scores
We conducted extensive experiments, and The results are demonstrated below. From the results, we can see the improvement of the results by adding each module at each stage. At first, we only employed YoloV5 to detect all the ROI in the images, and the result of detection is not good enough. We also compare the result of ViT with data augmentation or not, the results show that our data augmentation is effective to solve this task (compare the last row and the sixth row). In addition, we filter out the bbox with a size less than 45*45 since the resolution of bbox is too low to recognize the correct strings.

Models(Detection/Recognition)	Final score	Precision	Recall
YoloV5(L) / ViT(aug)	0.60926	0.7794	0.9084
YoloV5(L) + ROI_transformation(Resnet50) / ViT(aug)	0.73148	0.9261	0.9017
YoloV5(L) + ROI_transformation(Resnet50) + reduce overlap bbox / ViT(aug)	0.78254	0.9324	0.9072
YoloV5(L) + ROI_transformation(SEResnet50) + reduce overlap bbox / ViT(aug)	0.78527	0.9324	0.9072
YoloV5(L) + ROI_transformation(SEResnet50) + reduce overlap bbox / ViT(aug) + filter bbox(40 * 40)	0.79373	0.9333	0.9029
YoloV5(L) + ROI_transformation(SEResnet50) + reduce overlap bbox / ViT(aug) + filter bbox(45 * 45)	0.79466	0.9335	0.9011
YoloV5(L) + ROI_transformation(SEResnet50) + reduce overlap bbox / ViT(aug) + filter bbox(50 * 50)	0.79431	0.9338	0.8991
YoloV5(L) + ROI_transformation(SEResnet50) + reduce overlap bbox / ViT(no aug) + filter bbox(45 * 45)	0.73802	0.9335	0.9011

Private Scores

Models(Detection/Recognition)	Final score	Precision	Recall
YoloV5(L) + ROI_transformation(SEResnet50) + reduce overlap bbox / ViT(aug) + filter bbox(40 * 40)	0.7828	0.9328	0.8919
YoloV5(L) + ROI_transformation(SEResnet50) + reduce overlap bbox / ViT(aug) + filter bbox(45 * 45)	0.7833	0.9323	0.8968
YoloV5(L) + ROI_transformation(SEResnet50) + reduce overlap bbox / ViT(aug) + filter bbox(50 * 50)	0.7830	0.9325	0.8944

4. Computer Equipment

System: Windows10、Ubuntu20.04
Pytorch version: Pytorch 1.7 or higher
Python version: Python 3.6
Testing:
CPU: AMR Ryzen 7 4800H with Radeon Graphics RAM: 32GB
GPU: NVIDIA GeForce RTX 1660Ti 6GB
Training:
CPU: Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz
RAM: 256GB
GPU: NVIDIA GeForce RTX 3090 24GB * 2

5. Getting Started

Clone this repo to your local

git clone https://github.com/come880412/Scene-Text-Detection-and-Recognition.git
cd Scene-Text-Detection-and-Recognition

Download pretrained models

Scene Text Detection
Please download pretrained models from Scene_Text_Detection. There are three folders, "ROI_transformation", "yolo_models" and "yolo_weight". First, please put the weights in "ROI_transformation" to the path ./Scene_Text_Detection/Tranform_card/models/. Second, please put all the models in "yolo_models" to the ./Scene_Text_Detection/yolov5-master/. Finally, please put the weight in "yolo_weight" to the path ./Scene_Text_Detection/yolov5-master/runs/train/expl/weights/.
Scene Text Recogniton
Please download pretrained models from Scene_Text_Recognition. There are two files in this foler, "best_accuracy.pth" and "character.txt". Please put the files to the path ./Scene_Text_Recogtion/saved_models/.

Inference

You should first download the pretrained models and change your path to ./Scene_Text_Detection/yolov5-master/

$ python Text_detection.py

The result will be saved in the path '../output/'. Where the folder "example" is the images detected by YoloV5 and after ROI transformation, the file "example.csv" records the coordinates of the bbox, starting from the upper left corner of the coordinates clockwise, respectively (x1, y1), (x2, y2), (x3, y3), and (x4, y4), and the file "exmaple_45.csv" is the predicted result.
If you would like to visualize the bbox detected by yoloV5, you can use the function public_crop() in the script ../../data_process.py to extract the bbox from images.

Training

You should first download the dataset provided by official, then put the data in the path '../dataset/'. After that, you could use the following script to transform the original data to the training format.

$ python data_process.py

Scene_Text_Detection
There are two models for the scene text detection task: ROI transformation and YoloV5. You could use the follow script to train these two models.

$ cd ./Scene_Text_Detection/yolov5-master # YoloV5
$ python train.py

$ cd ../Tranform_card/ # ROI Transformation
$ python Trainer.py

Scene_Text_Recognition

$ cd ./Scene_Text_Recogtion # ViT for text recognition
$ python train.py

References

[1] https://github.com/ultralytics/yolov5
[2] https://github.com/pytorch/vision/blob/main/torchvision/models/resnet.py
[3] https://github.com/roatienza/deep-text-recognition-benchmark
[4] https://www.pyimagesearch.com/2014/08/25/4-point-opencv-getperspective-transform-example/
[5] Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7132-7141).

This project is a re-implementation of MASTER: Multi-Aspect Non-local Network for Scene Text Recognition by MMOCR

This project is a re-implementation of MASTER: Multi-Aspect Non-local Network for Scene Text Recognition by MMOCR，which is an open-source toolbox based on PyTorch. The overall architecture will be shown below.

82 Nov 17, 2022

Code for CVPR 2021 oral paper "Exploring Data-Efficient 3D Scene Understanding with Contrastive Scene Contexts"

Exploring Data-Efficient 3D Scene Understanding with Contrastive Scene Contexts The rapid progress in 3D scene understanding has come with growing dem

182 Dec 30, 2022

Neural Scene Graphs for Dynamic Scene (CVPR 2021)

Implementation of Neural Scene Graphs, that optimizes multiple radiance fields to represent different objects and a static scene background. Learned representations can be rendered with novel object compositions and views.

151 Dec 26, 2022

A weakly-supervised scene graph generation codebase. The implementation of our CVPR2021 paper ``Linguistic Structures as Weak Supervision for Visual Scene Graph Generation''

README.md shall be finished soon. WSSGG 0 Overview 1 Installation 1.1 Faster-RCNN 1.2 Language Parser 1.3 GloVe Embeddings 2 Settings 2.1 VG-GT-Graph

35 Nov 20, 2022

The repo for the paper "I3CL: Intra- and Inter-Instance Collaborative Learning for Arbitrary-shaped Scene Text Detection".

33 Jan 5, 2023

Omnidirectional Scene Text Detection with Sequential-free Box Discretization (IJCAI 2019). Including competition model, online demo, etc.

Box_Discretization_Network This repository is built on the pytorch [maskrcnn_benchmark]. The method is the foundation of our ReCTs-competition method

266 Nov 24, 2022

Tightness-aware Evaluation Protocol for Scene Text Detection

TIoU-metric Release on 27/03/2019. This repository is built on the ICDAR 2015 evaluation code. If you propose a better metric and require further eval

206 Nov 18, 2022

A Text Attention Network for Spatial Deformation Robust Scene Text Image Super-resolution (CVPR2022)

A Text Attention Network for Spatial Deformation Robust Scene Text Image Super-resolution (CVPR2022) https://arxiv.org/abs/2203.09388 Jianqi Ma, Zheto

104 Jan 5, 2023

A PyTorch implementation of "From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network" (ICCV2021)

From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network The official code of VisionLAN (ICCV2021). VisionLAN successfully a

81 Dec 12, 2022

Input image	Scene Text detection	Scene Text recognition
		驗車委託汽車代檢元力汽車公司新竹區監理所
		3c配件玻璃貼專業包膜
		台灣大哥大 myfone 新店中正加盟門市
		西門町楊排骨酥麵非常感謝 tvbs食尚玩家蘋果日報壹週刊財訊錢櫃雜誌聯合報飛碟電台等報導排骨酥專賣店西門町楊排骨酥麵排骨酥麵嘉義店
		永晟電動工具行 492913338

Scene-Text-Detection-and-Recognition (Pytorch)

Related tags

Overview

Scene-Text-Detection-and-Recognition (Pytorch)

1. Proposed Method

The models

Data augmentation

2. Demo

3. Competition Results

4. Computer Equipment

5. Getting Started

Download pretrained models

Inference

Training

References

You might also like...

This project is a re-implementation of MASTER: Multi-Aspect Non-local Network for Scene Text Recognition by MMOCR

Code for CVPR 2021 oral paper "Exploring Data-Efficient 3D Scene Understanding with Contrastive Scene Contexts"

Neural Scene Graphs for Dynamic Scene (CVPR 2021)

A weakly-supervised scene graph generation codebase. The implementation of our CVPR2021 paper ``Linguistic Structures as Weak Supervision for Visual Scene Graph Generation''

The repo for the paper "I3CL: Intra- and Inter-Instance Collaborative Learning for Arbitrary-shaped Scene Text Detection".

Omnidirectional Scene Text Detection with Sequential-free Box Discretization (IJCAI 2019). Including competition model, online demo, etc.

Tightness-aware Evaluation Protocol for Scene Text Detection

A Text Attention Network for Spatial Deformation Robust Scene Text Image Super-resolution (CVPR2022)

A PyTorch implementation of "From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network" (ICCV2021)

Owner

Gi-Luen Huang

Scene-Text-Detection-and-Recognition (Pytorch)

Pytorch implementation of Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors

Code for the paper "MASTER: Multi-Aspect Non-local Network for Scene Text Recognition" (Pattern Recognition 2021)

[TIP 2020] Multi-Temporal Scene Classification and Scene Change Detection with Correlation based Fusion

PyTorch code of my ICDAR 2021 paper Vision Transformer for Fast and Efficient Scene Text Recognition (ViTSTR)

text_recognition_toolbox: The reimplementation of a series of classical scene text recognition papers with Pytorch in a uniform way.

Official PyTorch code of DeepPanoContext: Panoramic 3D Scene Understanding with Holistic Scene Context Graph and Relation-based Optimization (ICCV 2021 Oral).

Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition

Image transformations designed for Scene Text Recognition (STR) data augmentation. Published at ICCV 2021 Workshop on Interactive Labeling and Data Augmentation for Vision.

Code for "Primitive Representation Learning for Scene Text Recognition" (CVPR 2021)