textspotter - An End-to-End TextSpotter with Explicit Alignment and Attention

Tong He

Last update: Nov 10, 2022

Related tags

Overview

An End-to-End TextSpotter with Explicit Alignment and Attention

This is initially described in our CVPR 2018 paper.

Getting Started

Installation

Clone the code

git clone https://github.com/tonghe90/textspotter
cd textspotter

Install caffe. You can follow this this tutorial. If you have build problem about std::allocater, please refer to this #3

# make sure you set WITH_PYTHON_LAYER := 1
# change Makefile.config according to your library path
cp Makefile.config.example Makefile.config
make clean
make -j8
make pycaffe

Training

we provide part of the training code. But you can not run this directly. 
We have give the comment in the [train.pt](https://github.com/tonghe90/textspotter/models/train.pt).
You have to write your own layer, IOUloss layer. We cannot publish this for some IP reason. 
To be noticed: 
[L6902](https://github.com/tonghe90/textspotter/models/train.pt#L6902) 
[L6947](https://github.com/tonghe90/textspotter/models/train.pt#L6907)

Testing

install editdistance and pyclipper: pip install editdistance and pip install pyclipper
After Caffe is set up, you need to download a trained model (about 40M) from Google Drive. This model is trained with VGG800k and finetuned on ICDAR2015.
Run python test.py --img=./imgs/img_105.jpg
hyperparameters:

cfg.py --mean_val ==> mean value during the testing.
       --max_len ==> maximum length of the text string (here we take 25, meaning a word can contain 25 characters at most.)
       --recog_th ==> the threshold during the recognition process. The score for a word is the average mean of every character.
       --word_score ==> the threshold for those words that contain number or symbols for they are not contained in the dictionary.

test.py --weight ==> weights file of caffemodel
        --prototxt-iou ==> the prototxt file for detection.
        --prototxt-lstm ==> the prototxt file for recognition.
        --img ==> the folder or img file for testing. The format can be added in ./pylayer/tool is_image function.
        --scales-ms ==> multiscales input for input during the testing process.
        --thresholds-ms ==> corresponding thresholds of text region for multiscale inputs.
        --nms ==> nms threshold for testing
        --save-dir ==> the dir for save results in format of ICDAR2015 submition.

One thing should be noted: the recognition results are achieved by comparing direct output with words in dictionary, which has about 90K lexicons. 
These lexicons don't contain any number and symbol. You can delete dictionary reference part and directly output recognition results.

Citation

If you use this code for your research, please cite our papers.

@inproceedings{tong2018,
  title={An End-to-End TextSpotter with Explicit Alignment and Attention},
  author={T. He and Z. Tian and W. Huang and C. Shen and Y. Qiao and C. Sun},
  booktitle={Computer Vision and Pattern Recognition (CVPR), 2018 IEEE Conference on},
  year={2018}
}

License

This code is for NON-COMMERCIAL purposes only. For commerical purposes, please contact Chunhua Shen [email protected]. This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, version 3. Please refer to http://www.gnu.org/licenses/ for more details.

Comments

Out of memory error on test example
When I try to run the simple example given in the README

python test.py --img=./imgs/img_105.jpg

I get an out of memory error:

F0426 09:41:13.545714 20964 syncedmem.cpp:71] Check failed: error == cudaSuccess (2 vs. 0) out of memory *** Check failure stack trace: ***

I am trying to run this on a GTX 1080, which has 8120 MB of global memory (according to deviceQuery).

When I tabulate the "Memory required for data" lines from the caffe log output, it adds up to 381 GB, though perhaps this isn't all required simultaneously or it is otherwise a double-counting. The same failure occurs when I try a much smaller (140x180 px) crop of the same image.

Is that right? Do you expect the model to fit and run within roughly 8GB of GPU memory? If not, how much memory is required to run this model?

EDIT: Same error happens on another host with K40 and K80 GPUs (each with roughly 12GB of GPU memory)
opened by weinman 21
Do i need to write OHEM layer

Hi @tonghe90 ： i was tried to implement your code recently, and found a layer named "OHEM",i am not sure whether i need to write this layer or not.

opened by fsluckymao 9
Is CPU mode supported?

I kept got errors during buiding the project with CPU_ONLY. And I found that the foward and backward pass of at_layer.cpp are not implemented. Have you considered adding CPU implementation?

opened by eugene123tw 9
Error loading parameters

Hello @tonghe90,

Congrats for the good project and paper. I am trying to test you code but I am having problems loading the params, do you have any idea why is this happening?

WARNING: Logging before InitGoogleLogging() is written to STDERR W0514 12:58:54.210842 2459 _caffe.cpp:139] DEPRECATION WARNING - deprecated use of Python interface W0514 12:58:54.210868 2459 _caffe.cpp:140] Use this instead (with the named "weights" parameter): W0514 12:58:54.210873 2459 _caffe.cpp:142] Net('./models/test_iou.pt', 1, weights='./models/textspotter.caffemodel') [libprotobuf ERROR google/protobuf/text_format.cc:288] Error parsing text-format caffe.NetParameter: 7067:24: Message type "caffe.LayerParameter" has no field named "point_bilinear_param". F0514 12:58:54.243449 2459 upgrade_proto.cpp:88] Check failed: ReadProtoFromTextFile(param_file, param) Failed to parse NetParameter file: ./models/test_iou.pt

opened by AndresPMD 9
test error

models/textspotter.caffemodel WARNING: Logging before InitGoogleLogging() is written to STDERR W0718 15:41:54.913141 23734 _caffe.cpp:140] DEPRECATION WARNING - deprecated use of Python interface W0718 15:41:54.913169 23734 _caffe.cpp:141] Use this instead (with the named "weights" parameter): W0718 15:41:54.913173 23734 _caffe.cpp:143] Net('models/test_iou.pt', 1, weights='models/textspotter.caffemodel') [libprotobuf ERROR google/protobuf/text_format.cc:274] Error parsing text-format caffe.NetParameter: 7067:24: Message type "caffe.LayerParameter" has no field named "point_bilinear_param". F0718 15:41:54.915925 23734 upgrade_proto.cpp:88] Check failed: ReadProtoFromTextFile(param_file, param) Failed to parse NetParameter file: models/test_iou.pt *** Check failure stack trace: *** Aborted (core dumped)

opened by zhuliqian 8
unknown layer type

hello @tonghe90 ， when I run python test.py --img=./imgs/img_105.jpg, it happens like below:

I0902 18:58:18.874758 37749 net.cpp:129] Top shape: 1 1 128 128 (16384) I0902 18:58:18.874763 37749 net.cpp:137] Memory required for data: 711786500 I0902 18:58:18.874770 37749 layer_factory.hpp:77] Creating layer iou_maps_angles F0902 18:58:18.874795 37749 layer_factory.hpp:81] Check failed: registry.count(type) == 1 (0 vs. 1) Unknown layer type: Python (known types: AbsVal, Accuracy, AffineTransformer, ArgMax, AttLstm, BNLL, BatchNorm, BatchReindex, Bias, Concat, ContrastiveLoss, Convolution, CosinangleLoss, Crop, Data, Deconvolution, Dropout, DummyData, ELU, Eltwise, Embed, EuclideanLoss, Exp, Filter, Flatten, HDF5Data, HDF5Output, HingeLoss, Im2col, ImageData, InfogainLoss, InnerProduct, Input, LRN, LSTMNew, LSTMUnit, Log, Lstm, MVN, MemoryData, MultinomialLogisticLoss, PReLU, Parameter, PointBilinear, Pooling, Power, RNN, ROIPooling, ReLU, Reduction, Reshape, ReverseAxis, SPP, Scale, Sigmoid, SigmoidCrossEntropyLoss, Silence, Slice, SmoothL1Loss, Softmax, SoftmaxWithLoss, Split, Sum, TanH, Threshold, Tile, Transpose, UnitboxLoss, WindowData) *** Check failure stack trace: *** Aborted (core dumped)

Can you give me a suggestion about how to deal with it? Thank you very much!

opened by kelulucaipeixi 5
How to training?

@tonghe90 I would like to ask you some questions about training: 1) How to build the train_val.prototxt file for training according to the two prototxt files test_iou.pt and test_lstm.pt for testing you have given? I am sorry that I have not used this branch network before. 2)In the paper, You mentioned the three steps of training. I want to know how to control the detection branch to be fixed or open it. Because I am a newbie, I hope that you can give me some guidance, of course, the more detailed the better, thank you very much.

opened by chunhui999 4
why score_map have two channels?

Hi @tonghe90 : i have reviewed your train.pt file, i found your score_map generated by 1*1 convoultion layer( "score_4s" in file) has two channels, as to your answer in this issue: https://github.com/tonghe90/textspotter/issues/16#issuecomment-405921502 its confuse me, should i perpare a correspoding two channels score map as surpervision information or just one channel ?

opened by fsluckymao 3
Therer are many errors about "at_layer.cpp" and other layers

Hi He: I git clone your code, when I make your code, we encounter many errors. Is the code you have released is incomplete or other reasons? Some of the errors are as follows：

src/caffe/layers/at_layer.cpp:18:20: error: request for member ‘output_h’ in ‘param’, which is of non-class type ‘const int’ output_H_ = param.output_h(); ^ src/caffe/layers/at_layer.cpp:20:12: error: request for member ‘has_output_w’ in ‘param’, which is of non-class type ‘const int’ if (param.has_output_w()) { ^ src/caffe/layers/at_layer.cpp:21:21: error: request for member ‘output_w’ in ‘param’, which is of non-class type ‘const int’

/usr/include/c++/5/bits/stl_vector.h:303:7: note: candidate expects 3 arguments, 5 provided /usr/include/c++/5/bits/stl_vector.h:264:7: note: candidate: std::vector<_Tp, _Alloc>::vector(const allocator_type&) [with _Tp = int; _Alloc = std::allocator; std::vector<_Tp, _Alloc>::allocator_type = std::allocator] vector(const allocator_type& __a) _GLIBCXX_NOEXCEPT ^ /usr/include/c++/5/bits/stl_vector.h:264:7: note: candidate expects 1 argument, 5 provided /usr/include/c++/5/bits/stl_vector.h:253:7: note: candidate: std::vector<_Tp, _Alloc>::vector() [with _Tp = int; _Alloc = std::allocator]

opened by liuxi2018 3
"gt_label" in tool_layers/gen_gts_layer

Hi @tonghe90 : sorry for bother you again. i can understand almost all of your code, but i am really confuse about customize layer "gen_gts_layer", specially the bottom[0] blob "gt_bbox" whose shape is N* 1 *H * W, but i dont know about what excatly gt_bbox is and what is the mean of the vaule in gt_bbox.

https://github.com/tonghe90/textspotter/blob/0166abdbe68bfe0a416a4a1d35ab8d1e1fcfe262/pylayer/tool_layers.py#L304

for n in range(batch_size): gt_label = bottom[0].data[n, 0] #gt_label is a matrix，shape=H*W tmp = np.sum(gt_label, axis=1) gt_num = len(np.where(tmp != 0)[0]) if gt_num == 0: continue roi_n = gt_label[:gt_num, :8] * 4 #here i cant understand. roi_n = np.hstack((np.ones((gt_num, 1)) * n, roi_n)) gt_boxes = np.vstack((gt_boxes, roi_n))

opened by fsluckymao 2
how about the time cost?

I run test.py on Tesla P40, if I set the scale to 1000, the detection part needs time 1s around. when I set the scale to 300, then the detection part time is 0.3s around. how about others? is it normal. and if I want to speed the forword, is there any sugestion? thanks

opened by wyhgood 2
@tonghe90 输入的bbox大小有什么限制吗?

我把loss_4s和iou loss层都注释掉了,现在仅有文字识别的softmaxwithloss损失函数(mask loss和iou loss都不参与训练); 然后自己写了一个输入数据层,可以输出包含文字的图片(640640大小), 作为gt的bbox的四个点的坐标以及文字的标签同时输出; 但是训练时候遇到segmentation fault, 提示内存越界; 请问输入给point bilinear layer的bbox大小有什么限制吗?648个采样点的条件下, 输入的bbox大小是否有什么要求?

opened by wenston2006 1
Synthtext pre processing and table 2 accuracies

Hi,

Please can you tell the steps taken for pre-procesing synthtext labels ??

your model uses fixed max length of 25 but synthtext dataset has boxes with labels length(number of characters per box in ground truth)>=35

Also how did you get the accuracies mentioned in Table 2 ? is that after all steps of training ? It says accuracy on icdar dataset but also says groundtruth used. or is it after training on Synthtext and then fine tune of 80k iteration on Icdar ie after step 2 of training ??

opened by crazysal 2
issues about gen_gts_layer

Q1:in train.pt ,"gt_bbox" is noted by ” N * 8 ### grounding truth boxes for text (for computing loss)” but in Class gen_gts_layer which in tool_layers.py it is noted by "bottom[0]: gt_label [N,1,sz,sz]" What does gt_bbox mean? Q2:Could you please provide an intuitive explanation of what the following variables are ?
'sample_gt_cont' 'sample_gt_label_input' 'sample_gt_label_output'

opened by chunhui999 13

Owner

Tong He

GitHub

EQFace: An implementation of EQFace: A Simple Explicit Quality Network for Face Recognition

EQFace: A Simple Explicit Quality Network for Face Recognition The first face recognition network that generates explicit face quality online.

141 Dec 31, 2022

Genalog is an open source, cross-platform python package allowing generation of synthetic document images with custom degradations and text alignment capabilities.

235 Dec 22, 2022

Official code for ROCA: Robust CAD Model Retrieval and Alignment from a Single Image (CVPR 2022)

ROCA: Robust CAD Model Alignment and Retrieval from a Single Image (CVPR 2022) Code release of our paper ROCA. Check out our video, paper, and website

123 Dec 25, 2022

Unofficial implementation of "TableNet: Deep Learning model for end-to-end Table detection and Tabular data extraction from Scanned Document Images"

TableNet Unofficial implementation of ICDAR 2019 paper : TableNet: Deep Learning model for end-to-end Table detection and Tabular data extraction from

243 Dec 30, 2022

End-to-end pipeline for real-time scene text detection and recognition.

Real-time-Scene-Text-Detection-and-Recognition-System End-to-end pipeline for real-time scene text detection and recognition. The detection model use

89 Aug 4, 2022

CTPN + DenseNet + CTC based end-to-end Chinese OCR implemented using tensorflow and keras

简介基于Tensorflow和Keras实现端到端的不定长中文字符检测和识别文本检测：CTPN 文本识别：DenseNet + CTC 环境部署 sh setup.sh 注：CPU环境执行前需注释掉for gpu部分，并解开for cpu部分的注释 Demo 将测试图片放入test_images

2.6k Dec 29, 2022

Code related to "Have Your Text and Use It Too! End-to-End Neural Data-to-Text Generation with Semantic Fidelity" paper

DataTuner You have just found the DataTuner. This repository provides tools for fine-tuning language models for a task. See LICENSE.txt for license de

81 Jan 1, 2023

A Joint Video and Image Encoder for End-to-End Retrieval

Frozen️ in Time ❄️ ️️️️ ⏳ A Joint Video and Image Encoder for End-to-End Retrieval (arXiv) Repository to contain the code, models, data for end-to-end

225 Dec 25, 2022

This repository lets you train neural networks models for performing end-to-end full-page handwriting recognition using the Apache MXNet deep learning frameworks on the IAM Dataset.

Handwritten Text Recognition (OCR) with MXNet Gluon These notebooks have been created by Jonathan Chung, as part of his internship as Applied Scientis

422 Jan 3, 2023

Code for the AAAI 2018 publication "SEE: Towards Semi-Supervised End-to-End Scene Text Recognition"

SEE: Towards Semi-Supervised End-to-End Scene Text Recognition Code for the AAAI 2018 publication "SEE: Towards Semi-Supervised End-to-End Scene Text

572 Jan 5, 2023

Code for AAAI 2021 paper: Sequential End-to-end Network for Efficient Person Search

This repository hosts the source code of our paper: [AAAI 2021]Sequential End-to-end Network for Efficient Person Search. SeqNet achieves the state-of

218 Dec 31, 2022

Demo for the paper "Overlap-aware low-latency online speaker diarization based on end-to-end local segmentation"

Streaming speaker diarization Overlap-aware low-latency online speaker diarization based on end-to-end local segmentation by Juan Manuel Coria, Hervé

185 Jan 1, 2023

A Tensorflow model for text recognition (CNN + seq2seq with visual attention) available as a Python package and compatible with Google Cloud ML Engine.

Attention-based OCR Visual attention-based OCR model for image recognition with additional tools for creating TFRecords datasets and exporting the tra

933 Dec 29, 2022

textspotter - An End-to-End TextSpotter with Explicit Alignment and Attention

Related tags

Overview

An End-to-End TextSpotter with Explicit Alignment and Attention

Getting Started

Installation

Training

Testing

Citation

License

Comments

Owner

Tong He

EQFace: An implementation of EQFace: A Simple Explicit Quality Network for Face Recognition

Genalog is an open source, cross-platform python package allowing generation of synthetic document images with custom degradations and text alignment capabilities.

Official code for ROCA: Robust CAD Model Retrieval and Alignment from a Single Image (CVPR 2022)

Unofficial implementation of "TableNet: Deep Learning model for end-to-end Table detection and Tabular data extraction from Scanned Document Images"

End-to-end pipeline for real-time scene text detection and recognition.

CTPN + DenseNet + CTC based end-to-end Chinese OCR implemented using tensorflow and keras

Code related to "Have Your Text and Use It Too! End-to-End Neural Data-to-Text Generation with Semantic Fidelity" paper

A Joint Video and Image Encoder for End-to-End Retrieval

This repository lets you train neural networks models for performing end-to-end full-page handwriting recognition using the Apache MXNet deep learning frameworks on the IAM Dataset.

Code for the AAAI 2018 publication "SEE: Towards Semi-Supervised End-to-End Scene Text Recognition"

Code for AAAI 2021 paper: Sequential End-to-end Network for Efficient Person Search

Demo for the paper "Overlap-aware low-latency online speaker diarization based on end-to-end local segmentation"

A Tensorflow model for text recognition (CNN + seq2seq with visual attention) available as a Python package and compatible with Google Cloud ML Engine.

MORAN: A Multi-Object Rectified Attention Network for Scene Text Recognition

Pytorch implementation of PSEnet with Pyramid Attention Network as feature extractor

Implement 'Single Shot Text Detector with Regional Attention, ICCV 2017 Spotlight'

Single Shot Text Detector with Regional Attention

🖺 OCR using tensorflow with attention

Visual Attention based OCR