A curated list of promising OCR resources


百度自家的 :基本可以放弃
第三方和阿里自己提供的 API 集中在身份证、银行卡、驾驶证、护照、电商商品评论文本、车牌、名片、贴吧文本、视频中的文本,多输出字符及相应坐标,卡片类可输出成结构化字段,价格在0.01左右
另外有三家提供了简历的解析,输出结果多为结构化字段,支持文档和图片格式 价格在0.1-0.3次不等
目前无第三方入驻,仅有腾讯自有的api 涵盖车牌、名片、身份证、驾驶证、银行卡、营业执照、通用印刷体,价格最高可达0.2左右。
OcrKing 从哪来?

OcrKing 源自2009年初 Aven 在数据挖掘中的自用项目,在对技术的执着和爱好的驱动下积累已近七载经多年的积累和迭代,如今已经进化为云架构的集多层神经网络与深度学习于一体的OCR识别系统2010年初为方便更多用户使用,特制作web版文字OCR识别,从始至今 OcrKing一直提供免费识别服务及开发接口,今后将继续提供免费云OCR识别服务。OcrKing从未做过推广,

但也确确实实默默地存在,因为他相信有需求的朋友肯定能找得到。欢迎把 OcrKing 在线识别介绍给您身边有类似需求的朋友!希望这个工具对你有用,谢谢各位的支持!

OcrKing 能做什么?

OcrKing 是一个免费的快速易用的在线云OCR平台,可以将PDF及图片中的内容识别出来,生成一个内容可编辑的文档。支持多种文件格式输入及输出,支持多语种(简体中文,繁体中文,英语,日语,韩语,德语,法语等)识别,支持多种识别方式, 支持多种系统平台, 支持多形式API调用!
Connectionist Temporal Classification is a loss function useful for performing supervised learning on sequence data, without needing an alignment between input data and labels. For example, CTC can be used to train end-to-end systems for speech recognition, which is how we have been using it at Baidu's Silicon Valley AI Lab.

Warp-CTC是一个可以应用在CPU和GPU上高效并行的CTC代码库 (library) 介绍 CTCConnectionist Temporal Classification作为一个损失函数,用于在序列数据上进行监督式学习,不需要对齐输入数据及标签。比如,CTC可以被用来训练端对端的语音识别系统,这正是我们在百度硅谷试验室所使用的方法。 端到端 系统 语音识别



Building on recent advances in image caption generation and optical character recognition (OCR), we present a general-purpose, deep learning-based system to decompile an image into presentational markup. While this task is a well-studied problem in OCR, our method takes an inherently different, data-driven approach. Our model does not require any knowledge of the underlying markup language, and is simply trained end-to-end on real-world example data. The model employs a convolutional network for text and layout recognition in tandem with an attention-based neural machine translation system. To train and evaluate the model, we introduce a new dataset of real-world rendered mathematical expressions paired with LaTeX markup, as well as a synthetic dataset of web pages paired with HTML snippets. Experimental results show that the system is surprisingly effective at generating accurate markup for both datasets. While a standard domain-specific LaTeX OCR system achieves around 25% accuracy, our model reproduces the exact rendered image on 75% of examples. 

We present recursive recurrent neural networks with attention modeling (R2AM) for lexicon-free optical character recognition in natural scene images. The primary advantages of the proposed method are: (1) use of recursive convolutional neural networks (CNNs), which allow for parametrically efficient and effective image feature extraction; (2) an implicitly learned character-level language model, embodied in a recurrent neural network which avoids the need to use N-grams; and (3) the use of a soft-attention mechanism, allowing the model to selectively exploit image features in a coordinated way, and allowing for end-to-end training within a standard backpropagation framework. We validate our method with state-of-the-art performance on challenging benchmark datasets: Street View Text, IIIT5k, ICDAR and Synth90k.

Clustering is central to many data-driven application domains and has been studied extensively in terms of distance functions and grouping algorithms. Relatively little work has focused on learning representations for clustering. In this paper, we propose Deep Embedded Clustering (DEC), a method that simultaneously learns feature representations and cluster assignments using deep neural networks. DEC learns a mapping from the data space to a lower-dimensional feature space in which it iteratively optimizes a clustering objective. Our experimental evaluations on image and text corpora show significant improvement over state-of-the-art methods

In recent years, recognition of text from natural scene image and video frame has got increased attention among the researchers due to its various complexities and challenges. Because of low resolution, blurring effect, complex background, different fonts, color and variant alignment of text within images and video frames, etc., text recognition in such scenario is difficult. Most of the current approaches usually apply a binarization algorithm to convert them into binary images and next OCR is applied to get the recognition result. In this paper, we present a novel approach based on color channel selection for text recognition from scene images and video frames. In the approach, at first, a color channel is automatically selected and then selected color channel is considered for text recognition. Our text recognition framework is based on Hidden Markov Model (HMM) which uses Pyramidal Histogram of Oriented Gradient features extracted from selected color channel. From each sliding window of a color channel our color-channel selection approach analyzes the image properties from the sliding window and then a multi-label Support Vector Machine (SVM) classifier is applied to select the color channel that will provide the best recognition results in the sliding window. This color channel selection for each sliding window has been found to be more fruitful than considering a single color channel for the whole word image. Five different features have been analyzed for multi-label SVM based color channel selection where wavelet transform based feature outperforms others. Our framework has been tested on different publicly available scene/video text image datasets. For Devanagari script, we collected our own data dataset. The performances obtained from experimental results are encouraging and show the advantage of the proposed method.

Recently, scene text detection has become an active research topic in computer vision and document analysis, because of its great importance and significant challenge. However, vast majority of the existing methods detect text within local regions, typically through extracting character, word or line level candidates followed by candidate aggregation and false positive elimination, which potentially exclude the effect of wide-scope and long-range contextual cues in the scene. To take full advantage of the rich information available in the whole natural image, we propose to localize text in a holistic manner, by casting scene text detection as a semantic segmentation problem. The proposed algorithm directly runs on full images and produces global, pixel-wise prediction maps, in which detections are subsequently formed. To better make use of the properties of text, three types of information regarding text region, individual characters and their relationship are estimated, with a single Fully Convolutional Network (FCN) model. With such predictions of text properties, the proposed algorithm can simultaneously handle horizontal, multi-oriented and curved text in real-world natural images. The experiments on standard benchmarks, including ICDAR 2013, ICDAR 2015 and MSRA-TD500, demonstrate that the proposed algorithm substantially outperforms previous state-of-the-art approaches. Moreover, we report the first baseline result on the recently-released, large-scale dataset COCO-Text.


特征描述的完整过程 http://dataunion.org/wp-content/uploads/2015/05/640.webp_2.jpg



Commercial products


Convert scanned images of documents into rich text with advanced Deep Learning OCR APIs. Free forever plans available.
  • IRIS
 真正能把中文OCR做得比较专业的,一共也没几家,国内2家,国外2家。国内是文通和汉王,国外是ABBYY和IRIS(台湾原来有2家丹青和蒙恬,这两年没什么动静了)。像大家提到的紫光OCR、CAJViewer、MS Office、清华OCR、包括慧视小灵鼠,这些都是文通的产品或者使用文通的识别引擎,尚书则是汉王的产品,和中晶扫描仪捆绑销售的。这两家的中文识别率都是非常不错的。而国外的2家,主要特点是西方语言的识别率很好,而且支持多种西欧语言,产品化程度也很高,不过中文方面速度和识别率还是有差距的,当然这两年人家也是在不断进步。Google的开源项目,至少在中文方面,和这些家相比,各项性能指标水平差距还蛮大的呢。 

目前看到最棒的免费的API  当然也提供商业版

OCR Databases


  • 基于深度学习的OCR-from 美團技術團隊

    基于深度学习的OCR-from 美團技術團隊

    http://tech.meituan.com/deeplearning_application.html 为了提升用户体验,O2O产品对OCR技术的需求已渗透到上单、支付、配送和用户评价等环节。OCR在美团点评业务中主要起着两方面作用。一方面是辅助录入,比如在移动支付环节通过对银行卡卡号的拍照识别,以实现自动绑卡,又如辅助BD录入菜单中菜品信息。另一方面是审核校验,比如在商家资质审核环节对商家上传的身份证、营业执照和餐饮许可证等证件照片进行信息提取和核验以确保该商家的合法性,比如机器过滤商家上单和用户评价环节产生的包含违禁词的图片。相比于传统OCR场景(印刷体、扫描文档),美团的OCR场景主要是针对手机拍摄的照片进行文字信息提取和识别,考虑到线下用户的多样性,因此主要面临以下挑战:





    1. 基于Faster R-CNN和FCN的文字定位


    对于受控场景,我们将文字定位转换为对特定关键字目标的检测问题。主要利用Faster R-CNN进行检测,如下图所示。为了保证回归框的定位精度同时提升运算速度,我们对原有框架和训练方式进行了微调:


    图4 基于Faster R-CNN的受控场景文字定位

    对于非受控场景,由于文字方向和笔画宽度任意变化,目标检测中回归框的定位粒度不够,我们利用语义分割中常用的全卷积网络(FCN)来进行像素级别的文字/背景标注,如下图所示。为了同时保证定位的精度和语义的清晰,我们不仅在最后一层进行反卷积,而且融合了深层Layer和浅层Layer的反卷积结果 图5 基于FCN的非受控场景文字定位

    1. 基于序列学习框架的文字识别


    图6 基于序列学习的端到端识别框架


    图7 深度学习OCR和传统OCR的性能比较

  • ctpn 测试

    ctpn 测试

    docker run  --rm -it  -v `pwd`:/opt/ctpn/CTPN/demo_images -p 8888:8888  dc/ctpn 
    docker run  --rm -it  -v `pwd`:/opt/ctpn/CTPN/demo_images  dc/ctpn /bin/bash
    root@8a1d73be4cbc:/opt/ctpn/CTPN# python tools/demo.py --no-gpu 
  • Adnan Ul-Hasan的博士论文-第四章 训练数据

    Adnan Ul-Hasan的博士论文-第四章 训练数据

    Benchmark Datasets for OCR Numerous character recognition algorithms require sizable ground-truthed real- world data for training and benchmarking. The quantity and quality of training data directly a ects the generalization accuracy of a trainable OCR model. However, de- veloping GT data manually is overwhelmingly laborious, as it involves a lot of e ort to produce a reasonable database that covers all possible words of a language. Tran- scribing historical documents is even more gruelling as it requires language expertise in addition to manual labelling e orts. The increased human e orts give rise to - nancial aspects of developing such datasets and could restrict the development of large-scale annotated databases for the purpose of OCR. It has been pointed out in the previous chapter that scarcity of training data is one of the limiting factors in de- veloping reliable OCR systems for many historical as well as for some modern scripts. The challenge of limited training data has been overcome by the following contri- butions of this thesis: • Asemi-automatedmethodologytogeneratetheGTdatabaseforcursivescripts at ligature level has been proposed. This methodology can equally be applied to produce character-level GT data. Section 4.2 reports the speci cs of this method for cursive Nabataean scripts. • Synthetically generated text-line databases have been developed to enhance the OCR research. These datasets include a database for Devanagari script (Deva-DB), a subset of printed Polytonic Greek script (Polytonic-DB), and three datasets for Multilingual OCR (MOCR) tasks. Section 4.3 details this process and describes the ne points about these datasets. 4.1 Related Work There are basically two types of methodologies that have been proposed in the liter- ature. The rst is to extract identi able symbols from the document image and apply some clustering methods to create representative prototypes. These prototypes are then assigned text labels. The second approach is to synthesize the document images from the textual data. These images are degraded using various image defect models to re ect the scanning artifacts. These degradation models [Bai92] include resolution, blur, threshold, sensitivity, jitter, skew, size, baseline, and kerning. Some of these artifacts are discussed in Section 4.3 where they are used to generate text-line images from the text. The use of synthesized training data is increasing and there are many datasets re- ported in the literature using this methodology. One dataset that is prominent among these types is the Arabic Printed Text Images (APTI) database, which is proposed by Sli- mane et al. [SIK+09]. This database is synthetically generated covering ten di erent Arabic fonts and as many font-sizes (ranging from 6 to 24). It is generated from vari- ous Arabic sources and contains over 1 million words. The number increases to over 45 million words when rendered using ten fonts, four styles and ten font-sizes. Another example of a synthetic text-line image database is the Urdu Printed Text Images (UPTI) database, published by Sabbour and Shafait [SS13]. This dataset consists of over 10 thousand unique text-lines selected from various sources. Each text-line is rendered synthetically with various degradation parameters. Thus the actual size of the database is quite large. The database contains GT information at both text-line and ligature levels. The second approach in automating the process of generating an OCR database from scanned document images is to nd the alignment of the transcription of the text lines with the document image. Kanungo et al. [KH99] presented a method for generating character GT automatically for scanned documents. A document is rst created electronically using any typesetting system. It is then printed out and scanned. Next, the corresponding feature points from both versions of the same doc- ument are found and the parameters of the transformation are estimated. The ideal GT information is transformed accordingly using these estimates. An improvement in this method is proposed by Kim and Kanungo [KK02] by using an attributed branch- and-bound algorithm. Von Beusekom et al. [vBSB08] proposed a robust and pixel-accurate alignment method. In the rst step, the global transformation parameters are estimated in a similar manner as in [KK02]. In the second step, the adaptation of the smaller region is carried out. Pechwitz et al. [PMM+02] presented the IfN/ENIT database of handwritten Arabic names of cities along with their postal codes. A projection pro le method is used to extract words and the postal codes automatically. Moza ari et al. [MAM+08] devel- oped a similar database (IfN/Farsi-database) for handwritten Farsi (Persian) names of cities. Sagheer et al. [SHNS09] also proposed a similar methodology for generating an Urdu database for handwriting recognition. Vamvakas et al. [VGSP08] proposed that a character database for historical docu- ments may be constructed by choosing a small subset of images and then using char- acter segmentation and clustering techniques. This work is similar to our approach; however, the main di erence is the use of a di erent segmentation technique for Urdu ligatures and the utilization of a dissimilar clustering algorithm.

  • ocrap 测试

    ocrap 测试


    (py3.5) ➜  ocrla git:(master) pip install      wand     imutils     regex     pymysql     boto3     pika     reportlab     docopt     schema     pyjarowinkler     enum34     google-api-python-client     numpy     setproctitle     scipy     sklearn     pycrypto     requests fluent-logger pypdf2 
  • Object detection with deep learning and OpenCV

    Object detection with deep learning and OpenCV

    Deep Learning with OpenCV

    http://www.pyimagesearch.com/2017/08/21/deep-learning-with-opencv/ Object detection with deep learning and OpenCV http://www.pyimagesearch.com/2017/09/11/object-detection-with-deep-learning-and-opencv/

  Adnan Ul-Hasan的博士论文-参考文献

  Adnan Ul-Hasan的博士论文-第五章 印刷体的OCR

    Adnan Ul-Hasan的博士论文-第五章 印刷体的OCR

    In recent times, Machine Learning (ML) based algorithms have been able to achieve very promising results on many pattern recognition tasks, such as speech, handwriting, activity and gesture recognition. However, they have not been thoroughly evaluated to recognize printed text. Printed OCR is similar to other sequence learning tasks like speech and handwriting recognition and therefore it can also reap the benefits of high performing ML algorithms. Various challenges that are hampering the accomplishment of a robust OCR system have been discussed in Chapter 3. Upon looking at these challenges closely, one can realize that a human reader does not face many of these issues while reading a particular script. Human reading is powerful because of the ability to process the context of any text. Similarly, the internal feedback mechanism of Long Short-Term Memory (LSTM) networks enables them to process the context effectively; thereby rendering them highly suitable for text recognition tasks. This chapter discusses the use of LSTM networks for the OCR on three modern scripts. The first part of the chapter, Section 5.1, overviews the complete design of the LSTM-based OCR system. The second part, from Section 5.2 to Section 5.4, reports the experimental evaluations for modern English, Devanagari and Urdu Nastaleeq scripts. 5.1 Design of LSTM-Based OCR System This section provides necessary details about the LSTM-based OCR methodology that has been used for the experiments reported for modern English (Section 5.2), Devanagari (Section 5.3) and Urdu Nastaleeq (Section 5.4). The LSTM networks have been described in detail in Appendix A. For experiments reported in this chapter,

    only 1D-LSTM networks have been utilized. MDLSTM architecture produced lower results in our preliminary experiments with printed English and Fraktur [BUHAAS13] and hence they are not considered. Some preliminary experiments on Urdu Nastaleeq script using Hierarchical Subsampling LSTM (HSLSTM) networks yield promising results; however, these networks have not yet been tested for other scripts. The complete process of using LSTM-based OCR system is shown in Figure 5.1. To use 1D-LSTM networks, the text-line image normalization is the only important preprocessing step. This is due to the fact that these networks are not translation invariant in vertical dimension, so this dimension has to be fixed prior to using these networks. Various normalization methods that have been used in this thesis, are described in Appendix B. There are few free parameters that are needed to be tuned in order to use 1D-LSTM networks and they are discussed in the following section. The features used for the LSTM networks are described in Section 5.1.2, while the performance metric is defined in Section 5.1.3

  •  Full-Page Text Recognition: Learning Where to Start and When to Stop

    Full-Page Text Recognition: Learning Where to Start and When to Stop


    Text line detection and localization is a crucial step for full page document analysis, but still suffers from heterogeneity of real life documents. In this paper, we present a new approach for full page text recognition. Localization of the text lines is based on regressions with Fully Convolutional Neural Networks and Multidimensional Long Short-Term Memory as contextual layers. In order to increase the efficiency of this localization method, only the position of the left side of the text lines are predicted. The text recognizer is then in charge of predicting the end of the text to recognize. This method has shown good results for full page text recognition on the highly heterogeneous Maurdor dataset.

  Adnan Ul-Hasan的博士论文-第八章 多种文字文档的通用 OCR 架构

    Adnan Ul-Hasan的博士论文-第八章 多种文字文档的通用 OCR 架构

    Multilingual documents are common in the computer age of today. Plethora of these documents exist in the form of translations, books, operational manuals, etc. The abundance of these multilingual documents in everyday life is observed today due to two main reasons. Firstly, technological advancements are reaching in each and every corner of the world due to globalization, and there is an increasing need from the international customers to access the technology in their native language. This phenomenon has a two-fold impact: 1) operational manuals of electronic gadgets are required to be in multiple languages, 2) the access to knowledge available in other languages has become very easy; thereby, an increase in bilingual books and dictionaries has been witnessed. Secondly, English has become an international language, and the effect of this internationalization is evident by its impact on many languages. Several languages have adopted words from English and various documents, for instance newspapers, magazines, and articles, use many English words on a daily basis. Therefore, the need to develop reliable Multilingual OCR (MOCR) systems to digitize these documents has inflated manifold. Despite the increase in the availability of multilingual documents, automatic recognition of multilingual text remains a challenge. Popat [Pop12] pointed out several challenges in the context of the Google books project1. Some of these unique challenges are: • Multiple scripts/languages on a single page. • Multiple languages in same or similar scripts, like Arabic-Persian, English- German. • The same language in multiple scripts, like Urdu in Nastaleeq and Naskh scripts. • Archaic and reformed orthographies, for example, 18th Century English, Fraktur (historical German). One solution to handle multilingual documents is to develop an OCR methodology that can recognize all characters of all scripts. However, it is commonly believed that such a generic OCR framework would be very difficult to realize [PD14]. The alternate process (as shown in Figure 8.1) is to employ a script identification step before recognizing the text. This step separates various scripts present in a document, so that a unilingual OCR model can be applied to recognize each script. This procedure, however, is unsatisfactory for many reasons, some of which are listed below: • The script identification is itself quite a challenging feat. Traditionally, it involves finding suitable features of the given script(s). One has to either fine tune these hand-crafted features or has to look for some other features, if the same script identification methodology has to be used for other scripts. • The process of script identification (see chapter 7) is not perfect, thereby the scripts recognized by such process can not be separated reliably. This directly affects the recognition accuracy of the OCR system employed. • Moreover, humans do not process the multilingual documents using the script identification step. A person possessing multilingual prowess reads a multilingual document in a similar manner as he/she would read a monolingual document. Hence the ultimate aim to OCR multilingual documents is to develop a generalized OCR system that can recognize all scripts. An MOCR system must be able to handle various scripts as well as it should be robust against the intraclass variations, that is, it should be able to recognize the letters despite slight variations in their shapes and sizes. Although the idea of generalized OCR system is not new, it has not been pursued greatly because of lack of computational powers and suitable algorithms to recognize all characters of multiple scripts. However, recent advancement in machine learning and pattern recognition fields have shown great promise on many tasks that were once considered very difficult. Moreover, these learning strategies are claimed to mimic the neural networks employed in the human brain. So they should be able to replicate the human capabilities in a better way than other neural networks. The main contribution of this chapter is a Generalized OCR framework2 that can be used to OCR multilingual and multiscript documents such that there is no need to employ the traditional script identification step. A sub-goal of this work is to highlight the discriminating power and sequence learning capability of LSTM networks for a large number of classes for OCR tasks. The trained LSTM networks can successfully discriminate hundreds of classes when it is trained for multiple scripts/languages simultaneously. The rest of this chapter is organized as follows. Section 8.1 reports the work done by other researchers to develop generalized OCR systems for multilingual documents. Our quest for a generalized OCR system starts with the development of a single OCR model that can recognize multilingual text in which all languages belong to a single script. Section 8.2 discusses the cross-language performance of LSTM networks. The next step of our quest is to extend the idea of “single OCR model” from multilingual documents to multiscript documents. A single OCR model that can recognize text in multiple scripts is the first step in realizing a generalized OCR system. Section 8.3 describes the design of LSTM-based generalized OCR framework in detail. Section 8.4 concludes the chapter with a brief summary and outlines some directions in which the present work can be further extended.

