make a better chinese character recognition OCR than tesseract

Overview

deep ocr

See README_en.md for English installation documentation.

只在ubuntu下面测试通过,需要virtualenv安装,安装路径可自行调整:

git clone https://github.com/JinpengLI/deep_ocr.git ~/deep_ocr
virtualenv ~/deep_ocr_env
source ~/deep_ocr_env/bin/activate
pip install -r ~/deep_ocr/requirements.txt
cd ~/deep_ocr && python setup.py install

测试

source ~/deep_ocr_env/bin/activate && cd ~/deep_ocr && ./bin/deep_ocr_reco data/holiday_notification.jpg -v -d

旧版说明

部分还能用,暂时保留,以后准备删除.

估计很多开发员使用tesseract做中文识别,但是结果不是一般的差,譬如下面的图片

alt text

$ tesseract -l chi_sim data/test_data.png out_test_data
看到恨多公司在招腭大改癫和机器字习胸人 v 我有3个建议 (T) 忧T ' 2个上t较靠遭
胸人就譬了 v不是越多越好 (2) 这T '2个人要能给大蒙上踝'倩邂知L目 (3) 不要招
不宣代四胸人:虹大改癫和机器字习胸v不裹目宣 (或者宣过) 大量代四v基本上就
只会忽悠了

其实现在做文字识别不是很难,特别基于深度学习,这里是这个项目的reco_chars.py脚本,基于caffe的识别效果,是不是好很多?而且代码比tesseract短很多。

$ python reco_chars.py
看很多公苘在招聘天数据和机器学习人我有个建议找个较靠谱
的人就够了不是越多越好这个人要给大家上课传递知识不要招
不写代码的人做天数据机器学习的不亲写或者写过天且代码基本上就
只会忽悠了

大家可以基于caffe训练自己的字体,系统基于这个文章开发单个字的识别:

Deep Convolutional Network for Handwritten Chinese Character Recognition

http://yuhao.im/files/Zhang_CNNChar.pdf

通过 Docker 安装

先安装docker,以下教程在Ubuntu 14.04 通过测试

https://www.docker.com/

下载deep_ocr_workspace.zip (https://pan.baidu.com/s/1nvz2wrBhttps://pan.baidu.com/s/1qYPKH3Y )

两个文件的md5sum值,用于校验文件是否成功下载。

$ md5sum deep_ocr_workspace.zip
ffeda7ea6604e7b8835c05a33fa0459e  deep_ocr_workspace.zip
$ md5sum deep_ocr_workspace.z01
ea66796c2bbdb2bec9b7ee28eb44012d  deep_ocr_workspace.z01

解压到本地硬盘,譬如到以下地方 (~/deep_ocr_workspace)

cat deep_ocr_workspace.z* > unsplit_deep_ocr_workspace.zip
unzip unsplit_deep_ocr_workspace.zip -d ~/

这个zip包含deep_ocr所有需要数据文件(由于太大了,所以放百度云了)。所有数据到解压到 ~/deep_ocr_workspace,你也可以把需要处理的数据放到这个文件夹。

基于cpu

docker pull jinpengli/deep_ocr_cpu_docker:latest

启动 docker container

docker run -ti --volume=${HOME}/deep_ocr_workspace:/workspace jinpengli/deep_ocr_cpu_docker:latest /bin/bash
cd /opt/deep_ocr
git pull origin master

volume用于mount到container里面,这样可以获取上面的识别结果。

python /opt/deep_ocr/reco_chars.py

然后可以继续你们的开发。。。。加油。。。

身份证识别

暂时不是很稳定,需要加一些语义模型。等等吧。。。。

识别图片

识别图片

执行命令

export WORKSPACE=/workspace
deep_ocr_id_card_reco --img $DEEP_OCR_ROOT/data/id_card_img.jpg             --debug_path /tmp/debug             --cls_sim ${WORKSPACE}/data/chongdata_caffe_cn_sim_digits_64_64             --cls_ua ${WORKSPACE}/data/chongdata_train_ualpha_digits_64_64

识别结果:

...
ocr res:
============================================================
name
韦小宝
============================================================
address
北京市东城区累山前街4号
紫禁城敬事房
============================================================
month
12
============================================================
minzu
汉
============================================================
year
1654
============================================================
sex
男
============================================================
id
1X21441114X221243X
============================================================
day
20

Comments
  • RuntimeError: Could not open file

    RuntimeError: Could not open file

    我下载了deep_ocr_workspace.zip 和 reco_chars.py 运行脚本出现以下错误,而且你的压缩包在window下解压出错。 我感觉是你的压缩文件有问题

    root@orange-VirtualBox:~/caffe/python# python reco_chars.py WARNING: Logging before InitGoogleLogging() is written to STDERR W1223 17:24:19.496032 3764 _caffe.cpp:122] DEPRECATION WARNING - deprecated use of Python interface W1223 17:24:19.496183 3764 _caffe.cpp:123] Use this instead (with the named "weights" parameter): W1223 17:24:19.496206 3764 _caffe.cpp:125] Net('/workspace/data/chongdata_caffe_cn_sim_digits_64_64/deploy_lenet_train_test.prototxt', 1, weights='/workspace/data/chongdata_caffe_cn_sim_digits_64_64/lenet_iter_50000.caffemodel') Traceback (most recent call last): File "test.py", line 294, in caffe_cls = CaffeCls(model_def, model_weights, y_tag_json_path) File "test.py", line 20, in init caffe.TEST) RuntimeError: Could not open file /workspace/data/chongdata_caffe_cn_sim_digits_64_64/lenet_iter_50000.caffemodel

    opened by 984958198 9
  • 找到BUG了!!

    找到BUG了!!

    PYTHON的BUG 㧟 䏝 㤘 䥽 䁖 䦃 㸆 这几个字无法通过PIL画出来,不信你试试 我平常不用PYTHON,这种BUG该往哪报啊?? 简单的测试代码

    font = ImageFont.truetype("STXIHEI.TTF", 300) img = Image.new("L", (300, 300), "black") draw = ImageDraw.Draw(img) ch = u'㸆' draw.text((0, -75), ch, 255, font=font) img.show()

    你可以试试,哈哈蛤 把STXIHEI.TTF找出来,系统里就有,是华文细黑!!

    opened by chibai 2
  • error  你的百度云没有更新出了错误y_tag_json只有ABC啊

    error 你的百度云没有更新出了错误y_tag_json只有ABC啊

    1. File "/opt/deep_ocr/reco_chars.py", line 53, in _predict_cv2_imgs_sub item = (self.y_tag_json[str(index)], KeyError: '4528' 2.还有就是拿个只有abc这样的jpc
      ocr res: 什么都没有啊
    opened by Jayhello 2
  • Could not open file /workspace/data/chongdata_caffe_cn_sim_digits_64_64/deploy_lenet_train_test.prototxt

    Could not open file /workspace/data/chongdata_caffe_cn_sim_digits_64_64/deploy_lenet_train_test.prototxt

    我仔细检查了每个依赖库、模型文件路径、解压等因素,还是报这个错误?所以,是你上传的压缩文件错误吗? Traceback (most recent call last): File "reco_chars.py", line 294, in caffe_cls = CaffeCls(model_def, model_weights, y_tag_json_path) File "reco_chars.py", line 20, in init caffe.TEST) RuntimeError: Could not open file /workspace/data/chongdata_caffe_cn_sim_digits_64_64/deploy_lenet_train_test.prototx

    best regards!

    opened by LuWei6896 1
  • deep_ocr_make_caffe_dataset的时候报错

    deep_ocr_make_caffe_dataset的时候报错

    您好: 执行虫数据中lesson4的deep_ocr_make_caffe_dataset命令时候,images文件夹生成了,但是没有生成图片, 报错代码: File "/opt/deep_ocr/bin/deep_ocr_make_caffe_dataset", line 83, in lang_chars = lang_chars_gen.do() File "build/bdist.linux-x86_64/egg/deep_ocr/lang_aux.py", line 27, in do ImportError: No module named langs.lower_eng langs已经添加到了python模块中,请问这个是什么原因导致的呢?

    opened by tonightcode 1
  • ModuleNotFoundError: No module named 'deep_ocr.ocrolib'

    ModuleNotFoundError: No module named 'deep_ocr.ocrolib'

    Traceback (most recent call last): File "./bin/deep_ocr_reco", line 19, in import deep_ocr.ocrolib as ocrolib ModuleNotFoundError: No module named 'deep_ocr.ocrolib'

    opened by k1ic 1
  • Failed to parse NetParameter file

    Failed to parse NetParameter file

    there is some error on the bellow:

    python3 reco_chars.py [libprotobuf ERROR google/protobuf/text_format.cc:274] Error parsing text-format caffe.NetParameter: 6:15: Message type "caffe.LayerParameter" has no field named "input_param". WARNING: Logging before InitGoogleLogging() is written to STDERR F0103 17:15:15.282599 7488 upgrade_proto.cpp:928] Check failed: ReadProtoFromTextFile(param_file, param) Failed to parse NetParameter file: workspace/data/chongdata_caffe_cn_sim_digits_64_64/deploy_lenet_train_test.prototxt *** Check failure stack trace: *** 已放弃 (核心已转储)

    what's the matter of this file?

    best regards!

    opened by LuWei6896 0
  • docker下,图片路径小问题

    docker下,图片路径小问题

    https://github.com/JinpengLI/deep_ocr/blob/450148c0c51b3565a96ac2f3c94ee33022e55307/reco_chars.py#L296 改为 test_image = "/opt/deep_ocr/data/test_data.png" docker 默认测试ok root@9db2c4c3f5f9:/# python /opt/deep_ocr/reco_chars.py

    opened by zkailinzhang 0
  • 请教

    请教

    你好, 按照步骤配置好虚拟环境后, 执行如下: haiyun@dell-Precision-Tower-5810:~$ source ~/deep_ocr_env/bin/activate && cd ~/deep_ocr && ./bin/deep_ocr_reco data/holiday_notification.jpg -v -d ! image to reco: data/holiday_notification.jpg Traceback (most recent call last): File "./bin/deep_ocr_reco", line 137, in show_img(raw, title="raw image") File "./bin/deep_ocr_reco", line 27, in show_img plt.gray() File "/home/haiyun/deep_ocr_env/lib/python2.7/site-packages/matplotlib/pyplot.py", line 3932, in gray set_cmap("gray") File "/home/haiyun/deep_ocr_env/lib/python2.7/site-packages/matplotlib/pyplot.py", line 2372, in set_cmap im = gci() File "/home/haiyun/deep_ocr_env/lib/python2.7/site-packages/matplotlib/pyplot.py", line 335, in gci return gcf()._gci() File "/home/haiyun/deep_ocr_env/lib/python2.7/site-packages/matplotlib/pyplot.py", line 601, in gcf return figure() File "/home/haiyun/deep_ocr_env/lib/python2.7/site-packages/matplotlib/pyplot.py", line 548, in figure **kwargs) File "/home/haiyun/deep_ocr_env/lib/python2.7/site-packages/matplotlib/backend_bases.py", line 161, in new_figure_manager return cls.new_figure_manager_given_figure(num, fig) File "/home/haiyun/deep_ocr_env/lib/python2.7/site-packages/matplotlib/backends/_backend_tk.py", line 1044, in new_figure_manager_given_figure window = Tk.Tk(className="matplotlib") File "/home/haiyun/install/python-2.7.11/lib/python2.7/lib-tk/Tkinter.py", line 1814, in init self.tk = _tkinter.create(screenName, baseName, className, interactive, wantobjects, useTk, sync, use) _tkinter.TclError: couldn't connect to display ":0" 在网上搜索“_tkinter.TclError: couldn't connect to display ":0"” 尝试去解决没有成功, 烦请指点! 多谢!

    opened by jt387 1
Owner
Jinpeng
Jinpeng
It is a image ocr tool using the Tesseract-OCR engine with the pytesseract package and has a GUI.

OCR-Tool It is a image ocr tool made in Python using the Tesseract-OCR engine with the pytesseract package and has a GUI. This is my second ever pytho

Khant Htet Aung 4 Jul 11, 2022
Indonesian ID Card OCR using tesseract OCR

KTP OCR Indonesian ID Card OCR using tesseract OCR KTP OCR is python-flask with tesseract web application to convert Indonesian ID Card to text / JSON

Revan Muhammad Dafa 5 Dec 6, 2021
This is a GUI for scrapping PDFs with the help of optical character recognition making easier than ever to scrape PDFs.

pdf-scraper-with-ocr With this tool I am aiming to facilitate the work of those who need to scrape PDFs either by hand or using tools that doesn't imp

Jacobo José Guijarro Villalba 75 Oct 21, 2022
ISI's Optical Character Recognition (OCR) software for machine-print and handwriting data

VistaOCR ISI's Optical Character Recognition (OCR) software for machine-print and handwriting data Publications "How to Efficiently Increase Resolutio

ISI Center for Vision, Image, Speech, and Text Analytics 21 Dec 8, 2021
Provides OCR (Optical Character Recognition) services through web applications

OCR4all As suggested by the name one of the main goals of OCR4all is to allow basically any given user to independently perform OCR on a wide variety

null 174 Dec 31, 2022
A collection of resources (including the papers and datasets) of OCR (Optical Character Recognition).

OCR Resources This repository contains a collection of resources (including the papers and datasets) of OCR (Optical Character Recognition). Contents

Zuming Huang 363 Jan 3, 2023
A Python wrapper for the tesseract-ocr API

tesserocr A simple, Pillow-friendly, wrapper around the tesseract-ocr API for Optical Character Recognition (OCR). tesserocr integrates directly with

Fayez 1.7k Dec 31, 2022
python ocr using tesseract/ with EAST opencv detector

pytextractor python ocr using tesseract/ with EAST opencv text detector Uses the EAST opencv detector defined here with pytesseract to extract text(de

Danny Crasto 38 Dec 5, 2022
Run tesseract with the tesserocr bindings with @OCR-D's interfaces

ocrd_tesserocr Crop, deskew, segment into regions / tables / lines / words, or recognize with tesserocr Introduction This package offers OCR-D complia

OCR-D 38 Oct 14, 2022
Tesseract Open Source OCR Engine (main repository)

Tesseract OCR About This package contains an OCR engine - libtesseract and a command line program - tesseract. Tesseract 4 adds a new neural net (LSTM

null 48.4k Jan 9, 2023
A Screen Translator/OCR Translator made by using Python and Tesseract, the user interface are made using Tkinter. All code written in python.

About An OCR translator tool. Made by me by utilizing Tesseract, compiled to .exe using pyinstaller. I made this program to learn more about python. I

Fauzan F A 41 Dec 30, 2022
A bot that extract text from images using the Tesseract OCR.

Text from image (OCR) @ocr_text_bot A simple bot to extract text from images. Usage What do I need? A AWS key configured locally, see here. NodeJS. I

Weverton Marques 4 Aug 6, 2021
This pyhton script converts a pdf to Image then using tesseract as OCR engine converts Image to Text

Script_Convertir_PDF_IMG_TXT Este script de pyhton convierte un pdf en Imagen luego utilizando tesseract como motor OCR convierte la Imagen a Texto. p

alebogado 1 Jan 27, 2022
Programa que viabiliza a OCR (Optical Character Reading - leitura óptica de caracteres) de um PDF.

Este programa tem o intuito de ser um modificador de arquivos PDF. Os arquivos PDFs podem ser 3: PDFs verdadeiros - em que podem ser selecionados o ti

Daniel Soares Saldanha 2 Oct 11, 2021
A curated list of resources for text detection/recognition (optical character recognition ) with deep learning methods.

awesome-deep-text-detection-recognition A curated list of awesome deep learning based papers on text detection and recognition. Text Detection Papers

null 2.4k Jan 8, 2023
Text recognition (optical character recognition) with deep learning methods.

What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis | paper | training and evaluation data | failure cases and cle

Clova AI Research 3.2k Jan 4, 2023
Pure Javascript OCR for more than 100 Languages 📖🎉🖥

Version 2 is now available and under development in the master branch, read a story about v2: Why I refactor tesseract.js v2? Check the support/1.x br

Project Naptha 29.2k Jan 5, 2023
Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.

EasyOCR Ready-to-use OCR with 80+ languages supported including Chinese, Japanese, Korean and Thai. What's new 1 February 2021 - Version 1.2.3 Add set

Jaided AI 16.7k Jan 3, 2023
CTPN + DenseNet + CTC based end-to-end Chinese OCR implemented using tensorflow and keras

简介 基于Tensorflow和Keras实现端到端的不定长中文字符检测和识别 文本检测:CTPN 文本识别:DenseNet + CTC 环境部署 sh setup.sh 注:CPU环境执行前需注释掉for gpu部分,并解开for cpu部分的注释 Demo 将测试图片放入test_images

Yang Chenguang 2.6k Dec 29, 2022