The open source extract transaction infomation by using OCR.

Nguyen Xuan Hung

Last update: Jun 2, 2022

Related tags

Computer Vision python api ocr transaction ocr-python google-ocr

Overview

Transaction OCR

Mã nguồn trích xuất thông tin transaction từ file scaned pdf, ở đây tôi lựa chọn tài liệu sao kê công khai của Thuy Tien. Mã nguồn có thể ứng dụng để giải quyết bài toán liên quan đến trích xuất thông tin văn bản từ hình ảnh (OCR - Optical Character Recognition) có cấu trúc nội dung xác định và với độ dài các dòng thông tin (row) bất kì như thông tin giao dịch, hóa đơn mua hàng,... Mã nguồn lựa chọn Cloud Vision API đại diện cho OCR model để có được độ chính xác cao, hoặc bạn có thể sử dụng model có sẵn như Vietocr hoặc có thể tự build custom OCR tiếng Việt từ clovaai: text-detection và text-recognition) mà tôi cho là khá tốt.

Getting Started

Dependency

Google cloud api | Cloud Vision API
Tài liệu công khai SAO KÊ MIỀN TRUNG

git clone https://github.com/hungtooc/transaction_ocr.git

pip install -r requirements.txt

1. Repair data input

1.1 Download raw data

Download raw pdf files from Drive link: https://drive.google.com/drive/folders/1SoWOGaAy92tZUgG7mwhJzoeBsDpxVO80?usp=sharing
Extract & put it in data/input

1.2 Convert pdf files to image

PDF password: Vcbsaoke@2021

python tools/pdf-to-images.py --pdf-password Vcbsaoke@2021

usage: pdf-to-images.py [-h] [--pdf-dir PDF_DIR] [--output-dir OUTPUT_DIR] [--pdf-password PDF_PASSWORD] [--from-page-no FROM_PAGE_NO] [--to-page-no TO_PAGE_NO] [--fix-page-number FIX_PAGE_NUMBER]

optional arguments:
  -h, --help            show this help message and exit
  --pdf-dir PDF_DIR     dir to pdf files
  --output-dir OUTPUT_DIR
                        dir to save images
  --pdf-password PDF_PASSWORD
                        pdf password
  --from-page-no FROM_PAGE_NO
                        extra image from page
  --to-page-no TO_PAGE_NO
                        extra image to page
  --fix-page-number FIX_PAGE_NUMBER
                        fix page number (page_no += fix_page_number)

2. Extract transaction information

The source perform the basic steps to extract transaction information, you may want to add additional processing to optimize the source code in lines marked #todo.

python run.py

usage: run.py [-h] [--image-dir IMAGE_DIR] [--output-respone-dir OUTPUT_RESPONE_DIR] [--output-content-dir OUTPUT_CONTENT_DIR] [--processed-log-file PROCESSED_LOG_FILE]

optional arguments:
  -h, --help            show this help message and exit
  --image-dir IMAGE_DIR
                        dir to images
  --output-respone-dir OUTPUT_RESPONE_DIR
                        dir to save api respone
  --output-content-dir OUTPUT_CONTENT_DIR
                        dir to save transaction content
  --processed-log-file PROCESSED_LOG_FILE
                        path to log file

File `run.py` perform 7 main stages:

Step 1. Find header & footer.
Step 2. Re-rotate image based on header-corner.
Step 3. Clean image.
Step 4. Call request google-ocr api. (include:text-detection & text-recognition
Step 5. Detect transaction line.
Step 6. Classify transaction content each line & each content type.
Step 7. Save transactions content to csv.

TNX Date	Doc No	Credit	Transaction in detail	(note)
13/10/2020	5091.55821	100.000	586062.131020.075756.Ung ho mien trung FT20287151644070	page_1
13/10/2020	5091.56080	1.000.000	586279.131020.075829.Ung ho dong bao mien Trung FT20287592192480	page_1
13/10/2020	5091.56138	200.000	219987.131020.075839.Trinh Thi Thu Thuy chuyen tien ung ho mien Trung	page_1
13/10/2020	5091.56155	100.000	586295.131020.075826.UH mien trung FT20287432289640	page_1
13/10/2020	5078.68388	500.000	MBVCB.807033343.PHAM THUY TRANG chuyen tien ung ho tu thien.CT tu 0561000606153 PHAM THUY TRANG toi 0181003469746 TRAN THI THUY TIEN	page_1
13/10/2020	5091.56261	1.000.000	184997.131020.075853.Em gui giup do ba con vung lu	page_1
13/10/2020	5078.68496	200.000	MBVCB.807033583.Ung ho mien trung.CT tu 0051000531310 HUYNH THI NHU Y toi 0181003469746 TRAN THI THUY TIEN	page_1
13/10/2020	5078.68526	100.000	MBVCB.807033514.ung ho mien trung.CT tu 0481000903279 NGUYEN THI HUONG AN toi 0181003469746 TRAN THI THUY TIEN	page_1
13/10/2020	5091.56381	100.000	479592.131020.075909.ho tro mien trung	page_1
13/10/2020	5078.68537	500.000	MBVCB.807034561.Ung ho Mien trung.CT tu 0721000588146 LE THI HONG DIEM toi 0181003469746 TRAN THI THUY TIEN	page_1
13/10/2020	5091.56405	200.000	292363.131020.075845.Ngan hang TMCP Ngoai Thuong Viet Nam 0181003469746 LUC NGHIEM LE chuyen khoan ung ho mien trung	page_1
13/10/2020	5091.56410	500.000	479627.131020.075913.Ung ho mien trung	page_1

3. Export Excel

Export each csv directory to an excel file. Example:

python tools/export-excel.py --csv-dir "data/content/TÀI KHOẢN XXX746 (Pass_ Vcbsaoke@2021)/TỪ 13.10.20 ĐẾN 23.11.20/1. TRANG 1 -1000.pdf"

usage: export-excel.py [-h] --csv-dir CSV_DIR [--output-dir OUTPUT_DIR] [--transaction-template TRANSACTION_TEMPLATE] [--filename FILENAME]

optional arguments:
  -h, --help            show this help message and exit
  --csv-dir CSV_DIR     csv dir
  --output-dir OUTPUT_DIR
                        output dir
  --transaction-template TRANSACTION_TEMPLATE
                        dir to save transaction content
  --filename FILENAME   output filename, leave blank to set default

4. Extract dataset

From api responed data, you can extract dataset to train text-recognization model:

 python tools/export-dataset.py

usage: extract-dataset.py [-h] [--respone-dir RESPONE_DIR] [-a OUTPUT_ANNOTATION] [-i OUTPUT_IMAGE_DIR]

optional arguments:
  -h, --help            show this help message and exit
  --respone-dir RESPONE_DIR
                        dir to api respone
  -a OUTPUT_ANNOTATION, --output-annotation OUTPUT_ANNOTATION
                        path to save annotation file
  -i OUTPUT_IMAGE_DIR, --output-image-dir OUTPUT_IMAGE_DIR
                        path to save annotation file

Dataset of first 1000 pages lalebed by google-ocr (~336k): Google Drive
Tips: you may want to balance data text type before extract

5. Result

18107 transaction statement pages have been extracted from pdf format: Google Drive - Accuracy >99%.

Go package for OCR (Optical Character Recognition), by using Tesseract C++ library

gosseract OCR Golang OCR package, by using Tesseract C++ library. OCR Server Do you just want OCR server, or see the working example of this package?

1.9k Dec 28, 2022

Convert PDF/Image to TXT using EasyOcr - the best OCR engine available!

PDFImage2TXT - DOWNLOAD INSTALLER HERE What can you do with it? Convert scanned PDFs to TXT. Convert scanned Documents to TXT. No coding required!! In

2 Feb 22, 2022

A bot that plays TFT using OCR. Keeps track of bench, board, items, and plays the user defined team comp.

NOTES: To ensure best results, make sure you are running this on a computer that has decent specs. 1920x1080 fullscreen is required in League, game mu

125 Dec 30, 2022

This pyhton script converts a pdf to Image then using tesseract as OCR engine converts Image to Text

Script_Convertir_PDF_IMG_TXT Este script de pyhton convierte un pdf en Imagen luego utilizando tesseract como motor OCR convierte la Imagen a Texto. p

1 Jan 27, 2022

Extract tables from scanned image PDFs using Optical Character Recognition.

ocr-table This project aims to extract tables from scanned image PDFs using Optical Character Recognition. Install Requirements Tesseract OCR sudo apt

209 Dec 6, 2022

Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.

EasyOCR Ready-to-use OCR with 80+ languages supported including Chinese, Japanese, Korean and Thai. What's new 1 February 2021 - Version 1.2.3 Add set

16.7k Jan 3, 2023

The open source extract transaction infomation by using OCR.

Related tags

Overview

Transaction OCR

Getting Started

Dependency

1. Repair data input

1.1 Download raw data

1.2 Convert pdf files to image

2. Extract transaction information

File `run.py` perform 7 main stages:

3. Export Excel

4. Extract dataset

5. Result

You might also like...

Go package for OCR (Optical Character Recognition), by using Tesseract C++ library

Convert PDF/Image to TXT using EasyOcr - the best OCR engine available!

A bot that plays TFT using OCR. Keeps track of bench, board, items, and plays the user defined team comp.

This pyhton script converts a pdf to Image then using tesseract as OCR engine converts Image to Text

Extract tables from scanned image PDFs using Optical Character Recognition.

Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.

A Python wrapper for the tesseract-ocr API

FastOCR is a desktop application for OCR API.

OCR-D-compliant page segmentation

Releases(v1.1)

v1.1(Oct 19, 2021)

v1.0(Oct 8, 2021)

Owner

Nguyen Xuan Hung

Indonesian ID Card OCR using tesseract OCR

A bot that extract text from images using the Tesseract OCR.

Awesome multilingual OCR toolkits based on PaddlePaddle （practical ultra lightweight OCR system, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices）

Tesseract Open Source OCR Engine (main repository)

list all open dataset about ocr.

A Screen Translator/OCR Translator made by using Python and Tesseract, the user interface are made using Tkinter. All code written in python.

python ocr using tesseract/ with EAST opencv detector

🖺 OCR using tensorflow with attention

CNN+LSTM+CTC based OCR implemented using tensorflow.

CTPN + DenseNet + CTC based end-to-end Chinese OCR implemented using tensorflow and keras

The open source extract transaction infomation by using OCR.

Related tags

Overview

Transaction OCR

Getting Started

Dependency

1. Repair data input

1.1 Download raw data

1.2 Convert pdf files to image

2. Extract transaction information

File run.py perform 7 main stages:

3. Export Excel

4. Extract dataset

5. Result

You might also like...

Go package for OCR (Optical Character Recognition), by using Tesseract C++ library

Convert PDF/Image to TXT using EasyOcr - the best OCR engine available!

A bot that plays TFT using OCR. Keeps track of bench, board, items, and plays the user defined team comp.

This pyhton script converts a pdf to Image then using tesseract as OCR engine converts Image to Text

Extract tables from scanned image PDFs using Optical Character Recognition.

Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.

A Python wrapper for the tesseract-ocr API

FastOCR is a desktop application for OCR API.

OCR-D-compliant page segmentation

Releases(v1.1)

v1.1(Oct 19, 2021)

v1.0(Oct 8, 2021)

Owner

Nguyen Xuan Hung

Indonesian ID Card OCR using tesseract OCR

A bot that extract text from images using the Tesseract OCR.

Awesome multilingual OCR toolkits based on PaddlePaddle （practical ultra lightweight OCR system, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices）

Tesseract Open Source OCR Engine (main repository)

list all open dataset about ocr.

A Screen Translator/OCR Translator made by using Python and Tesseract, the user interface are made using Tkinter. All code written in python.

python ocr using tesseract/ with EAST opencv detector

🖺 OCR using tensorflow with attention

CNN+LSTM+CTC based OCR implemented using tensorflow.

CTPN + DenseNet + CTC based end-to-end Chinese OCR implemented using tensorflow and keras

File `run.py` perform 7 main stages: