296 Python Ocr-pdf Libraries

A Certificate renaming tool made for IEEE CS SBC, SJCE.

PDF Batch Renamer Made for IEEE CS SBC, SJCE How to use? Before using the python script, ensure that pytesseract, pdf2image, opencv and other supporti

2 Nov 14, 2021

Searching keywords in PDF file folders

keyword_searching Steps to use this Python scripts： (1)Paste this script into the file folder containing the PDF files you need to search from; (2)Thi

1 Nov 8, 2021

This is a file deletion program that asks you for an extension of a file (.mp3, .pdf, .docx, etc.) to delete all of the files in a dir that have that extension.

FileBulk This is a file deletion program that asks you for an extension of a file (.mp3, .pdf, .docx, etc.) to delete all of the files in a dir that h

1 Jun 26, 2022

pikepdf is a Python library for reading and writing PDF files.

A Python library for reading and writing PDF, powered by qpdf

1.6k Jan 3, 2023

This is a GUI for scrapping PDFs with the help of optical character recognition making easier than ever to scrape PDFs.

pdf-scraper-with-ocr With this tool I am aiming to facilitate the work of those who need to scrape PDFs either by hand or using tools that doesn't imp

75 Oct 21, 2022

OCR Post Correction for Endangered Language Texts

📌 Coming soon: an update to the software including features from our paper on semi-supervised OCR post-correction, to be published in the Transaction

96 Dec 31, 2022

Small python-gtk application, which helps the user to merge or split pdf documents and rotate, crop and rearrange their pages using an interactive and intuitive graphical interface

1.8k Dec 29, 2022

Python-based tools for document analysis and OCR

ocropy OCRopus is a collection of document analysis programs, not a turn-key OCR system. In order to apply it to your documents, you may need to do so

3.2k Dec 31, 2022

Tool which allow you to detect and translate text.

Text detection and recognition This repository contains tool which allow to detect region with text and translate it one by one. Description Two pretr

176 Nov 28, 2022

Tools for manipulating and evaluating the hOCR format for representing multi-lingual OCR results by embedding them into HTML.

hocr-tools About About the code Installation System-wide with pip System-wide from source virtualenv Available Programs hocr-check -- check the hOCR f

285 Dec 8, 2022

Generate a preview image for a PDF.

PDF ➡️ Preview A simple tool to save me time on Illustrator. Generates a preview image for a PDF file. Useful for sneak peeks to academic publications

51 Sep 22, 2022

Program that locks/unlocks pdf files🐍

🐍 📄 PDFtools 📄 🐍 Programa que bloqueia/desbloqueia arquivos pdf Requisitos • Como usar • Capturas de Tela 🚨 Aviso 🚨 Altere os caminhos referente

1 Nov 4, 2021

Classical OCR DCNN reproduction based on PaddlePaddle framework.

Paddle-SVHN Classical OCR DCNN reproduction based on PaddlePaddle framework. This project reproduces Multi-digit Number Recognition from Street View I

1 Nov 12, 2021

Converting Html files to pdf using python script, pdfkit module and wkhtmltopdf.

Html-to-pdf-pdfkit-wkhtml- This repository has code for converting local html files and online html resources into pdf. It is an python script which u

1 Nov 9, 2021

this is simple program, that converts pdf file to png

author: a5892731 last update:2021-11-01 version: 1.1 resources: -https://pypi.org/project/pdf2image/ -https://github.com/oschwartz10612/poppler-window

1 Nov 1, 2021

Convert markdown to HTML using the GitHub API and some additional tweaks with Python.

Convert markdown to HTML using the GitHub API and some additional tweaks with Python. Comes with full formula support and image compression.

70 Dec 23, 2022

This is a implementation of CRAFT OCR method

0 Nov 1, 2021

a simple ehentai downloader with jpg 2 pdf

Simple_Ehentai_DownLoader a simple ehentai downloader with jpg 2 pdf 中文介绍 Environment python3.8 How to use before you start,there are some tips. the q

6 Dec 11, 2022

This is a GUI for scrapping PDFs with the help of optical character recognition making easier than ever to scrape PDFs.

pdf-scraper-with-ocr With this tool I am aiming to facilitate the work of those who need to scrape PDFs either by hand or using tools that doesn't imp

75 Oct 21, 2022

Dirty, ugly, and hopefully useful OCR of Facebook Papers docs released by Gizmodo

Quick and Dirty OCR of Facebook Papers Gizmodo has been working through the Facebook Papers and releasing the docs that they process and review. As lu

2 Oct 28, 2021

Generate custom detailed survey paper with topic clustered sections and proper citations, from just a single query in just under 30 mins !!

Auto-Research A no-code utility to generate a detailed well-cited survey with topic clustered sections (draft paper format) and other interesting arti

20 Dec 14, 2022

A Vietnamese personal card OCR website built with Django.

Django VietCardOCR Installation Creation of virtual environments is done by executing the command venv: python -m venv venv That will create a new fol

4 Sep 4, 2021

Table automatically extraction from PDF Document

PDF Table Extractor Table automatically extraction from PDF Document Our Icon 📌 Name : PDF Table Extractor 📌 Authors : Minku Koo Jiyong Park 📌 Deve

1 Jan 10, 2022

Convolutional Recurrent Neural Network (CRNN) for image-based sequence recognition.

Convolutional Recurrent Neural Network This software implements the Convolutional Recurrent Neural Network (CRNN), a combination of CNN, RNN and CTC l

2k Dec 31, 2022

Key information extraction from invoice document with Graph Convolution Network

Key Information Extraction from Scanned Invoices Key information extraction from invoice document with Graph Convolution Network Related blog post fro

39 Dec 16, 2022

Convert lecture videos to slides in one line. Takes an input of a directory containing your lecture videos and outputs a directory containing .PDF files containing the slides of each lecture.

12 Sep 10, 2022

Machine Leaning applied to denoise images to improve OCR Accuracy

Machine Learning to Denoise Images for Better OCR Accuracy This project is an adaptation of this tutorial and used only for learning purposes: https:/

2 Nov 16, 2022

A bot that extract text from images using the Tesseract OCR.

Text from image (OCR) @ocr_text_bot A simple bot to extract text from images. Usage What do I need? A AWS key configured locally, see here. NodeJS. I

4 Aug 6, 2021

Use Youdao OCR API to covert your clipboard image to text.

Alfred Clipboard OCR 注：本仓库基于 oott123/alfred-clipboard-ocr 的逻辑用 Python 重写，换用了有道 AI 的 API，准确率更高，有效防止百度导致隐私泄露等问题，并且有道 AI 初始提供的 50 元体验金对于其资费而言个人用户基本可以永久使用

6 Sep 19, 2022

A simple pdf size compressing telegram robot witten in python.

Pdf Compressor Telegram Bot ##About : A simple pdf size compressing telegram robot witten in python. Mostly useful for digital documentation. Deploy t

22 Oct 28, 2022

Usando o Amazon Textract como OCR para Extração de Dados no DynamoDB

dio-live-textract2 Repositório de código para o live coding do dia 05/10/2021 sobre extração de dados estruturados e gravação em banco de dados a part

0 Jan 19, 2022

Arxiv2Kindle is a simple script written in python that converts LaTeX source downloaded from Arxiv and recompiles it to better fit a Kindle or other similar reading devices.

Arxiv2Kindle is a simple script written in python that converts LaTeX source downloaded from Arxiv and recompiles it to better fit a read

8 Jul 9, 2022

Convert Lecture Videos to PDF

Convert Lecture Videos to PDF Description Want to go through lecture videos faster without missing any information? Wish you can read the lecture vide

20 Nov 25, 2022

Ground truth data for the Optical Character Recognition of Historical Classical Commentaries.

OCR Ground Truth for Historical Commentaries The dataset OCR ground truth for historical commentaries (GT4HistComment) was created from the public dom

3 Sep 8, 2022

a micro OCR network with 0.07mb params.

MicroOCR a micro OCR network with 0.07mb params. Layer (type) Output Shape Param # Conv2d-1 [-1, 64, 8,

29 Aug 6, 2022

Programa que viabiliza a OCR (Optical Character Reading - leitura óptica de caracteres) de um PDF.

Este programa tem o intuito de ser um modificador de arquivos PDF. Os arquivos PDFs podem ser 3: PDFs verdadeiros - em que podem ser selecionados o ti

2 Oct 11, 2021

Simple python tool created for downloading PDF.

PDFdownloader Usage Open PDF in full-screen mode Run scan.exe Enter how many pages you want to scan Focus PDF After scanning is done, run merge.exe En

5 Oct 27, 2021

It is a image ocr tool using the Tesseract-OCR engine with the pytesseract package and has a GUI.

OCR-Tool It is a image ocr tool made in Python using the Tesseract-OCR engine with the pytesseract package and has a GUI. This is my second ever pytho

4 Jul 11, 2022

This book will take you on an exploratory journey through the PDF format, and the borb Python library.

281 Jan 1, 2023

Docbarcodes extracts 1D and 2D barcodes from scanned PDF documents or images. It can be used to automate extraction and processing of all kind of documents.

Intro Barcodes are being used in many documents or forms to enable machine reading capabilities and reduce manual processing effort. Simple 1D barcode

3 Jun 18, 2022

METS/ALTO OCR enhancing tool by the National Library of Luxembourg (BnL)

Nautilus-OCR The National Library of Luxembourg (BnL) started its first initiative in digitizing newspapers, with layout recognition and OCR on articl

36 Dec 5, 2022

Website desenvolvido em Django para gerenciamento e upload de arquivos (.pdf).

Website para Gerenciamento de Arquivos Features Esta é uma aplicação full stack web construída para desenvolver habilidades com o framework Django. O

8 Sep 22, 2022

distfit - Probability density fitting

Python package for probability density function fitting of univariate distributions of non-censored data

187 Dec 30, 2022

An application which enables the users to perform simple yet intriguing PDF operations

AstutePDF A repository containing the GUI for an application which enables the users to perform simple yet intriguing PDF operations. These include, M

5 Jan 22, 2022

I can help you convert your images to pdf file.

IMAGE TO PDF CONVERTER BOT Configs TOKEN - Get bot token from @BotFather API_ID - From my.telegram.org API_HASH - From my.telegram.org Deploy to Herok

10 Dec 14, 2022

Parser manager for parsing DOC, DOCX, PDF or HTML files

Parser manager Description Parser gets PDF, DOC, DOCX or HTML file via API and saves parsed data to the database. Implemented in Ruby 3.0.1 using Acti

4 Dec 4, 2021

docTR by Mindee (Document Text Recognition) - a seamless, high-performing & accessible library for OCR-related tasks powered by Deep Learning.

1.5k Jan 1, 2023

Convert emails without attachments to pdf and send as email

Email to PDF to email This script will check an imap folder for unread emails. Any unread email that does not have an attachment will be converted to

21 Nov 22, 2022

Gera um PDF, logo depois de você responder um questionário simples, e envia para o e-mail que você informar.

PDF generator and send it for your email Criador: Francisco Robson de O. Dutra Filho Repositório criado no dia 18/09/2021 Instagram: @robsondutra_ Sob

8 Nov 22, 2021

x-ray is a Python library for finding bad redactions in PDF documents.

A tool to detect whether a PDF has a bad redaction

73 Dec 19, 2022

A Screen Translator/OCR Translator made by using Python and Tesseract, the user interface are made using Tkinter. All code written in python.

About An OCR translator tool. Made by me by utilizing Tesseract, compiled to .exe using pyinstaller. I made this program to learn more about python. I

41 Dec 30, 2022

Python lib for Simple PDF text extraction

651 Jan 1, 2023

WeasyPrint is a smart solution helping web developers to create PDF documents.

WeasyPrint is a smart solution helping web developers to create PDF documents. It turns simple HTML pages into gorgeous statistical reports, invoices, tickets…

5.4k Jan 8, 2023

Generate text line images for training deep learning OCR model (e.g. CRNN)

532 Jan 6, 2023

borb is a library for reading, creating and manipulating PDF files in python.

2.9k Jan 1, 2023

Fetch McDonald invoices from mailbox and merge them to one PDF file.

concatenate Fetch McDonald invoices from mailbox and merge them to one PDF file. Description This script will fetch all McDonald invoice pdfs from a p

3 Oct 6, 2022

Python script that split PDF files.

Automatic PDF Splitter This script can create new single-page PDFs files from multipaged PDFs. Requirements Python 3.0+ # Debian distros sudo apt-get

5 Apr 2, 2022

Merge multiple PDF files into one.

PDF Merger Merge multiple PDF files into one. Usage % python pdf_merger.py -h usage: pdf_merger.py [-h] [-o OUTPUT] [-f [FILES ...]] optional argumen

6 Oct 3, 2022

Images to PDF Telegram Bot

ilovepdf Convert Images to PDF Bot This bot will helps you to create pdf's from your images [without leaving telegram] 😉 By Default: your pdf fil

116 Dec 29, 2022

Generate a bunch of malicious pdf files with phone-home functionality. Can be used with Burp Collaborator

Malicious PDF Generator ☠️ Generate ten different malicious pdf files with phone-home functionality. Can be used with Burp Collaborator. Used for pene

1.9k Jan 1, 2023

一个多语言支持、易使用的 OCR 项目。An easy-to-use OCR project with multilingual support.

AgentOCR 简介 AgentOCR 是一个基于 PaddleOCR 和 ONNXRuntime 项目开发的一个使用简单、调用方便的 OCR 项目本项目目前包含 Python Package 【AgentOCR】和 OCR 标注软件【AgentOCRLabeling】使用指南 Pytho

98 Nov 10, 2022

pix2tex: Using a ViT to convert images of equations into LaTeX code.

The goal of this project is to create a learning based system that takes an image of a math formula and returns corresponding LaTeX code.

2.6k Dec 30, 2022

Genalog is an open source, cross-platform python package allowing generation of synthetic document images with custom degradations and text alignment capabilities.

235 Dec 22, 2022

This app converts an pdf file into the audio file.

PDF-to-Audio This app takes an pdf as an input and convert it into audio, and the library text-to-speech starts speaking the preffered page given in t

3 Aug 4, 2021

Official implementation of SynthTIGER (Synthetic Text Image GEneratoR) ICDAR 2021

🐯 SynthTIGER: Synthetic Text Image GEneratoR Official implementation of SynthTIGER | Paper | Datasets Moonbin Yim1, Yoonsik Kim1, Han-cheol Cho1, Sun

256 Jan 5, 2023

Extract Thailand COVID-19 Cluster data from daily briefing pdf.

Thailand COVID-19 Cluster Data Extraction About Extract Clusters from Thailand Daily COVID-19 briefing PDF Download latest data Here. Data will be upd

5 Sep 27, 2021

Performing the following operations using python on PDF.

Python PDF Handling Tutorial Python is a highly versatile language with a huge set of libraries. It is a high level language with simple syntax. Pytho

131 Dec 16, 2022

A Python tool to generate a static HTML file that represents the internal structure of a PDF file

PDFSyntax A Python tool to generate a static HTML file that represents the internal structure of a PDF file At some point the low-level functions deve

394 Dec 30, 2022

pystitcher stitches your PDF files together, generating nice customizable bookmarks for you using a declarative markdown file as input

pystitcher pystitcher stitches your PDF files together, generating nice customizable bookmarks for you using a declarative input in the form of a mark

387 Dec 10, 2022

A python library for extracting text from PDFs without losing the formatting of the PDF content.

Multilingual PDF to Text Install Package from Pypi Install it using pip. pip install multilingual-pdf2text The library uses Tesseract which can be ins

49 Nov 7, 2022

A dataset for online Arabic calligraphy

Calliar Calliar is a dataset for Arabic calligraphy. The dataset consists of 2500 json files that contain strokes manually annotated for Arabic callig

114 Dec 28, 2022

Some Boring Research About Products Recognition 、Duplicate Img Detection、Img Stitch、OCR

Products Recognition 介绍商品识别，围绕在复杂的商场零售场景中，识别出货架图像中的商品信息。主要组成部分：重复图像检测。【更新进度 4/10】图像拼接。【更新进度 0/10】目标检测。【更新进度 0/10】商品识别。【更新进度 1/10】 OCR。【更新进度 1/10】

18 Jan 27, 2022

A cross platform OCR Library based on PaddleOCR & OnnxRuntime

767 Jan 9, 2023

PyMuPDF is a Python binding with support for MuPDF

PyMuPDF is a Python binding with support for MuPDF (current version 1.18.*), a lightweight PDF, XPS, and E-book viewer, renderer, and toolkit, which is maintained and developed by Artifex Software, Inc.

1.9k Jan 3, 2023

PyTorch code of my ICDAR 2021 paper Vision Transformer for Fast and Efficient Scene Text Recognition (ViTSTR)

Vision Transformer for Fast and Efficient Scene Text Recognition (ICDAR 2021) ViTSTR is a simple single-stage model that uses a pre-trained Vision Tra

198 Dec 27, 2022

Python PDF Parser (Not actively maintained). Check out pdfminer.six.

PDFMiner PDFMiner is a text extraction tool for PDF documents. Warning: As of 2020, PDFMiner is not actively maintained. The code still works, but thi

4.9k Jan 4, 2023

PyPDF2 is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files.

PyPDF2 is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. It can retrieve text and metadata from PDFs as well as merge entire files together.

5k Jan 4, 2023

Simple HTML and PDF document generator for Python - with built-in support for popular data analysis and plotting libraries.

Esparto is a simple HTML and PDF document generator for Python. Its primary use is for generating shareable single page reports with content from popular analytics and data science libraries.

76 Dec 12, 2022

OpenMMLab Text Detection, Recognition and Understanding Toolbox

Introduction English | 简体中文 MMOCR is an open-source toolbox based on PyTorch and mmdetection for text detection, text recognition, and the correspondi

3k Jan 7, 2023

Tool which allow you to detect and translate text.

Text detection and recognition This repository contains tool which allow to detect region with text and translate it one by one. Description Two pretr

176 Nov 28, 2022

Code for the paper "MASTER: Multi-Aspect Non-local Network for Scene Text Recognition" (Pattern Recognition 2021)

MASTER-PyTorch PyTorch reimplementation of "MASTER: Multi-Aspect Non-local Network for Scene Text Recognition" (Pattern Recognition 2021). This projec

255 Dec 29, 2022

PGPortfolio: Policy Gradient Portfolio, the source code of "A Deep Reinforcement Learning Framework for the Financial Portfolio Management Problem"(https://arxiv.org/pdf/1706.10059.pdf).

This is the original implementation of our paper, A Deep Reinforcement Learning Framework for the Financial Portfolio Management Problem (arXiv:1706.1

1.5k Dec 29, 2022

Layout Parser is a deep learning based tool for document image layout analysis tasks.

A Python Library for Document Layout Understanding

3.4k Dec 30, 2022

Source Code for DialogBERT: Discourse-Aware Response Generation via Learning to Recover and Rank Utterances (https://arxiv.org/pdf/2012.01775.pdf)

DialogBERT This is a PyTorch implementation of the DialogBERT model described in DialogBERT: Neural Response Generation via Hierarchical BERT with Dis

67 Jan 6, 2023

PyTorch implementation of TabNet paper : https://arxiv.org/pdf/1908.07442.pdf

README TabNet : Attentive Interpretable Tabular Learning This is a pyTorch implementation of Tabnet (Arik, S. O., & Pfister, T. (2019). TabNet: Attent

2k Dec 27, 2022

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched or copy-pasted. ocrmypdf # it's a scriptable c

7.9k Jan 3, 2023

Convolutional Recurrent Neural Networks(CRNN) for Scene Text Recognition

CRNN_Tensorflow This is a TensorFlow implementation of a Deep Neural Network for scene text recognition. It is mainly based on the paper "An End-to-En

1000 Dec 27, 2022

An expandable and scalable OCR pipeline

Overview Nidaba is the central controller for the entire OGL OCR pipeline. It oversees and automates the process of converting raw images into citable

81 Jan 4, 2023

A simple OCR API server, seriously easy to be deployed by Docker, on Heroku as well

ocrserver Simple OCR server, as a small working sample for gosseract. Try now here https://ocr-example.herokuapp.com/, and deploy your own now. Deploy

541 Dec 28, 2022

Open Source research tool to search, browse, analyze and explore large document collections by Semantic Search Engine and Open Source Text Mining & Text Analytics platform (Integrates ETL for document processing, OCR for images & PDF, named entity recognition for persons, organizations & locations, metadata management by thesaurus & ontologies, search user interface & search apps for fulltext search, faceted search & knowledge graph)

Open Semantic Search https://opensemanticsearch.org Integrated search server, ETL framework for document processing (crawling, text extraction, text a

684 Jan 6, 2023

Python Ocr-pdf Resources

Python ocr-pdf Libraries

A Certificate renaming tool made for IEEE CS SBC, SJCE.

Searching keywords in PDF file folders

This is a file deletion program that asks you for an extension of a file (.mp3, .pdf, .docx, etc.) to delete all of the files in a dir that have that extension.

pikepdf is a Python library for reading and writing PDF files.

This is a GUI for scrapping PDFs with the help of optical character recognition making easier than ever to scrape PDFs.

OCR Post Correction for Endangered Language Texts

Small python-gtk application, which helps the user to merge or split pdf documents and rotate, crop and rearrange their pages using an interactive and intuitive graphical interface

Python-based tools for document analysis and OCR

Tool which allow you to detect and translate text.

Tools for manipulating and evaluating the hOCR format for representing multi-lingual OCR results by embedding them into HTML.

Generate a preview image for a PDF.

Program that locks/unlocks pdf files🐍

Classical OCR DCNN reproduction based on PaddlePaddle framework.

Converting Html files to pdf using python script, pdfkit module and wkhtmltopdf.

this is simple program, that converts pdf file to png

Convert markdown to HTML using the GitHub API and some additional tweaks with Python.

This is a implementation of CRAFT OCR method

a simple ehentai downloader with jpg 2 pdf

This is a GUI for scrapping PDFs with the help of optical character recognition making easier than ever to scrape PDFs.

Dirty, ugly, and hopefully useful OCR of Facebook Papers docs released by Gizmodo

Generate custom detailed survey paper with topic clustered sections and proper citations, from just a single query in just under 30 mins !!

A Vietnamese personal card OCR website built with Django.

Table automatically extraction from PDF Document

Convolutional Recurrent Neural Network (CRNN) for image-based sequence recognition.

Key information extraction from invoice document with Graph Convolution Network

Convert lecture videos to slides in one line. Takes an input of a directory containing your lecture videos and outputs a directory containing .PDF files containing the slides of each lecture.

Machine Leaning applied to denoise images to improve OCR Accuracy

A bot that extract text from images using the Tesseract OCR.

Use Youdao OCR API to covert your clipboard image to text.

A simple pdf size compressing telegram robot witten in python.

Usando o Amazon Textract como OCR para Extração de Dados no DynamoDB

Arxiv2Kindle is a simple script written in python that converts LaTeX source downloaded from Arxiv and recompiles it to better fit a Kindle or other similar reading devices.

Convert Lecture Videos to PDF

Ground truth data for the Optical Character Recognition of Historical Classical Commentaries.

a micro OCR network with 0.07mb params.

Programa que viabiliza a OCR (Optical Character Reading - leitura óptica de caracteres) de um PDF.

Simple python tool created for downloading PDF.

It is a image ocr tool using the Tesseract-OCR engine with the pytesseract package and has a GUI.

This book will take you on an exploratory journey through the PDF format, and the borb Python library.

Docbarcodes extracts 1D and 2D barcodes from scanned PDF documents or images. It can be used to automate extraction and processing of all kind of documents.

METS/ALTO OCR enhancing tool by the National Library of Luxembourg (BnL)

Website desenvolvido em Django para gerenciamento e upload de arquivos (.pdf).

distfit - Probability density fitting

An application which enables the users to perform simple yet intriguing PDF operations

I can help you convert your images to pdf file.

Parser manager for parsing DOC, DOCX, PDF or HTML files

docTR by Mindee (Document Text Recognition) - a seamless, high-performing & accessible library for OCR-related tasks powered by Deep Learning.

Convert emails without attachments to pdf and send as email

Gera um PDF, logo depois de você responder um questionário simples, e envia para o e-mail que você informar.

x-ray is a Python library for finding bad redactions in PDF documents.

A Screen Translator/OCR Translator made by using Python and Tesseract, the user interface are made using Tkinter. All code written in python.

Python lib for Simple PDF text extraction

WeasyPrint is a smart solution helping web developers to create PDF documents.

Generate text line images for training deep learning OCR model (e.g. CRNN)

borb is a library for reading, creating and manipulating PDF files in python.

Fetch McDonald invoices from mailbox and merge them to one PDF file.

Python script that split PDF files.

Merge multiple PDF files into one.

Images to PDF Telegram Bot

Generate a bunch of malicious pdf files with phone-home functionality. Can be used with Burp Collaborator

一个多语言支持、易使用的 OCR 项目。An easy-to-use OCR project with multilingual support.

pix2tex: Using a ViT to convert images of equations into LaTeX code.

Genalog is an open source, cross-platform python package allowing generation of synthetic document images with custom degradations and text alignment capabilities.

This app converts an pdf file into the audio file.

Official implementation of SynthTIGER (Synthetic Text Image GEneratoR) ICDAR 2021

Extract Thailand COVID-19 Cluster data from daily briefing pdf.

Performing the following operations using python on PDF.

A Python tool to generate a static HTML file that represents the internal structure of a PDF file

pystitcher stitches your PDF files together, generating nice customizable bookmarks for you using a declarative markdown file as input

A python library for extracting text from PDFs without losing the formatting of the PDF content.

A dataset for online Arabic calligraphy

Some Boring Research About Products Recognition 、Duplicate Img Detection、Img Stitch、OCR

A cross platform OCR Library based on PaddleOCR & OnnxRuntime

PyMuPDF is a Python binding with support for MuPDF

PyTorch code of my ICDAR 2021 paper Vision Transformer for Fast and Efficient Scene Text Recognition (ViTSTR)

Python PDF Parser (Not actively maintained). Check out pdfminer.six.

PyPDF2 is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files.

Simple HTML and PDF document generator for Python - with built-in support for popular data analysis and plotting libraries.