A python library for extracting text from PDFs without losing the formatting of the PDF content.

Shahrukh Khan

Last update: Nov 7, 2022

Related tags

PDF Files Processing multilingual-pdf2text

Overview

Multilingual PDF to Text

Install Package from Pypi

Install it using pip.

pip install multilingual-pdf2text

The library uses Tesseract which can be installed by following instructions:

Tesseract Installation

Example Usage

Use it in your code

from multilingual_pdf2text.pdf2text import PDF2Text
from multilingual_pdf2text.models.document_model.document import Document
import logging
logging.basicConfig(level=logging.INFO)

def main():
    ## create document for extraction with configurations
    pdf_document = Document(
        document_path='/Users/shahrukh/Desktop/multilingual-pdf2text/example/example.pdf',
        language='spa'
        )
    pdf2text = PDF2Text(document=pdf_document)
    content = pdf2text.extract()
    print(content)

if __name__ == "__main__":
    main()

Tesseract supports the following languages:
Code Language

afr Afrikaans
amh Amharic
ara Arabic
asm Assamese
aze Azerbaijani
aze_cyrl Azerbaijani - Cyrillic aze_
bel Belarusian
ben Bengali
bod Tibetan
bos Bosnian
bul Bulgarian
cat Catalan; Valencian
ceb Cebuano
ces Czech
chi_sim Chinese - Simplified chi_
chi_tra Chinese - Traditional chi_
chr Cherokee
cym Welsh
dan Danish
deu German
dzo Dzongkha
ell Greek, Modern (1453-)
eng English
enm English, Middle (1100-1500)
epo Esperanto
est Estonian
eus Basque
fas Persian
fin Finnish
fra French
frk German Fraktur
frm French, Middle (ca. 1400-1600)
gle Irish
glg Galician
grc Greek, Ancient (-1453)
guj Gujarati
hat Haitian; Haitian Creole
heb Hebrew
hin Hindi
hrv Croatian
hun Hungarian
iku Inuktitut
ind Indonesian
isl Icelandic
ita Italian
ita_old Italian - Old ita_
jav Javanese
jpn Japanese
kan Kannada
kat Georgian
kat_old Georgian - Old kat_
kaz Kazakh
khm Central Khmer
kir Kirghiz; Kyrgyz
kor Korean
kur Kurdish
lao Lao
lat Latin
lav Latvian
lit Lithuanian
mal Malayalam
mar Marathi
mkd Macedonian
mlt Maltese
msa Malay
mya Burmese
nep Nepali
nld Dutch; Flemish
nor Norwegian
ori Oriya
pan Panjabi; Punjabi
pol Polish
por Portuguese
pus Pushto; Pashto
ron Romanian; Moldavian; Moldovan
rus Russian
san Sanskrit
sin Sinhala; Sinhalese
slk Slovak
slv Slovenian
spa Spanish; Castilian
spa_old Spanish; Castilian - Old spa_
sqi Albanian
srp Serbian
srp_latn Serbian - Latin srp_
swa Swahili
swe Swedish
syr Syriac
tam Tamil
tel Telugu
tgk Tajik
tgl Tagalog
tha Thai
tir Tigrinya
tur Turkish
uig Uighur; Uyghur
ukr Ukrainian
urd Urdu
uzb Uzbek
uzb_cyrl Uzbek - Cyrillic uzb_
vie Vietnamese
yid Yiddish

You might also like...

Scans pdfs for links written in plaintext and checks if they are active or returns an error code.

Scans pdfs for links written in plaintext and checks if they are active or returns an error code. It then generates a report of its findings. Extract references (pdf, url, doi, arxiv) and metadata from a PDF.

22 Nov 21, 2022

Python lib for Simple PDF text extraction

651 Jan 1, 2023

borb is a library for reading, creating and manipulating PDF files in python.

2.9k Jan 1, 2023

x-ray is a Python library for finding bad redactions in PDF documents.

A tool to detect whether a PDF has a bad redaction

73 Dec 19, 2022

This book will take you on an exploratory journey through the PDF format, and the borb Python library.

281 Jan 1, 2023

pikepdf is a Python library for reading and writing PDF files.

A Python library for reading and writing PDF, powered by qpdf

1.6k Jan 3, 2023

Simple HTML and PDF document generator for Python - with built-in support for popular data analysis and plotting libraries.

Esparto is a simple HTML and PDF document generator for Python. Its primary use is for generating shareable single page reports with content from popular analytics and data science libraries.

76 Dec 12, 2022

Python PDF Parser (Not actively maintained). Check out pdfminer.six.

PDFMiner PDFMiner is a text extraction tool for PDF documents. Warning: As of 2020, PDFMiner is not actively maintained. The code still works, but thi

4.9k Jan 4, 2023

A Python tool to generate a static HTML file that represents the internal structure of a PDF file

PDFSyntax A Python tool to generate a static HTML file that represents the internal structure of a PDF file At some point the low-level functions deve

394 Dec 30, 2022

Comments

import error

Hi there. on colab seizing: from multilingual_pdf2text.pdf2text import PDF2Text gives:

TypeError Traceback (most recent call last)

in () ----> 1 from multilingual_pdf2text.pdf2text import PDF2Text 2 from multilingual_pdf2text.models.document_model.document import Document 3 import logging 4 logging.basicConfig(level=logging.INFO) 5

2 frames

/usr/local/lib/python3.7/dist-packages/multilingual_pdf2text/doc2img/parse_document.py in PDF2Images() 12 self.logger = logging.getLogger(name) 13 ---> 14 def convert_document_to_images(self, document: Document) -> list[PpmImageFile]: 15 """ 16 Converts the Document object to

TypeError: 'type' object is not subscriptable

opened by dstoekl 2
content = pdf2text.extract() taking a lot of time before crashing colab
Thank you for making this awesome library.i am trying to make a bengali tafsir reader using your repository. here is the code that i tried in colab:

!pip install gTTS #!pip install PyPDF2 !pip install playsound !pip install multilingual-pdf2text==1.1.0 !apt install tesseract-ocr !apt install libtesseract-dev !apt-get install poppler-utils !apt-get install tesseract-ocr-ara !apt-get install tesseract-ocr-ben from multilingual_pdf2text.pdf2text import PDF2Text from multilingual_pdf2text.models.document_model.document import Document import logging logging.basicConfig(level=logging.INFO) def main(): ## create document for extraction with configurations pdf_document = Document( document_path='/content/tafsir.pdf', language='ben' ) pdf2text = PDF2Text(document=pdf_document) content = pdf2text.extract() for page in content: print(page['text']) if __name__ == "__main__": main()

it takes a lot of time and basically is stuck after printing this :

INFO:multilingual_pdf2text.doc2img.parse_document:Parsing document from pdf to image INFO:multilingual_pdf2text.ocr.image_to_text:Extracting text from images via OCR

and after few minutes colab will crash,,seems like after exhausting all available ram of colab,the notebook gets crashed. the pdf book that i am trying to read using this library is written in bangla and arabic.here is the link of that pdf book : https://i-onlinemedia.net/downloads/books/quran-tafsir/tafsir_ibn_kasir/Tafsir_Ibn_Kasir_Part-1-2-3.pdf
opened by mobassir94 1

Poppler dependency?

convert_to_text.py
INFO:multilingual_pdf2text.doc2img.parse_document:Parsing document from pdf to image
INFO:multilingual_pdf2text.doc2img.parse_document:Unable to get page count. Is poppler installed and in PATH?
INFO:multilingual_pdf2text.ocr.image_to_text:Extracting text from images via OCR
[]

opened by zzj0402 1

Does not have support for windows?

Hi, first of all the library is really good.

I tried to run this library on windows 10 and it doesn't work. I believe I did everything right, installed Tesseract and ran the following code:

from multilingual_pdf2text.pdf2text import PDF2Text
from multilingual_pdf2text.models.document_model.document import Document
import logging

from utils import write_txt

logging.basicConfig(level=logging.INFO)


def main():
    ## create document for extraction with configurations
    pdf_document = Document(document_path="./pdfs_samples/page1.pdf", language="por")
    pdf2text = PDF2Text(document=pdf_document)
    content = pdf2text.extract()
    for page in content:
        print(page["text"])
        write_txt(page["text"], filename="output_multilingual_pdf2text1.txt")


if __name__ == "__main__":
    main()

I ran this same code on linux(ubuntu 20.04) and it worked perfectly. So, was wondering if the library doesn't support windows?

opened by ghost 1

Owner

Shahrukh Khan

CS Grad Student @ Saarland University

GitHub

A bulk pdf generator. This application can generate PDFs in bulk by using just one click.

A bulk html pdf generator. This application can generate PDFs in bulk by using just one click. Screenshots Requirements ?? Your system must have the f

3 Apr 23, 2022

PyPDF2 is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files.

PyPDF2 is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. It can retrieve text and metadata from PDFs as well as merge entire files together.

5k Jan 4, 2023

A python library for extracting text from PDFs without losing the formatting of the PDF content.

Related tags

Overview

Multilingual PDF to Text

Install Package from Pypi

Example Usage

You might also like...

Scans pdfs for links written in plaintext and checks if they are active or returns an error code.

Python lib for Simple PDF text extraction

borb is a library for reading, creating and manipulating PDF files in python.

x-ray is a Python library for finding bad redactions in PDF documents.

This book will take you on an exploratory journey through the PDF format, and the borb Python library.

pikepdf is a Python library for reading and writing PDF files.

Simple HTML and PDF document generator for Python - with built-in support for popular data analysis and plotting libraries.

Python PDF Parser (Not actively maintained). Check out pdfminer.six.

A Python tool to generate a static HTML file that represents the internal structure of a PDF file

Comments

import error

content = pdf2text.extract() taking a lot of time before crashing colab

Poppler dependency?

Does not have support for windows?

Owner

Shahrukh Khan

A bulk pdf generator. This application can generate PDFs in bulk by using just one click.

PyPDF2 is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files.

pdf_sprinkles: sprinkles text in your PDFs

Camelot is a Python library that can help you extract tables from PDFs!

Trata PDF para torná-lo compatível com PDF/X e com impressoras em escala de cinza.

PDFSanitizer - Renders possibly unsafe PDF files and outputs harmless PDF files

Compare-pdf - A Flask driven restful API for comparing two PDF files

Convert PDF to AudioBook and Audio Speech to PDF

Auto Convert PDFs to png files in python

Pdfencrypt is a tool to encrypt/lock PDFs