A python library for extracting text from PDFs without losing the formatting of the PDF content.

Overview

Open In Colab Multilingual PDF to Text

Install Package from Pypi

  1. Install it using pip.
pip install multilingual-pdf2text

The library uses Tesseract which can be installed by following instructions:

Tesseract Installation

Example Usage

  1. Use it in your code
from multilingual_pdf2text.pdf2text import PDF2Text
from multilingual_pdf2text.models.document_model.document import Document
import logging
logging.basicConfig(level=logging.INFO)

def main():
    ## create document for extraction with configurations
    pdf_document = Document(
        document_path='/Users/shahrukh/Desktop/multilingual-pdf2text/example/example.pdf',
        language='spa'
        )
    pdf2text = PDF2Text(document=pdf_document)
    content = pdf2text.extract()
    print(content)

if __name__ == "__main__":
    main()

Tesseract supports the following languages:
Code Language

  • afr Afrikaans
  • amh Amharic
  • ara Arabic
  • asm Assamese
  • aze Azerbaijani
  • aze_cyrl Azerbaijani - Cyrillic aze_
  • bel Belarusian
  • ben Bengali
  • bod Tibetan
  • bos Bosnian
  • bul Bulgarian
  • cat Catalan; Valencian
  • ceb Cebuano
  • ces Czech
  • chi_sim Chinese - Simplified chi_
  • chi_tra Chinese - Traditional chi_
  • chr Cherokee
  • cym Welsh
  • dan Danish
  • deu German
  • dzo Dzongkha
  • ell Greek, Modern (1453-)
  • eng English
  • enm English, Middle (1100-1500)
  • epo Esperanto
  • est Estonian
  • eus Basque
  • fas Persian
  • fin Finnish
  • fra French
  • frk German Fraktur
  • frm French, Middle (ca. 1400-1600)
  • gle Irish
  • glg Galician
  • grc Greek, Ancient (-1453)
  • guj Gujarati
  • hat Haitian; Haitian Creole
  • heb Hebrew
  • hin Hindi
  • hrv Croatian
  • hun Hungarian
  • iku Inuktitut
  • ind Indonesian
  • isl Icelandic
  • ita Italian
  • ita_old Italian - Old ita_
  • jav Javanese
  • jpn Japanese
  • kan Kannada
  • kat Georgian
  • kat_old Georgian - Old kat_
  • kaz Kazakh
  • khm Central Khmer
  • kir Kirghiz; Kyrgyz
  • kor Korean
  • kur Kurdish
  • lao Lao
  • lat Latin
  • lav Latvian
  • lit Lithuanian
  • mal Malayalam
  • mar Marathi
  • mkd Macedonian
  • mlt Maltese
  • msa Malay
  • mya Burmese
  • nep Nepali
  • nld Dutch; Flemish
  • nor Norwegian
  • ori Oriya
  • pan Panjabi; Punjabi
  • pol Polish
  • por Portuguese
  • pus Pushto; Pashto
  • ron Romanian; Moldavian; Moldovan
  • rus Russian
  • san Sanskrit
  • sin Sinhala; Sinhalese
  • slk Slovak
  • slv Slovenian
  • spa Spanish; Castilian
  • spa_old Spanish; Castilian - Old spa_
  • sqi Albanian
  • srp Serbian
  • srp_latn Serbian - Latin srp_
  • swa Swahili
  • swe Swedish
  • syr Syriac
  • tam Tamil
  • tel Telugu
  • tgk Tajik
  • tgl Tagalog
  • tha Thai
  • tir Tigrinya
  • tur Turkish
  • uig Uighur; Uyghur
  • ukr Ukrainian
  • urd Urdu
  • uzb Uzbek
  • uzb_cyrl Uzbek - Cyrillic uzb_
  • vie Vietnamese
  • yid Yiddish
You might also like...
Scans pdfs for links written in plaintext and checks if they are active or returns an error code.
Scans pdfs for links written in plaintext and checks if they are active or returns an error code.

Scans pdfs for links written in plaintext and checks if they are active or returns an error code. It then generates a report of its findings. Extract references (pdf, url, doi, arxiv) and metadata from a PDF.

Python lib for Simple PDF text extraction

Python lib for Simple PDF text extraction

borb is a library for reading, creating and manipulating PDF files in python.
borb is a library for reading, creating and manipulating PDF files in python.

borb is a library for reading, creating and manipulating PDF files in python.

x-ray is a Python library for finding bad redactions in PDF documents.
x-ray is a Python library for finding bad redactions in PDF documents.

A tool to detect whether a PDF has a bad redaction

This book will take you on an exploratory journey through the PDF format, and the borb Python library.
This book will take you on an exploratory journey through the PDF format, and the borb Python library.

This book will take you on an exploratory journey through the PDF format, and the borb Python library.

pikepdf is a Python library for reading and writing PDF files.

A Python library for reading and writing PDF, powered by qpdf

Simple HTML and PDF document generator for Python - with built-in support for popular data analysis and plotting libraries.

Esparto is a simple HTML and PDF document generator for Python. Its primary use is for generating shareable single page reports with content from popular analytics and data science libraries.

Python PDF Parser (Not actively maintained). Check out pdfminer.six.

PDFMiner PDFMiner is a text extraction tool for PDF documents. Warning: As of 2020, PDFMiner is not actively maintained. The code still works, but thi

A Python tool to generate a static HTML file that represents the internal structure of a PDF file
A Python tool to generate a static HTML file that represents the internal structure of a PDF file

PDFSyntax A Python tool to generate a static HTML file that represents the internal structure of a PDF file At some point the low-level functions deve

Comments
  • import error

    import error

    Hi there. on colab seizing: from multilingual_pdf2text.pdf2text import PDF2Text gives:

    TypeError Traceback (most recent call last)

    in () ----> 1 from multilingual_pdf2text.pdf2text import PDF2Text 2 from multilingual_pdf2text.models.document_model.document import Document 3 import logging 4 logging.basicConfig(level=logging.INFO) 5

    2 frames

    /usr/local/lib/python3.7/dist-packages/multilingual_pdf2text/doc2img/parse_document.py in PDF2Images() 12 self.logger = logging.getLogger(name) 13 ---> 14 def convert_document_to_images(self, document: Document) -> list[PpmImageFile]: 15 """ 16 Converts the Document object to

    TypeError: 'type' object is not subscriptable

    opened by dstoekl 2
  • content = pdf2text.extract() taking a lot of time before crashing colab

    content = pdf2text.extract() taking a lot of time before crashing colab

    Thank you for making this awesome library.i am trying to make a bengali tafsir reader using your repository. here is the code that i tried in colab:

    !pip install gTTS
    #!pip install PyPDF2
    !pip install playsound
    !pip install multilingual-pdf2text==1.1.0
    !apt install tesseract-ocr
    !apt install libtesseract-dev
    !apt-get install poppler-utils 
    
    !apt-get install tesseract-ocr-ara
    !apt-get install tesseract-ocr-ben
    
    from multilingual_pdf2text.pdf2text import PDF2Text
    from multilingual_pdf2text.models.document_model.document import Document
    import logging
    logging.basicConfig(level=logging.INFO)
    
    
    def main():
        ## create document for extraction with configurations
        pdf_document = Document(
            document_path='/content/tafsir.pdf',
            language='ben'
            )
        pdf2text = PDF2Text(document=pdf_document)
        content = pdf2text.extract()
        for page in content:
          print(page['text'])
    
    if __name__ == "__main__":
        main()
    
    

    it takes a lot of time and basically is stuck after printing this :

    INFO:multilingual_pdf2text.doc2img.parse_document:Parsing document from pdf to image INFO:multilingual_pdf2text.ocr.image_to_text:Extracting text from images via OCR

    and after few minutes colab will crash,,seems like after exhausting all available ram of colab,the notebook gets crashed. the pdf book that i am trying to read using this library is written in bangla and arabic.here is the link of that pdf book : https://i-onlinemedia.net/downloads/books/quran-tafsir/tafsir_ibn_kasir/Tafsir_Ibn_Kasir_Part-1-2-3.pdf

    opened by mobassir94 1
  • Poppler dependency?

    Poppler dependency?

    convert_to_text.py
    INFO:multilingual_pdf2text.doc2img.parse_document:Parsing document from pdf to image
    INFO:multilingual_pdf2text.doc2img.parse_document:Unable to get page count. Is poppler installed and in PATH?
    INFO:multilingual_pdf2text.ocr.image_to_text:Extracting text from images via OCR
    []
    
    opened by zzj0402 1
  • Does not have support for windows?

    Does not have support for windows?

    Hi, first of all the library is really good.

    I tried to run this library on windows 10 and it doesn't work. I believe I did everything right, installed Tesseract and ran the following code:

    from multilingual_pdf2text.pdf2text import PDF2Text
    from multilingual_pdf2text.models.document_model.document import Document
    import logging
    
    from utils import write_txt
    
    logging.basicConfig(level=logging.INFO)
    
    
    def main():
        ## create document for extraction with configurations
        pdf_document = Document(document_path="./pdfs_samples/page1.pdf", language="por")
        pdf2text = PDF2Text(document=pdf_document)
        content = pdf2text.extract()
        for page in content:
            print(page["text"])
            write_txt(page["text"], filename="output_multilingual_pdf2text1.txt")
    
    
    if __name__ == "__main__":
        main()
    

    I ran this same code on linux(ubuntu 20.04) and it worked perfectly. So, was wondering if the library doesn't support windows?

    opened by ghost 1
Owner
Shahrukh Khan
CS Grad Student @ Saarland University
Shahrukh Khan
A bulk pdf generator. This application can generate PDFs in bulk by using just one click.

A bulk html pdf generator. This application can generate PDFs in bulk by using just one click. Screenshots Requirements ?? Your system must have the f

Aman Nirala 3 Apr 23, 2022
PyPDF2 is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files.

PyPDF2 is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. It can retrieve text and metadata from PDFs as well as merge entire files together.

Matthew Stamy 5k Jan 4, 2023
pdf_sprinkles: sprinkles text in your PDFs

pdf_sprinkles: sprinkles text in your PDFs pdf_sprinkles remotely OCRs a PDF with Google Cloud Document AI, and returns the result as a PDF with searc

Will Angley 2 Dec 17, 2021
Camelot is a Python library that can help you extract tables from PDFs!

A Python library to extract tabular data from PDFs

null 1.8k Jan 3, 2023
Trata PDF para torná-lo compatível com PDF/X e com impressoras em escala de cinza.

tratapdf Trata PDF para torná-lo compatível com PDF/X e com impressoras em escala de cinza. dependências icc-profiles ghostscript visualizador de PDF

null 1 Nov 30, 2021
PDFSanitizer - Renders possibly unsafe PDF files and outputs harmless PDF files

PDFSanitizer Renders possibly malicious PDF files and outputs harmless PDF files

null 9 Jan 30, 2022
Compare-pdf - A Flask driven restful API for comparing two PDF files

COMPARE-PDF A Flask driven restful API for comparing two PDF files. Description

Karthikeyan JC 3 Mar 13, 2022
Convert PDF to AudioBook and Audio Speech to PDF

In this Python project, we will build a GUI-based PDF to Audio and Audio to PDF converter using the Tkinter, OS, path, pyttsx3, SpeechRecognition, PyPDF4, and Pydub libraries and the messagebox module of the Tkinter library.

RISHABH MISHRA 1 Feb 13, 2022
Auto Convert PDFs to png files in python

This python tool, which is an application of PyMuPDF module, could auto convert PDFs to png files

Bo-Yu 4 Dec 5, 2021
Pdfencrypt is a tool to encrypt/lock PDFs

Pdfencrypt Pdfencrypt is a tool to encrypt/lock PDFs Installation $ apt update $ apt upgrade $ apt install git $ apt install python $ git clone https:

Anontemitayo 5 Nov 28, 2021