Dirty, ugly, and hopefully useful OCR of Facebook Papers docs released by Gizmodo

Bill Fitzgerald

Last update: Oct 28, 2021

Related tags

Computer Vision facebook_papers_ocr

Overview

Quick and Dirty OCR of Facebook Papers

Gizmodo has been working through the Facebook Papers and releasing the docs that they process and review.

As luck would have it, I had some ugly but functional code lying around that would do a first pass on OCR on these docs. That code is in the pdf_to_image.py script. I'd welcome improvement to the code, especially in image cleanup prior to OCR (lines 92-97, approx). I experimented with cleaning up the image via PIL and cv2, but the results were less accurate, almost certainly due to my lack of familiarity with either of these approaches.

These Facebook Papers are especially challenging from an OCR perspective because many of them are pictures taken of a screen, so the base image quality isn't especially good. Because of this, not every document can be processed cleanly, and the documents that do get processed have some cruft in them.

With that said, the text pulled from these files simplifies the process of parsing through a large amount of data for keywords.

Other (Better) Options

This OCR should be seen as a first step. Text files are generally a decent starting point because they allow for a wide range of follow on analysis.

And, other/better options exist. For a comprehensive, contained analysis, these other options will almost certainly be a better choice.

Want to help?

If you want to collaborate on this project, let me know!

Dirty, ugly, and hopefully useful OCR of Facebook Papers docs released by Gizmodo

Related tags

Overview

Quick and Dirty OCR of Facebook Papers

Other (Better) Options

Want to help?

You might also like...

Tracking the latest progress in Scene Text Detection and Recognition: Must-read papers well organized

A curated list of papers, code and resources pertaining to image composition

Turn images of tables into CSV data. Detect tables from images and run OCR on the cells.

Code for the paper STN-OCR: A single Neural Network for Text Detection and Text Recognition

A pure pytorch implemented ocr project including text detection and recognition

A set of workflows for corpus building through OCR, post-correction and normalisation

MXNet OCR implementation. Including text recognition and detection.

ISI's Optical Character Recognition (OCR) software for machine-print and handwriting data

Python-based tools for document analysis and OCR

Owner

Bill Fitzgerald

Awesome multilingual OCR toolkits based on PaddlePaddle （practical ultra lightweight OCR system, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices）

A collection of resources (including the papers and datasets) of OCR (Optical Character Recognition).

It is a image ocr tool using the Tesseract-OCR engine with the pytesseract package and has a GUI.

A little but useful tool to explore OCR data extracted with `pytesseract` and `opencv`

Indonesian ID Card OCR using tesseract OCR

Fast image augmentation library and easy to use wrapper around other libraries. Documentation: https://albumentations.ai/docs/ Paper about library: https://www.mdpi.com/2078-2489/11/2/125

A curated list of papers and resources for scene text detection and recognition

Repository of conference publications and source code for first-/ second-authored papers published at NeurIPS, ICML, and ICLR.

Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.