Page to PAGE Layout Analysis Tool

Lorenzo Quirós Díaz

Last update: Nov 24, 2022

Related tags

Computer Vision deep-neural-networks computer-vision pytorch generative-adversarial-network gan image-segmentation pix2pix handwritten-text-recognition page-xml document-layout-analysis

Overview

P2PaLA

Page to PAGE Layout Analysis (P2PaLA) is a toolkit for Document Layout Analysis based on Neural Networks.

💥 Try our new DEMO for online baseline detection. ❗ ❗

If you find this toolkit useful in your research, please cite:

@misc{p2pala2017,
  author = {Lorenzo Quirós},
  title = {P2PaLA: Page to PAGE Layout Analysis tookit},
  year = {2017},
  publisher = {GitHub},
  note = {GitHub repository},
  howpublished = {\url{https://github.com/lquirosd/P2PaLA}},
}

Check this paper for more details Arxiv.

Requirements

Linux (OSX may work, but untested.).
Python (2.7, 3.6 under conda virtual environment is recomended)
Numpy
PyTorch (1.0). PyTorch 0.3.1 compatible on this branch
OpenCv (3.4.5.20).
NVIDIA GPU + CUDA CuDNN (CPU mode and CUDA without CuDNN works, but is not recomended for training).
tensorboard-pytorch (v0.9) [Optional]. pip install tensorboardX > A diferent conda env is recomended to keep tensorflow separated from PyTorch

Install

python setup.py install

To install python dependencies alone, use requirements file conda env create --file conda_requirements.yml

Usage

Input data must follow the folder structure data_tag/page, where images must be into the data_tag folder and xml files into page. For example:

mkdir -p data/{train,val,test,prod}/page;
tree data;

data
├── prod
│   ├── page
│   │   ├── prod_0.xml
│   │   └── prod_1.xml
│   ├── prod_0.jpg
│   └── prod_1.jpg
├── test
│   ├── page
│   │   ├── test_0.xml
│   │   └── test_1.xml
│   ├── test_0.jpg
│   └── test_1.jpg
├── train
│   ├── page
│   │   ├── train_0.xml
│   │   └── train_1.xml
│   ├── train_0.jpg
│   └── train_1.jpg
└── val
    ├── page
    │   ├── val_0.xml
    │   └── val_1.xml
    ├── val_0.jpg
    └── val_1.jpg

Run the tool.

python P2PaLA.py --config config.txt --tr_data ./data/train --te_data ./data/test --log_comment "_foo"

❗ Pre-trained models available here

Use TensorBoard to visualize train status:

tensorboard --logdir ./work/runs

xml-PAGE files must be at "./work/results/test/"

We recommend Transkribus or nw-page-editor to visualize and edit PAGE-xml files.

For detail about arguments and config file, see docs or python P2PaLa.py -h.
For more detailed example see egs:
- Bozen dataset see
- cBAD complex competition dataset see
- OHG dataset see

License

GNU General Public License v3.0 See LICENSE to see the full text.

Acknowledgments

Code is inspired by pix2pix and pytorch-CycleGAN-and-pix2pix

Comments

RTX cards require minimum Pytorch 1.0 [CUDNN_STATUS_EXECUTION_FAILED]

On my Linux mint 19.1 using an RTX 2070

When trying to recognize using the default installation:

(p3p) home@home-lnx:~/Desktop/programs/P2PaLA$ python P2PaLA.py --config config_ALAR_min_model_17_12_18.txt --prev_model ALAR_min_model_17_12_18.pth --prod_data ./images/
2019-01-21 13:42:19,280 - optparse - INFO - Reading configuration from config_ALAR_min_model_17_12_18.txt
2019-01-21 13:42:19,282 - P2PaLA - INFO - Working on prod inference...
2019-01-21 13:42:19,283 - P2PaLA - INFO - Results will be saved to ./work/results/prod
2019-01-21 13:42:19,599 - P2PaLA - INFO - Resumming from model ALAR_min_model_17_12_18.pth
/home/home/.conda/envs/p3p/lib/python3.6/site-packages/torch/cuda/__init__.py:95: UserWarning: 
    Found GPU0 GeForce RTX 2070 which requires CUDA_VERSION >= 9000 for
     optimal performance and fast startup time, but your PyTorch was compiled
     with CUDA_VERSION 8000. Please install the correct PyTorch binary
     using instructions from http://pytorch.org
    
  warnings.warn(incorrect_binary_warn % (d, name, 9000, CUDA_VERSION))

So I installed latest torch and torchvision:

(p3p) home@home-lnx:~/Desktop/programs/P2PaLA$ pip install --ignore-installed torch torchvision

Then ran recognition:

(p3p) home@home-lnx:~/Desktop/programs/P2PaLA$ python P2PaLA.py --config config_ALAR_min_model_17_12_18.txt --prev_model ALAR_min_model_17_12_18.pth --prod_data ./images/
/home/home/.conda/envs/p3p/lib/python3.6/site-packages/torch/nn/_reduction.py:49: UserWarning: size_average and reduce args will be deprecated, please use reduction='mean' instead.
  warnings.warn(warning.format(ret))
2019-01-21 13:58:31,771 - optparse - INFO - Reading configuration from config_ALAR_min_model_17_12_18.txt
2019-01-21 13:58:31,773 - P2PaLA - INFO - Working on prod inference...
2019-01-21 13:58:31,774 - P2PaLA - INFO - Results will be saved to ./work/results/prod
2019-01-21 13:58:32,125 - P2PaLA - INFO - Resumming from model ALAR_min_model_17_12_18.pth
2019-01-21 13:58:34,859 - P2PaLA - INFO - Preprocessing data from ./images/
P2PaLA.py:1195: UserWarning: volatile was removed and now has no effect. Use `with torch.no_grad():` instead.
  pr_x = Variable(sample["image"], volatile=True)
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=405 error=11 : invalid argument
2019-01-21 13:58:35,463 - P2PaLA - INFO - Production stage done. total time taken: 0.604010820388794
2019-01-21 13:58:35,463 - P2PaLA - INFO - Average time per page: 0.604010820388794
2019-01-21 13:58:35,463 - P2PaLA - INFO - All Done...

Now the problem is when trying to train

(p3p) home@home-lnx:~/Desktop/programs/P2PaLA$ python P2PaLA.py --config config_BL_only.txt --tr_data ./data/train --te_data ./data/test --log_comment "_foo"
/home/home/.conda/envs/p3p/lib/python3.6/site-packages/torch/nn/_reduction.py:49: UserWarning: size_average and reduce args will be deprecated, please use reduction='mean' instead.
  warnings.warn(warning.format(ret))
2019-01-21 14:06:09,788 - optparse - INFO - Reading configuration from config_BL_only.txt
2019-01-21 14:06:09,789 - optparse - DEBUG - Creating output dir: ./work_BL_only
2019-01-21 14:06:09,790 - optparse - DEBUG - Creating checkpoints dir: ./work_BL_only/checkpoints
2019-01-21 14:06:09,790 - P2PaLA - INFO - Working on training stage...
2019-01-21 14:06:09,791 - P2PaLA - WARNING - tensorboardX is not installed, display logger set to OFF.
2019-01-21 14:06:09,791 - P2PaLA - INFO - Preprocessing data from ./data/train
/home/home/Desktop/programs/P2PaLA/nn_models/models.py:293: UserWarning: nn.init.uniform is now deprecated in favor of nn.init.uniform_.
  init.uniform(m.weight.data, 0.0, 0.02)
/home/home/Desktop/programs/P2PaLA/nn_models/models.py:298: UserWarning: nn.init.uniform is now deprecated in favor of nn.init.uniform_.
  init.uniform(m.weight.data, 1.0, 0.02)
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=405 error=11 : invalid argument
Traceback (most recent call last):
  File "P2PaLA.py", line 1262, in <module>
    main()
  File "P2PaLA.py", line 606, in main
    epoch_lossD += d_loss.data[0]
IndexError: invalid index of a 0-dim tensor. Use tensor.item() to convert a 0-dim tensor to a Python number

opened by ghost 14

Is it possible for fine tune the existing model for different regions?

Hi,

I want to extend this work to find the regions on different documents with different regions (other than defined in the regions list regions = ["$tip", "$par", "$not", "$nop", "$pag"]. Is this possible? If so, what change will be needed ? I am new to segmentation problems, please guide.

@lquirosd

opened by lordzuko 6
How to see output polygon drawn on an input image?

Please advise on utility to draw the output baseline polygon on source image. I see an example on your online page(btw it doesn't work this time, skips the Process button). But I can't fond a way to look at it on my machine. If any built-in function please write the cli command: cmd data/source_img.jpg work/results_prod/page/source_img.xml
-> to produce drawn countor on source image. In my run there are zero-size jpg files in work/results/prod
Probably it's a sort of error so it's interesting to draw xml lines over image(s).

opened by longwall 4
Baseline + polygon detection of handwriting

Hello. I have ran P2Pala succesfully against typewritten and print with the default model, but I am not getting very good results when running against handwriting. Is it possible to train a model to work against handwriting, and if so, what kind of ground truth, and how much is required?

opened by stevethearkiv 4
TextLine region

@lquirosd

Currently, I am trying to train p2pala to recognize the "TextLine" regions not baselines. How exactly can I do that, how can I select the default TextLine region itself

An example page-xml is attached along with my training config txt file. sample.zip

Waiting for your reply

opened by mrocr 4
"No region type defined for r1 at 00001096"

I've used P2PaLA to train a Document Layout Analysis model for zone segmentation with PRImA Layout Analysis Dataset. PRImA dataset use PAGE XML with 2010 schema version. When model load data from corpus, I've got:

No region type defined for r1 at 0000001096 Element type "Node" undefined on color dic, set to default=175

This message have happened for all <TextRegion> in the XML file. After training phase have done, I see the result/test and nothing was predicted. I don't know what it mean. Please help me. Thanks.

opened by vndee 4
error while running the pre trained model in google colab

the error is as shown below: 2020-04-02 09:33:10,254 - optparse - INFO - Reading configuration from config_ALAR_min_model_17_12_18_inference.txt 2020-04-02 09:33:10,263 - P2PaLA - INFO - Working on prod inference... 2020-04-02 09:33:10,268 - P2PaLA - INFO - Results will be saved to ./work/results/prod 2020-04-02 09:33:10,844 - P2PaLA - INFO - Resumming from model ALAR_min_model_17_12_18.pth 2020-04-02 09:33:11,408 - P2PaLA - INFO - Preprocessing data from ./images Premature end of JPEG file Traceback (most recent call last): File "P2PaLA.py", line 1268, in main() File "P2PaLA.py", line 1250, in main out_folder=res_path, File "/content/gdrive/My Drive/P2PaLA-master/data/imgprocess.py", line 187, in gen_page os.path.realpath(self.img_data[img_id]), os.path.join(out_folder, img_name) File "/content/gdrive/My Drive/P2PaLA-master/data/imgprocess.py", line 503, in symlink_force raise e File "/content/gdrive/My Drive/P2PaLA-master/data/imgprocess.py", line 497, in symlink_force os.symlink(target, link_name) OSError: [Errno 95] Operation not supported: '/content/gdrive/My Drive/P2PaLA-master/images/5.jpeg' -> './work/results/prod/5.jpeg'

opened by akshay94950 3
require opencv-python-headless variant

The requirements.txt currently lists opencv-python, which drags in libraries for windowing systems like X11 – not ideal for headless servers. The Python bindings for OpenCV do offer other build variants like -contrib and -headless though (which are on PyPI for many platforms, and can also be built by ENABLE_HEADLESS=1 python setup.py bdist_wheel).

Therefore I suggest switching to opencv-python-headless instead.

opened by bertsky 2
Any chance to see more pre-trained models?

Hello, I played a little with the provided model with the weights ALAR_min_model_17_12_18.pth As to the word min in the name I wonder ther are other models. Do you plan to publish it?

I've thouigh out some tricks to improve the accuracy of coverage the htr text regions but it's still rough. In many cases important parts of letters are cropped out. I don't have neither hardware nor labeled datasets for the trainig. Could you share a more powerful model?

opened by longwall 2
[Enhancement] Page-XML extractor

To adapt the script to extract information/coordinates about page, from XML's formats most knowledge in the industry, like YOLO and PASCAL/VOC.

Observation: I could help with this task.

opened by EvertonTomalok 2
JoseRPrietoF version?
@JoseRPrietoF What is different in your version of P2PaLA?

Modified version to do table segmentation and act separation on Passau and Chancery corpus.

what do you mean by act separation?

also, when you say table segmentation, do you mean it detects baseline, and then extracts text?
opened by ghost 2
Bump opencv-python from 3.4.5.20 to 4.2.0.32
Bumps opencv-python from 3.4.5.20 to 4.2.0.32.

Release notes

Sourced from opencv-python's releases.

4.2.0.32

opencv-python: https://pypi.org/project/opencv-python/

opencv-contrib-python: https://pypi.org/project/opencv-contrib-python/

opencv-python-headless: https://pypi.org/project/opencv-python-headless/

opencv-contrib-python-headless: https://pypi.org/project/opencv-contrib-python-headless/

OpenCV version 4.2.0.

Changes:

macOS environment updated from xcode8.3 to xcode 9.4

macOS uses now Qt 5 instead of Qt 4

Nasm version updated to Docker containers

multibuild updated

Fixes:

don't use deprecated brew tap-pin, instead refer to the full package name when installing #267

replace get_config_var() with get_config_vars() in setup.py #274

add workaround for DLL errors in Windows Server #264

3.4.9.31

opencv-python: https://pypi.org/project/opencv-python/

opencv-contrib-python: https://pypi.org/project/opencv-contrib-python/

opencv-python-headless: https://pypi.org/project/opencv-python-headless/

opencv-contrib-python-headless: https://pypi.org/project/opencv-contrib-python-headless/

OpenCV version 3.4.9.

Changes:

macOS environment updated from xcode8.3 to xcode 9.4

macOS uses now Qt 5 instead of Qt 4

Nasm version updated to Docker containers

multibuild updated

Fixes:

don't use deprecated brew tap-pin, instead refer to the full package name when installing #267

replace get_config_var() with get_config_vars() in setup.py #274

add workaround for DLL errors in Windows Server #264

4.1.2.30

opencv-python: https://pypi.org/project/opencv-python/

opencv-contrib-python: https://pypi.org/project/opencv-contrib-python/

opencv-python-headless: https://pypi.org/project/opencv-python-headless/

opencv-contrib-python-headless: https://pypi.org/project/opencv-contrib-python-headless/

OpenCV version 4.1.2.

Changes:

... (truncated)

Commits

See full diff in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 0

Owner

Lorenzo Quirós Díaz

GitHub

Deep learning based page layout analysis

Deep Learning Based Page Layout Analyze This is a Python implementaion of page layout analyze tool. The goal of page layout analyze is to segment page

186 Dec 29, 2022

ocroseg - This is a deep learning model for page layout analysis / segmentation.

ocroseg This is a deep learning model for page layout analysis / segmentation. There are many different ways in which you can train and run it, but by

71 Dec 6, 2022

a deep learning model for page layout analysis / segmentation.

OCR Segmentation a deep learning model for page layout analysis / segmentation. dependencies tensorflow1.8 python3 dataset: uw3-framed-lines-degraded-

99 Dec 12, 2022

A semi-automatic open-source tool for Layout Analysis and Region EXtraction on early printed books.

LAREX LAREX is a semi-automatic open-source tool for layout analysis on early printed books. It uses a rule based connected components approach which

162 Jan 5, 2023

Document Layout Analysis Projects

Layout_Analysis Introduction This is an implementation of RLSA and X-Y Cut with OpenCV Dependencies OpenCV 3.0+ How to use Compile with g++ : g++ -std

22 Dec 8, 2022

A simple document layout analysis using Python-OpenCV

Run the application: python main.py *Note: For first time running the application, create a folder named "output". The application is a simple documen

109 Dec 12, 2022

Document Layout Analysis

Eynollah Document Layout Analysis Introduction This tool performs document layout analysis (segmentation) from image data and returns the results as P

198 Dec 29, 2022

PAGE XML format collection for document image page content and more

PAGE-XML PAGE XML format collection for document image page content and more For an introduction, please see the following publication: http://www.pri

46 Nov 14, 2022

CVPR 2021 Oral paper "LED2-Net: Monocular 360˚ Layout Estimation via Differentiable Depth Rendering" official PyTorch implementation.

LED2-Net This is PyTorch implementation of our CVPR 2021 Oral paper "LED2-Net: Monocular 360˚ Layout Estimation via Differentiable Depth Rendering". Y

83 Jan 4, 2023

Text page dewarping using a "cubic sheet" model

page_dewarp Page dewarping and thresholding using a "cubic sheet" model - see full writeup at https://mzucker.github.io/2016/08/15/page-dewarping.html

1.2k Dec 29, 2022

OCR-D-compliant page segmentation

ocrd_segment This repository aims to provide a number of OCR-D-compliant processors for layout analysis and evaluation. Installation In your virtual e

59 Sep 10, 2022

This repository lets you train neural networks models for performing end-to-end full-page handwriting recognition using the Apache MXNet deep learning frameworks on the IAM Dataset.

Handwritten Text Recognition (OCR) with MXNet Gluon These notebooks have been created by Jonathan Chung, as part of his internship as Applied Scientis

422 Jan 3, 2023

Page to PAGE Layout Analysis Tool

Related tags

Overview

P2PaLA

Requirements

Install

Usage

License

Acknowledgments

Comments

4.2.0.32

3.4.9.31

4.1.2.30

Owner

Lorenzo Quirós Díaz

Deep learning based page layout analysis

ocroseg - This is a deep learning model for page layout analysis / segmentation.

a deep learning model for page layout analysis / segmentation.

A semi-automatic open-source tool for Layout Analysis and Region EXtraction on early printed books.

Document Layout Analysis Projects

A simple document layout analysis using Python-OpenCV

Document Layout Analysis

PAGE XML format collection for document image page content and more

CVPR 2021 Oral paper "LED2-Net: Monocular 360˚ Layout Estimation via Differentiable Depth Rendering" official PyTorch implementation.

Text page dewarping using a "cubic sheet" model

OCR-D-compliant page segmentation

This repository lets you train neural networks models for performing end-to-end full-page handwriting recognition using the Apache MXNet deep learning frameworks on the IAM Dataset.

Simple app for visual editing of Page XML files

Validate and transform various OCR file formats (hOCR, ALTO, PAGE, FineReader)

~1000 book pages + OpenCV + python = page regions identified as paragraphs, lines, images, captions, etc.

Python-based tools for document analysis and OCR

CellProfiler is a open-source application for biological image analysis

Python-based tools for document analysis and OCR

A post-processing tool for scanned sheets of paper.