BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents

Related tags

Text Data & NLP bros
Overview

BROS

Introduction

BROS (BERT Relying On Spatiality) is a pre-trained language model focusing on text and layout for better key information extraction from documents. Given the OCR results of the document image, which are text and bounding box pairs, it can perform various key information extraction tasks, such as extracting an ordered item list from receipts. For more details, please refer to our paper:

BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents
Teakgyu Hong, Donghyun Kim, Mingi Ji, Wonseok Hwang, Daehyun Nam, Sungrae Park
AAAI 2022 (to appear)

Pre-trained models

name # params Hugging Face - Models
bros-base-uncased < 110M naver-clova-ocr/bros-base-uncased
bros-large-uncased < 340M naver-clova-ocr/bros-large-uncased

Model usage

The example code below is written with reference to LayoutLM.

import torch
from bros import BrosTokenizer, BrosModel


tokenizer = BrosTokenizer.from_pretrained("naver-clova-ocr/bros-base-uncased")
model = BrosModel.from_pretrained("naver-clova-ocr/bros-base-uncased")


width, height = 1280, 720

words = ["to", "the", "moon!"]
quads = [
    [638, 451, 863, 451, 863, 569, 638, 569],
    [877, 453, 1190, 455, 1190, 568, 876, 567],
    [632, 566, 1107, 566, 1107, 691, 632, 691],
]

bbox = []
for word, quad in zip(words, quads):
    n_word_tokens = len(tokenizer.tokenize(word))
    bbox.extend([quad] * n_word_tokens)

cls_quad = [0.0] * 8
sep_quad = [width, height] * 4
bbox = [cls_quad] + bbox + [sep_quad]

encoding = tokenizer(" ".join(words), return_tensors="pt")
input_ids = encoding["input_ids"]
attention_mask = encoding["attention_mask"]

bbox = torch.tensor([bbox])
bbox[:, :, [0, 2, 4, 6]] = bbox[:, :, [0, 2, 4, 6]] / width
bbox[:, :, [1, 3, 5, 7]] = bbox[:, :, [1, 3, 5, 7]] / height

outputs = model(input_ids=input_ids, bbox=bbox, attention_mask=attention_mask)
last_hidden_state = outputs.last_hidden_state

print("- last_hidden_state")
print(last_hidden_state)
print()
print("- last_hidden_state.shape")
print(last_hidden_state.shape)

Result

- last_hidden_state
tensor([[[-0.0342,  0.2487, -0.2819,  ...,  0.1495,  0.0218,  0.0484],
         [ 0.0792, -0.0040, -0.0127,  ..., -0.0918,  0.0810,  0.0419],
         [ 0.0808, -0.0918,  0.0199,  ..., -0.0566,  0.0869, -0.1859],
         [ 0.0862,  0.0901,  0.0473,  ..., -0.1328,  0.0300, -0.1613],
         [-0.2925,  0.2539,  0.1348,  ...,  0.1988, -0.0148, -0.0982],
         [-0.4160,  0.2135, -0.0390,  ...,  0.6908, -0.2985,  0.1847]]],
       grad_fn=
   
    )

- last_hidden_state.shape
torch.Size([1, 6, 768])

   

Fine-tuning examples

Please refer to docs/finetuning_examples.md.

Acknowledgements

We referenced the code of LayoutLM when implementing BROS in the form of Hugging Face - transformers.
In this repository, we used two public benchmark datasets, FUNSD and SROIE.

License

Copyright 2022-present NAVER Corp.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
Comments
  • End2end EE and EL

    End2end EE and EL

    Hi, first of all thanks for the code, that's a great contribution to the community!!. From the paper I understood that the model could be fine-tuned end2end for EE and EL at the same time, however looking at the code I think it does not do it like that, right? Is it supported the combined EE and EL end2end somehow?

    Thanks,

    opened by ealmazanm 2
  • How to solve lr = 0 after training 5 epochs

    How to solve lr = 0 after training 5 epochs

    Thank you for the amazing work! I am training the model with a customized dataset. However, I just noticed after 5 epochs training, the learning rate came to 0 which makes model hard to learn. Could you please point me to the learning rate strategy of BROS and may I know how to change it according to my case? Thanks!

    TRAIN [epoch: 0/50] || train_loss: 460.69653 || lr: 4e-05 || time: 193.6 secs. precision: 0.9080, recall: 0.9023, f1: 0.9052 TRAIN [epoch: 1/50] || train_loss: 129.8502 || lr: 3e-05 || time: 198.6 secs. precision: 0.9374, recall: 0.9184, f1: 0.9278 TRAIN [epoch: 2/50] || train_loss: 75.951 || lr: 3e-05 || time: 198.0 secs. precision: 0.9293, recall: 0.9183, f1: 0.9237 TRAIN [epoch: 3/50] || train_loss: 46.87292 || lr: 2e-05 || time: 197.8 secs. precision: 0.9442, recall: 0.9391, f1: 0.9416 TRAIN [epoch: 4/50] || train_loss: 28.64673 || lr: 1e-05 || time: 197.7 secs. precision: 0.9444, recall: 0.9392, f1: 0.9418 TRAIN [epoch: 5/50] || train_loss: 16.82515 || lr: 0.0 || time: 197.6 secs.

    opened by kkkris7 1
  • Code release date

    Code release date

    Hi. I've seen that the model has been uploaded to the huggingface hub but without any information in the card: https://huggingface.co/naver-clova-ocr/bros-base-uncased Just wondering when you are planning to upload the code in this repo. Thanks in advance,

    opened by davidjimenezphd 1
  • RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED

    RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED

    RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)

    This error while training the model CUDA_VISIBLE_DEVICES=0 python3 train.py --config=configs/custom.yaml

    opened by deepanshudashora 0
  • F-score on CORD dataset

    F-score on CORD dataset

    Thanks for the excellent work! I am trying to reproduce the result on CORD dataset. However, I find the f-score results in your paper are somewhat different from that in LayoutLMv2 paper. Specifically, LayoutLMv2*-base achieves 96.05 and LayoutLMv2*-large achieves 97.24 in your paper. While in LayoutLMv2 paper, LayoutLMv2-base achieves 94.95 and LayoutLMv2-large achieves 96.01. Could you give an example of BROS fine-tuning on CORD dataset? Thanks!

    opened by taosong2019 0
  • TorchText Issue on Google Colab

    TorchText Issue on Google Colab

    Hello,

    I am trying to run the fine tuning scripts for FUNSD on Google Colab; I have installed all the required dependencies in requirements.txt, but when running

    !CUDA_VISIBLE_DEVICES=0 python train.py --config=configs/finetune_funsd_ee_bies.yaml

    I am getting

    OSError: /usr/local/lib/python3.8/dist-packages/torchtext/lib/libtorchtext.so: undefined symbol: _ZN5torch6detail10class_baseC2ERKSsS3_SsRKSt9type_infoS6_

    I have tried installing torchtext and upgraded pytorch lightning correspondingly, to no avail.

    Any ideas on what could be going on?

    Thanks!

    opened by harshkaria 0
  • Bounding box clarification

    Bounding box clarification

    Thanks for contributing this awesome piece of research!

    Quick question about the input boxes.

    1. For bros, is the expected format [x1, y1, x2, y2, x3, y3, x4, y4], where each x,y pair is the corners of the bounding box, starting from the top left and clockwise?

    2. Each bounding box should be normalized by dividing x values by width and y values by height?

    I'm training on DocVQA, but the results are not that great. Just trying to make sure I'm doing everything right :)

    opened by logan-markewich 0
  • The dataset for CORD linking task

    The dataset for CORD linking task

    Hello, I am interested in the great work. However, I am a little bit confused about the linking task in CORD. Is the entity with category as "menu.nm" linking to all the other entities within a same group? Besides, do you use "is_key" to split a valid line (often in the bottom of an image) into 2 entities and then generate a link between them?

    opened by ccx1997 0
  • Clarification on table 5

    Clarification on table 5

    Hi there, first of all thanks for sharing your excellent work. I have a doubt regarding how you get the results of table 5. In the paper you mention that you don't use the order information, but how do you implement that exactly?. Do you remove the 1D abs. positional embeddings from the model? if so, that comes with a new pre-training? and finally, I guess you still train with the dataset order of the words and it is only in test where you shuffle the words, is that right?

    Thanks in advance!

    opened by ealmazanm 0
  • Clarification regarding `num_samples_per_epoch`

    Clarification regarding `num_samples_per_epoch`

    Could you guys please clarify whether num_samples_per_epoch in the config files refers to the total number of documents in the training set or does it mean something else?

    I set the num_samples_per_epoch to the number of docs in my training set, however, the LRScheduler warmup is not working as expected.

    opened by suyogdahal 1
Owner
Clova AI Research
Open source repository of Clova AI Research, NAVER & LINE
Clova AI Research
Python implementation of TextRank for phrase extraction and summarization of text documents

PyTextRank PyTextRank is a Python implementation of TextRank as a spaCy pipeline extension, used to: extract the top-ranked phrases from text document

derwen.ai 1.4k Feb 17, 2021
BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia.

BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia. Its intended use is as input for neural models in natural language processing.

Benjamin Heinzerling 1.1k Jan 3, 2023
Implementation of Natural Language Code Search in the project CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

CodeBERT-Implementation In this repo we have replicated the paper CodeBERT: A Pre-Trained Model for Programming and Natural Languages. We are interest

Tanuj Sur 4 Jul 1, 2022
Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

Alexander Veysov 3.2k Dec 31, 2022
DziriBERT: a Pre-trained Language Model for the Algerian Dialect

DziriBERT is the first Transformer-based Language Model that has been pre-trained specifically for the Algerian Dialect.

null 117 Jan 7, 2023
Use PaddlePaddle to reproduce the paper:mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

MT5_paddle Use PaddlePaddle to reproduce the paper:mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer English | 简体中文 mT5: A Massively

null 2 Oct 17, 2021
PyTorch Implementation of "Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging" (Findings of ACL 2022)

Feature_CRF_AE Feature_CRF_AE provides a implementation of Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging

Jacob Zhou 6 Apr 29, 2022
Blue Brain text mining toolbox for semantic search and structured information extraction

Blue Brain Search Source Code DOI Data & Models DOI Documentation Latest Release Python Versions License Build Status Static Typing Code Style Securit

The Blue Brain Project 29 Dec 1, 2022
Must-read papers on improving efficiency for pre-trained language models.

Must-read papers on improving efficiency for pre-trained language models.

Tobias Lee 89 Jan 3, 2023
The repository for the paper: Multilingual Translation via Grafting Pre-trained Language Models

Graformer The repository for the paper: Multilingual Translation via Grafting Pre-trained Language Models Graformer (also named BridgeTransformer in t

null 22 Dec 14, 2022
Prompt-learning is the latest paradigm to adapt pre-trained language models (PLMs) to downstream NLP tasks

Prompt-learning is the latest paradigm to adapt pre-trained language models (PLMs) to downstream NLP tasks, which modifies the input text with a textual template and directly uses PLMs to conduct pre-trained tasks. This library provides a standard, flexible and extensible framework to deploy the prompt-learning pipeline. OpenPrompt supports loading PLMs directly from huggingface transformers. In the future, we will also support PLMs implemented by other libraries.

THUNLP 2.3k Jan 8, 2023
Chinese Pre-Trained Language Models (CPM-LM) Version-I

CPM-Generate 为了促进中文自然语言处理研究的发展,本项目提供了 CPM-LM (2.6B) 模型的文本生成代码,可用于文本生成的本地测试,并以此为基础进一步研究零次学习/少次学习等场景。[项目首页] [模型下载] [技术报告] 若您想使用CPM-1进行推理,我们建议使用高效推理工具BMI

Tsinghua AI 1.4k Jan 3, 2023
Guide to using pre-trained large language models of source code

Large Models of Source Code I occasionally train and publicly release large neural language models on programs, including PolyCoder. Here, I describe

Vincent Hellendoorn 947 Dec 28, 2022
Google and Stanford University released a new pre-trained model called ELECTRA

Google and Stanford University released a new pre-trained model called ELECTRA, which has a much compact model size and relatively competitive performance compared to BERT and its variants. For further accelerating the research of the Chinese pre-trained model, the Joint Laboratory of HIT and iFLYTEK Research (HFL) has released the Chinese ELECTRA models based on the official code of ELECTRA. ELECTRA-small could reach similar or even higher scores on several NLP tasks with only 1/10 parameters compared to BERT and its variants.

Yiming Cui 1.2k Dec 30, 2022
TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset.

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset. TunBERT was applied to three NLP downstream tasks: Sentiment Analysis (SA), Tunisian Dialect Identification (TDI) and Reading Comprehension Question-Answering (RCQA)

InstaDeep Ltd 72 Dec 9, 2022
Code for CodeT5: a new code-aware pre-trained encoder-decoder model.

CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation This is the official PyTorch implementation

Salesforce 564 Jan 8, 2023
ElasticBERT: A pre-trained model with multi-exit transformer architecture.

This repository contains finetuning code and checkpoints for ElasticBERT. Towards Efficient NLP: A Standard Evaluation and A Strong Baseli

fastNLP 48 Dec 14, 2022