Usando o Amazon Textract como OCR para Extração de Dados no DynamoDB

hugoportela

Last update: Jan 19, 2022

Related tags

Computer Vision Amazon-Textract-OCR-Extracao-Dados-DynamoDB

Overview

dio-live-textract2

Repositório de código para o live coding do dia 05/10/2021 sobre extração de dados estruturados e gravação em banco de dados a partir do Amazon Textract.

Serviços utilizados

Amazon Textract
AWS Lambda
Amazon S3
Amazon DynamoDB

Desenvolvimento

Criando um bucket no Amazon S3

S3 Console -> Create bucket -> Bucket name "dio-live-input-data" -> Manter as configurações padrão -> Create bucket

Processando imagens no Amazon Textract

Textract Console -> Select Document -> Analyze Document -> Tables
Download results -> Salvar arquivo .zip

Criando uma tabela no DynamoDB

DynamoDB Console -> Tables -> Create Table -> Partition key "cod" -> Create table

Implementando a função lambda

Lambda Console -> Functions -> Create function
Use a blueprint -> "s3-get-object-python"
Function name "dio-live-csv-to-db"
Execution role -> "Create a new role from AWS policy templates" -> Role name "S3ToDynamoDBRole"
S3 Trigger -> Bucket criado anteriormente
Create function
Substituir o código gerado pelo código da pasta /src deste repositório (Obs: atenção para o nome da tabela, deve ser substituído pelo nome da sua)

Passo adicional: Criando um layer com a biblioteca boto3 do Python

Lambda Console -> Additional Resources -> Layers
Name "boto3_layer" -> Upload a .zip file -> baixe e insira o arquivo .zip contido na pasta /src deste respositório
Compatible architecture "x86_64"
Compatible runtimes "Python3.7" (É necessário ser Python3.7 para ser compatível com a versão do blueprint utilizado)
Create
Na função lambda criada -> Selecione layers no diagrama -> Add layer -> Custom layers "boto3_layer" -> Version 1 -> Add

Configurando permissões no Lambda para o DynamoDB

Lambda Console -> Functions -> Selecione a função criada -> Configuration -> Permission -> Execution Role -> Abrir a role criada no Amazon IAM
No IAM -> Permission -> Add inline policy -> Choose a service "DynamoDB" -> Write "PutItem"
Resources -> Selecionar o Arn da sua tabela -> Selecionar a sua região -> Add -> Review Policy -> Name "LambdaDynamoDBPolicy" -> Create policy

Utilizando a aplicação

No Amazon Textract

Amazon Textract Console -> Select Document -> Choose file -> Buscar o arquivo a ser analisado
Download results

No Amazon S3

Extrair o arquivo table_1.csv do arquivo baixado do Amazon Textract
Acessar o bucket criado anteriormente -> Upload -> Selecionar o arquivo table_1.csv -> Upload

No DynamoDB

Tables -> Acessar a tabela criada -> View Items

You might also like...

A pure pytorch implemented ocr project including text detection and recognition

ocr.pytorch A pure pytorch implemented ocr project. Text detection is based CTPN and text recognition is based CRNN. More detection and recognition me

444 Dec 30, 2022

python ocr using tesseract/ with EAST opencv detector

pytextractor python ocr using tesseract/ with EAST opencv text detector Uses the EAST opencv detector defined here with pytesseract to extract text(de

38 Dec 5, 2022

Run tesseract with the tesserocr bindings with @OCR-D's interfaces

ocrd_tesserocr Crop, deskew, segment into regions / tables / lines / words, or recognize with tesserocr Introduction This package offers OCR-D complia

38 Oct 14, 2022

A set of workflows for corpus building through OCR, post-correction and normalisation

PICCL: Philosophical Integrator of Computational and Corpus Libraries PICCL offers a workflow for corpus building and builds on a variety of tools. Th

41 Dec 27, 2022

Tensorflow-based CNN+LSTM trained with CTC-loss for OCR

Overview This collection demonstrates how to construct and train a deep, bidirectional stacked LSTM using CNN features as input with CTC loss to perfo

489 Dec 21, 2022

🖺 OCR using tensorflow with attention

tensorflow-ocr 🖺 OCR using tensorflow with attention, batteries included Installation git clone --recursive http://github.com/pannous/tensorflow-ocr

646 Nov 11, 2022

This is the implementation of the paper "Gated Recurrent Convolution Neural Network for OCR"

Gated Recurrent Convolution Neural Network for OCR This project is an implementation of the GRCNN for OCR. For details, please refer to the paper: htt

90 Dec 22, 2022

A tool for extracting text from scanned documents (via OCR), with user-defined post-processing.

The project is based on older versions of tesseract and other tools, and is now superseded by another project which allows for more granular control o

32 Jul 24, 2022

MXNet OCR implementation. Including text recognition and detection.

insightocr Text Recognition Accuracy on Chinese dataset by caffe-ocr Network LSTM 4x1 Pooling Gray Test Acc SimpleNet N Y Y 99.37% SE-ResNet34 N Y Y 9