Let's create a tool to convert Thailand budget from PDF to CSV.

Kao.Geek

Last update: Dec 19, 2022

Related tags

Deep Learning thailand-budget-pdf2csv

Overview

thailand-budget-pdf2csv

Let's create a tool to convert Thailand Government Budgeting from PDF to CSV!

รวมพลัง Dev แปลงงบ จาก PDF สู่ Machine-readable

เพื่อการตรวจสอบงบประมาณแผ่นดินที่ง่ายมากขึ้น

Usage

PDF -> TXT

You can download the results and see the source code in each approach under ./txt-extraction folder, or, just download output files from shortcut links below:

tee4cute-gcloud-vision: Google Drive folder.

TXT -> CSV

You can download the results and see the source code in each approach under ./csv-extraction folder, or, just download output files from shortcut links below:

napatswift-coordintes: Google Drive folder.

Translations

English version

napatswift-coordintes (partially translated using Google Translation API): Google Sheet, see @asiripanich's repo for code.

Let's Code!

Download source budget PDF files from budget-pdf (เล่มขาวคาดแดง) and do some secret magics to generate output csv files with exepcted format below:

Expected Output Format (V2)

Field Name	Formal Thai Name	Data Type / Format	Description	Since Version
`ITEM_ID`	-	str / [`REF_DOC`].[RUNNING_NO]	Unique Id ของแต่ละ row, สำหรับ `REF_DOC` = ดูที่ field `REF_DOC`, RUNNING_NO = เลข running no ของแต่ละ row ในเล่มงบ (pdf) ไฟล์นั้น ๆ	v1
`REF_DOC`	-	str / [FY].[ฉบับ].[เล่ม]	เลขที่เอกสารเล่มงบ (pdf), [FY]=ปีงบประมาณของเล่มงบ, [ฉบับ]=ฉบับที่, [เล่ม]=เล่มที่ (บางเล่มจะมีวงเล็บต่อท้ายด้วย)	v1
`REF_PAGE_NO`	-	int	หน้าของเอกสารในเล่มงบที่แสดงอยู่บริเวณหัวกระดาษของ row นั้น (โปรดระวัง! เกือบทุกกรณี หน้าเอกสารจะไม่ใช่ pdf page)	v1
`MINISTRY`	กระทรวง/หน่วยงานเทียบเท่ากระทรวง	str		v1
`BUDGETARY_UNIT`	หน่วยรับงบประมาณ	str	ส่วนใหญ่เป็นกรม/หน่วยงานเทียบเท่ากรม	v1
`CROSS_FUNC?`		bool	เป็น row (งบประมาณ) ภายใต้แผนงานบูรณาการ ใช่หรือไม่?, แผนงานบูรณาการ หมายถึง แผนงานที่มีชื่อขึ้นต้นด้วยคำว่า "แผนงานบูรณาการ", See: `BUDGET_PLAN`	v1
`BUDGET_PLAN`	แผนงาน	str	ชื่อแผนงานตาม พ.ร.บ.วิธีการงบประมาณฯ	v1
`OUTPUT`	ผลผลิต	str	ภายใต้แผนงานจะมี `0-n` ผลผลิต/โครงการ, 1 row จะสามารถอยู่ภายใต้ 1 ผลผลิต `XOR` 1 โครงการ อย่างใดอย่างหนึ่ง	v1
`PROJECT`	โครงการ	str	ภายใต้แผนงานจะมี `0-n` ผลผลิต/โครงการ, 1 row จะสามารถอยู่ภายใต้ 1 ผลผลิต `XOR` 1 โครงการ อย่างใดอย่างหนึ่ง	v1
`CATEGORY_LV1`	งบรายจ่าย	str	หมวดงบรายจ่าย `level-1` จะประกอบไปด้วย งบบุคลากร, งบดำเนินงาน, งบลงทุน, งบเงินอุดหนุน, งบรายจ่ายอื่น เท่านั้น (ยกเว้น "งบกลาง" ที่อาจมีรายการอื่น ๆ นอกเหนือจากนี้ได้)	v1
`CATEGORY_LV2`	งบรายจ่าย	str	หมวดงบรายจ่าย `level-2`, ในเอกสาร pdf จะปรากฏอยู่ใน line item ที่มีเลข (ordered list) นำหน้าอยู่ใน format `x.y.z`	v1
`CATEGORY_LV3`	งบรายจ่าย	str	หมวดงบรายจ่าย `level-3`, ในเอกสาร pdf จะปรากฏอยู่ใน line item ที่มีเลข (ordered list) นำหน้าอยู่ใน format `x.y.z`	v1
`CATEGORY_LV4`	งบรายจ่าย	str	หมวดงบรายจ่าย `level-4`, ในเอกสาร pdf จะปรากฏอยู่ใน line item ที่มีเลข (ordered list) นำหน้าอยู่ใน format `x.y.z`	v1
`CATEGORY_LV5`	งบรายจ่าย	str	หมวดงบรายจ่าย `level-5`, ในเอกสาร pdf จะปรากฏอยู่ใน line item ที่มีเลข (ordered list) นำหน้าอยู่ใน format `x.y.z`	v1
`CATEGORY_LV6`	งบรายจ่าย	str	หมวดงบรายจ่าย `level-6`, ในเอกสาร pdf จะปรากฏอยู่ใน line item ที่มีเลข (ordered list) นำหน้าอยู่ใน format `x.y.z`	v1
`ITEM_DESCRIPTION`	-	str	ชื่อรายการ, ในเอกสาร pdf จะปรากฏอยู่ใน line item ที่มีเลข (ordered list) นำหน้าอยู่ใน format `(x)`, บาง row อาจไม่มี `ITEM_DESCRIPTION` ก็ได้	v1
`FISCAL_YEAR`	ปีงบประมาณ	str / ปี ค.ศ.	มีโอกาสที่ 1 line item อาจมีหลาย row ได้หากรายการนั้นเป็นรายการ งบผูกพัน	v1
`AMOUNT`	-	float	จำนวนเงินงบประมาณ	v1
`OBLIGED?`	-	bool	มีค่าเป็น TRUE ก็ต่อเมื่อ เป็น line item ที่มีข้อมูลหลาย row `FISCAL_YEAR`	v1
`DEBUG_LOG`	-	str	Log message สำหรับแจ้ง error ที่เกิดขึ้นระหว่างการ extract row นั้น ๆ	v2

Note: Please see output example in output_example_vx.xlsx and output_example_vx.csv at repository root.

Release Notes

29 Jul 2021

Send messages to DEBUG_LOG to cleary inform user about the source of error where it was orignated from: Syntactic Error or OCR Error.
- Invalid CATEGORY_LV1 values will be reported in DEBUG_LOG as follows: "CATEGORY_LV1 is not as described". issue#15-comment
- Invalid AMOUNT values will be reported in DEBUG_LOG as follows: "AMOUNT FORMAT IS WRONG".

25 Jul 2021

Fix some of Syntactic Errors reported by issue#15.
Fix Compiler Error for wrong AMOUNT output on obliged item written in "XXXX - YYYY ZZZZ บาท" format.
- For example, if the obliged entry is written as "2562 - 2564 30,000,000 บาท", the output will be:
```
  2562    10,000,000
  2563    10,000,000
  2564    10,000,000
```
  instead of
```
  2562    30,000,000
  2563    30,000,000
  2564    30,000,000
```
Sending OCR Error reported by issue#11 to DEBUG_LOG to make it clear that the error was originated from the OCR Tool and needed to be cleaned by hand.

21 Jul 2021

First version release
You can download the first version in CSV format here.

Powered by This Dataset

Budget Overview by korlan rayong

https://public.tableau.com/app/profile/korlan.rayong2953/viz/OverviewBudget65/Dashboard1
2022 Thai Budget Structure by Thanawit Prasongpongchai

Visualization: https://taepras.github.io/thaibudget65 Repository: https://github.com/taepras/thaibudget65

Talk

"ก้าวGeek Community", Line Group: http://line.me/ti/g/STUxfMX87U

Comments

load CSV data via COPY command
Purposed changes:

use init.sql script during spinning container up via docker compose up -d

use copy SQL command to load CSV data to declared table. This will serve a use case of Import CSV data with a single command
opened by talerngpong 5
Add a module to setup local database
This module allow you to start a local database, import CSV data to it and run query to look for more insignts.

Feature added:

[x] Starting local DB

[x] Connect to local db with a single command

[x] Import CSV data with a single command

I will convert this to a real PR when the import script is done.
opened by vtno 2
Adding coordinates approach result

This is an update on coordinates approach and its results. I edited original OCR json since there are many misplace coordinates and missing text.

Still in the middle of fixing the result in 3.15 - 3.17 files but I wanted to share the results as soon as possible so others can review it.

opened by napatswift 2
Clarification on meta data for "ITEM_ID"

As the format of ITEM_ID is xxxx.x.x.xx, can you make it clear in the metadata which section refers to REF_DOC and which one refer to RUNNING_NO? Or maybe clarify each section means if possible. Thx.

opened by korlan-rayong 1
Add an English version

I translated most of the character columns to English using Google Translation API and uploaded the translated data to Google Sheet. Feel free to merge if you think this might be useful to others as well. :)

opened by asiripanich 1
Translate the budget data to english

MANY thanks to the team for all the hard work! |||

Do you think we can programmatically translate the data to English too? :)

@nitikornbunya Maybe with LONGDO Dict? :)

opened by asiripanich 1
Parsing PDF into HTML

PDF is not really a programming friendly format. HTML would make it easier to parse data and would let more people with different expertise in different programming language to take part.

I'm using this library to do the parsing: https://github.com/pdf2htmlEX/pdf2htmlEX

It is quite simple to use as they provide the docker image. See example here

There is a minor unicode issue that shifted า to ำ in some places but it is deterministic and should be easily fixed.

opened by vtno 1
use `init.sql` script for CSV ETL
Purposed changes:

Use init.sql script during spinning container up via docker compose up -d. That script will do ETL from CSV file to containerized Postgres database.

Make output_example_v2.csv more machine-friendly by removing non-realistic lines

Note

Even though this PR is similar to #20 , there were no real change applied. So this PR aims for change in main branch.
opened by talerngpong 0
[Update] Add references to visualizations powered by this dataset in README.md
Add references to visualizations powered by this dataset in README.md including:

Thanawit Prasongpongchai's https://taepras.github.io/thaibudget65/#/

korlan rayong's https://public.tableau.com/app/profile/korlan.rayong2953/viz/OverviewBudget65/Dashboard1
opened by taepras 0
[Update] Adding log message and fixing some bug

Removed single Thai char that are mistakes from OCR.

โครงการพัฒนาครูผู้สอนวิทยาศาสตร์ ข ข คณิตศาสตร์ และ เทคโนโลยี to โครงการพัฒนาครูผู้สอนวิทยาศาสตร์ คณิตศาสตร์ และ เทคโนโลยี.

โครงการการพัฒนาอุตสาหกรรมการท่องเที่ยวอย่างยั่งยืน ญ 4 to โครงการการพัฒนาอุตสาหกรรมการท่องเที่ยวอย่างยั่งยืน 4

Fixed bug in first document. Added log message as #14 reported.

opened by napatswift 0
Coordinates results update
The leaf node will be item description of each entry. example

1. งบเงินอุดหนุน 47,034,100 บาท 1.1 เงินอุดหนุนทั่วไป 38,887,100 บาท 1) ค่าใช้จ่ายบุคลากร 918,500 บาท 2) ค่าใช้จ่ายดำเนินงาน 23,870,700 บาท 3) เงินอุดหนุนดำเนินการตามอำนาจหน้าที่และภารกิจถ่ายโอน 14,097,900 บาท 1.2 เงินอุดหนุนเฉพาะกิจ 8,147,000 บาท

result will be |CATEGORY_LV1|CATEGORY_LV2|CATEGORY_LV3|...|ITEM_DESCRIPTION|FISCAL_YEAR|AMOUNT| |---|---|---|---|---|---|---| |งบเงินอุดหนุน|เงินอุดหนุนทั่วไป|| |ค่าใช้จ่ายบุคลากร|2022|918,500 |งบเงินอุดหนุน|เงินอุดหนุนทั่วไป|| |ค่าใช้จ่ายดำเนินงาน|2022|23,870,700 |งบเงินอุดหนุน|เงินอุดหนุนทั่วไป|| |เงินอุดหนุนดำเนินการตามอำนาจหน้าที่และภารกิจถ่ายโอน|2022|14,097,900 |งบเงินอุดหนุน||||เงินอุดหนุนเฉพาะกิจ |2022|8,147,000
opened by napatswift 0
CATEGORY_LV1 is not as descripted in README.md
Hello there, According to README.md, it said "หมวดงบรายจ่าย level-1 จะประกอบไปด้วย งบบุคลากร, งบดำเนินงาน, งบลงทุน, งบเงินอุดหนุน, งบรายจ่ายอื่น เท่านั้น" for CATEGORY_LV1 But this is what I got when using python pandas

df.groupby(["CATEGORY_LV1"])[["ITEM_ID"]].agg(["count"])

So, I would like to state the issues here and hope that it will be fixed soon
opened by JanYanisa 4
Let's geocode the data!

While the team and the kaogeek community are working together to make the data more accurate, I would like to continue to add more value to the data.

Would be great if we can extract all addresses from the data, and MAYBE even geocode them. For geocoding, we can use Google Geocoding API (again) to do this. Happy to contribute, but I can't promise when. :)

opened by asiripanich 3
Parsing issue in "AMOUNT" field

Hi,

Thank you very much for your effort!

I've played with field (on Google Spreadsheet) a bit and found that there are some issues on parsing AMOUNT. Many values there don't look correct (in numerical sense). You can try to reproduce the screenshot by sorting AMOUNT.

Because the information in the file is quite important, we really need to make sure that these issue get fixed asap before anyone would try to use it and compute statistics from the data.

opened by p16i 4

Owner

Kao.Geek

We're [O]pen Community, [K]een to move Thailand forward, [A]gile, and willing to contribute the code daily to make Thailand more progressive continuously.

GitHub

The official implementation of NeMo: Neural Mesh Models of Contrastive Features for Robust 3D Pose Estimation [ICLR-2021]. https://arxiv.org/pdf/2101.12378.pdf

NeMo: Neural Mesh Models of Contrastive Features for Robust 3D Pose Estimation [ICLR-2021] Release Notes The offical PyTorch implementation of NeMo, p

76 Nov 23, 2022

This repo provides the official code for TransBTS: Multimodal Brain Tumor Segmentation Using Transformer (https://arxiv.org/pdf/2103.04430.pdf).

TransBTS: Multimodal Brain Tumor Segmentation Using Transformer This repo is the official implementation for TransBTS: Multimodal Brain Tumor Segmenta

247 Dec 28, 2022

Source Code for DialogBERT: Discourse-Aware Response Generation via Learning to Recover and Rank Utterances (https://arxiv.org/pdf/2012.01775.pdf)

DialogBERT This is a PyTorch implementation of the DialogBERT model described in DialogBERT: Neural Response Generation via Hierarchical BERT with Dis

67 Jan 6, 2023

PGPortfolio: Policy Gradient Portfolio, the source code of "A Deep Reinforcement Learning Framework for the Financial Portfolio Management Problem"(https://arxiv.org/pdf/1706.10059.pdf).

This is the original implementation of our paper, A Deep Reinforcement Learning Framework for the Financial Portfolio Management Problem (arXiv:1706.1

1.5k Dec 29, 2022

sequitur is a library that lets you create and train an autoencoder for sequential data in just two lines of code

sequitur sequitur is a library that lets you create and train an autoencoder for sequential data in just two lines of code. It implements three differ

305 Dec 21, 2022

Pgn2tex - Scripts to convert pgn files to latex document. Useful to build books or pdf from pgn studies

Pgn2Latex (WIP) A simple script to make pdf from pgn files and studies. It's sti

12 Jul 23, 2022

Convert Mission Planner (ArduCopter) Waypoint Missions to Litchi CSV Format to execute on DJI Drones

Mission Planner to Litchi Convert Mission Planner (ArduCopter) Waypoint Surveys to Litchi CSV Format to execute on DJI Drones Litchi doesn't support S

24 Dec 9, 2022

Simple-System-Convert--C--F - Simple System Convert With Python

Simple-System-Convert--C--F REQUIREMENTS Python version : 3 HOW TO USE Run the c

2 Feb 16, 2022

Json2Xml tool will help you convert from json COCO format to VOC xml format in Object Detection Problem.

JSON 2 XML All codes assume running from root directory. Please update the sys path at the beginning of the codes before running. Over View Json2Xml t

6 Aug 22, 2022

Txt2Xml tool will help you convert from txt COCO format to VOC xml format in Object Detection Problem.

TXT 2 XML All codes assume running from root directory. Please update the sys path at the beginning of the codes before running. Over View Txt2Xml too

4 Nov 24, 2022

A modern pure-Python library for reading PDF files

pdf A modern pure-Python library for reading PDF files. The goal is to have a modern interface to handle PDF files which is consistent with itself and

6 Apr 6, 2022

Incomplete easy-to-use math solver and PDF generator.

Math Expert Let me do your work Preview preview.mp4 Introduction Math Expert is our (@salastro, @younis-tarek, @marawn-mogeb) math high school graduat

22 Jul 11, 2022

realsense d400 -> jpg + csv

Realsense-capture realsense d400 -> jpg + csv Requirements RealSense sdk : Installation Python3 pyrealsense2 (RealSense SDK) Numpy OpenCV Tkinter Run

2 Mar 22, 2022

Simple Python project using Opencv and datetime package to recognise faces and log attendance data in a csv file.

Attendance-System-based-on-Facial-recognition-Attendance-data-stored-in-csv-file- Simple Python project using Opencv and datetime package to recognise

3 Aug 9, 2022

MM1 and MMC Queue Simulation using python - Results and parameters in excel and csv files

implementation of MM1 and MMC Queue on randomly generated data and evaluate simulation results then compare with analytical results and draw a plot curve for them, simulate some integrals and compare results and run monte carlo algorithm with them

1 Jan 19, 2022

OpenCVのGrabCut()を利用したセマンティックセグメンテーション向けアノテーションツール(Annotation tool using GrabCut() of OpenCV. It can be used to create datasets for semantic segmentation.)

[Japanese/English] GrabCut-Annotation-Tool GrabCut-Annotation-Tool.mp4 OpenCVのGrabCut()を利用したアノテーションツールです。セマンティックセグメンテーション向けのデータセット作成にご使用いただけます。 ※Grab

30 Nov 18, 2022

This project aim to create multi-label classification annotation tool to boost annotation speed and make it more easier.

4 Aug 2, 2022

NuPIC Studio is an all-in-one tool that allows users create a HTM neural network from scratch

NuPIC Studio is an all-in-one tool that allows users create a HTM neural network from scratch, train it, collect statistics, and share it among the members of the community. It is not just a visualization tool but an HTM builder, debugger and laboratory for experiments. It is ideal for newbies with little intimacy with NuPIC code as well as experts that wish a better productivity. Among its features and advantages:

93 Sep 30, 2022

The tool under this branch fork can be used to crack devices above A12 and up to A15. After cracking, you can also use SSH channel strong opening tool to open SSH channel and activate it with Demo or Shell script. The file can be extracted from my Github homepage, and the SSH channel opening tool can be extracted from Dr238 account.

Welcome to C0xy-A12-A15-Attack-Tool The tool under this branch fork can be used to crack devices above A12 and up to A15. After cracking, you can also

13 Dec 23, 2022

Let's create a tool to convert Thailand budget from PDF to CSV.

Related tags

Overview

thailand-budget-pdf2csv

Let's create a tool to convert Thailand Government Budgeting from PDF to CSV!

Usage

PDF -> TXT

TXT -> CSV

Translations

English version

Let's Code!

Expected Output Format (V2)

Release Notes

29 Jul 2021

25 Jul 2021

21 Jul 2021

Powered by This Dataset

Talk

Comments

Purposed changes:

Purposed changes:

Note

Owner

Kao.Geek

The official implementation of NeMo: Neural Mesh Models of Contrastive Features for Robust 3D Pose Estimation [ICLR-2021]. https://arxiv.org/pdf/2101.12378.pdf

This repo provides the official code for TransBTS: Multimodal Brain Tumor Segmentation Using Transformer (https://arxiv.org/pdf/2103.04430.pdf).

Source Code for DialogBERT: Discourse-Aware Response Generation via Learning to Recover and Rank Utterances (https://arxiv.org/pdf/2012.01775.pdf)

PGPortfolio: Policy Gradient Portfolio, the source code of "A Deep Reinforcement Learning Framework for the Financial Portfolio Management Problem"(https://arxiv.org/pdf/1706.10059.pdf).

sequitur is a library that lets you create and train an autoencoder for sequential data in just two lines of code

Pgn2tex - Scripts to convert pgn files to latex document. Useful to build books or pdf from pgn studies

Convert Mission Planner (ArduCopter) Waypoint Missions to Litchi CSV Format to execute on DJI Drones

Simple-System-Convert--C--F - Simple System Convert With Python

Json2Xml tool will help you convert from json COCO format to VOC xml format in Object Detection Problem.

Txt2Xml tool will help you convert from txt COCO format to VOC xml format in Object Detection Problem.

A modern pure-Python library for reading PDF files

Incomplete easy-to-use math solver and PDF generator.

realsense d400 -> jpg + csv

Simple Python project using Opencv and datetime package to recognise faces and log attendance data in a csv file.

MM1 and MMC Queue Simulation using python - Results and parameters in excel and csv files

OpenCVのGrabCut()を利用したセマンティックセグメンテーション向けアノテーションツール(Annotation tool using GrabCut() of OpenCV. It can be used to create datasets for semantic segmentation.)

This project aim to create multi-label classification annotation tool to boost annotation speed and make it more easier.

NuPIC Studio is an all­-in-­one tool that allows users create a HTM neural network from scratch

NuPIC Studio is an all-in-one tool that allows users create a HTM neural network from scratch