Implementation of legal QA system based on SentenceKoBART

Heewon Jeon(gogamza)

Last update: Dec 27, 2022

Related tags

Text Data & NLP LegalQA

Overview

LegalQA using SentenceKoBART

Implementation of legal QA system based on SentenceKoBART

How to train SentenceKoBART
Based on Neural Search Engine Jina
Provide Korean legal QA data(1,830 pairs)

Setup

# install git lfs , https://github.com/git-lfs/git-lfs/wiki/Installation
curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
sudo apt install git-lfs
git clone https://github.com/haven-jeon/LegalQA.git
cd LegalQA
git lfs pull
pip install -r requirements.txt

Index

python app.py -t index

GPU-based indexing available as an option

pods/encoder.yml - on_gpu: true

Search

With REST API

To start the Jina server for REST API:

python app.py -t query_restful

Then use a client to query:

curl --request POST -d '{"top_k": 1, "mode": "search",  "data": ["상속 관련 문의"]}' -H 'Content-Type: application/json' 'http://0.0.0.0:1234/api/search'

Or use Jinabox with endpoint http://127.0.0.1:1234/api/search

From the terminal

python app.py -t query

Demo

http://ec2-3-36-123-253.ap-northeast-2.compute.amazonaws.com:7874/

Citation

Model training, data crawling, and demo system were all supported by the AWS Hero program.

@misc{heewon2021,
author = {Heewon Jeon},
title = {LegalQA using SentenceKoBART},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/haven-jeon/LegalQA}}

License

QA data data/legalqa.jsonlines is crawled in www.freelawfirm.co.kr based on robots.txt. Commercial use other than academic use is prohibited.
We are not responsible for any legal decisions we make based on the resources provided here.

You might also like...

End-to-end text to speech system using gruut and onnx. There are 40 voices available across 8 languages.

End to end text to speech system using gruut and onnx

673 Dec 28, 2022

A Survey of Natural Language Generation in Task-Oriented Dialogue System (TOD): Recent Advances and New Frontiers

132 Nov 25, 2022

glow-speak is a fast, local, neural text to speech system that uses eSpeak-ng as a text/phoneme front-end.

Glow-Speak glow-speak is a fast, local, neural text to speech system that uses eSpeak-ng as a text/phoneme front-end. Installation git clone https://g

8 Dec 25, 2022

The ibet-Prime security token management system for ibet network.

ibet-Prime The ibet-Prime security token management system for ibet network. Features ibet-Prime is an API service that enables the issuance and manag

8 Dec 22, 2022

This is Assignment1 code for the Web Data Processing System.

This is a Python program to Entity Linking by processing WARC files. We recognize entities from web pages and link them to a Knowledge Base(Wikidata).

3 Dec 4, 2022

ADCS - Automatic Defect Classification System (ADCS) for SSMC

Table of Contents Table of Contents ADCS Overview Summary Operator's Guide Demo System Design System Logic Training Mode Production System Flow Folder

2 Jun 24, 2022

A multi-voice TTS system trained with an emphasis on quality

TorToiSe Tortoise is a text-to-speech program built with the following priorities: Strong multi-voice capabilities. Highly realistic prosody and inton

2.1k Jan 1, 2023

Winner system (DAMO-NLP) of SemEval 2022 MultiCoNER shared task over 10 out of 13 tracks.

KB-NER: a Knowledge-based System for Multilingual Complex Named Entity Recognition The code is for the winner system (DAMO-NLP) of SemEval 2022 MultiC

116 Dec 27, 2022

Implemented shortest-circuit disambiguation, maximum probability disambiguation, HMM-based lexical annotation and BiLSTM+CRF-based named entity recognition

0 Feb 13, 2022

Comments

correct model ckpt path
안녕하세요 개발자님! 해당 레포지토리에 관심이 있어 README에 따라 테스트를 시도하던 중 일부 어려움이 있어서 PR을 날립니다.

git lfs가 더 이상 되지 않습니다. 다행히 안내된 데이터셋의 출처에서 직접 데이터를 가져와서 모델 training을 할 수 있었으나 이에 대해 언급이 있으면 좋을 것 같습니다.

모델을 주어진 데이터셋으로 training 해야한다는 안내가 pods 폴더 내의 README 외에도 main README에도 있다면 좋을 것 같습니다.

encode.yml을 통해 실행되는pods/sentencekobart의 class KoBARTRegEncoder(BaseTorchEncoder)에서 model의 path를 트레이닝 후 생성되는 최종 모델로 지정해야 정상적으로 index와 query가 됩니다.

좋은 오픈소스를 올려주셔서 감사합니다.
opened by dleunji 1
Bump pytorch-lightning from 1.3.4 to 1.6.0 in /SentenceKoBART
Bumps pytorch-lightning from 1.3.4 to 1.6.0.

Release notes

Sourced from pytorch-lightning's releases.

PyTorch Lightning 1.6: Support Intel's Habana Accelerator, New efficient DDP strategy (Bagua), Manual Fault-tolerance, Stability and Reliability.

The core team is excited to announce the PyTorch Lightning 1.6 release ⚡

Highlights

Backward Incompatible Changes

Full Changelog

Contributors

Highlights

PyTorch Lightning 1.6 is the work of 99 contributors who have worked on features, bug-fixes, and documentation for a total of over 750 commits since 1.5. This is our most active release yet. Here are some highlights:

Introducing Intel's Habana Accelerator

Lightning 1.6 now supports the Habana® framework, which includes Gaudi® AI training processors. Their heterogeneous architecture includes a cluster of fully programmable Tensor Processing Cores (TPC) along with its associated development tools and libraries and a configurable Matrix Math engine.

You can leverage the Habana hardware to accelerate your Deep Learning training workloads simply by passing:

trainer = pl.Trainer(accelerator="hpu") single Gaudi training trainer = pl.Trainer(accelerator="hpu", devices=1) distributed training with 8 Gaudi
trainer = pl.Trainer(accelerator="hpu", devices=8)

The Bagua Strategy

The Bagua Strategy is a deep learning acceleration framework that supports multiple, advanced distributed training algorithms with state-of-the-art system relaxation techniques. Enabling Bagua, which can be considerably faster than vanilla PyTorch DDP, is as simple as:

trainer = pl.Trainer(strategy="bagua") or to choose a custom algorithm
trainer = pl.Trainer(strategy=BaguaStrategy(algorithm="gradient_allreduce") # default

Towards stable Accelerator, Strategy, and Plugin APIs

The Accelerator, Strategy, and Plugin APIs are a core part of PyTorch Lightning. They're where all the distributed boilerplate lives, and we're constantly working to improve both them and the overall PyTorch Lightning platform experience.

In this release, we've made some large changes to achieve that goal. Not to worry, though! The only users affected by these changes are those who use custom implementations of Accelerator and Strategy (TrainingTypePlugin) as well as certain Plugins. In particular, we want to highlight the following changes:

All TrainingTypePlugins have been renamed to Strategy (#11120). Strategy is a more appropriate name because it encompasses more than simply training communcation. This change is now aligned with the changes we implemented in 1.5, which introduced the new strategy and devices flags to the Trainer.

... (truncated)

Changelog

Sourced from pytorch-lightning's changelog.

[1.6.0] - 2022-03-29

Added

Allow logging to an existing run ID in MLflow with MLFlowLogger (#12290)

Enable gradient accumulation using Horovod's backward_passes_per_step (#11911)

Add new DETAIL log level to provide useful logs for improving monitoring and debugging of batch jobs (#11008)

Added a flag SLURMEnvironment(auto_requeue=True|False) to control whether Lightning handles the requeuing (#10601)

Fault Tolerant Manual

Add _Stateful protocol to detect if classes are stateful (#10646)

Add _FaultTolerantMode enum used to track different supported fault tolerant modes (#10645)

Add a _rotate_worker_indices utility to reload the state according the latest worker (#10647)

Add stateful workers (#10674)

Add an utility to collect the states across processes (#10639)

Add logic to reload the states across data loading components (#10699)

Cleanup some fault tolerant utilities (#10703)

Enable Fault Tolerant Manual Training (#10707)

Broadcast the _terminate_gracefully to all processes and add support for DDP (#10638)

Added support for re-instantiation of custom (subclasses of) DataLoaders returned in the *_dataloader() methods, i.e., automatic replacement of samplers now works with custom types of DataLoader (#10680)

Added a function to validate if fault tolerant training is supported. (#10465)

Added a private callback to manage the creation and deletion of fault-tolerance checkpoints (#11862)

Show a better error message when a custom DataLoader implementation is not well implemented and we need to reconstruct it (#10719)

Show a better error message when frozen dataclass is used as a batch (#10927)

Save the Loop's state by default in the checkpoint (#10784)

Added Loop.replace to easily switch one loop for another (#10324)

Added support for --lr_scheduler=ReduceLROnPlateau to the LightningCLI (#10860)

Added LightningCLI.configure_optimizers to override the configure_optimizers return value (#10860)

Added LightningCLI(auto_registry) flag to register all subclasses of the registerable components automatically (#12108)

Added a warning that shows when max_epochs in the Trainer is not set (#10700)

Added support for returning a single Callback from LightningModule.configure_callbacks without wrapping it into a list (#11060)

Added console_kwargs for RichProgressBar to initialize inner Console (#10875)

Added support for shorthand notation to instantiate loggers with the LightningCLI (#11533)

Added a LOGGER_REGISTRY instance to register custom loggers to the LightningCLI (#11533)

Added info message when the Trainer arguments limit_*_batches, overfit_batches, or val_check_interval are set to 1 or 1.0 (#11950)

Added a PrecisionPlugin.teardown method (#10990)

Added LightningModule.lr_scheduler_step (#10249)

Added support for no pre-fetching to DataFetcher (#11606)

Added support for optimizer step progress tracking with manual optimization (#11848)

Return the output of the optimizer.step. This can be useful for LightningLite users, manual optimization users, or users overriding LightningModule.optimizer_step (#11711)

Teardown the active loop and strategy on exception (#11620)

Added a MisconfigurationException if user provided opt_idx in scheduler config doesn't match with actual optimizer index of its respective optimizer (#11247)

Added a loggers property to Trainer which returns a list of loggers provided by the user (#11683)

Added a loggers property to LightningModule which retrieves the loggers property from Trainer (#11683)

Added support for DDP when using a CombinedLoader for the training data (#11648)

Added a warning when using DistributedSampler during validation/testing (#11479)

Added support for Bagua training strategy (#11146)

Added support for manually returning a poptorch.DataLoader in a *_dataloader hook (#12116)

Added rank_zero module to centralize utilities (#11747)

Added a _Stateful support for LightningDataModule (#11637)

Added _Stateful support for PrecisionPlugin (#11638)

... (truncated)

Commits

44e3edb Cleanup CHANGELOG (#12507)

e3893b9 Merge pull request #12509 from RobertLaurella/patch-1

041da41 Remove TPU Availability check from parse devices (#12326)

4fe0076 Prepare for the 1.6.0 release

17215ed Fix titles capitalization in docs

a775804 Update Plugins doc (#12440)

71e25f3 Update CI in README.md (#12495)

c6cb634 Add usage of Jupyter magic command for loggers (#12333)

42169a2 Add typing to LightningModule.trainer (#12345)

2de6a9b Fix warning message formatting in save_hyperparameters (#12498)

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 0
Add doc2query for zero-shot ranker training.
The doc2query model to increase the ranking performance based on ideas from the https://arxiv.org/abs/2004.14503.

KoBART is used instead of T5 due to Korean performance issues.

Opened training data

https://huggingface.co/datasets/squad_kor_v2

good first issue
opened by haven-jeon 0

Implementation of legal QA system based on SentenceKoBART

Related tags

Overview

LegalQA using SentenceKoBART

Setup

Index

Search

With REST API

From the terminal

Demo

Citation

License

You might also like...

End-to-end text to speech system using gruut and onnx. There are 40 voices available across 8 languages.

A Survey of Natural Language Generation in Task-Oriented Dialogue System (TOD): Recent Advances and New Frontiers

glow-speak is a fast, local, neural text to speech system that uses eSpeak-ng as a text/phoneme front-end.

The ibet-Prime security token management system for ibet network.

This is Assignment1 code for the Web Data Processing System.

ADCS - Automatic Defect Classification System (ADCS) for SSMC

A multi-voice TTS system trained with an emphasis on quality

Winner system (DAMO-NLP) of SemEval 2022 MultiCoNER shared task over 10 out of 13 tracks.

Implemented shortest-circuit disambiguation, maximum probability disambiguation, HMM-based lexical annotation and BiLSTM+CRF-based named entity recognition

Comments

correct model ckpt path

Bump pytorch-lightning from 1.3.4 to 1.6.0 in /SentenceKoBART

PyTorch Lightning 1.6: Support Intel's Habana Accelerator, New efficient DDP strategy (Bagua), Manual Fault-tolerance, Stability and Reliability.

Highlights

Introducing Intel's Habana Accelerator

single Gaudi training

distributed training with 8 Gaudi

The Bagua Strategy

or to choose a custom algorithm

Towards stable Accelerator, Strategy, and Plugin APIs

[1.6.0] - 2022-03-29

Added

Add doc2query for zero-shot ranker training.

Owner

Heewon Jeon(gogamza)

Legal text retrieval for python

TFIDF-based QA system for AIO2 competition

BERT-based Financial Question Answering System

Unofficial PyTorch implementation of Google AI's VoiceFilter system

PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.

Stand-alone language identification system

Stand-alone language identification system

Stand-alone language identification system

Knowledge Graph,Question Answering System，基于知识图谱和向量检索的医疗诊断问答系统

open-information-extraction-system, build open-knowledge-graph(SPO, subject-predicate-object) by pyltp(version==3.4.0)