Implementation of legal QA system based on SentenceKoBART

Overview

LegalQA using SentenceKoBART

Implementation of legal QA system based on SentenceKoBART

Setup

# install git lfs , https://github.com/git-lfs/git-lfs/wiki/Installation
curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
sudo apt install git-lfs
git clone https://github.com/haven-jeon/LegalQA.git
cd LegalQA
git lfs pull
pip install -r requirements.txt

Index

python app.py -t index

GPU-based indexing available as an option

  • pods/encoder.yml - on_gpu: true

Search

With REST API

To start the Jina server for REST API:

python app.py -t query_restful

Then use a client to query:

curl --request POST -d '{"top_k": 1, "mode": "search",  "data": ["상속 관련 문의"]}' -H 'Content-Type: application/json' 'http://0.0.0.0:1234/api/search'

Or use Jinabox with endpoint http://127.0.0.1:1234/api/search

From the terminal

python app.py -t query

Demo

Citation

Model training, data crawling, and demo system were all supported by the AWS Hero program.

@misc{heewon2021,
author = {Heewon Jeon},
title = {LegalQA using SentenceKoBART},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/haven-jeon/LegalQA}}

License

  • QA data data/legalqa.jsonlines is crawled in www.freelawfirm.co.kr based on robots.txt. Commercial use other than academic use is prohibited.
  • We are not responsible for any legal decisions we make based on the resources provided here.
You might also like...
End-to-end text to speech system using gruut and onnx. There are 40 voices available across 8 languages.  A Survey of Natural Language Generation in Task-Oriented Dialogue System (TOD): Recent Advances and New Frontiers
A Survey of Natural Language Generation in Task-Oriented Dialogue System (TOD): Recent Advances and New Frontiers

A Survey of Natural Language Generation in Task-Oriented Dialogue System (TOD): Recent Advances and New Frontiers

glow-speak is a fast, local, neural text to speech system that uses eSpeak-ng as a text/phoneme front-end.

Glow-Speak glow-speak is a fast, local, neural text to speech system that uses eSpeak-ng as a text/phoneme front-end. Installation git clone https://g

The ibet-Prime security token management system for ibet network.
The ibet-Prime security token management system for ibet network.

ibet-Prime The ibet-Prime security token management system for ibet network. Features ibet-Prime is an API service that enables the issuance and manag

This is Assignment1 code for the Web Data Processing System.
This is Assignment1 code for the Web Data Processing System.

This is a Python program to Entity Linking by processing WARC files. We recognize entities from web pages and link them to a Knowledge Base(Wikidata).

ADCS - Automatic Defect Classification System (ADCS) for SSMC
ADCS - Automatic Defect Classification System (ADCS) for SSMC

Table of Contents Table of Contents ADCS Overview Summary Operator's Guide Demo System Design System Logic Training Mode Production System Flow Folder

A multi-voice TTS system trained with an emphasis on quality

TorToiSe Tortoise is a text-to-speech program built with the following priorities: Strong multi-voice capabilities. Highly realistic prosody and inton

Winner system (DAMO-NLP) of SemEval 2022 MultiCoNER shared task over 10 out of 13 tracks.
Winner system (DAMO-NLP) of SemEval 2022 MultiCoNER shared task over 10 out of 13 tracks.

KB-NER: a Knowledge-based System for Multilingual Complex Named Entity Recognition The code is for the winner system (DAMO-NLP) of SemEval 2022 MultiC

Implemented shortest-circuit disambiguation, maximum probability disambiguation, HMM-based lexical annotation and BiLSTM+CRF-based named entity recognition
Implemented shortest-circuit disambiguation, maximum probability disambiguation, HMM-based lexical annotation and BiLSTM+CRF-based named entity recognition

Implemented shortest-circuit disambiguation, maximum probability disambiguation, HMM-based lexical annotation and BiLSTM+CRF-based named entity recognition

Comments
  • correct model ckpt path

    correct model ckpt path

    안녕하세요 개발자님! 해당 레포지토리에 관심이 있어 README에 따라 테스트를 시도하던 중 일부 어려움이 있어서 PR을 날립니다.

    • git lfs가 더 이상 되지 않습니다. 다행히 안내된 데이터셋의 출처에서 직접 데이터를 가져와서 모델 training을 할 수 있었으나 이에 대해 언급이 있으면 좋을 것 같습니다.
    • 모델을 주어진 데이터셋으로 training 해야한다는 안내가 pods 폴더 내의 README 외에도 main README에도 있다면 좋을 것 같습니다.
    • encode.yml을 통해 실행되는pods/sentencekobart의 class KoBARTRegEncoder(BaseTorchEncoder)에서 model의 path를 트레이닝 후 생성되는 최종 모델로 지정해야 정상적으로 index와 query가 됩니다.

    좋은 오픈소스를 올려주셔서 감사합니다.

    opened by dleunji 1
  • Bump pytorch-lightning from 1.3.4 to 1.6.0 in /SentenceKoBART

    Bump pytorch-lightning from 1.3.4 to 1.6.0 in /SentenceKoBART

    Bumps pytorch-lightning from 1.3.4 to 1.6.0.

    Release notes

    Sourced from pytorch-lightning's releases.

    PyTorch Lightning 1.6: Support Intel's Habana Accelerator, New efficient DDP strategy (Bagua), Manual Fault-tolerance, Stability and Reliability.

    The core team is excited to announce the PyTorch Lightning 1.6 release ⚡

    Highlights

    PyTorch Lightning 1.6 is the work of 99 contributors who have worked on features, bug-fixes, and documentation for a total of over 750 commits since 1.5. This is our most active release yet. Here are some highlights:

    Introducing Intel's Habana Accelerator

    Lightning 1.6 now supports the Habana® framework, which includes Gaudi® AI training processors. Their heterogeneous architecture includes a cluster of fully programmable Tensor Processing Cores (TPC) along with its associated development tools and libraries and a configurable Matrix Math engine.

    You can leverage the Habana hardware to accelerate your Deep Learning training workloads simply by passing:

    trainer = pl.Trainer(accelerator="hpu")
    

    single Gaudi training

    trainer = pl.Trainer(accelerator="hpu", devices=1)

    distributed training with 8 Gaudi

    trainer = pl.Trainer(accelerator="hpu", devices=8)

    The Bagua Strategy

    The Bagua Strategy is a deep learning acceleration framework that supports multiple, advanced distributed training algorithms with state-of-the-art system relaxation techniques. Enabling Bagua, which can be considerably faster than vanilla PyTorch DDP, is as simple as:

    trainer = pl.Trainer(strategy="bagua")
    

    or to choose a custom algorithm

    trainer = pl.Trainer(strategy=BaguaStrategy(algorithm="gradient_allreduce") # default

    Towards stable Accelerator, Strategy, and Plugin APIs

    The Accelerator, Strategy, and Plugin APIs are a core part of PyTorch Lightning. They're where all the distributed boilerplate lives, and we're constantly working to improve both them and the overall PyTorch Lightning platform experience.

    In this release, we've made some large changes to achieve that goal. Not to worry, though! The only users affected by these changes are those who use custom implementations of Accelerator and Strategy (TrainingTypePlugin) as well as certain Plugins. In particular, we want to highlight the following changes:

    • All TrainingTypePlugins have been renamed to Strategy (#11120). Strategy is a more appropriate name because it encompasses more than simply training communcation. This change is now aligned with the changes we implemented in 1.5, which introduced the new strategy and devices flags to the Trainer.

    ... (truncated)

    Changelog

    Sourced from pytorch-lightning's changelog.

    [1.6.0] - 2022-03-29

    Added

    • Allow logging to an existing run ID in MLflow with MLFlowLogger (#12290)
    • Enable gradient accumulation using Horovod's backward_passes_per_step (#11911)
    • Add new DETAIL log level to provide useful logs for improving monitoring and debugging of batch jobs (#11008)
    • Added a flag SLURMEnvironment(auto_requeue=True|False) to control whether Lightning handles the requeuing (#10601)
    • Fault Tolerant Manual
      • Add _Stateful protocol to detect if classes are stateful (#10646)
      • Add _FaultTolerantMode enum used to track different supported fault tolerant modes (#10645)
      • Add a _rotate_worker_indices utility to reload the state according the latest worker (#10647)
      • Add stateful workers (#10674)
      • Add an utility to collect the states across processes (#10639)
      • Add logic to reload the states across data loading components (#10699)
      • Cleanup some fault tolerant utilities (#10703)
      • Enable Fault Tolerant Manual Training (#10707)
      • Broadcast the _terminate_gracefully to all processes and add support for DDP (#10638)
    • Added support for re-instantiation of custom (subclasses of) DataLoaders returned in the *_dataloader() methods, i.e., automatic replacement of samplers now works with custom types of DataLoader (#10680)
    • Added a function to validate if fault tolerant training is supported. (#10465)
    • Added a private callback to manage the creation and deletion of fault-tolerance checkpoints (#11862)
    • Show a better error message when a custom DataLoader implementation is not well implemented and we need to reconstruct it (#10719)
    • Show a better error message when frozen dataclass is used as a batch (#10927)
    • Save the Loop's state by default in the checkpoint (#10784)
    • Added Loop.replace to easily switch one loop for another (#10324)
    • Added support for --lr_scheduler=ReduceLROnPlateau to the LightningCLI (#10860)
    • Added LightningCLI.configure_optimizers to override the configure_optimizers return value (#10860)
    • Added LightningCLI(auto_registry) flag to register all subclasses of the registerable components automatically (#12108)
    • Added a warning that shows when max_epochs in the Trainer is not set (#10700)
    • Added support for returning a single Callback from LightningModule.configure_callbacks without wrapping it into a list (#11060)
    • Added console_kwargs for RichProgressBar to initialize inner Console (#10875)
    • Added support for shorthand notation to instantiate loggers with the LightningCLI (#11533)
    • Added a LOGGER_REGISTRY instance to register custom loggers to the LightningCLI (#11533)
    • Added info message when the Trainer arguments limit_*_batches, overfit_batches, or val_check_interval are set to 1 or 1.0 (#11950)
    • Added a PrecisionPlugin.teardown method (#10990)
    • Added LightningModule.lr_scheduler_step (#10249)
    • Added support for no pre-fetching to DataFetcher (#11606)
    • Added support for optimizer step progress tracking with manual optimization (#11848)
    • Return the output of the optimizer.step. This can be useful for LightningLite users, manual optimization users, or users overriding LightningModule.optimizer_step (#11711)
    • Teardown the active loop and strategy on exception (#11620)
    • Added a MisconfigurationException if user provided opt_idx in scheduler config doesn't match with actual optimizer index of its respective optimizer (#11247)
    • Added a loggers property to Trainer which returns a list of loggers provided by the user (#11683)
    • Added a loggers property to LightningModule which retrieves the loggers property from Trainer (#11683)
    • Added support for DDP when using a CombinedLoader for the training data (#11648)
    • Added a warning when using DistributedSampler during validation/testing (#11479)
    • Added support for Bagua training strategy (#11146)
    • Added support for manually returning a poptorch.DataLoader in a *_dataloader hook (#12116)
    • Added rank_zero module to centralize utilities (#11747)
    • Added a _Stateful support for LightningDataModule (#11637)
    • Added _Stateful support for PrecisionPlugin (#11638)

    ... (truncated)

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • Add doc2query for zero-shot ranker training.

    Add doc2query for zero-shot ranker training.

    The doc2query model to increase the ranking performance based on ideas from the https://arxiv.org/abs/2004.14503.

    KoBART is used instead of T5 due to Korean performance issues.

    Opened training data

    • https://huggingface.co/datasets/squad_kor_v2
    good first issue 
    opened by haven-jeon 0
Owner
Heewon Jeon(gogamza)
Democratization of NLP technology, [email protected]
Heewon Jeon(gogamza)
Legal text retrieval for python

legal-text-retrieval Overview This system contains 2 steps: generate training data containing negative sample found by mixture score of cosine(tfidf)

Nguyễn Minh Phương 22 Dec 6, 2022
TFIDF-based QA system for AIO2 competition

AIO2 TF-IDF Baseline This is a very simple question answering system, which is developed as a lightweight baseline for AIO2 competition. In the traini

Masatoshi Suzuki 4 Feb 19, 2022
BERT-based Financial Question Answering System

BERT-based Financial Question Answering System In this example, we use Jina, PyTorch, and Hugging Face transformers to build a production-ready BERT-b

Bithiah Yuan 61 Sep 18, 2022
Unofficial PyTorch implementation of Google AI's VoiceFilter system

VoiceFilter Note from Seung-won (2020.10.25) Hi everyone! It's Seung-won from MINDs Lab, Inc. It's been a long time since I've released this open-sour

MINDs Lab 881 Jan 3, 2023
PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.

An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"

Chung-Ming Chien 1k Dec 30, 2022
Stand-alone language identification system

langid.py readme Introduction langid.py is a standalone Language Identification (LangID) tool. The design principles are as follows: Fast Pre-trained

null 2k Jan 4, 2023
Stand-alone language identification system

langid.py readme Introduction langid.py is a standalone Language Identification (LangID) tool. The design principles are as follows: Fast Pre-trained

null 1.7k Feb 7, 2021
Stand-alone language identification system

langid.py readme Introduction langid.py is a standalone Language Identification (LangID) tool. The design principles are as follows: Fast Pre-trained

null 1.7k Feb 17, 2021
Knowledge Graph,Question Answering System,基于知识图谱和向量检索的医疗诊断问答系统

Knowledge Graph,Question Answering System,基于知识图谱和向量检索的医疗诊断问答系统

wangle 823 Dec 28, 2022
open-information-extraction-system, build open-knowledge-graph(SPO, subject-predicate-object) by pyltp(version==3.4.0)

中文开放信息抽取系统, open-information-extraction-system, build open-knowledge-graph(SPO, subject-predicate-object) by pyltp(version==3.4.0)

null 7 Nov 2, 2022