BERT-based Financial Question Answering System

Overview

Jina Jina Jina Jina Docs We are hiring tweet button Python 3.7 3.8 Docker

BERT-based Financial Question Answering System

In this example, we use Jina, PyTorch, and Hugging Face transformers to build a production-ready BERT-based Financial Question Answering System. We adapt a passage reranking approach by first retrieving the top-50 candidate answers, then reranking the candidate answers using FinBERT-QA, a BERT-based model fine-tuned on the FiQA dataset that achieved the state-of-the-art results.

🦉 Please refer to this tutorial for a step-by-step guide and detailed explanations.

Motivation

Motivated by the emerging demand in the financial industry for the automatic analysis of unstructured and structured data at scale, QA systems can provide lucrative and competitive advantages to companies by facilitating the decision making of financial advisers. The goal of our system is to search for a list of relevant answer passages given a question. Here is an example of a question and a ground truth answer from the FiQA dataset:

performance

Set up

Clone:

https://github.com/yuanbit/jina-financial-qa-search.git

We will use jina-financial-qa-search/ as our working directory.

Install:

pip install -r requirements.txt

Download data and model:

bash get_data.sh

Index Answers

We want to index a subset of the answer passages from the FiQA dataset, dataset/test_answers.csv:

398960	From  http://financial-dictionary.thefreedictionary.com/Business+Fundamentals  The  facts  that  affect  a  company's      underlying  value.  Examples  of  business      fundamentals  include  debt,  cash  flow,      supply  of  and  demand  for  the  company's      products,  and  so  forth.  For  instance,      if  a  company  does  not  have  a      sufficient  supply  of  products,  it  will      fail.  Likewise,  demand  for  the  product      must  remain  at  a  certain  level  in      order  for  it  to  be  successful.  Strong      business  fundamentals  are  considered      essential  for  long-term  success  and      stability.  See  also:  Value  Investing,      Fundamental  Analysis.  For  a  stock  the  basic  fundamentals  are  the  second  column  of  numbers  you  see  on  the  google  finance  summary  page,    P/E  ratio,  div/yeild,  EPS,  shares,  beta.      For  the  company  itself  it's  generally  the  stuff  on  the  'financials'  link    (e.g.  things  in  the  quarterly  and  annual  report,    debt,  liabilities,  assets,  earnings,  profit  etc.
19183	If  your  sole  proprietorship  losses  exceed  all  other  sources  of  taxable  income,  then  you  have  what's  called  a  Net  Operating  Loss  (NOL).  You  will  have  the  option  to  "carry  back"  and  amend  a  return  you  filed  in  the  last  2  years  where  you  owed  tax,  or  you  can  "carry  forward"  the  losses  and  decrease  your  taxes  in  a  future  year,  up  to  20  years  in  the  future.  For  more  information  see  the  IRS  links  for  NOL.  Note:  it's  important  to  make  sure  you  file  the  NOL  correctly  so  I'd  advise  speaking  with  an  accountant.  (Especially  if  the  loss  is  greater  than  the  cost  of  the  accountant...)
327002	To  be  deductible,  a  business  expense  must  be  both  ordinary  and  necessary.  An  ordinary  expense  is  one  that  is  common  and  accepted  in  your  trade  or  business.  A  necessary  expense  is  one  that  is  helpful  and  appropriate  for  your  trade  or  business.  An  expense  does  not  have  to  be  indispensable  to  be  considered  necessary.    (IRS,  Deducting  Business  Expenses)  It  seems  to  me  you'd  have  a  hard  time  convincing  an  auditor  that  this  is  the  case.    Since  business  don't  commonly  own  cars  for  the  sole  purpose  of  housing  $25  computers,  you'd  have  trouble  with  the  "ordinary"  test.    And  since  there  are  lots  of  other  ways  to  house  a  computer  other  than  a  car,  "necessary"  seems  problematic  also.

You can change the path to answer_collection.tsv to index with the full dataset.

Run

python app.py index

asciicast

At the end you will see the following:

✅ done in ⏱ 1 minute and 54 seconds 🐎 7.7/s
        gateway@18904[S]:terminated
    doc_indexer@18903[I]:recv ControlRequest from ctl▸doc_indexer▸⚐
    doc_indexer@18903[I]:Terminating loop requested by terminate signal RequestLoopEnd()
    doc_indexer@18903[I]:#sent: 56 #recv: 56 sent_size: 1.7 MB recv_size: 1.7 MB
    doc_indexer@18903[I]:request loop ended, tearing down ...
    doc_indexer@18903[I]:indexer size: 865 physical size: 3.1 MB
    doc_indexer@18903[S]:artifacts of this executor (vecidx) is persisted to ./workspace/doc_compound_indexer-0/vecidx.bin
    doc_indexer@18903[I]:indexer size: 865 physical size: 3.2 MB
    doc_indexer@18903[S]:artifacts of this executor (docidx) is persisted to ./workspace/doc_compound_indexer-0/docidx.bin

Search Answers

We need to build a custom Executor to rerank the top-50 candidate answers. We can do this with the Jina Hub API. Let's get make sure that the Jina Hub extension is installed:

pip install "jina[hub]"

We can build the custom Ranker, FinBertQARanker by running:

jina hub build FinBertQARanker/ --pull --test-uses --timeout-ready 60000

Run

We can now use our Financial QA search engine by running:

python app.py search

The Ranker might take some time to compute the relevancy scores since it is using a BERT-based model. You can try out this list of questions from the FiQA dataset:

• What does it mean that stocks are “memoryless”?
• What would a stock be worth if dividends did not exist?
• What are the risks of Dividend-yielding stocks?
• Why do financial institutions charge so much to convert currency?
• Is there a candlestick pattern that guarantees any kind of future profit?
• 15 year mortgage vs 30 year paid off in 15
• Why is it rational to pay out a dividend?
• Why do companies have a fiscal year different from the calendar year?
• What should I look at before investing in a start-up?
• Where do large corporations store their massive amounts of cash?

Community

  • Slack channel - a communication platform for developers to discuss Jina
  • Community newsletter - subscribe to the latest update, release and event news of Jina
  • LinkedIn - get to know Jina AI as a company and find job opportunities
  • Twitter Follow - follow Jina AI and interact with them using hashtag #JinaSearch
  • Company - know more about the company, Jina AI is fully committed to open-source!

License

Copyright (c) 2021 Jina's friend. All rights reserved.

You might also like...
CCQA A New Web-Scale Question Answering Dataset for Model Pre-Training

CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training This is the official repository for the code and models of the paper CCQA: A N

chaii - hindi & tamil question answering

chaii - hindi & tamil question answering This is the solution for rank 5th in Kaggle competition: chaii - Hindi and Tamil Question Answering. The comp

Contact Extraction with Question Answering.

contactsQA Extraction of contact entities from address blocks and imprints with Extractive Question Answering. Goal Input: Dr. Max Mustermann Hauptstr

🚀 RocketQA, dense retrieval for information retrieval and question answering, including both Chinese and English state-of-the-art models.
🚀 RocketQA, dense retrieval for information retrieval and question answering, including both Chinese and English state-of-the-art models.

In recent years, the dense retrievers based on pre-trained language models have achieved remarkable progress. To facilitate more developers using cutt

Question and answer retrieval in Turkish with BERT

trfaq Google supported this work by providing Google Cloud credit. Thank you Google for supporting the open source! 🎉 What is this? At this repo, I'm

MILES is a multilingual text simplifier inspired by LSBert - A BERT-based lexical simplification approach proposed in 2018. Unlike LSBert, MILES uses the bert-base-multilingual-uncased model, as well as simple language-agnostic approaches to complex word identification (CWI) and candidate ranking.
LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)

LV-BERT Introduction In this repo, we introduce LV-BERT by exploiting layer variety for BERT. For detailed description and experimental results, pleas

 VD-BERT: A Unified Vision and Dialog Transformer with BERT
VD-BERT: A Unified Vision and Dialog Transformer with BERT

VD-BERT: A Unified Vision and Dialog Transformer with BERT PyTorch Code for the following paper at EMNLP2020: Title: VD-BERT: A Unified Vision and Dia

Pytorch-version BERT-flow: One can apply BERT-flow to any PLM within Pytorch framework.

Pytorch-version BERT-flow: One can apply BERT-flow to any PLM within Pytorch framework.

Comments
  • Difference between Finbert-domain and Finbert-QA

    Difference between Finbert-domain and Finbert-QA

    In the retriever system, I understand you are using the Finbert-domain. In the ranker, you are using Finbert-QA. What is the difference between the two? Can we use the Finbert-domain in the the ranker?

    opened by samjoy 1
  • Error during command jina hub build FinBertQARanker/ --pull --test-uses --timeout-ready 60000

    Error during command jina hub build FinBertQARanker/ --pull --test-uses --timeout-ready 60000

    On macos I got

    1. jina hub build FinBertQARanker/ --pull --test-uses --timeout-ready 60000 .../opt/anaconda3/lib/python3.7/site-packages/jina/importer.py:135: UserWarning: due to the missing dependencies or bad implementations, ['jina.hub.encoders.image.VSEImageEncoder', 'jina.hub.encoders.nlp.VSETextEncoder'] can not be imported if you are using these executors/drivers, they wont work. You can use jina check to list all executors and drivers f'due to the missing dependencies or bad implementations, ' usage: jina [-h] [-v] [-vf] {hello-world, pod, flow, gateway ... 7 more choices} ... jina: error: unrecognized arguments: --timeout-ready 60000
    2. I deleted option --timeout-ready 60000 This time I got AttributeError: 'FinBertQARanker' object has no attribute 'COL_MATCH_HASH' HubIO@24673[I]: def score( HubIO@24673[I]: self, query_meta: Dict, old_match_scores: Dict, match_meta: Dict HubIO@24673[I]: ) -> "np.ndarray": HubIO@24673[I]:
      HubIO@24673[I]: # Compute new matching scores using a fine-tuned model. HubIO@24673[I]: new_scores = [ HubIO@24673[I]: ( HubIO@24673[I]: match_id, HubIO@24673[I]: self._get_score(query_meta['text'], match_meta[match_id]['text']), HubIO@24673[I]: ) HubIO@24673[I]: for match_id, old_score in old_match_scores.items() HubIO@24673[I]: ] HubIO@24673[I]: return np.array( HubIO@24673[I]: new_scores HubIO@24673[I]: , HubIO@24673[I]:> dtype=[(self.COL_MATCH_HASH, np.int64), (self.COL_SCORE, np.float64)], HubIO@24673[I]: ) HubIO@24673[I]:E AttributeError: 'FinBertQARanker' object has no attribute 'COL_MATCH_HASH' HubIO@24673[I]: HubIO@24673[I]:init.py:96: AttributeError
    opened by valery-semantipedia 4
Owner
Bithiah Yuan
Bithiah Yuan
Question answering app is used to answer for a user given question from user given text.

Question answering app is used to answer for a user given question from user given text.It is created using HuggingFace's transformer pipeline and streamlit python packages.

Siva Prakash 3 Apr 5, 2022
NeuralQA: A Usable Library for Question Answering on Large Datasets with BERT

NeuralQA: A Usable Library for (Extractive) Question Answering on Large Datasets with BERT Still in alpha, lots of changes anticipated. View demo on n

Victor Dibia 220 Dec 11, 2022
NeuralQA: A Usable Library for Question Answering on Large Datasets with BERT

NeuralQA: A Usable Library for (Extractive) Question Answering on Large Datasets with BERT Still in alpha, lots of changes anticipated. View demo on n

Victor Dibia 184 Feb 10, 2021
Knowledge Graph,Question Answering System,基于知识图谱和向量检索的医疗诊断问答系统

Knowledge Graph,Question Answering System,基于知识图谱和向量检索的医疗诊断问答系统

wangle 823 Dec 28, 2022
天池中药说明书实体识别挑战冠军方案;中文命名实体识别;NER; BERT-CRF & BERT-SPAN & BERT-MRC;Pytorch

天池中药说明书实体识别挑战冠军方案;中文命名实体识别;NER; BERT-CRF & BERT-SPAN & BERT-MRC;Pytorch

zxx飞翔的鱼 751 Dec 30, 2022
:mag: Transformers at scale for question answering & neural search. Using NLP via a modular Retriever-Reader-Pipeline. Supporting DPR, Elasticsearch, HuggingFace's Modelhub...

Haystack is an end-to-end framework for Question Answering & Neural search that enables you to ... ... ask questions in natural language and find gran

deepset 6.4k Jan 9, 2023
:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

(Framework for Adapting Representation Models) What is it? FARM makes Transfer Learning with BERT & Co simple, fast and enterprise-ready. It's built u

deepset 1.6k Dec 27, 2022
:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

(Framework for Adapting Representation Models) What is it? FARM makes Transfer Learning with BERT & Co simple, fast and enterprise-ready. It's built u

deepset 1.1k Feb 14, 2021
Baseline code for Korean open domain question answering(ODQA)

Open-Domain Question Answering(ODQA)는 다양한 주제에 대한 문서 집합으로부터 자연어 질의에 대한 답변을 찾아오는 task입니다. 이때 사용자 질의에 답변하기 위해 주어지는 지문이 따로 존재하지 않습니다. 따라서 사전에 구축되어있는 Knowl

VUMBLEB 69 Nov 4, 2022
Disfl-QA: A Benchmark Dataset for Understanding Disfluencies in Question Answering

Disfl-QA is a targeted dataset for contextual disfluencies in an information seeking setting, namely question answering over Wikipedia passages. Disfl-QA builds upon the SQuAD-v2 (Rajpurkar et al., 2018) dataset, where each question in the dev set is annotated to add a contextual disfluency using the paragraph as a source of distractors.

Google Research Datasets 52 Jun 21, 2022