The FinQA dataset from paper: FinQA: A Dataset of Numerical Reasoning over Financial Data

Overview

FinQA

The FinQA dataset from paper: FinQA: A Dataset of Numerical Reasoning over Financial Data

Format

"pre_text": the texts before the table;
"post_text": the text after the table;
"table": the table;
"id": unique example id. composed by the original report name plus example index for this report. 

"qa": {
  "question": the question;
  "program": the reasoning program;
  "gold_inds": the gold supporting facts;
  "exe_ans": the gold execution result;
  "program_re": the reasoning program in nested format;
}
You might also like...
Code associated with the "Data Augmentation using Pre-trained Transformer Models" paper

Data Augmentation using Pre-trained Transformer Models Code associated with the Data Augmentation using Pre-trained Transformer Models paper Code cont

This is the code for the EMNLP 2021 paper AEDA: An Easier Data Augmentation Technique for Text Classification
This is the code for the EMNLP 2021 paper AEDA: An Easier Data Augmentation Technique for Text Classification

The baseline code is for EDA: Easy Data Augmentation techniques for boosting performance on text classification tasks

Implementaion of our ACL 2022 paper Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation

Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation This is the implementaion of our paper: Bridging the

This repository is home to the Optimus data transformation plugins for various data processing needs.

Transformers Optimus's transformation plugins are implementations of Task and Hook interfaces that allows execution of arbitrary jobs in optimus. To i

Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code.
Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code.

textgenrnn Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code, or quickly tr

Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code.
Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code.

textgenrnn Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code, or quickly tr

Contract Understanding Atticus Dataset
Contract Understanding Atticus Dataset

Contract Understanding Atticus Dataset This repository contains code for the Contract Understanding Atticus Dataset (CUAD), a dataset for legal contra

Common Voice Dataset explorer

Common Voice Dataset Explorer Common Voice Dataset is by Mozilla Made during huggingface finetuning week Usage pip install -r requirements.txt streaml

Comments
  • Labelling examples with 1,2,3-step hops

    Labelling examples with 1,2,3-step hops

    Hi - from the paper it mentions proportions of 1,2,3-step questions in the dataset. However, there is no explicit label indicating which question is a 1-step question or a 2,3-step question. Could you add that information or provide a way to infer the number of hops a question takes given a passage?

    Thanks!

    opened by imranq 1
  • How can we fill the ‘model_input’ field in the private test dataset?

    How can we fill the ‘model_input’ field in the private test dataset?

    hi,

    As we know, there is a field named "model_input" in the test.json. It is very import for the generator model. But it is gone in the private test dataset. Can you give me some suggestions how to fill the ‘model_input’ field ? Thanks.

    opened by qinhui99 1
  • Description for the attributes in the data file

    Description for the attributes in the data file

    Hi @czyssrs, thanks for your excellent work

    I only found these descriptions available in the README.file.

    "pre_text": the texts before the table;
    "post_text": the text after the table;
    "table": the table;
    "id": unique example id. composed by the original report name plus example index for this report. 
    
    "qa": {
      "question": the question;
      "program": the reasoning program;
      "gold_inds": the gold supporting facts;
      "exe_ans": the gold execution result;
      "program_re": the reasoning program in nested format;
    }
    

    However, after walking through the actual train.json file, I found many other attributes like table_ori, table_retrieved, text_retrieved, etc. Just wondering if its possible that you could add a description for each attribute that in your dataset?

    Also:

    1. Isn't text_retrieved and table_retrieved with corresponding scores generated by a model?
    2. what is model_input, tfidftopn under qa?
    3. answer and exe_ans are different for Sample 1 (filename":"ADI/2009/page_49.pdf") in the train.json.
      • answer:380, exe_ans:3.8

    Thanks for your help and support

    opened by icedpanda 0
  • Input_mask and number_indices doesn't match, because of [cls] at the beginning

    Input_mask and number_indices doesn't match, because of [cls] at the beginning

    def convert_single_mathqa_example(example, is_training, tokenizer, max_seq_length,
                                      max_program_length, op_list, op_list_size,
                                      const_list, const_list_size,
                                      cls_token, sep_token):
        """Converts a single MathQAExample into an InputFeature."""
        features = []
        question_tokens = example.question_tokens
        if len(question_tokens) > max_seq_length - 2:
            print("too long")
            question_tokens = question_tokens[:max_seq_length - 2]
        tokens = [cls_token] + question_tokens + [sep_token]         # 1. This line add [cls_token] at beginning.
        segment_ids = [0] * len(tokens)
    
        input_ids = tokenizer.convert_tokens_to_ids(tokens)
    
        input_mask = [1] * len(input_ids)
        for ind, offset in enumerate(example.number_indices):          # 2. Why don't number_indices offset by 1 ?
            if offset < len(input_mask):
                input_mask[offset] = 2
            else:
                if is_training == True:
    
                    # invalid example, drop for training
                    return features
    
                # assert is_training == False
    

    Hello, Thanks for the great work! However, I am confused with the code. In the 1. comment, you add [cls_token] in front of the tokens, which means that the indices of tokens in the tokens will shift to the right by 1. In. 2. comment, you just use the example.number_indices to assign 2 to the indices of numbers, this is confusing, since input_mask is created from the tokens, which contains the [cls] at the beginning. For example: tokens: [[cls], a, b, 1, c, d], the example.number_indices will be [2] (because when you calculate the example.number_indices, there is no [cls] at the beginning, the "2" refers to the number "1"'s index ), the corresponding input_mask will be [1, 1, 1, 1, 1, 1]. When you try to assign the numbers' indices to 2 by the example.number_indices , the input_mask will be [1, 1, 0, 1, 1, 1], however, the 0'index 2 refers to the "b" in the tokens. Could you please explain this? Thanks!

    opened by KnightZhang625 0
Owner
Zhiyu Chen
Ph.D. student in ML/NLP
Zhiyu Chen
BERT-based Financial Question Answering System

BERT-based Financial Question Answering System In this example, we use Jina, PyTorch, and Hugging Face transformers to build a production-ready BERT-b

Bithiah Yuan 61 Sep 18, 2022
AMUSE - financial summarization

AMUSE AMUSE - financial summarization Unzip data.zip Train new model: python FinAnalyze.py --task train --start 0 --count <how many files,-1 for all>

null 1 Jan 11, 2022
Unsupervised Abstract Reasoning for Raven’s Problem Matrices

Unsupervised Abstract Reasoning for Raven’s Problem Matrices This code is the implementation of our TIP paper. This is the first unsupervised abstract

Tao Zhuo 9 Dec 17, 2022
Winner system (DAMO-NLP) of SemEval 2022 MultiCoNER shared task over 10 out of 13 tracks.

KB-NER: a Knowledge-based System for Multilingual Complex Named Entity Recognition The code is for the winner system (DAMO-NLP) of SemEval 2022 MultiC

null 116 Dec 27, 2022
WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

Google Research Datasets 740 Dec 24, 2022
A 30000+ Chinese MRC dataset - Delta Reading Comprehension Dataset

Delta Reading Comprehension Dataset 台達閱讀理解資料集 Delta Reading Comprehension Dataset (DRCD) 屬於通用領域繁體中文機器閱讀理解資料集。 本資料集期望成為適用於遷移學習之標準中文閱讀理解資料集。 本資料集從2,108篇

null 272 Dec 15, 2022
This repository contains data used in the NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems

Proteno This is the data release associated with the corresponding NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deploymen

null 37 Dec 4, 2022
💛 Code and Dataset for our EMNLP 2021 paper: "Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion Causes"

Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion Causes Official PyTorch implementation and EmoCause evaluatio

Hyunwoo Kim 50 Dec 21, 2022
Code and dataset for the EMNLP 2021 Finding paper "Can NLI Models Verify QA Systems’ Predictions?"

Code and dataset for the EMNLP 2021 Finding paper "Can NLI Models Verify QA Systems’ Predictions?"

Jifan Chen 22 Oct 21, 2022
BiQE: Code and dataset for the BiQE paper

BiQE: Bidirectional Query Embedding This repository includes code for BiQE and the datasets introduced in Answering Complex Queries in Knowledge Graph

Bhushan Kotnis 1 Oct 20, 2021