The FinQA dataset from paper: FinQA: A Dataset of Numerical Reasoning over Financial Data

Zhiyu Chen

Last update: Dec 29, 2022

Related tags

Text Data & NLP FinQA

Overview

FinQA

The FinQA dataset from paper: FinQA: A Dataset of Numerical Reasoning over Financial Data

Format

"pre_text": the texts before the table;
"post_text": the text after the table;
"table": the table;
"id": unique example id. composed by the original report name plus example index for this report. 

"qa": {
  "question": the question;
  "program": the reasoning program;
  "gold_inds": the gold supporting facts;
  "exe_ans": the gold execution result;
  "program_re": the reasoning program in nested format;
}

Code associated with the "Data Augmentation using Pre-trained Transformer Models" paper

Data Augmentation using Pre-trained Transformer Models Code associated with the Data Augmentation using Pre-trained Transformer Models paper Code cont

44 Dec 31, 2022

This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL 2021.

XL-Sum This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Lang

189 Jan 2, 2023

This is the code for the EMNLP 2021 paper AEDA: An Easier Data Augmentation Technique for Text Classification

The baseline code is for EDA: Easy Data Augmentation techniques for boosting performance on text classification tasks

81 Dec 9, 2022

Implementaion of our ACL 2022 paper Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation

Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation This is the implementaion of our paper: Bridging the

20 Dec 12, 2022

This repository is home to the Optimus data transformation plugins for various data processing needs.

Transformers Optimus's transformation plugins are implementations of Task and Hook interfaces that allows execution of arbitrary jobs in optimus. To i

37 Dec 14, 2022

Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code.

textgenrnn Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code, or quickly tr

4.8k Dec 30, 2022

Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code.

textgenrnn Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code, or quickly tr

4.3k Feb 18, 2021

Contract Understanding Atticus Dataset

Contract Understanding Atticus Dataset This repository contains code for the Contract Understanding Atticus Dataset (CUAD), a dataset for legal contra

273 Dec 17, 2022

Common Voice Dataset explorer

Common Voice Dataset Explorer Common Voice Dataset is by Mozilla Made during huggingface finetuning week Usage pip install -r requirements.txt streaml

22 Nov 16, 2022

Comments

Labelling examples with 1,2,3-step hops

Hi - from the paper it mentions proportions of 1,2,3-step questions in the dataset. However, there is no explicit label indicating which question is a 1-step question or a 2,3-step question. Could you add that information or provide a way to infer the number of hops a question takes given a passage?

Thanks!

opened by imranq 1
How can we fill the ‘model_input’ field in the private test dataset?

hi,

As we know, there is a field named "model_input" in the test.json. It is very import for the generator model. But it is gone in the private test dataset. Can you give me some suggestions how to fill the ‘model_input’ field ? Thanks.

opened by qinhui99 1
Description for the attributes in the data file
Hi @czyssrs, thanks for your excellent work

I only found these descriptions available in the README.file.

"pre_text": the texts before the table; "post_text": the text after the table; "table": the table; "id": unique example id. composed by the original report name plus example index for this report. "qa": { "question": the question; "program": the reasoning program; "gold_inds": the gold supporting facts; "exe_ans": the gold execution result; "program_re": the reasoning program in nested format; }

However, after walking through the actual train.json file, I found many other attributes like table_ori, table_retrieved, text_retrieved, etc. Just wondering if its possible that you could add a description for each attribute that in your dataset?

Also:

Isn't text_retrieved and table_retrieved with corresponding scores generated by a model?

what is model_input, tfidftopn under qa?

answer and exe_ans are different for Sample 1 (filename":"ADI/2009/page_49.pdf") in the train.json.

answer:380, exe_ans:3.8

Thanks for your help and support
opened by icedpanda 0

Input_mask and number_indices doesn't match, because of [cls] at the beginning

def convert_single_mathqa_example(example, is_training, tokenizer, max_seq_length,
                                  max_program_length, op_list, op_list_size,
                                  const_list, const_list_size,
                                  cls_token, sep_token):
    """Converts a single MathQAExample into an InputFeature."""
    features = []
    question_tokens = example.question_tokens
    if len(question_tokens) > max_seq_length - 2:
        print("too long")
        question_tokens = question_tokens[:max_seq_length - 2]
    tokens = [cls_token] + question_tokens + [sep_token]         # 1. This line add [cls_token] at beginning.
    segment_ids = [0] * len(tokens)

    input_ids = tokenizer.convert_tokens_to_ids(tokens)

    input_mask = [1] * len(input_ids)
    for ind, offset in enumerate(example.number_indices):          # 2. Why don't number_indices offset by 1 ?
        if offset < len(input_mask):
            input_mask[offset] = 2
        else:
            if is_training == True:

                # invalid example, drop for training
                return features

            # assert is_training == False

Hello, Thanks for the great work! However, I am confused with the code. In the 1. comment, you add [cls_token] in front of the tokens, which means that the indices of tokens in the tokens will shift to the right by 1. In. 2. comment, you just use the example.number_indices to assign 2 to the indices of numbers, this is confusing, since input_mask is created from the tokens, which contains the [cls] at the beginning. For example: tokens: [[cls], a, b, 1, c, d], the example.number_indices will be [2] (because when you calculate the example.number_indices, there is no [cls] at the beginning, the "2" refers to the number "1"'s index ), the corresponding input_mask will be [1, 1, 1, 1, 1, 1]. When you try to assign the numbers' indices to 2 by the example.number_indices , the input_mask will be [1, 1, 0, 1, 1, 1], however, the 0'index 2 refers to the "b" in the tokens. Could you please explain this? Thanks!

opened by KnightZhang625 0

Owner

Zhiyu Chen

Ph.D. student in ML/NLP

GitHub

BERT-based Financial Question Answering System

BERT-based Financial Question Answering System In this example, we use Jina, PyTorch, and Hugging Face transformers to build a production-ready BERT-b

61 Sep 18, 2022

AMUSE - financial summarization

AMUSE AMUSE - financial summarization Unzip data.zip Train new model: python FinAnalyze.py --task train --start 0 --count <how many files,-1 for all>

1 Jan 11, 2022

Unsupervised Abstract Reasoning for Raven’s Problem Matrices

Unsupervised Abstract Reasoning for Raven’s Problem Matrices This code is the implementation of our TIP paper. This is the first unsupervised abstract

9 Dec 17, 2022

Winner system (DAMO-NLP) of SemEval 2022 MultiCoNER shared task over 10 out of 13 tracks.

KB-NER: a Knowledge-based System for Multilingual Complex Named Entity Recognition The code is for the winner system (DAMO-NLP) of SemEval 2022 MultiC

116 Dec 27, 2022

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

740 Dec 24, 2022

A 30000+ Chinese MRC dataset - Delta Reading Comprehension Dataset

Delta Reading Comprehension Dataset 台達閱讀理解資料集 Delta Reading Comprehension Dataset (DRCD) 屬於通用領域繁體中文機器閱讀理解資料集。本資料集期望成為適用於遷移學習之標準中文閱讀理解資料集。本資料集從2,108篇

272 Dec 15, 2022

This repository contains data used in the NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems

Proteno This is the data release associated with the corresponding NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deploymen

37 Dec 4, 2022

💛 Code and Dataset for our EMNLP 2021 paper: "Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion Causes"

Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion Causes Official PyTorch implementation and EmoCause evaluatio

50 Dec 21, 2022

Code and dataset for the EMNLP 2021 Finding paper "Can NLI Models Verify QA Systems’ Predictions?"

22 Oct 21, 2022

BiQE: Code and dataset for the BiQE paper

BiQE: Bidirectional Query Embedding This repository includes code for BiQE and the datasets introduced in Answering Complex Queries in Knowledge Graph

1 Oct 20, 2021

The FinQA dataset from paper: FinQA: A Dataset of Numerical Reasoning over Financial Data

Related tags

Overview

FinQA

Format

You might also like...

Code associated with the "Data Augmentation using Pre-trained Transformer Models" paper

This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL 2021.

This is the code for the EMNLP 2021 paper AEDA: An Easier Data Augmentation Technique for Text Classification

Implementaion of our ACL 2022 paper Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation

This repository is home to the Optimus data transformation plugins for various data processing needs.

Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code.

Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code.

Contract Understanding Atticus Dataset

Common Voice Dataset explorer

Comments

Labelling examples with 1,2,3-step hops

How can we fill the ‘model_input’ field in the private test dataset?

Description for the attributes in the data file

Input_mask and number_indices doesn't match, because of [cls] at the beginning

Owner

Zhiyu Chen

BERT-based Financial Question Answering System

AMUSE - financial summarization

Unsupervised Abstract Reasoning for Raven’s Problem Matrices

Winner system (DAMO-NLP) of SemEval 2022 MultiCoNER shared task over 10 out of 13 tracks.

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

A 30000+ Chinese MRC dataset - Delta Reading Comprehension Dataset

This repository contains data used in the NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems

💛 Code and Dataset for our EMNLP 2021 paper: "Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion Causes"

Code and dataset for the EMNLP 2021 Finding paper "Can NLI Models Verify QA Systems’ Predictions?"

BiQE: Code and dataset for the BiQE paper