Contract Understanding Atticus Dataset

The Atticus Project

Last update: Dec 17, 2022

Related tags

Overview

Contract Understanding Atticus Dataset

This repository contains code for the Contract Understanding Atticus Dataset (CUAD), a dataset for legal contract review curated by the Atticus Project. It is part of the associated paper CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review by Dan Hendrycks, Collin Burns, Anya Chen, and Spencer Ball.

Contract review is a task about "finding needles in a haystack." We find that Transformer models have nascent performance on CUAD, but that this performance is strongly influenced by model design and training dataset size. Despite some promising results, there is still substantial room for improvement. As one of the only large, specialized NLP benchmarks annotated by experts, CUAD can serve as a challenging research benchmark for the broader NLP community.

For more details about CUAD and legal contract review, see the Atticus Project website.

Trained Models

We provide checkpoints for three of the best models fine-tuned on CUAD: RoBERTa-base (~100M parameters), RoBERTa-large (~300M parameters), and DeBERTa-xlarge (~900M parameters).

Requirements

This repository requires the HuggingFace Transformers library. It was tested with Python 3.8, PyTorch 1.7, and Transformers 4.3/4.4.

Citation

If you find this useful in your research, please consider citing:

@article{hendrycks2021cuad,
      title={CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review}, 
      author={Dan Hendrycks and Collin Burns and Anya Chen and Spencer Ball},
      journal={arXiv preprint arXiv:2103.06268},
      year={2021}
}

Comments

convert squad examples to features very slow

Hello, The step to convert squad examples to features is very slow on my machine:48 cores + GPU. The tqdm estimates 24 hours to finish. Is it normal? Thanks!

convert squad examples to features: 4%|██ | 865/22450 [20:16<24:52:42, 4.15s/it]

opened by Saibo-creator 2
Predictions take a lot of time.

Hi

When using the Deberta model for predictions, it takes more than an hour for one document (85 page document). Is there any way to reduce the time taken?, please advise on this. Thanks in advance.

opened by deepakumarln 2
Consume too much memory

When "Creating features from dataset file at .", this code consumes too much memory (I have a 48G machine).

This makes me can not run this code. (I guess this needs a 128G machine)

Is it possible to fix this problem? Thx.

opened by wangdsh 2

Checkpoints location

I couldn't locate them in the provided documentation, do you mind pointing or linking to them in README?

We provide checkpoints for three of the best models fine-tuned on CUAD: RoBERTa-base (~100M parameters), RoBERTa-large (~300M parameters), and DeBERTa-xlarge (~900M parameters).

opened by gitcarbs 2

Provide the precision numbers for best CUAD model from paper

In the CUAD paper you show the chart below. Can you share the precision numbers that correspond to each clause? That would be useful for comparing against other models.

opened by ahegel 1
[Feature Request] Publish Dataset to Hugging Face Datasets?

Thanks so much for open sourcing this dataset, looking forward to using it! I would love if you added it to Hugging Face's Datasets to make it even more accessible and discoverable for folks!

opened by morganmcg1 1
Could you push your models to Huggingface hub?

First at all, thanks for this amazing dataset and for the pretrained models. Would it be possible to push your models to the huggingface model's hub https://huggingface.co/models ? I can do it too but I think it would give the models more legitimacy if they came from your account

opened by dfioravanti 0
Why is test dataset (test.json) labeled?

The "--predict_file ./data/test.json" file is labeled with questions and answers, and it's passed directly into predictions = compute_predictions_logits() for predictions in train.py.

If I want to use your model to do predictions on my own dataset, do I also need to label it in the same json format? Doesn't that defeat the purpose? Let me know if I am misunderstanding, but shouldn't the model predict on unlabeled, raw text file?

Thanks!

opened by ShuJackson 1
NCCL Error 1: unhandled cuda error

When I run the training script, I ran into an instance of 'std::runtime_error' what(): NCCL Error 1: unhandled cuda error ./run.sh

This happens every time in the Evaluation step of the train.py script - after the 'convert squad examples to features' step completes successfully and right after 'Evaluating: 0%' is printed.

I have made sure torch can pick up the cuda info:

print(torch.cuda.is_available()) True

opened by ShuJackson 3
Use pythons default split function
The default split function is better than splitting just on spaces. Consider the following two examples.

"The quick brown\n fox".split() ['The', 'quick', 'brown', 'fox']

vs

"The quick brown\n fox".split(" ") ['The', '', '', '', 'quick', 'brown\n', 'fox']
opened by brian8128 0

Owner

The Atticus Project

GitHub https://www.atticusprojectai.org/cuad

Disfl-QA: A Benchmark Dataset for Understanding Disfluencies in Question Answering

Disfl-QA is a targeted dataset for contextual disfluencies in an information seeking setting, namely question answering over Wikipedia passages. Disfl-QA builds upon the SQuAD-v2 (Rajpurkar et al., 2018) dataset, where each question in the dev set is annotated to add a contextual disfluency using the paragraph as a source of distractors.

52 Jun 21, 2022

The FinQA dataset from paper: FinQA: A Dataset of Numerical Reasoning over Financial Data

Data and code for EMNLP 2021 paper "FinQA: A Dataset of Numerical Reasoning over Financial Data"

114 Dec 29, 2022

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

740 Dec 24, 2022

A 30000+ Chinese MRC dataset - Delta Reading Comprehension Dataset

Delta Reading Comprehension Dataset 台達閱讀理解資料集 Delta Reading Comprehension Dataset (DRCD) 屬於通用領域繁體中文機器閱讀理解資料集。本資料集期望成為適用於遷移學習之標準中文閱讀理解資料集。本資料集從2,108篇

272 Dec 15, 2022

Natural language Understanding Toolkit

Natural language Understanding Toolkit TOC Requirements Installation Documentation CLSCL NER References Requirements To install nut you need: Python 2

119 Oct 8, 2022

Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].

PLBART Code pre-release of our work, Unified Pre-training for Program Understanding and Generation accepted at NAACL 2021. Note. A detailed documentat

138 Dec 30, 2022

KLUE-baseline contains the baseline code for the Korean Language Understanding Evaluation (KLUE) benchmark.

KLUE Baseline Korean(한국어) KLUE-baseline contains the baseline code for the Korean Language Understanding Evaluation (KLUE) benchmark. See our paper fo

74 Dec 13, 2022

This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding".

BanglaBERT This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced i

197 Dec 25, 2022

Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG)

Indobenchmark Toolkit Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG) resources fo

11 Aug 26, 2022

Watson Natural Language Understanding and Knowledge Studio

Material de demonstração dos serviços: Watson Natural Language Understanding e Knowledge Studio Visão Geral: https://www.ibm.com/br-pt/cloud/watson-na

4 Oct 24, 2021

PyTorch implementation of the paper: Text is no more Enough! A Benchmark for Profile-based Spoken Language Understanding

Text is no more Enough! A Benchmark for Profile-based Spoken Language Understanding This repository contains the official PyTorch implementation of th

26 Dec 14, 2022

Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code.

textgenrnn Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code, or quickly tr

4.8k Dec 30, 2022

Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code.

textgenrnn Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code, or quickly tr

4.3k Feb 18, 2021

Common Voice Dataset explorer

Common Voice Dataset Explorer Common Voice Dataset is by Mozilla Made during huggingface finetuning week Usage pip install -r requirements.txt streaml

22 Nov 16, 2022

ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset.

ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset. Through its Python API, the pretrained model can be fine-tuned on any protein-related task in a matter of minutes. Based on our experiments with a wide range of benchmarks, ProteinBERT usually achieves state-of-the-art performance. ProteinBERT is built on TenforFlow/Keras.

241 Jan 4, 2023

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset.

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset. TunBERT was applied to three NLP downstream tasks: Sentiment Analysis (SA), Tunisian Dialect Identification (TDI) and Reading Comprehension Question-Answering (RCQA)

72 Dec 9, 2022

Contract Understanding Atticus Dataset

Related tags

Overview

Contract Understanding Atticus Dataset

Trained Models

Requirements

Citation

Comments

convert squad examples to features very slow

Predictions take a lot of time.

Consume too much memory

Checkpoints location

Provide the precision numbers for best CUAD model from paper

[Feature Request] Publish Dataset to Hugging Face Datasets?

Could you push your models to Huggingface hub?

Why is test dataset (test.json) labeled?

NCCL Error 1: unhandled cuda error

Use pythons default split function

Owner

The Atticus Project

Disfl-QA: A Benchmark Dataset for Understanding Disfluencies in Question Answering

The FinQA dataset from paper: FinQA: A Dataset of Numerical Reasoning over Financial Data

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

A 30000+ Chinese MRC dataset - Delta Reading Comprehension Dataset

Natural language Understanding Toolkit

Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].

KLUE-baseline contains the baseline code for the Korean Language Understanding Evaluation (KLUE) benchmark.

This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding".

Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG)

Watson Natural Language Understanding and Knowledge Studio

PyTorch implementation of the paper: Text is no more Enough! A Benchmark for Profile-based Spoken Language Understanding

Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code.

Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code.

Common Voice Dataset explorer

ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset.

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset.

SDL: Synthetic Document Layout dataset

PyTorch Language Model for 1-Billion Word (LM1B / GBW) Dataset

Korean Simple Contrastive Learning of Sentence Embeddings using SKT KoBERT and kakaobrain KorNLU dataset