Contract Understanding Atticus Dataset

Overview

Contract Understanding Atticus Dataset

This repository contains code for the Contract Understanding Atticus Dataset (CUAD), a dataset for legal contract review curated by the Atticus Project. It is part of the associated paper CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review by Dan Hendrycks, Collin Burns, Anya Chen, and Spencer Ball.

Contract review is a task about "finding needles in a haystack." We find that Transformer models have nascent performance on CUAD, but that this performance is strongly influenced by model design and training dataset size. Despite some promising results, there is still substantial room for improvement. As one of the only large, specialized NLP benchmarks annotated by experts, CUAD can serve as a challenging research benchmark for the broader NLP community.

For more details about CUAD and legal contract review, see the Atticus Project website.

Trained Models

We provide checkpoints for three of the best models fine-tuned on CUAD: RoBERTa-base (~100M parameters), RoBERTa-large (~300M parameters), and DeBERTa-xlarge (~900M parameters).

Requirements

This repository requires the HuggingFace Transformers library. It was tested with Python 3.8, PyTorch 1.7, and Transformers 4.3/4.4.

Citation

If you find this useful in your research, please consider citing:

@article{hendrycks2021cuad,
      title={CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review}, 
      author={Dan Hendrycks and Collin Burns and Anya Chen and Spencer Ball},
      journal={arXiv preprint arXiv:2103.06268},
      year={2021}
}
Comments
  • convert squad examples to features very slow

    convert squad examples to features very slow

    Hello, The step to convert squad examples to features is very slow on my machine:48 cores + GPU. The tqdm estimates 24 hours to finish. Is it normal? Thanks!

    convert squad examples to features: 4%|██ | 865/22450 [20:16<24:52:42, 4.15s/it]

    opened by Saibo-creator 2
  • Predictions take a lot of time.

    Predictions take a lot of time.

    Hi

    When using the Deberta model for predictions, it takes more than an hour for one document (85 page document). Is there any way to reduce the time taken?, please advise on this. Thanks in advance.

    opened by deepakumarln 2
  • Consume too much memory

    Consume too much memory

    When "Creating features from dataset file at .", this code consumes too much memory (I have a 48G machine).

    This makes me can not run this code. (I guess this needs a 128G machine)

    Is it possible to fix this problem? Thx.

    opened by wangdsh 2
  • Checkpoints location

    Checkpoints location

    I couldn't locate them in the provided documentation, do you mind pointing or linking to them in README?

    We provide checkpoints for three of the best models fine-tuned on CUAD: RoBERTa-base (~100M parameters), RoBERTa-large (~300M parameters), and DeBERTa-xlarge (~900M parameters).
    
    opened by gitcarbs 2
  • Provide the precision numbers for best CUAD model from paper

    Provide the precision numbers for best CUAD model from paper

    In the CUAD paper you show the chart below. Can you share the precision numbers that correspond to each clause? That would be useful for comparing against other models.

    image_6483441 (1)_JPG copy

    opened by ahegel 1
  • [Feature Request] Publish Dataset to Hugging Face Datasets?

    [Feature Request] Publish Dataset to Hugging Face Datasets?

    Thanks so much for open sourcing this dataset, looking forward to using it! I would love if you added it to Hugging Face's Datasets to make it even more accessible and discoverable for folks!

    opened by morganmcg1 1
  • Could you push your models to Huggingface hub?

    Could you push your models to Huggingface hub?

    First at all, thanks for this amazing dataset and for the pretrained models. Would it be possible to push your models to the huggingface model's hub https://huggingface.co/models ? I can do it too but I think it would give the models more legitimacy if they came from your account

    opened by dfioravanti 0
  • Why is test dataset (test.json) labeled?

    Why is test dataset (test.json) labeled?

    The "--predict_file ./data/test.json" file is labeled with questions and answers, and it's passed directly into predictions = compute_predictions_logits() for predictions in train.py.

    If I want to use your model to do predictions on my own dataset, do I also need to label it in the same json format? Doesn't that defeat the purpose? Let me know if I am misunderstanding, but shouldn't the model predict on unlabeled, raw text file?

    Thanks!

    opened by ShuJackson 1
  • NCCL Error 1: unhandled cuda error

    NCCL Error 1: unhandled cuda error

    When I run the training script, I ran into an instance of 'std::runtime_error' what(): NCCL Error 1: unhandled cuda error ./run.sh

    This happens every time in the Evaluation step of the train.py script - after the 'convert squad examples to features' step completes successfully and right after 'Evaluating: 0%' is printed.

    I have made sure torch can pick up the cuda info:

    print(torch.cuda.is_available()) True

    image

    opened by ShuJackson 3
  • Use pythons default split function

    Use pythons default split function

    The default split function is better than splitting just on spaces. Consider the following two examples.

    "The    quick brown\n fox".split()
    ['The', 'quick', 'brown', 'fox']
    

    vs

    "The    quick brown\n fox".split(" ")
    ['The', '', '', '', 'quick', 'brown\n', 'fox']
    
    opened by brian8128 0
Disfl-QA: A Benchmark Dataset for Understanding Disfluencies in Question Answering

Disfl-QA is a targeted dataset for contextual disfluencies in an information seeking setting, namely question answering over Wikipedia passages. Disfl-QA builds upon the SQuAD-v2 (Rajpurkar et al., 2018) dataset, where each question in the dev set is annotated to add a contextual disfluency using the paragraph as a source of distractors.

Google Research Datasets 52 Jun 21, 2022
The FinQA dataset from paper: FinQA: A Dataset of Numerical Reasoning over Financial Data

Data and code for EMNLP 2021 paper "FinQA: A Dataset of Numerical Reasoning over Financial Data"

Zhiyu Chen 114 Dec 29, 2022
WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

Google Research Datasets 740 Dec 24, 2022
A 30000+ Chinese MRC dataset - Delta Reading Comprehension Dataset

Delta Reading Comprehension Dataset 台達閱讀理解資料集 Delta Reading Comprehension Dataset (DRCD) 屬於通用領域繁體中文機器閱讀理解資料集。 本資料集期望成為適用於遷移學習之標準中文閱讀理解資料集。 本資料集從2,108篇

null 272 Dec 15, 2022
Natural language Understanding Toolkit

Natural language Understanding Toolkit TOC Requirements Installation Documentation CLSCL NER References Requirements To install nut you need: Python 2

Peter Prettenhofer 119 Oct 8, 2022
Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].

PLBART Code pre-release of our work, Unified Pre-training for Program Understanding and Generation accepted at NAACL 2021. Note. A detailed documentat

Wasi Ahmad 138 Dec 30, 2022
KLUE-baseline contains the baseline code for the Korean Language Understanding Evaluation (KLUE) benchmark.

KLUE Baseline Korean(한국어) KLUE-baseline contains the baseline code for the Korean Language Understanding Evaluation (KLUE) benchmark. See our paper fo

null 74 Dec 13, 2022
Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG)

Indobenchmark Toolkit Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG) resources fo

Samuel Cahyawijaya 11 Aug 26, 2022
Watson Natural Language Understanding and Knowledge Studio

Material de demonstração dos serviços: Watson Natural Language Understanding e Knowledge Studio Visão Geral: https://www.ibm.com/br-pt/cloud/watson-na

Vanderlei Munhoz 4 Oct 24, 2021
PyTorch implementation of the paper: Text is no more Enough! A Benchmark for Profile-based Spoken Language Understanding

Text is no more Enough! A Benchmark for Profile-based Spoken Language Understanding This repository contains the official PyTorch implementation of th

Xiao Xu 26 Dec 14, 2022
Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code.

textgenrnn Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code, or quickly tr

Max Woolf 4.8k Dec 30, 2022
Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code.

textgenrnn Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code, or quickly tr

Max Woolf 4.3k Feb 18, 2021
Common Voice Dataset explorer

Common Voice Dataset Explorer Common Voice Dataset is by Mozilla Made during huggingface finetuning week Usage pip install -r requirements.txt streaml

Ceyda Cinarel 22 Nov 16, 2022
ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset.

ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset. Through its Python API, the pretrained model can be fine-tuned on any protein-related task in a matter of minutes. Based on our experiments with a wide range of benchmarks, ProteinBERT usually achieves state-of-the-art performance. ProteinBERT is built on TenforFlow/Keras.

null 241 Jan 4, 2023
TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset.

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset. TunBERT was applied to three NLP downstream tasks: Sentiment Analysis (SA), Tunisian Dialect Identification (TDI) and Reading Comprehension Question-Answering (RCQA)

InstaDeep Ltd 72 Dec 9, 2022
SDL: Synthetic Document Layout dataset

SDL is the project that synthesizes document images. It facilitates multiple-level labeling on document images and can generate in multiple languages.

Sơn Nguyễn 0 Oct 7, 2021
PyTorch Language Model for 1-Billion Word (LM1B / GBW) Dataset

PyTorch Large-Scale Language Model A Large-Scale PyTorch Language Model trained on the 1-Billion Word (LM1B) / (GBW) dataset Latest Results 39.98 Perp

Ryan Spring 114 Nov 4, 2022
Korean Simple Contrastive Learning of Sentence Embeddings using SKT KoBERT and kakaobrain KorNLU dataset

KoSimCSE Korean Simple Contrastive Learning of Sentence Embeddings implementation using pytorch SimCSE Installation git clone https://github.com/BM-K/

null 34 Nov 24, 2022