CUAD

Overview

Contract Understanding Atticus Dataset

This repository contains code for the Contract Understanding Atticus Dataset (CUAD), a dataset for legal contract review curated by the Atticus Project. It is part of the associated paper CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review by Dan Hendrycks, Collin Burns, Anya Chen, and Spencer Ball.

Contract review is a task about "finding needles in a haystack." We find that Transformer models have nascent performance on CUAD, but that this performance is strongly influenced by model design and training dataset size. Despite some promising results, there is still substantial room for improvement. As one of the only large, specialized NLP benchmarks annotated by experts, CUAD can serve as a challenging research benchmark for the broader NLP community.

For more details about CUAD and legal contract review, see the Atticus Project website.

Trained Models

We provide checkpoints for three of the best models fine-tuned on CUAD: RoBERTa-base (~100M parameters), RoBERTa-large (~300M parameters), and DeBERTa-xlarge (~900M parameters).

Requirements

This repository requires the HuggingFace Transformers library. It was tested with Python 3.8, PyTorch 1.7, and Transformers 4.3/4.4.

Citation

If you find this useful in your research, please consider citing:

@article{hendrycks2021cuad,
      title={CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review}, 
      author={Dan Hendrycks and Collin Burns and Anya Chen and Spencer Ball},
      journal={arXiv preprint arXiv:2103.06268},
      year={2021}
}
Comments
  • convert squad examples to features very slow

    convert squad examples to features very slow

    Hello, The step to convert squad examples to features is very slow on my machine:48 cores + GPU. The tqdm estimates 24 hours to finish. Is it normal? Thanks!

    convert squad examples to features: 4%|██ | 865/22450 [20:16<24:52:42, 4.15s/it]

    opened by Saibo-creator 2
  • Predictions take a lot of time.

    Predictions take a lot of time.

    Hi

    When using the Deberta model for predictions, it takes more than an hour for one document (85 page document). Is there any way to reduce the time taken?, please advise on this. Thanks in advance.

    opened by deepakumarln 2
  • Consume too much memory

    Consume too much memory

    When "Creating features from dataset file at .", this code consumes too much memory (I have a 48G machine).

    This makes me can not run this code. (I guess this needs a 128G machine)

    Is it possible to fix this problem? Thx.

    opened by wangdsh 2
  • Checkpoints location

    Checkpoints location

    I couldn't locate them in the provided documentation, do you mind pointing or linking to them in README?

    We provide checkpoints for three of the best models fine-tuned on CUAD: RoBERTa-base (~100M parameters), RoBERTa-large (~300M parameters), and DeBERTa-xlarge (~900M parameters).
    
    opened by gitcarbs 2
  • Provide the precision numbers for best CUAD model from paper

    Provide the precision numbers for best CUAD model from paper

    In the CUAD paper you show the chart below. Can you share the precision numbers that correspond to each clause? That would be useful for comparing against other models.

    image_6483441 (1)_JPG copy

    opened by ahegel 1
  • [Feature Request] Publish Dataset to Hugging Face Datasets?

    [Feature Request] Publish Dataset to Hugging Face Datasets?

    Thanks so much for open sourcing this dataset, looking forward to using it! I would love if you added it to Hugging Face's Datasets to make it even more accessible and discoverable for folks!

    opened by morganmcg1 1
  • Could you push your models to Huggingface hub?

    Could you push your models to Huggingface hub?

    First at all, thanks for this amazing dataset and for the pretrained models. Would it be possible to push your models to the huggingface model's hub https://huggingface.co/models ? I can do it too but I think it would give the models more legitimacy if they came from your account

    opened by dfioravanti 0
  • Why is test dataset (test.json) labeled?

    Why is test dataset (test.json) labeled?

    The "--predict_file ./data/test.json" file is labeled with questions and answers, and it's passed directly into predictions = compute_predictions_logits() for predictions in train.py.

    If I want to use your model to do predictions on my own dataset, do I also need to label it in the same json format? Doesn't that defeat the purpose? Let me know if I am misunderstanding, but shouldn't the model predict on unlabeled, raw text file?

    Thanks!

    opened by ShuJackson 1
  • NCCL Error 1: unhandled cuda error

    NCCL Error 1: unhandled cuda error

    When I run the training script, I ran into an instance of 'std::runtime_error' what(): NCCL Error 1: unhandled cuda error ./run.sh

    This happens every time in the Evaluation step of the train.py script - after the 'convert squad examples to features' step completes successfully and right after 'Evaluating: 0%' is printed.

    I have made sure torch can pick up the cuda info:

    print(torch.cuda.is_available()) True

    image

    opened by ShuJackson 3
  • Use pythons default split function

    Use pythons default split function

    The default split function is better than splitting just on spaces. Consider the following two examples.

    "The    quick brown\n fox".split()
    ['The', 'quick', 'brown', 'fox']
    

    vs

    "The    quick brown\n fox".split(" ")
    ['The', '', '', '', 'quick', 'brown\n', 'fox']
    
    opened by brian8128 0