TANL: Structured Prediction as Translation between Augmented Natural Languages

Related tags

Deep Learning tanl
Overview

TANL: Structured Prediction as Translation between Augmented Natural Languages

Code for the paper "Structured Prediction as Translation between Augmented Natural Languages" (ICLR 2021).

If you use this code, please cite the paper using the bibtex reference below.

@inproceedings{tanl,
    title={Structured Prediction as Translation between Augmented Natural Languages},
    author={Giovanni Paolini and Ben Athiwaratkun and Jason Krone and Jie Ma and Alessandro Achille and Rishita Anubhai and Cicero Nogueira dos Santos and Bing Xiang and Stefano Soatto},
    booktitle={9th International Conference on Learning Representations, {ICLR} 2021},
    year={2021},
}

Requirements

  • Python 3.6+
  • PyTorch (tested with version 1.7.1)
  • Transformers (tested with version 4.0.0)
  • NetworkX (tested with version 2.5, only used in coreference resolution)

You can install all required Python packages with pip install -r requirements.txt

Datasets

By default, datasets are expected to be in data/DATASET_NAME. Dataset-specific code is in datasets.py.

For example, the CoNLL04 and ADE datasets (joint entity and relation extraction) in the correct format can be downloaded using https://github.com/markus-eberts/spert/blob/master/scripts/fetch_datasets.sh. For other datasets, pre-processing and links are documented in the code.

Running the code

Use the following command: python run.py JOB

The JOB argument refers to a section of the config file, which by default is config.ini. A sample config file is provided, with settings that allow for a faster training and less memory usage than the settings used to obtain the final results in the paper.

For example, to replicate the paper's results on CoNLL04, have the following section in the config file:

[conll04_final]
datasets = conll04
model_name_or_path = t5-base
num_train_epochs = 200
max_seq_length = 256
max_seq_length_eval = 512
train_split = train,dev
per_device_train_batch_size = 8
per_device_eval_batch_size = 16
do_train = True
do_eval = False
do_predict = True
episodes = 1-10
num_beams = 8

Then run python run.py conll04_final. Note that the final results will differ slightly from the ones reported in the paper, due to small code changes and randomness.

Config arguments can be overwritten by command line arguments. For example: python run.py conll04_final --num_train_epochs 50.

Additional details

If do_train = True, the model is trained on the given train split (e.g., 'train') of the given datasets. The final weights and intermediate checkpoints are written in a directory such as experiments/conll04_final-t5-base-ep200-len256-b8-train, with one subdirectory per episode. Results in JSON format are also going to be saved there.

In every episode, the model is trained on a different (random) permutation of the training set. The random seed is given by the episode number, so that every episode always produces the same exact model.

Once a model is trained, it is possible to evaluate it without training again. For this, set do_train = False or (more easily) provide the -e command-line argument: python run.py conll04_final -e.

If do_eval = True, the model is evaluated on the 'dev' split. If do_predict = True, the model is evaluated on the 'test' split.

Arguments

The following are the most important command-line arguments for the run.py script. Run python run.py -h for the full list.

  • -c CONFIG_FILE: specify config file to use (default is config.ini)
  • -e: only run evaluation (overwrites the setting do_train in the config file)
  • -a: evaluate also intermediate checkpoints, in addition to the final model
  • -v : print results for each evaluation run
  • -g GPU: specify which GPU to use for evaluation

The following are the most important arguments for the config file. See the sample config file to understand the format.

  • datasets (str): comma-separated list of datasets for training
  • eval_datasets (str): comma-separated list of datasets for evaluation (default is the same as for training)
  • model_name_or_path (str): path to pretrained model or model identifier from huggingface.co/models (e.g. t5-base)
  • do_train (bool): whether to run training (default is False)
  • do_eval (bool): whether to run evaluation on the dev set (default is False)
  • do_predict (bool): whether to run evaluation on the test set (default is False)
  • train_split (str): comma-separated list of data splits for training (default is train)
  • num_train_epochs (int): number of train epochs
  • learning_rate (float): initial learning rate (default is 5e-4)
  • train_subset (float > 0 and <=1): portion of training data to effectively use during training (default is 1, i.e., use all training data)
  • per_device_train_batch_size (int): batch size per GPU during training (default is 8)
  • per_device_eval_batch_size (int): batch size during evaluation (default is 8; only one GPU is used for evaluation)
  • max_seq_length (int): maximum input sequence length after tokenization; longer sequences are truncated
  • max_output_seq_length (int): maximum output sequence length (default is max_seq_length)
  • max_seq_length_eval (int): maximum input sequence length for evaluation (default is max_seq_length)
  • max_output_seq_length_eval (int): maximum output sequence length for evaluation (default is max_output_seq_length or max_seq_length_eval or max_seq_length)
  • episodes (str): episodes to run (default is 0; an interval can be specified, such as 1-4; the episode number is used as the random seed)
  • num_beams (int): number of beams for beam search during generation (default is 1)
  • multitask (bool): if True, the name of the dataset is prepended to each input sentence (default is False)

See arguments.py and transformers.TrainingArguments for additional config arguments.

Comments
  • The format of Multiwoz dataset

    The format of Multiwoz dataset

    Hi Giovanni,

    Nice work and thanks for the sharing. I am reproducing the results of the DST task. However, I found the processed data format of multiwoz 2.1 dataset using the script from https://github.com/jasonwu0731/trade-dst does not match your code. May I ask if you do additional preprocessing procedure? If so, would you mind sharing the script?

    Sincerely, Yan

    opened by yanzhangnlp 7
  • ATIS and SNIPS Dataset Source

    ATIS and SNIPS Dataset Source

    Hi,

    Would you mind leaving some instructions for where you found/preprocessed the ATIS and SNIPS datasets?

    I found some .tsv files here for train/dev/test, but the format is not exactly what tanl/datasets.py expects.

    Thanks,

    opened by MerrickWang1 3
  • Bug in augment_sentence function

    Bug in augment_sentence function

    Hi, there seems to be a small bug in augment_sentence function in utils.py. When the root of the entity tree is an entity with tags, those tags won't be augmented onto the output. For example, when I run the code above:

    from utils import augment_sentence
    
    tokens = ['Tolkien', 'was', 'born', 'here']
    augmentations =  [
            ([('person',), ('born in', 'here')], 0, 1),
            ([('location',)], 3, 4),
        ]
    
    # example from the test set of conll03 NER
    tokens = ['Premier', 'league']
    augmentations = [([('miscellaneous',)], 0, 2)]
    
    begin_entity_token = "["
    sep_token = "|"
    relation_sep_token = "="
    end_entity_token = "]"
    
    augmented_output = augment_sentence(tokens, augmentations, begin_entity_token, sep_token, relation_sep_token, end_entity_token)
    print(augmented_output)
    

    It prints out Premier league instead of [ Premier league | miscellaneous ]. This happened because in line 124 (utils.py), the value of the root in entity tree is reset to an empty list. My quick fix of this is initializing the start index of the root as -1. That is changing line 103 in utils.py to

    root = (None, -1, len(tokens))   # this node represents the entire sentence
    

    It would be great if someone could let me know if I am correct on this. Thanks!

    opened by xiangc2 3
  • Episode numbers in few-shot experiment

    Episode numbers in few-shot experiment

    Hi, Thank you for sharing the code ! I'm trying to reproduce the results on FewRel 1.0 . And I'm wondering how many episodes and query numbers are used in 1 shot , and 5 shot cases, respectively ?

    Thanks.

    opened by mtt1998 3
  • About performance on tacred

    About performance on tacred

    Hi,

    Thanks for sharing the code. I try to reproduce the result on tacred. However, the F1 score on the test set is only 67.67.

    The config I used is listed below.

    [tacred] datasets = tacred multitask = False model_name_or_path = t5-base num_train_epochs = 10 max_seq_length = 256 train_split = train per_device_train_batch_size = 16 do_train = True do_eval = True do_predict = True

    I run the code with

    CUDA_VISIBLE_DEVICES=0,1 nohup python3 -m torch.distributed.launch --nproc_per_node=2 run.py tacred > result.log 2>&1 &

    May I ask which part goes wrong? Thank you.

    Regards, Yiming

    opened by MatthewCYM 3
  • Joint IC/SL Integration

    Joint IC/SL Integration

    Description of changes:

    Added ATIS and Snips support to datasets.py and a joint IC/SL format to output_formats.py. Also added configs to config.ini and updated the evaluation logic to include intent classification accuracy.

    By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

    opened by aaronmueller 2
  • About the training batch size used in the low resource experiments

    About the training batch size used in the low resource experiments

    Hi! Thanks a lot for sharing the code! I'm trying to reproduce the low-resource experiments on the CoNLL04 dataset. Could you please provide the size of a training batch (per GPU and the number of GPUs used)? Would also appreciate a full training config!

    opened by wangpf3 2
  • CoNLL2012 Datasets in .json format

    CoNLL2012 Datasets in .json format

    datasets.py expects .json files for CoNLL2012 dataset. However, after searching online, I cannot find any preprocessing tools to yield .json files for the CoNLL2012 dataset.

    Would the authors be able to provide a way to preprocess the CoNLL2012 dataset so that it can be used for training?

    Thanks,

    opened by MerrickWang1 1
  • About data files used for the FewRel dataset

    About data files used for the FewRel dataset

    Hi! I'm wondering how to prepare the data files for the FewRel dataset. Do we use the full train_wiki.json from https://github.com/thunlp/FewRel/tree/master/data as the training split for meta-training, and the full val_wiki.json for evaluation (support&query)? I'm confused because I notice that the fewrel_meta config also specifies do_eval=True. Then what dev split would the code use? Would appreciate any guidance on this!

    opened by wangpf3 1
  • Ace2005EventExtraction Dataset

    Ace2005EventExtraction Dataset

    Hi,

    I've followed the instructions per section A.5 of the paper using this github repo: https://github.com/nlpcl-lab/ace2005-preprocessing/tree/96c8fd4b5a8c87dd6a265d5c14f4d8b8eb9b7fbe

    which gives me train/dev/test.json files for ace2005.

    However, inside of tanl/datasets.py, https://github.com/amazon-research/tanl/blob/2bd8052f0ff6df3b8fd04d7da1469d73f8639099/datasets.py#L1165 , I cannot find a way to run Ace2005. I am currently receiving the following error when attempting to train with ace2005 -

    FileNotFoundError: [Errno 2] No such file or directory: 'data/ace2005event/ace2005event_types.json'

    Does anyone have any advice on how to obtain the necessary files besides train/dev/test.json files for ace2005 to train Ace2005 event extraction?

    Thanks,

    opened by MerrickWang1 0
  • add multi-woz 2.1 preprocessing scripts

    add multi-woz 2.1 preprocessing scripts

    Issue #, if available: #2

    Description of changes: Add pre-processing instructions and script for the MultiWOZ 2.1 dataset.

    By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

    opened by jasonkrone 0
  • reproduce on other datasets

    reproduce on other datasets

    Since you mentioned, "For other datasets, we provide sample processing code which does not necessarily match the format of publicly available versions (we do not plan to adapt the code to load datasets in other formats)". I'd like to know how can I reproduce the results on other datasets in the paper.

    opened by Mizar77 0
Owner
null
Neural machine translation between the writings of Shakespeare and modern English using TensorFlow

Shakespeare translations using TensorFlow This is an example of using the new Google's TensorFlow library on monolingual translation going from modern

Motoki Wu 245 Dec 28, 2022
PyTorch implementation of "Conformer: Convolution-augmented Transformer for Speech Recognition" (INTERSPEECH 2020)

PyTorch implementation of Conformer: Convolution-augmented Transformer for Speech Recognition. Transformer models are good at capturing content-based

Soohwan Kim 565 Jan 4, 2023
A neuroanatomy-based augmented reality experience powered by computer vision. Features 3D visuals of the Atlas Brain Map slices.

Brain Augmented Reality (AR) A neuroanatomy-based augmented reality experience powered by computer vision that features 3D visuals of the Atlas Brain

Yasmeen Brain 10 Oct 6, 2022
Motion Planner Augmented Reinforcement Learning for Robot Manipulation in Obstructed Environments (CoRL 2020)

Motion Planner Augmented Reinforcement Learning for Robot Manipulation in Obstructed Environments [Project website] [Paper] This project is a PyTorch

Cognitive Learning for Vision and Robotics (CLVR) lab @ USC 49 Nov 28, 2022
A heterogeneous entity-augmented academic language model based on Open Academic Graph (OAG)

Library | Paper | Slack We released two versions of OAG-BERT in CogDL package. OAG-BERT is a heterogeneous entity-augmented academic language model wh

THUDM 58 Dec 17, 2022
DrQ-v2: Improved Data-Augmented Reinforcement Learning

DrQ-v2: Improved Data-Augmented RL Agent Method DrQ-v2 is a model-free off-policy algorithm for image-based continuous control. DrQ-v2 builds on DrQ,

Facebook Research 234 Jan 1, 2023
[EMNLP 2021] Distantly-Supervised Named Entity Recognition with Noise-Robust Learning and Language Model Augmented Self-Training

RoSTER The source code used for Distantly-Supervised Named Entity Recognition with Noise-Robust Learning and Language Model Augmented Self-Training, p

Yu Meng 60 Dec 30, 2022
PyTorch implementation of CVPR 2020 paper (Reference-Based Sketch Image Colorization using Augmented-Self Reference and Dense Semantic Correspondence) and pre-trained model on ImageNet dataset

Reference-Based-Sketch-Image-Colorization-ImageNet This is a PyTorch implementation of CVPR 2020 paper (Reference-Based Sketch Image Colorization usin

Yuzhi ZHAO 11 Jul 28, 2022
RNG-KBQA: Generation Augmented Iterative Ranking for Knowledge Base Question Answering

RNG-KBQA: Generation Augmented Iterative Ranking for Knowledge Base Question Answering Authors: Xi Ye, Semih Yavuz, Kazuma Hashimoto, Yingbo Zhou and

Salesforce 72 Dec 5, 2022
Distilling Motion Planner Augmented Policies into Visual Control Policies for Robot Manipulation (CoRL 2021)

Distilling Motion Planner Augmented Policies into Visual Control Policies for Robot Manipulation [Project website] [Paper] This project is a PyTorch i

Cognitive Learning for Vision and Robotics (CLVR) lab @ USC 6 Feb 28, 2022
Official PyTorch implementation of "Contrastive Learning from Extremely Augmented Skeleton Sequences for Self-supervised Action Recognition" in AAAI2022.

AimCLR This is an official PyTorch implementation of "Contrastive Learning from Extremely Augmented Skeleton Sequences for Self-supervised Action Reco

Gty 44 Dec 17, 2022
Code to use Augmented Shapiro Wilks Stopping, as well as code for the paper "Statistically Signifigant Stopping of Neural Network Training"

This codebase is being actively maintained, please create and issue if you have issues using it Basics All data files are included under losses and ea

J K Terry 32 Nov 9, 2021
Implementation of Retrieval-Augmented Denoising Diffusion Probabilistic Models in Pytorch

Retrieval-Augmented Denoising Diffusion Probabilistic Models (wip) Implementation of Retrieval-Augmented Denoising Diffusion Probabilistic Models in P

Phil Wang 55 Jan 1, 2023
Price-Prediction-For-a-Dream-Home - A machine learning based linear regression trained model for house price prediction.

Price-Prediction-For-a-Dream-Home ROADMAP TO THIS LINEAR REGRESSION BASED HOUSE PRICE PREDICTION PREDICTION MODEL Import all the dependencies of the p

DIKSHA DESWAL 1 Dec 29, 2021
Doge-Prediction - Coding Club prediction ig

Doge-Prediction Coding Club prediction ig Basically: Create an application that

null 1 Jan 10, 2022
Cross-media Structured Common Space for Multimedia Event Extraction (ACL2020)

Cross-media Structured Common Space for Multimedia Event Extraction Table of Contents Overview Requirements Data Quickstart Citation Overview The code

Manling Li 49 Nov 21, 2022
PyTorch implementation of ARM-Net: Adaptive Relation Modeling Network for Structured Data.

A ready-to-use framework of latest models for structured (tabular) data learning with PyTorch. Applications include recommendation, CRT prediction, healthcare analytics, and etc.

null 48 Nov 30, 2022
This repo contains the official implementations of EigenDamage: Structured Pruning in the Kronecker-Factored Eigenbasis

EigenDamage: Structured Pruning in the Kronecker-Factored Eigenbasis This repo contains the official implementations of EigenDamage: Structured Prunin

Chaoqi Wang 107 Apr 20, 2022
A Closer Look at Structured Pruning for Neural Network Compression

A Closer Look at Structured Pruning for Neural Network Compression Code used to reproduce experiments in https://arxiv.org/abs/1810.04622. To prune, w

Bayesian and Neural Systems Group 140 Dec 5, 2022