Code and data form the paper BERT Got a Date: Introducing Transformers to Temporal Tagging

Last update: Dec 4, 2022

Related tags

Deep Learning pytorch transformer seq2seq encoder-decoder temporal-tagger bert-model huggingface token-classification

Overview

BERT Got a Date: Introducing Transformers to Temporal Tagging

Satya Almasian*, Dennis Aumiller*, and Michael Gertz
Heidelberg University
Contact us via: <lastname>@informatik.uni-heidelberg.de

Code and data for the paper BERT Got a Date: Introducing Transformers to Temporal Tagging. Temporal tagging is the task of identification of temporal mentions in text; these expressions can be further divided into different type categories, which is what we refer to as expression (type) classification. This repository describes two different types of transformer-based temporal taggers, which are both additionally capable of expression classification. We follow the TIMEX3 schema definitions in their styling and expression classes (notably, the latter are one of TIME, DATE, SET, DURATION). The available data sources for temporal tagging are in the TimeML format, which is essentially a form of XML with tags encapsulating temporal expressions.
An example can be seen below:

Due to lockdown restrictions, 2020 might go down as the worst economic year in over <TIMEX3 tid="t2" type="DURATION" value="P1DE">a decade</TIMEX3>.

For more data instances, look at the content of data.zip. Refer to the README file in the respective unzipped folder for more information.
This repository contains code for data preparation and training of a seq2seq model (encoder-decoder architectured initialized from encoder-only architectures, specifically BERT or RoBERTa), as well as three token classification encoders (BERT-based).
The output of the models discussed in the paper is in the results folder. Refer to the README file in the folder for more information.

Data Preparation

The scripts to generate training data is in the subfolder data_preparation. For more usage information, refer to the README file in the subfolder. The data used for training and evaluation are provided in zipped form in data.zip.

Evaluation

For evaluation, we use a slightly modified version of the TempEval-3 evaluation toolkit (original source here). We refactored the code to be compatible with Python3, and incorporated additional evaluation metrics, such as a confusion matrix for type classification. We cross-referenced results to ensure full backward-compatibility and all runs result in the exact same results for both versions. Our adjusted code, as well as scripts to convert the output of transformer-based tagging models are in the evaluation subfolder. For more usage information, refer to the README file in the respective subfolder.

Temporal models

We train and evaluate two types of setups for joint temporal tagging and classification:

Token Classification: We define three variants of simple token classifiers; all of them are based on Huggingface's BertForTokenClassification. We adapt their "token classification for named entity recognition script" to train these models. All the models are trained using bert-base-uncased as their pre-trained checkpoint.
Text-to-Text Generation (Seq2Seq): These models are encoder-decoder architectures using BERT or RoBERTa for initial weights. We use Huggingface's EncoderDecoder class for initialization of weights, starting from bert-base-uncased and roberta-base, respectively.

Seq2seq

To train the seq2seq models, use run_seq2seq_bert_roberta.py. Example usage is as follows:

python3 run_seq2seq_bert_roberta.py --model_name roberta-base --pre_train True \
--model_dir ./test --train_data ./data/seq2seq/train/tempeval_train.json \ 
--eval_data ./data/seq2seq/test/tempeval_test.json --num_gpu 2 --num_train_epochs 1 \
warmup_steps 100 --seed 0 --eval_steps 200

Which trains a roberta2roberta model defined by model_name for num_train_epochs epochs on the gpu with ID num_gpu. The random seed is set by seed and the number of warmup steps by warmup_steps. Train data should be specified in train_data and model_dir defines where the model is saved. set eval_data if you want intermediate evaluation defined by eval_steps. If the pre_train flag is set to true it will load the checkpoints from the hugginface hub and fine-tune on the dataset given. If the pre_train is false, we are in the fine-tuning mode and you can provide the path to the pre-trained model with pretrain_path. We used the pre_train mode to train on weakly labeled data provided by the rule-based system of HeidelTime and set the pre_train to false for fine-tunning on the benchmark datasets. If you wish to simply fine-tune the benchmark datasets using the huggingface checkpoints you can set the pre_train to ture, as displayed in the example above. For additional arguments such as length penalty, the number of beams, early stopping, and other model specifications, please refer to the script.

Token Classifiers

As mentioned above all token classifiers are trained using an adaptation of the NER script from hugginface. To train these models use
run_token_classifier.py like the following example:

python3 run_token_classifier.py --data_dir /data/temporal/BIO/wikiwars \ 
--labels ./data/temporal/BIO/train_staging/labels.txt \ 
--model_name_or_path bert-base-uncased \ 
--output_dir ./fine_tune_wikiwars/bert_tagging_with_date_no_pretrain_8epochs/bert_tagging_with_date_layer_seed_19 --max_seq_length  512  \
--num_train_epochs 8 --per_device_train_batch_size 34 --save_steps 3000 --logging_steps 300 --eval_steps 3000 \ 
--do_train --do_eval --overwrite_output_dir --seed 19 --model_date_extra_layer

We used bert-base-uncased as the base of all our token classification models for pre-training as defined by model_name_or_path. For fine-tuning on the datasets model_name_or_path should point to the path of the pre-trained model. labels file is created during data preparation for more information refer to the subfolder. data_dir points to a folder that contains train.txt, test.txt and dev.txt and output_dir points to the saving location. You can define the number of epochs by num_train_epochs, set the seed with seed and batch size on each GPU with per_device_train_batch_size. For more information on the parameters refer to the Hugginface script. In our paper, we introduce 3 variants of token classification, which are defined by flags in the script. If no flag is set the model trains the vanilla BERT for token classification. The flag model_date_extra_layer trains the model with an extra date layer and model_crf adds the extra crf layer. To train the extra date embedding you need to download the vocabulary file and specify its path in date_vocab argument. The description and model definition of the BERT variants are in folder temporal_models. Please refer to the README file for further information. For training different model types on the same data, make sure to remove the cached dataset, since the feature generation is different for each model type.

Load directly from the Huggingface Model Hub

We uploaded our best-performing version of each architecture to the Huggingface Model Hub. The weights for the other four seeding runs are available upon request. We upload the variants that were fine-tuned on the concatenation of all three evaluation sets for better generalization to various domains. Token classification models are variants without pre-training. Both seq2seq models are pretrained on the weakly labled corpus and fine-tuned on the mixed data.

Overall we upload the following five models. For other model configurations and checkpoints please get in contact with us:

satyaalmasian/temporal_tagger_roberta2roberta: Our best perfoming model from the paper, an encoder-decoder architecture using RoBERTa. The model is pre-trained on weakly labeled news articles, tagged with HeidelTime, and fined-tuned on the train set of TempEval-3, Tweets, and Wikiwars.
satyaalmasian/temporal_tagger_bert2bert: Our second seq2seq model , an encoder-decoder architecture using BERT. The model is pre-trained on weakly labeled news articles, tagged with HeidelTime, and fined-tuned on the train set of TempEval-3, Tweets, and Wikiwars.
satyaalmasian/temporal_tagger_BERT_tokenclassifier: BERT for token classification model or vanilla BERT model from the paper. This model is only trained on the train set of TempEval-3, Tweets, and Wikiwars.
satyaalmasian/temporal_tagger_DATEBERT_tokenclassifier: BERT for token classification with an extra date embedding, that encodes the reference date of the document. If the document does not have a reference date, it is best to avoid this model. Moreover, since the architecture is a modification of a default hugginface model, the usage is not as straightforward and requires the classes defined in the temporal_model module. This model is only trained on the train set of TempEval-3, Tweets, and Wikiwars.
satyaalmasian/temporal_tagger_BERTCRF_tokenclassifier :BERT for token classification with a CRF layer on the output. Moreover, since the architecture is a modification of a default huggingface model, the usage is not as straightforward and requires the classes defined in the temporal_model module. This model is only trained on the train set of TempEval-3, Tweets, and Wikiwars.

In the examples module, you find two scripts model_hub_seq2seq_examples.py and model_hub_tokenclassifiers_examples.py for seq2seq and token classification examples using the hugginface model hub. The examples load the models and use them on example sentences for tagging. The seq2seq example uses the pre-defined post-processing from the tempeval evaluation and contains rules for the cases we came across in the benchmark dataset. If you plan to use these models on new data, it is best to observe the raw output of the first few samples to detect possible format problems that are easily fixable. Further fine-tuning of the models is also possible. For seq2seq models you can simply load the models with

tokenizer = AutoTokenizer.from_pretrained("satyaalmasian/temporal_tagger_roberta2roberta")
model = EncoderDecoderModel.from_pretrained("satyaalmasian/temporal_tagger_roberta2roberta")

and use the DataProcessor from temporal_models.seq2seq_utils to preprocess the json dataset. The model can be fine-tuned using Seq2SeqTrainer (same as in run_seq2seq_bert_roberta.py). For token classifiers the model and the tokenizers are loaded as follows:

tokenizer = AutoTokenizer.from_pretrained("satyaalmasian/temporal_tagger_BERT_tokenclassifier", use_fast=False)
model = BertForTokenClassification.from_pretrained("satyaalmasian/temporal_tagger_BERT_tokenclassifier")

Classifiers need a BIO-tagged file that can be loaded using TokenClassificationDataset and fine-tuned with the hugginface Trainer. For more information on the usage of these models refer to their model hub page.

Citation

If you use our models in your work, we would appreciate attribution with the following citation:

@article{almasian2021bert,
  title={{BERT got a Date: Introducing Transformers to Temporal Tagging}},
  author={Almasian, Satya and Aumiller, Dennis and Gertz, Michael},
  journal={arXiv},
  year={2021}
}

Comments

Using reference/create date

Hi, thank you for sharing the nice project and the paper. Is it possible to use reference dates for the examples in model_hub_seq2seq_examples.py?

I also tried to use the tokenclassifier approach. But it seems that it does not provide a value in the tag. Am I getting the expected output?

https://github.com/satya77/Transformer_Temporal_Tagger/blob/81b92d0f0a958a37f8e4ebcb2de13430cfa73277/evaluation/classifier_generate_tempeval_data.py#L158


I lived in New York for <TIMEX3 tid="t28" type="DURATION" value="">10 years</TIMEX3>.
Cumbre Vieja last erupted in <TIMEX3 tid="t29" type="DATE" value="">1971</TIMEX3> and in <TIMEX3 tid="t30" type="DATE" value="">1949</TIMEX3>.
The club's founding date, <TIMEX3 tid="t31" type="DATE" value="">15 January</TIMEX3>, was intentional.
Officers were called to the house at 07:25 BST on <TIMEX3 tid="t32" type="DATE" value="">Sunday</TIMEX3> after concerns were raised about the people living there.
Police were first called to the scene just after <TIMEX3 tid="t33" type="TIME" value="">7</TIMEX3>.<TIMEX3 tid="t34" type="TIME" value="">25am this morning</TIMEX3>, <TIMEX3 tid="t35" type="DATE" value="">Sunday, September 19</TIMEX3>, and have confirmed they will continue to remain in the area for <TIMEX3 tid="t36" type="DATE" value="">some time</TIMEX3>.

Thanks again!

question

opened by maifeng 4

Add license

We should add a license to the project, to make it clear which use cases are legally allowed. Depending on other dependencies, we might have to go with a specific license (e.g., transformers has parts licensed under Apache, I don't know how they affect each other).
help wanted question

opened by dennlinger 3
examples/run_model_hub_taggers.py torch.cat() error when run with model_type = "date"

The problem is caused by line 21: processed_date = torch.LongTensor(date_tokenizer(["2020 2 28" for _ in range(len(input_texts))], add_special_tokens=False)["input_ids"])

It should be: processed_date = torch.LongTensor(date_tokenizer(["2020 2 28"], add_special_tokens=False)["input_ids"])

You are processing each text individually in a loop, so this tensor should just contain the one item.

opened by joshua-rubin 1
Major refactor and consistency fixes
This PR addresses several issues that have come up in the discussions:

Addresses #2 , by simply removing outdated code snippets that caused an issue with specific (newer) transformers versions. Note that this can be further improved in the future by revisiting how much of our actual code is required, and whether we can substitute these classes by inheriting from transformers directly, thus avoiding code duplication.

Adds the option to install the repository as a pip package. This solves several (related) issues, mostly centered around correct imports from parent-level folders, which is non-trivial with Python. For now, the option is to install it with python3 -m pip install . (in the root folder), however, I am actively working to get it on PyPI.org as well. This way, I have also verified that the examples are finally running correctly, which was a previous issue.

I've started to be more serious about "separation of concerns", and hopefully improved readability of the actual module by moving our scripts into subfolders.

Renaming some of the classes to be more consistent (e.g., BERT_CRF_NER is now BertWithCRF), as well as more expressive (NumbertTokenizer is now DateTokenizer).

I've contained the refactor of actual code to a minimum for now, since I was first trying to get the right code structure, which would then allow us to improve other aspects in a separate PRs, without further crowding changes into this one. There are some obvious fixes, but no deeper changes to the "business logic".

Also includes re-zipped versions of the LFS files (data and results), since the MacOS-zipped folders contain "ghost files" which show up on Linux systems, and cause the folder structure to become invalid.

Disclaimer: I have only tested functionality on the two model_hub scripts, which is not ideal. If you have any suggestions for which tests/scripts to run to see whether something is broken, please let me know.
This is also relevant for paths referenced in files. While I believe to have taken care of most paths, the package change might have invalidated some paths that were not found by PyCharm, so no guarantees here.
documentation enhancement help wanted
opened by dennlinger 1
`TypeError` with DateBERT in latest transformers version
We've encountered an issue when loading the DateBERT model, which seems to be caused by these lines:

@add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format("batch_size, sequence_length")) @add_code_sample_docstrings( tokenizer_class=_TOKENIZER_FOR_DOC, checkpoint=_CHECKPOINT_FOR_DOC, output_type=TokenClassifierOutput, config_class=_CONFIG_FOR_DOC, )

Since the docstrings aren't currently used, we will remove them shortly and make sure that everything is still runnable with the latest version of transformers.
bug
opened by dennlinger 1
How to tag dates using the model temporal_tagger_DATEBERT_tokenclassifier
I am using this code to load the model and the tokenizer:

tokenizer = AutoTokenizer.from_pretrained("satyaalmasian/temporal_tagger_DATEBERT_tokenclassifier", use_fast=False) model = tr.BertForTokenClassification.from_pretrained("satyaalmasian/temporal_tagger_DATEBERT_tokenclassifier")

I have a list of text:

examples=['Texas DRIVER LICENSE Class AM 04/10/2014', 'N. Joy Driving DOB 09/21/1990']

How do I now pass this into the model to get proper tagging of dates? Sorry little confused.

@dennlinger
opened by pratikchhapolika 2
Issue Loading Data

Hello,

I have been unable to unzip this data.zip file. Is it possibly broken?

Additionally, is there pre-processed Seq2Seq available for download? (uploading as a Huggingface dataset would be awesome)

Thanks!

opened by JosephGatto 1
Plain text tagging

Hi, I am trying to use this model for inference but the plain text tagging comes out to be very weird.

Input: "Hello, today is Monday but not Tuesday, maybe tomorrow"

Output: 'Hello, today </time x3> is </ timex 3>, maybe this AFTER now type + - today like $ ',

Is this expected or am I missing something ?

opened by nikhilranjan7 1
The latest version (v3) on arXiv is not available

First, thank you for the great work!

Just want to bring to your attention that the latest arXiv version (v3) is not available. Any possibility of seeing the v3 published again?

opened by XiaomoWu 2
Cannot directly write out annotation_group due to potentially different casing

Hey there,

I'm getting the following error and I'm not sure what is causing it.

/opt/conda/lib/python3.9/site-packages/temporal_taggers/evaluation/tagger_evaluation.py(152)place_timex_tag() 150 import pdb 151 pdb.set_trace() --> 152 print(f"Remaining raw text: {raw_text}") 153 raise ValueError(f"Could not find current annotation group "{annotation_group}" in text.") 154 # Cannot directly write out annotation_group due to potentially different casing

Any ideas on how I can troubleshoot?
bug

opened by linguist89 1
Automatically batch texts when too long

For samples that exceed the 512 subword token limit, we currently do not have a strategy in place to deal with this. This is both unwanted and relatively easy to improve. There are a few considerations with respect to the exact strategy to be used, but it seems like a good starting point to approximate sentences with something like a lightweight spacy model, and then chunk based on approximate max length.
bug enhancement

opened by dennlinger 0

Owner

GitHub

This is an official implementation of CvT: Introducing Convolutions to Vision Transformers.

Introduction This is an official implementation of CvT: Introducing Convolutions to Vision Transformers. We present a new architecture, named Convolut

408 Dec 30, 2022

This is an official implementation of CvT: Introducing Convolutions to Vision Transformers.

Introduction This is an official implementation of CvT: Introducing Convolutions to Vision Transformers. We present a new architecture, named Convolut

175 Jan 8, 2023

Pre-trained BERT Models for Ancient and Medieval Greek, and associated code for LaTeCH 2021 paper titled - "A Pilot Study for BERT Language Modelling and Morphological Analysis for Ancient and Medieval Greek"

Ancient Greek BERT The first and only available Ancient Greek sub-word BERT model! State-of-the-art post fine-tuning on Part-of-Speech Tagging and Mor

22 Dec 8, 2022

The source codes for ACL 2021 paper 'BoB: BERT Over BERT for Training Persona-based Dialogue Models from Limited Personalized Data'

BoB: BERT Over BERT for Training Persona-based Dialogue Models from Limited Personalized Data This repository provides the implementation details for

124 Dec 27, 2022

Source code for NAACL 2021 paper "TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference"

TR-BERT Source code and dataset for "TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference". The code is based on huggaface's transformers.

37 Oct 30, 2022

Introducing neural networks to predict stock prices

IntroNeuralNetworks in Python: A Template Project IntroNeuralNetworks is a project that introduces neural networks and illustrates an example of how o

637 Jan 4, 2023

DeepProbLog is an extension of ProbLog that integrates Probabilistic Logic Programming with deep learning by introducing the neural predicate.

DeepProbLog DeepProbLog is an extension of ProbLog that integrates Probabilistic Logic Programming with deep learning by introducing the neural predic

KU Leuven Machine Learning Research Group

94 Dec 18, 2022

Introducing neural networks to predict stock prices

IntroNeuralNetworks in Python: A Template Project IntroNeuralNetworks is a project that introduces neural networks and illustrates an example of how o

637 Jan 4, 2023

Code for the TASLP paper "PSLA: Improving Audio Tagging With Pretraining, Sampling, Labeling, and Aggregation".

PSLA: Improving Audio Tagging with Pretraining, Sampling, Labeling, and Aggregation Introduction Getting Started FSD50K Recipe AudioSet Recipe Label E

84 Dec 27, 2022

Minimal But Practical Image Classifier Pipline Using Pytorch, Finetune on ResNet18, Got 99% Accuracy on Own Small Datasets.

PyTorch Image Classifier Updates As for many users request, I released a new version of standared pytorch immage classification example at here: http:

106 Nov 6, 2022

Where-Got-Time - An NUS timetable generator which uses a genetic algorithm to optimise timetables to suit the needs of NUS students

Where Got Time(table)? A timetable optimsier which uses an evolutionary algorith

3 Jan 9, 2022

Pytorch reimplement of the paper "A Novel Cascade Binary Tagging Framework for Relational Triple Extraction" ACL2020. The original code is written in keras.

CasRel-pytorch-reimplement Pytorch reimplement of the paper "A Novel Cascade Binary Tagging Framework for Relational Triple Extraction" ACL2020. The o

170 Dec 1, 2022

Converting CPT to bert form for use

cpt-encoder 将CPT转成bert形式使用说明刚刚刷到又出了一种模型：CPT，看论文显示，在很多中文任务上性能比mac bert还好，就迫不及待想把它用起来。根据对源码的研究，发现该模型在做nlu建模时主要用的encoder部分，也就是bert，因此我将这部分权重转为bert权重类型

1 Oct 14, 2021

The project is an official implementation of our paper "3D Human Pose Estimation with Spatial and Temporal Transformers".

3D Human Pose Estimation with Spatial and Temporal Transformers This repo is the official implementation for 3D Human Pose Estimation with Spatial and

363 Dec 28, 2022

Cascaded Deep Video Deblurring Using Temporal Sharpness Prior and Non-local Spatial-Temporal Similarity

This repository is the official PyTorch implementation of Cascaded Deep Video Deblurring Using Temporal Sharpness Prior and Non-local Spatial-Temporal Similarity

4 Dec 11, 2022

Official repository with code and data accompanying the NAACL 2021 paper "Hurdles to Progress in Long-form Question Answering" (https://arxiv.org/abs/2103.06332).

Hurdles to Progress in Long-form Question Answering This repository contains the official scripts and datasets accompanying our NAACL 2021 paper, "Hur

41 Nov 8, 2022

VD-BERT: A Unified Vision and Dialog Transformer with BERT

VD-BERT: A Unified Vision and Dialog Transformer with BERT PyTorch Code for the following paper at EMNLP2020: Title: VD-BERT: A Unified Vision and Dia