Codes for our IJCAI21 paper: Dialogue Discourse-Aware Graph Model and Data Augmentation for Meeting Summarization

xcfeng

Last update: Dec 27, 2022

Related tags

Deep Learning DDAMS

Overview

DDAMS

This is the pytorch code for our IJCAI 2021 paper Dialogue Discourse-Aware Graph Model and Data Augmentation for Meeting Summarization [Arxiv Preprint].

Requirements

We use Conda python 3.7 and strongly recommend that you create a new environment: conda create -n ddams python=3.7.
Run the following command: pip install -r requirements.txt.
- We use pytorch_geometric for GNN implementation.

Data

You can download data here, put the data under the project dir DDAMS/data/xxx.

data/ami
- data/ami/ami: preprocessed meeting data
- data/ami/ami_qg: pseudo summarization data.
- data/ami/ami_reference: golden reference for test file.
data/icsi
- data/icsi/icsi: preprocessed meeting data
- data/icsi/icsi_qg: pseudo summarization data.
- data/icsi/icsi_reference: golden reference for test file.
data/glove: pre-trained word embedding glove.6B.300d.txt.

Reproduce Results

You can follow the following steps to reproduce the best results in our paper.

download checkpoints

Download checkpoints here. Put the checkpoints, including AMI.pt and ICSI.pt, under the project dir DDAMS/models/xx.pt.

translate

Produce final summaries.

For AMI, we can get summaries/ami_summary.txt.

CUDA_VISIBLE_DEVICES=X python translate.py -batch_size 1 \
               -src data/ami/ami/test.src \
               -tgt data/ami/ami/test.tgt \
               -seg data/ami/ami/test.seg \
               -speaker data/ami/ami/test.speaker \
               -relation data/ami/ami/test.relation \
               -beam_size 10 \
               -share_vocab \
               -dynamic_dict \
               -replace_unk \
               -model models/AMI.pt \
               -output summaries/ami_summary.txt \
               -block_ngram_repeat 3 \
               -gpu 0 \
               -min_length 280 \
               -max_length 450

For ICSI, we can get summaries/icsi_summary.txt.

CUDA_VISIBLE_DEVICES=x python translate.py -batch_size 1 \
               -src data/icsi/icsi/test.src \
               -seg data/icsi/icsi/test.seg \
               -speaker data/icsi/icsi/test.speaker \
               -relation data/icsi/icsi/test.relation \
               -beam_size 10 \
               -share_vocab \
               -dynamic_dict \
               -replace_unk \
               -model models/ICSI.pt \
               -output summaries/icsi_summary.txt \
               -block_ngram_repeat 3 \
               -gpu 0 \
               -min_length 400 \
               -max_length 550

remove tags

<t> and </t> will raise errors for ROUGE test. So we should first remove them. (following OpenNMT)

sed -i 's/ <\/t>//g' summaries/ami_summary.txt
sed -i 's/<t> //g' summaries/ami_summary.txt
sed -i 's/ <\/t>//g' summaries/icsi_summary.txt
sed -i 's/<t> //g' summaries/icsi_summary.txt

test rouge score

Change pyrouge.Rouge155() to your local path.

Output format >> ROUGE(1/2/L): xx.xx-xx.xx-xx.xx

python test_rouge.py -c summaries/ami_summary.txt
python test_rouge_icsi.py -c summaries/icsi_summary.txt

ROUGE score

You will get following ROUGE scores.

	ROUGE-1	ROUGE-2	ROUGE-L
AMI	53.15	22.32	25.67
ICSI	40.41	11.02	19.18

From Scratch

For AMI

Preprocess

(1) Preprocess AMI dataset.

python preprocess.py -train_src data/ami/ami/train.src \
                     -train_tgt data/ami/ami/train.tgt \
                     -train_seg data/ami/ami/train.seg \
                     -train_speaker data/ami/ami/train.speaker \
                     -train_relation data/ami/ami/train.relation \
                     -valid_src data/ami/ami/valid.src \
                     -valid_tgt data/ami/ami/valid.tgt \
                     -valid_seg data/ami/ami/valid.seg \
                     -valid_speaker data/ami/ami/valid.speaker \
                     -valid_relation data/ami/ami/valid.relation \
                     -save_data data/ami/AMI \
                     -dynamic_dict \
                     -share_vocab \
                     -lower \
                     -overwrite

(2) Create pre-trained word embeddings.

python embeddings_to_torch.py -emb_file_both data/glove/glove.6B.300d.txt \
-dict_file data/ami/AMI.vocab.pt \
-output_file data/ami/ami_embeddings

(3) Preprocess pseudo summarization dataset.

python preprocess.py -train_src data/ami/ami_qg/train.src \
                     -train_tgt data/ami/ami_qg/train.tgt \
                     -train_seg data/ami/ami_qg/train.seg \
                     -train_speaker data/ami/ami_qg/train.speaker \
                     -train_relation data/ami/ami_qg/train.relation \
                     -save_data data/ami/AMIQG \
                     -lower \
                     -overwrite \
                     -shard_size 500 \
                     -dynamic_dict \
                     -share_vocab

Train

(1) we first pre-train our DDAMS on the pseudo summarization dataset.

run the following command to save config file (-save_config).
remove -save_config and rerun the command to start the training process.

CUDA_VISIBLE_DEVICES=X python train.py -save_model ami_qg_pretrain/AMI_qg\
           -data data/ami/AMIQG \
           -speaker_type ami \
           -batch_size 64 \
           -learning_rate 0.001 \
           -share_embeddings \
           -share_decoder_embeddings \
           -copy_attn \
           -reuse_copy_attn \
           -report_every 30 \
           -encoder_type hier3 \
           -global_attention general \
           -save_checkpoint_steps 500 \
           -start_decay_steps 1500 \
           -pre_word_vecs_enc data/ami/ami_embeddings.enc.pt \
           -pre_word_vecs_dec data/ami/ami_embeddings.dec.pt \
           -log_file logs/ami_qg_pretrain.txt \
           -save_config logs/ami_qg_pretrain.txt

(2) fine-tuning on AMI.

CUDA_VISIBLE_DEVICES=X python train.py -save_model ami_final/AMI \
           -data data/ami/AMI \
           -speaker_type ami \
           -train_from ami_qg_pretrain/xxx.pt  \
           -reset_optim all \
           -batch_size 1 \
           -learning_rate 0.0005 \
           -share_embeddings \
           -share_decoder_embeddings \
           -copy_attn \
           -reuse_copy_attn \
           -encoder_type hier3 \
           -global_attention general \
           -dropout 0.5 \
           -attention_dropout 0.5 \
           -start_decay_steps 500 \
           -decay_steps 500 \
           -log_file logs/ami_final.txt \
           -save_config logs/ami_final.txt

Translate

CUDA_VISIBLE_DEVICES=X python translate.py -batch_size 1 \
               -src data/ami/ami/test.src \
               -tgt data/ami/ami/test.tgt \
               -seg data/ami/ami/test.seg \
               -speaker data/ami/ami/test.speaker \
               -relation data/ami/ami/test.relation \
               -beam_size 10 \
               -share_vocab \
               -dynamic_dict \
               -replace_unk \
               -model xxx.pt \
               -output xxx.txt \
               -block_ngram_repeat 3 \
               -gpu 0 \
               -min_length 280 \
               -max_length 450

For ICSI

Preprocess

(1) Preprocess ICSI dataset.

python preprocess.py -train_src data/icsi/icsi/train.src \
                     -train_tgt data/icsi/icsi/train.tgt \
                     -train_seg data/icsi/icsi/train.seg \
                     -train_speaker data/icsi/icsi/train.speaker \
                     -train_relation data/icsi/icsi/train.relation \
                     -valid_src data/icsi/icsi/valid.src \
                     -valid_tgt data/icsi/icsi/valid.tgt \
                     -valid_seg data/icsi/icsi/valid.seg \
                     -valid_speaker data/icsi/icsi/valid.speaker \
                     -valid_relation data/icsi/icsi/valid.relation \
                     -save_data data/icsi/ICSI \
                     -src_seq_length 20000 \
                     -src_seq_length_trunc 20000 \
                     -tgt_seq_length 700 \
                     -tgt_seq_length_trunc 700 \
                     -dynamic_dict \
                     -share_vocab \
                     -lower \
                     -overwrite

(2) Create pre-trained word embeddings.

python embeddings_to_torch.py -emb_file_both data/glove/glove.6B.300d.txt \
-dict_file data/icsi/ICSI.vocab.pt \
-output_file data/icsi/icsi_embeddings

(3) Preprocess pseudo summarization dataset.

python preprocess.py -train_src data/icsi/icsi_qg/train.src \
                     -train_tgt data/icsi/icsi_qg/train.tgt \
                     -train_seg data/icsi/icsi_qg/train.seg \
                     -train_speaker data/icsi/icsi_qg/train.speaker \
                     -train_relation data/icsi/icsi_qg/train.relation \
                     -save_data data/icsi/ICSIQG \
                     -lower \
                     -overwrite \
                     -shard_size 500 \
                     -dynamic_dict \
                     -share_vocab

Train

(1) pre-training.

CUDA_VISIBLE_DEVICES=X python train.py -save_model icsi_qg_pretrain/ICSI \
           -data data/icsi/ICSIQG \
           -speaker_type icsi \
           -batch_size 64 \
           -learning_rate 0.001 \
           -share_embeddings \
           -share_decoder_embeddings \
           -copy_attn \
           -reuse_copy_attn \
           -report_every 30 \
           -encoder_type hier3 \
           -global_attention general \
           -save_checkpoint_steps 500 \
           -start_decay_steps 1500 \
           -pre_word_vecs_enc data/icsi/icsi_embeddings.enc.pt \
           -pre_word_vecs_dec data/icsi/icsi_embeddings.dec.pt \
           -log_file logs/icsi_qg_pretrain.txt \
           -save_config logs/icsi_qg_pretrain.txt

(2) fine-tuning on ICSI.

CUDA_VISIBLE_DEVICES=X python train.py -save_model icsi_final/ICSI \
           -data data/icsi/ICSI \
           -speaker_type icsi \
           -train_from icsi_qg_pretrain/xxx.pt  \
           -reset_optim all \
           -batch_size 1 \
           -learning_rate 0.0005 \
           -share_embeddings \
           -share_decoder_embeddings \
           -copy_attn \
           -reuse_copy_attn \
           -encoder_type hier3 \
           -global_attention general \
           -dropout 0.5 \
           -attention_dropout 0.5 \
           -start_decay_steps 1000 \
           -decay_steps 100 \
           -save_checkpoint_steps 50 \
           -valid_steps 50 \
           -log_file logs/icsi_final.txt \
           -save_config logs/icsi_final.txt

Translate

CUDA_VISIBLE_DEVICES=x python translate.py -batch_size 1 \
               -src data/icsi/icsi/test.src \
               -seg data/icsi/icsi/test.seg \
               -speaker data/icsi/icsi/test.speaker \
               -relation data/icsi/icsi/test.relation \
               -beam_size 10 \
               -share_vocab \
               -dynamic_dict \
               -replace_unk \
               -model xxx.pt \
               -output xxx.txt \
               -block_ngram_repeat 3 \
               -gpu 0 \
               -min_length 400 \
               -max_length 550

Test Rouge

(1) Before ROUGE test, we should first remove special tags: .

sed -i 's/ <\/t>//g' xxx.txt
sed -i 's/<t> //g' xxx.txt

(2) Test rouge

python test_rouge.py -c summaries/xxx.txt
python test_rouge_icsi.py -c summaries/xxx.txt

Comments

How to do dialog discourse parser with my own data?
Hello, Thanks for the open source code! I am wondering how can I use dialogdiscourse parser with my own dialog data? Should I put all training data in datadir/train.json? I tried to make every training instance into a json file and put them together in train.json. Also, I add a list.my_data in the folder. the dataformat of train.json is List[Dict],with keys:edus,id,relation(empty) But it seems that it did not work.

Traceback (most recent call last): File "main.py", line 78, in <module> data_test = load_data("./" + FLAGS.data + "/" + FLAGS.test_data, map_relations) File "/home/mist/DialogueDiscourseParser/utils.py", line 21, in load_data for edu in dialog["edus"]: TypeError: string indices must be integers
documentation
opened by Hannibal046 9
RuntimeError: No CUDA GPUs are available

I am trying to replicate this project.However I am facing this run time error No cuda Gpus available.My system has 4 GPU s and still i am getting this error.Can you help me in solving this?

/home/user/anaconda3/envs/ddams/lib/python3.7/site-packages/torch/nn/modules/rnn.py:63: UserWarning: dropout option adds dropout after all but last recurrent layer, so non-zero dropout expects num_layers greater than 1, but got dropout=0.2 and num_layers=1 "num_layers={}".format(dropout, num_layers)) Traceback (most recent call last): File "translate.py", line 57, in main(opt) File "translate.py", line 19, in main translator = build_translator(opt, report_score=True) File "/home/user/DDAMS/onmt/translate/translator.py", line 28, in build_translator fields, model, model_opt = load_test_model(opt) File "/home/user/DDAMS/onmt/model_builder.py", line 110, in load_test_model opt.gpu) File "/home/user/DDAMS/onmt/model_builder.py", line 255, in build_base_model model.to(device) File "/home/user/anaconda3/envs/ddams/lib/python3.7/site-packages/torch/nn/modules/module.py", line 673, in to return self._apply(convert) File "/home/user/anaconda3/envs/ddams/lib/python3.7/site-packages/torch/nn/modules/module.py", line 387, in _apply module._apply(fn) File "/home/user/anaconda3/envs/ddams/lib/python3.7/site-packages/torch/nn/modules/module.py", line 387, in _apply module._apply(fn) File "/home/user/anaconda3/envs/ddams/lib/python3.7/site-packages/torch/nn/modules/module.py", line 387, in _apply module._apply(fn) [Previous line repeated 2 more times] File "/home/user/anaconda3/envs/ddams/lib/python3.7/site-packages/torch/nn/modules/module.py", line 409, in _apply param_applied = fn(param) File "/home/user/anaconda3/envs/ddams/lib/python3.7/site-packages/torch/nn/modules/module.py", line 671, in convert return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) File "/home/user/anaconda3/envs/ddams/lib/python3.7/site-packages/torch/cuda/init.py", line 170, in _lazy_init torch._C._cuda_init() RuntimeError: No CUDA GPUs are available

opened by lakshmi-imrsv 5
Rouge didn't match

Thank you for your wonderful work!

I'm reproducing the result, and found the calculated rouge by test_rouge.py on the AMI dataset doesn't match the reported result (the test result is 52.66-22.29-25.50, slightly lower than the reported results), have I missed something?

opened by Sunnycheey 4
Custom dataset

Hi, I am trying to build this model from scratch. But, I wasn't able to figure out how you preprocessed the AMI data. How did you guys preprocess the dataset to get train.src, train.tgt, train.speaker, and train.seg?

Thanks

opened by jahn96 4
transformer as backbone

Hi, thanks for your great contribution. In the original paper, you note: "We have also tried Transformer [Vaswani et al., 2017] as our backbone model.", can you provide the code that uses transformer as the backbone (I have found transformer.py in the Openmt project, but there is still a gap between that code and the hierarchical encoder)?

Thank you again.

opened by Sunnycheey 3
data set related doubts

Hi i need some more details on how you got the test.relation data.I can see that it is from dialogue discourse parser.But I can see that the STAC data set is in a different format than AMI. Can you please elaborate on obtaining the relation file?Do we need audio file in the .aa format also?
enhancement

opened by lakshmi-imrsv 3
The speaker embedding

Hi, It's claimed that the speaker type is embedded as the one hot vector, while in onmt/model_builder.py when speaker type is icsi, the vector seems not strictly one-hot (slightly different as a[:][0] == 1 except line 1 (start from 0)).
bug

opened by Sunnycheey 2
Reproducing Shen et al. (UNS) baseline

Hello. Thank you for your wonderful paper (and reproducible code)!

I was wondering how you obtained 37.86 R-1 on AMI for UNS. I've run the original code, as well as my own implementation for UNS, and consistently get around 30 R-1 on AMI.

Did you just run the original code to obtain the rouge scores? Or was pyrouge used? I'm guessing the test/training splits you used is the same as the original UNS paper (as can be seen from this repo).

I was wondering because I am using UNS as a baseline for my own paper, but noticed scores from my trials were much lower than scores reported as your baseline.

opened by seongminp 2
Installing pip requirements fails and data related questions
Hi!

I am trying to use the code on a new dataset to get summaries.

Installation issues

When setting up the requirements I ran into a few issues with regard to installing the pip requirements. Here are steps I have taken:

git clone https://github.com/xcfcode/DDAMS.git cd DDAMS conda create -n ddams python=3.7 pip install -r requirements.txt

After running the last command, I get the following error:

For ease, this is the text of the error: Collecting torch-cluster==1.4.4 Downloading torch_cluster-1.4.4.tar.gz (18 kB) ERROR: Command errored out with exit status 1: command: /opt/anaconda3/envs/ddams/bin/python -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/s4/94wbh2bd7_x1zpthktk_k5q00000gn/T/pip-install-wetzfkx_/torch-cluster_48529a159cc6448fbcd0d4cf88e577bd/setup.py'"'"'; __file__='"'"'/private/var/folders/s4/94wbh2bd7_x1zpthktk_k5q00000gn/T/pip-install-wetzfkx_/torch-cluster_48529a159cc6448fbcd0d4cf88e577bd/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /private/var/folders/s4/94wbh2bd7_x1zpthktk_k5q00000gn/T/pip-pip-egg-info-dzkcak7c cwd: /private/var/folders/s4/94wbh2bd7_x1zpthktk_k5q00000gn/T/pip-install-wetzfkx_/torch-cluster_48529a159cc6448fbcd0d4cf88e577bd/ Complete output (5 lines): Traceback (most recent call last): File "<string>", line 1, in <module> File "/private/var/folders/s4/94wbh2bd7_x1zpthktk_k5q00000gn/T/pip-install-wetzfkx_/torch-cluster_48529a159cc6448fbcd0d4cf88e577bd/setup.py", line 2, in <module> import torch ModuleNotFoundError: No module named 'torch' ---------------------------------------- WARNING: Discarding https://files.pythonhosted.org/packages/bd/5f/01c5799cd1f81f9956f03a0e1d9a861e020a598dd411d9bd3c3c1dd5b8a4/torch_cluster-1.4.4.tar.gz#sha256=7907f3f270116cb299bdd4f88de497a85b3b34cf127910ffe0a6131e16620123 (from https://pypi.org/simple/torch-cluster/). Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output. ERROR: Could not find a version that satisfies the requirement torch-cluster==1.4.4 (from versions: 0.1.1, 0.2.3, 0.2.4, 1.0.1, 1.0.3, 1.1.1, 1.1.2, 1.1.3, 1.1.4, 1.1.5, 1.2.1, 1.2.2, 1.2.3, 1.2.4, 1.3.0, 1.4.0, 1.4.1, 1.4.2, 1.4.3a1, 1.4.3, 1.4.4, 1.4.5, 1.5.2, 1.5.3, 1.5.4, 1.5.5, 1.5.6, 1.5.7, 1.5.8, 1.5.9) ERROR: No matching distribution found for torch-cluster==1.4.4

I get the error on both macOS Sierra 10.12.6 and Ubuntu 20.04. I have tried to change the dependencies in requirements.txt so they no longer conflict, but this always results in some kind of error later on:

on macOS Sierra 10.12.6 I get AssertionError: Torch not compiled with CUDA enabled when running translate.py after manually changing the dependencies to match.

on Ubuntu 20.04 I get ImportError: /home/khalid/anaconda3/envs/ddams2/lib/python3.7/site-packages/torch_scatter/scatter_cpu.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZNK2at11ATenOpTable11reportErrorEN3c1012TensorTypeIdE when running translate.py after manually changing the dependencies to match.

Data pre-processing

I intend to use your code to get a summary of a meeting based on a meeting transcript. In order to do so, if I understand everything correctly, you need a test.src, test.tgt, test.seg, test.speaker, and test.relation file. In issue #5 you had already given a script to generate those files based on the AMI dataset. However, if you only have a single meeting transcript how do you generate those files? In particular, the test.tgt file seems to be a summary in itself already, do you need to have such a file already beforehand?

Thank you for your help.
opened by kelhaji 2

Codes for our IJCAI21 paper: Dialogue Discourse-Aware Graph Model and Data Augmentation for Meeting Summarization

Related tags

Overview

DDAMS

Requirements

Data

Reproduce Results

download checkpoints

translate

remove tags

test rouge score

ROUGE score

From Scratch

For AMI

Preprocess

Train

Translate

For ICSI

Preprocess

Train

Translate

Test Rouge

Comments

Installation issues

Data pre-processing

Owner

xcfeng

KE-Dialogue: Injecting knowledge graph into a fully end-to-end dialogue system.

Source Code for DialogBERT: Discourse-Aware Response Generation via Learning to Recover and Rank Utterances (https://arxiv.org/pdf/2012.01775.pdf)

An integration of several popular automatic augmentation methods, including OHL (Online Hyper-Parameter Learning for Auto-Augmentation Strategy) and AWS (Improving Auto Augment via Augmentation Wise Weight Sharing) by Sensetime Research.

The source codes for ACL 2021 paper 'BoB: BERT Over BERT for Training Persona-based Dialogue Models from Limited Personalized Data'

The source code for the Cutoff data augmentation approach proposed in this paper: "A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation".

One implementation of the paper "DMRST: A Joint Framework for Document-Level Multilingual RST Discourse Segmentation and Parsing".

Image transformations designed for Scene Text Recognition (STR) data augmentation. Published at ICCV 2021 Workshop on Interactive Labeling and Data Augmentation for Vision.

Codes for our paper "SentiLARE: Sentiment-Aware Language Representation Learning with Linguistic Knowledge" (EMNLP 2020)

Code for our paper "Sematic Representation for Dialogue Modeling" in ACL2021

Source codes for the paper "Local Additivity Based Data Augmentation for Semi-supervised NER"

Pytorch codes for "Self-supervised Multi-view Stereo via Effective Co-Segmentation and Data-Augmentation"

Code for our paper "SimCLS: A Simple Framework for Contrastive Learning of Abstractive Summarization", ACL 2021

Code and data for the EMNLP 2021 paper "Just Say No: Analyzing the Stance of Neural Dialogue Generation in Offensive Contexts". Coming soon!

This is the repo for the paper `SumGNN: Multi-typed Drug Interaction Prediction via Efficient Knowledge Graph Summarization'. (published in Bioinformatics'21)

Code for "Graph-Evolving Meta-Learning for Low-Resource Medical Dialogue Generation". [AAAI 2021]

Part-Aware Data Augmentation for 3D Object Detection in Point Cloud

We have implemented shaDow-GNN as a general and powerful pipeline for graph representation learning. For more details, please find our paper titled Deep Graph Neural Networks with Shallow Subgraph Samplers, available on arXiv (https//arxiv.org/abs/2012.01380).

A weakly-supervised scene graph generation codebase. The implementation of our CVPR2021 paper ``Linguistic Structures as Weak Supervision for Visual Scene Graph Generation''

NeuralWOZ: Learning to Collect Task-Oriented Dialogue via Model-based Simulation (ACL-IJCNLP 2021)