Extracting Summary Knowledge Graphs from Long Documents

Zeqiu (Ellen) Wu

Last update: Oct 21, 2022

Related tags

Text Data & NLP GraphSum

Overview

GraphSum

This repo contains the data and code for the G2G model in the paper: Extracting Summary Knowledge Graphs from Long Documents. The other baseline TTG is simply based on BertSumExt.

Environment Setup

This code is tested on python 3.6.9, transformer 3.0.2 and pytorch 1.7.0. You would also need numpy and scipy packages.

Data

Download and unzip the data from this link. Put the unzipped folder named as ./data parallel with ./src. You should see four subfolders under ./data/json, corresponding to four data splits as described in the paper.

Under each subfolder, the json file contains all document full texts, abstracts as well as the summarized graphs obtained from the abstract, organized by the document keys. Each full text consists of a list of sections. Each summarized graph contains a list of entity and relation mentions. Except for the test split, other three data splits have their summarized graphs obtained by running DyGIE++ on the abstract. The test set have manually annotated summarized graphs from SciERC dataset. The format of the graph follows the output of DyGIE++, where each entity mention in a section is represented by (start token id, end token id, entity type) and each relation mention is represented by (start token id of entity 1, end token id of entity 1, start token id of entity 2, end token id of entity 2, relation type). The graph also contains a list of coreferential entity mentions.

You should also see two subfolders under the processed folder of each data split: merged_entities and aligned_entities. merged_entities contains the full and summarized graphs for each document, where the graph vertices are cluster of entity mentions. Entity clusters in each summarized graph are coreferential entity mentions predicted by DyGIE++ or annotated (in test set). Entity clusters in each full graph contains entity mentions that are coreferences or share the same non-generic string names (as described in our paper). Under merged_entities, we provide entity clusters and relations between entity clusters, as well as corresponding entity and relation mentions in the full paper or abstract. Each relation is represented by "[entity cluster id 1]_[entity cluster id 2]_[relation type]". The original full graphs with all entity and relation mentions are obtained by running DyGIE++ on the document full text. You don't need them to run the code, but you can find them here. For some entity names, you may see a trailing string "<GENERIC_ID> [number]". It means these entity names are classified by DyGIE++ as "generic" and the trailing string is used to differentiate the same entity name strings in different clusters in such cases.

aligned_entities contains the pre-calculated alignment between entity clusters (see Section 5.1 in the paper) in the summarized and full graphs for each document. In each entity alignment file, under each entity cluster of the summarized graph, there is a list of entity clusters from the full graph if the list is not empty. They are used to facilitate data preprocessing of G2G and evaluation.

Training and Evaluation

The model is based on GAT. Go to ./src and run bash run.sh. You can also find the pretrained model here. Put it under ./src/output and run the inference and evaluation parts in ./src/run.sh.

Quantifiers and Negations in RE Documents

Quantifiers-and-Negations-in-RE-Documents This project was part of my work for a

1 Feb 1, 2022

Python utility library for compositing PDF documents with reportlab.

pdfdoc-py Python utility library for compositing PDF documents with reportlab. Installation The pdfdoc-py package can be installed directly from the s

1 Jan 6, 2022

Auto-researching tool generating word documents.

About ResearchTE automates researching by generating document with answers to given questions. Supports getting results from: Google DuckDuckGo (with

1 Feb 14, 2022

[WWW 2021 GLB] New Benchmarks for Learning on Non-Homophilous Graphs

New Benchmarks for Learning on Non-Homophilous Graphs Here are the codes and datasets accompanying the paper: New Benchmarks for Learning on Non-Homop

94 Dec 21, 2022

This repo is to provide a list of literature regarding Deep Learning on Graphs for NLP

230 Nov 22, 2022

When doing audio and video sentiment recognition, I found that a lot of code is duplicated, often a function in different time debugging for a long time, based on this problem, I want to manage all the previous work, organized into an open source library can be iterative. For their own use and others.

FastAudioVisual Our project is developed here. The goal finish time is March 01, 2021 What is FastAudioVisual? FastAudioVisual is a tool that allows u

39 Oct 27, 2022

ThinkTwice: A Two-Stage Method for Long-Text Machine Reading Comprehension

ThinkTwice ThinkTwice is a retriever-reader architecture for solving long-text machine reading comprehension. It is based on the paper: ThinkTwice: A

4 Aug 6, 2021

Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Pytorch-NLU，一个中文文本分类、序列标注工具包，支持中文长文本、短文本的多类、多标签分类任务，支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

186 Dec 24, 2022

Official Pytorch implementation of Test-Agnostic Long-Tailed Recognition by Test-Time Aggregating Diverse Experts with Self-Supervision.

This repository is the official Pytorch implementation of Test-Agnostic Long-Tailed Recognition by Test-Time Aggregating Diverse Experts with Self-Supervision.

101 Dec 30, 2022

Comments

IndexError: list index out of range

hello. An error was caught while executing the code, so I wrote the article. In generate_graph_sum_input.py "corpus_type = sys.argv[1]" This part seems to mean the data path, but I keep getting an error at this part. I would be very grateful if you could explain this part. And you said to put the data file in parallel with src, but I do not understand this part, so I ask a question.

opened by 01JDJ 4

Extracting Summary Knowledge Graphs from Long Documents

Related tags

Overview

GraphSum

Environment Setup

Data

Training and Evaluation

You might also like...

Quantifiers and Negations in RE Documents

Python utility library for compositing PDF documents with reportlab.

Auto-researching tool generating word documents.

[WWW 2021 GLB] New Benchmarks for Learning on Non-Homophilous Graphs

This repo is to provide a list of literature regarding Deep Learning on Graphs for NLP

ThinkTwice: A Two-Stage Method for Long-Text Machine Reading Comprehension

Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Official Pytorch implementation of Test-Agnostic Long-Tailed Recognition by Test-Time Aggregating Diverse Experts with Self-Supervision.

Comments

IndexError: list index out of range

Owner

Zeqiu (Ellen) Wu

This repository contains Python scripts for extracting linguistic features from Filipino texts.

Search for documents in a domain through Google. The objective is to extract metadata

texlive expressions for documents

Module for automatic summarization of text documents and HTML pages.

Python implementation of TextRank for phrase extraction and summarization of text documents

A full spaCy pipeline and models for scientific/biomedical documents.

Module for automatic summarization of text documents and HTML pages.

Python implementation of TextRank for phrase extraction and summarization of text documents

A full spaCy pipeline and models for scientific/biomedical documents.

Implementation of TF-IDF algorithm to find documents similarity with cosine similarity