Graph parsing approach to structured sentiment analysis.

Overview

Fine-grained Sentiment Analysis as Dependency Graph Parsing

This repository contains the code and datasets described in following paper: Fine-grained Sentiment Analysis as Dependency Graph Parsing.

Problem description

Fine-grained sentiment analysis can be theoretically cast as an information extraction problem in which one attempts to find all of the opinion tuples $O = O_i,\ldots,O_n$ in a text. Each opinion $O_i$ is a tuple $(h, t, e, p)$

where $h$ is a \textbf {holder} who expresses a \textbf{polarity} $p$ towards a \textbf{target} $t$ through a \textbf{sentiment expression} $e$, implicitly defining the relationships between these elements.

The two examples below (first in English, then in Basque) show the conception of sentiment graphs.

multilingual example

Rather than treating this as a sequence-labeling task, we can treat it as a bilexical dependency graph prediction task, although some decisions must me made. We create two versions (a) head-first and (b) head-final, shown below:

bilexical

Requirements

  1. python3
  2. pytorch
  3. matplotlib
  4. sklearn
  5. gensim
  6. numpy
  7. h5py
  8. transformers
  9. tqdm

Data collection and preprocessing

We provide the preprocessed bilexical sentiment graph data as conllu files in 'data/sent_graphs'. If you want to run the experiments, you can use this data directly. If, however, you are interested in how we create the data, you can use the following steps.

The first step is to download and preprocess the data, and then create the sentiment dependency graphs. The original data can be downloaded and converted to json files using the scripts found at https://github.com/jerbarnes/finegrained_data. After creating the json files for the finegrained datasets following the instructions, you can then place the directories (renamed to 'mpqa', 'ds_unis', 'norec_fine', 'eu', 'ca') in the 'data' directory.

After that, you can use the available scripts to create the bilexical dependency graphs, as mentioned in the paper.

cd data
./create_english_sent_graphs.sh
./create_euca_sent_graphs.sh
./create_norec_sent_graphs
cd ..

Experimental results

To reproduce the results, first you will need to download the word vectors used:

mkdir vectors
cd vectors
wget http://vectors.nlpl.eu/repository/20/58.zip
wget http://vectors.nlpl.eu/repository/20/32.zip
wget http://vectors.nlpl.eu/repository/20/34.zip
wget http://vectors.nlpl.eu/repository/20/18.zip
cd ..

You will similarly need to extract mBERT token representations for all datasets.

./do_bert.sh

Finally, you can run the SLURM scripts to reproduce the experimental results.

./scripts/run_base.sh
./scripts/run_bert.sh
Comments
  • Fix broken bert embeddings extractor

    Fix broken bert embeddings extractor

    Currently, bert_embed.py fails with

    File "bert_embed.py", line 104, in ee
        token_reps = token_reps.squeeze(0)
    AttributeError: 'str' object has no attribute 'squeeze'
    

    (HuggingFace Transformers 4.14.1, bert-base-multilingual-cased)

    That's because token_reps contains only a string, one of the keys of the BaseModelOutputWithPoolingAndCrossAttentionsobject.

    I guess the idea here is to get the final representations from the model, so I changed token_reps, _ = model(**tokenized) to token_reps = model(**tokenized)["last_hidden_state"]. Or may be it should be hidden_states, to get representations from all layers?

    opened by akutuzov 1
  • Move lengths to CPU when packing sequences

    Move lengths to CPU when packing sequences

    Otherwise, the code fails when run on GPU with PyTorch 1.7.1: RuntimeError: 'lengths' argument should be a 1D CPU int64 tensor, but got 1D cuda:0 Long tensor

    opened by akutuzov 0
  • Unable to open file: name = '../data/sent_graphs/ca/test_bert.hdf5'

    Unable to open file: name = '../data/sent_graphs/ca/test_bert.hdf5'

    While executing the run_bert.sh script, I encountered this error : unable to open file: name = '../data/sent_graphs/ca/test_bert.hdf5'. Want to check if there is anything that I need to do from my end.

    opened by shamgane 0
  • Predictions as json files

    Predictions as json files

    I tried running the experiements using the SLURM scripts mentioned in the repo to produce predictions for test.conllu files in all the datasets. I see that the outputs are produced in the .conllu format. Is there any script that can produce outputs in json format?

    opened by shamgane 0
  • do_bert.sh error in bert_embed.py

    do_bert.sh error in bert_embed.py

    I have been trying to run the mBERT extraction script for the dataset : ca/head_first with bert-base-multilingual-cased. I am faced with the following error trace :

    Using bert-base-multilingual-cased for data/sent_graphs/ca/head_first/train.conllu Saving to data/sent_graphs/ca/head_first/train_bert.hdf5 Embedding... 0%| | 0/1173 [00:00<?, ?it/s] Traceback (most recent call last): File "bert_embed.py", line 135, in reps, sids = ee(model, args.indata) File "bert_embed.py", line 108, in ee assert len(sent.split()) == len(ave_reps) AssertionError Using bert-base-multilingual-cased for data/sent_graphs/ca/head_first/dev.conllu Saving to data/sent_graphs/ca/head_first/dev_bert.hdf5 Embedding... 0%| | 0/168 [00:00<?, ?it/s] Traceback (most recent call last): File "bert_embed.py", line 135, in reps, sids = ee(model, args.indata) File "bert_embed.py", line 108, in ee assert len(sent.split()) == len(ave_reps) AssertionError Using bert-base-multilingual-cased for data/sent_graphs/ca/head_first/test.conllu Saving to data/sent_graphs/ca/head_first/test_bert.hdf5 Embedding... 0%| | 0/336 [00:00<?, ?it/s] Traceback (most recent call last): File "bert_embed.py", line 135, in reps, sids = ee(model, args.indata) File "bert_embed.py", line 108, in ee assert len(sent.split()) == len(ave_reps) AssertionError

    Seems like the output from average_reps function in bert_embed.py is giving an empty output [] for the data : 'Bona ubicació .' When it reaches the assert statement, this output length is clearly not equal to the length of the number of tokens in the sentence. This was an example that I illustrated to explain the problem. Would really appreciate if you could guide me on how to fix this.

    opened by shamgane 2
Owner
Jeremy Barnes
I'm a professor of Natural Language Processing. My interests are in multi-linguality and incorporating diverse sources of information into neural networks.
Jeremy Barnes
The source code for the Cutoff data augmentation approach proposed in this paper: "A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation".

Cutoff: A Simple Data Augmentation Approach for Natural Language This repository contains source code necessary to reproduce the results presented in

Dinghan Shen 49 Dec 22, 2022
Deep Structured Instance Graph for Distilling Object Detectors (ICCV 2021)

DSIG Deep Structured Instance Graph for Distilling Object Detectors Authors: Yixin Chen, Pengguang Chen, Shu Liu, Liwei Wang, Jiaya Jia. [pdf] [slide]

DV Lab 31 Nov 17, 2022
Semi-supervised Learning for Sentiment Analysis

Neural-Semi-supervised-Learning-for-Text-Classification-Under-Large-Scale-Pretraining Code, models and Datasets for《Neural Semi-supervised Learning fo

null 47 Jan 1, 2023
This is the code for ACL2021 paper A Unified Generative Framework for Aspect-Based Sentiment Analysis

This is the code for ACL2021 paper A Unified Generative Framework for Aspect-Based Sentiment Analysis Install the package in the requirements.txt, the

null 108 Dec 23, 2022
This repository contains various models targetting multimodal representation learning, multimodal fusion for downstream tasks such as multimodal sentiment analysis.

Multimodal Deep Learning ?? ?? ?? Announcing the multimodal deep learning repository that contains implementation of various deep learning-based model

Deep Cognition and Language Research (DeCLaRe) Lab 398 Dec 30, 2022
This repository contains the official implementation code of the paper Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis, accepted at EMNLP 2021.

MultiModal-InfoMax This repository contains the official implementation code of the paper Improving Multimodal Fusion with Hierarchical Mutual Informa

Deep Cognition and Language Research (DeCLaRe) Lab 89 Dec 26, 2022
A Python multilingual toolkit for Sentiment Analysis and Social NLP tasks

pysentimiento: A Python toolkit for Sentiment Analysis and Social NLP tasks A Transformer-based library for SocialNLP classification tasks. Currently

null 298 Jan 7, 2023
Enhancing Aspect-Based Sentiment Analysis with Supervised Contrastive Learning.

Enhancing Aspect-Based Sentiment Analysis with Supervised Contrastive Learning. Enhancing Aspect-Based Sentiment Analysis with Supervised Contrastive

HLT@HIT(SZ) 7 Dec 16, 2021
(AAAI2020)Grapy-ML: Graph Pyramid Mutual Learning for Cross-dataset Human Parsing

Grapy-ML: Graph Pyramid Mutual Learning for Cross-dataset Human Parsing This repository contains pytorch source code for AAAI2020 oral paper: Grapy-ML

null 54 Aug 4, 2022
Code for our paper "Graph Pre-training for AMR Parsing and Generation" in ACL2022

AMRBART An implementation for ACL2022 paper "Graph Pre-training for AMR Parsing and Generation". You may find our paper here (Arxiv). Requirements pyt

xfbai 60 Jan 3, 2023
A static analysis library for computing graph representations of Python programs suitable for use with graph neural networks.

python_graphs This package is for computing graph representations of Python programs for machine learning applications. It includes the following modu

Google Research 258 Dec 29, 2022
A PyTorch implementation of "SimGNN: A Neural Network Approach to Fast Graph Similarity Computation" (WSDM 2019).

SimGNN ⠀⠀⠀ A PyTorch implementation of SimGNN: A Neural Network Approach to Fast Graph Similarity Computation (WSDM 2019). Abstract Graph similarity s

Benedek Rozemberczki 534 Dec 25, 2022
Graph-based community clustering approach to extract protein domains from a predicted aligned error matrix

Using a predicted aligned error matrix corresponding to an AlphaFold2 model , returns a series of lists of residue indices, where each list corresponds to a set of residues clustering together into a pseudo-rigid domain.

Tristan Croll 24 Nov 23, 2022
Digital Twin Mobility Profiling: A Spatio-Temporal Graph Learning Approach

Digital Twin Mobility Profiling: A Spatio-Temporal Graph Learning Approach This is the implementation of traffic prediction code in DTMP based on PyTo

chenxin 1 Dec 19, 2021
Towards Open-World Feature Extrapolation: An Inductive Graph Learning Approach

This repository holds the implementation for paper Towards Open-World Feature Extrapolation: An Inductive Graph Learning Approach Download our preproc

Qitian Wu 42 Dec 27, 2022
TANL: Structured Prediction as Translation between Augmented Natural Languages

TANL: Structured Prediction as Translation between Augmented Natural Languages Code for the paper "Structured Prediction as Translation between Augmen

null 98 Dec 15, 2022
Cross-media Structured Common Space for Multimedia Event Extraction (ACL2020)

Cross-media Structured Common Space for Multimedia Event Extraction Table of Contents Overview Requirements Data Quickstart Citation Overview The code

Manling Li 49 Nov 21, 2022
PyTorch implementation of ARM-Net: Adaptive Relation Modeling Network for Structured Data.

A ready-to-use framework of latest models for structured (tabular) data learning with PyTorch. Applications include recommendation, CRT prediction, healthcare analytics, and etc.

null 48 Nov 30, 2022
This repo contains the official implementations of EigenDamage: Structured Pruning in the Kronecker-Factored Eigenbasis

EigenDamage: Structured Pruning in the Kronecker-Factored Eigenbasis This repo contains the official implementations of EigenDamage: Structured Prunin

Chaoqi Wang 107 Apr 20, 2022