Convolutional Neural Networks for Sentence Classification

Yoon Kim

Last update: Jan 2, 2023

Related tags

Text Data & NLP CNN_sentence

Overview

Convolutional Neural Networks for Sentence Classification

Code for the paper Convolutional Neural Networks for Sentence Classification (EMNLP 2014).

Runs the model on Pang and Lee's movie review dataset (MR in the paper). Please cite the original paper when using the data.

Requirements

Code is written in Python (2.7) and requires Theano (0.7).

Using the pre-trained word2vec vectors will also require downloading the binary file from https://code.google.com/p/word2vec/

Data Preprocessing

To process the raw data, run

python process_data.py path

where path points to the word2vec binary file (i.e. GoogleNews-vectors-negative300.bin file). This will create a pickle object called mr.p in the same folder, which contains the dataset in the right format.

Note: This will create the dataset with different fold-assignments than was used in the paper. You should still be getting a CV score of >81% with CNN-nonstatic model, though.

Running the models (CPU)

Example commands:

THEANO_FLAGS=mode=FAST_RUN,device=cpu,floatX=float32 python conv_net_sentence.py -nonstatic -rand
THEANO_FLAGS=mode=FAST_RUN,device=cpu,floatX=float32 python conv_net_sentence.py -static -word2vec
THEANO_FLAGS=mode=FAST_RUN,device=cpu,floatX=float32 python conv_net_sentence.py -nonstatic -word2vec

This will run the CNN-rand, CNN-static, and CNN-nonstatic models respectively in the paper.

Using the GPU

GPU will result in a good 10x to 20x speed-up, so it is highly recommended. To use the GPU, simply change device=cpu to device=gpu (or whichever gpu you are using). For example:

THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python conv_net_sentence.py -nonstatic -word2vec

Example output

CPU output:

epoch: 1, training time: 219.72 secs, train perf: 81.79 %, val perf: 79.26 %
epoch: 2, training time: 219.55 secs, train perf: 82.64 %, val perf: 76.84 %
epoch: 3, training time: 219.54 secs, train perf: 92.06 %, val perf: 80.95 %

GPU output:

epoch: 1, training time: 16.49 secs, train perf: 81.80 %, val perf: 78.32 %
epoch: 2, training time: 16.12 secs, train perf: 82.53 %, val perf: 76.74 %
epoch: 3, training time: 16.16 secs, train perf: 91.87 %, val perf: 81.37 %

Other Implementations

TensorFlow

Denny Britz has an implementation of the model in TensorFlow:

https://github.com/dennybritz/cnn-text-classification-tf

He also wrote a nice tutorial on it, as well as a general tutorial on CNNs for NLP.

Torch

HarvardNLP group has an implementation in Torch.

https://github.com/harvardnlp/sent-conv-torch

Hyperparameters

At the time of my original experiments I did not have access to a GPU so I could not run a lot of different experiments. Hence the paper is missing a lot of things like ablation studies and variance in performance, and some of the conclusions were premature (e.g. regularization does not always seem to help).

Ye Zhang has written a very nice paper doing an extensive analysis of model variants (e.g. filter widths, k-max pooling, word2vec vs Glove, etc.) and their effect on performance.

Comments

other datasets

When the code is used with a different dataset than the provided one it crashed if the longest sentence in the other dataset is longer than the longest sentence in the provided dataset. This can be fixed easily, the max length it set in the code manually.

opened by alex-j-j 2
NotImplementedError: The image and the kernel must have the same type.inputs(float32), kerns(float64)

Hi, I'm facing this error while trying to reproduce the experiments. Does anyone know how to solve this ? I have changed nothing and I completed the requirements. Thanks for your help :)

opened by moses9591 0
AttributeError: 'module' object has no attribute 'LeNetConvPoolLayer'

Hello, I'm sorry to write this, when I import this project into my workspace,there are two red line under two place,LeNetConvPoolLayer and MLPDropout,line 88 and 95 in the conv_net_classes.py file;then I add theano. prefix ,the line dispeared;but when I run the conv_net_classes.py file,it is wrong --AttributeError: 'module' object has no attribute 'LeNetConvPoolLayer',how can I dispose it?

opened by 1394125422 0
Getting class probablitly vectors from the intermediate layers

Hi, I am trying to use CNN_sentence to classify tweets into one of the K predefined topics. Although I am able to get the final ouput class for each tweet, I am more interested in the probability vector in the layer right before the ouput layer based on which its decided which class the tweet should belong to. eg. If I have 3 classes and the output class for a given tweet is [2], the I assume the previous layer would be dealing with a class-wise probability which could be something like [0.3 , 0.78, 0.1] for class 1, 2, 3 respectively(just an example).

[test_loss,y_pred] = test_model_all(test_set_x,test_set_y) the variable "y_pred" gives me the final output but not the class probabilities but not class probabilities. Can you suggest a way to get these probs ?

opened by rohitiyer91 0
a pickle file problem

Hi, @yoonkim I am a beginner of natural language processing and machine learning. Since 'GoogleNews-vectors-negative300.bin' file size is quite large, all of my attemps for making a pickle file ('mr.p') failed. Could you give me some pieces of advice for making 'mr.p' with 16GB~32GB RAM if you don't mind?

And.. I wonder if 'mr.p' also need a chunk process to solve the memory problem. (I little know about pickle file..)

Thank you

opened by ghost 3
How much memory do I need to process bin file (i.e. GoogleNews-vectors-negative300.bin)

Hi everyone,

I try to run the "process_data.py" file with the same word2vec binary file (i.e. GoogleNews-vectors-negative300.bin) but it didn't work. The process got killed after 30 mint approx.

Before I was thinking, it may be a memory problem, but I tried on the server (256GB RAM and 16GB GPU) too but unfortunately found same results (i.e. program got killed after running approx. 30 mint).

what could be possible reasons?

Your response will be highly appreciable.

opened by usama6832 3
multilabel classificaion

Hi What changes have to be done in this code to allow multi label classification? is there any resource that I can refer to extend your code to allow multi label classification?

opened by shaikarshad 1
Confused with vocab in process_data.py, need heeeeeeelp
I'm a new bee in Sentiment Analysis and recently I'm trying to use CNN to apply to Sentiment Analysis. Yoon's paper helps me a lot and I really appreicate that.

I want to understand every piece of code in this repo, but I get some trouble when I read process_data.py. Variable vocab is a type of dictionary and it should store the frequency of each word occurred in MR datas, which is {word, word_frequency}, but in the function build_data_cv, Yoon used set to store words in each line, which means the duplicate words will be removed, in this case how can we calculate the occurred times of each word ?

vocab = defaultdict(float) # dict to store words with its frequences with open(pos_file, "rb") as f: for line in f: rev = [] rev.append(line.strip()) if clean_string: orig_rev = clean_str(" ".join(rev)) else: orig_rev = " ".join(rev).lower() words = set(orig_rev.split()) # use set to store words, which means duplicate words will be removed in current line

IS THERE ANYBODY CAN HELP ME? THANKS A LOT!!!
opened by Larry955 2
confused on the dropout_cost_p and cost_p ??

I am not familiar with Theano. But it seems in the train_model function, it outputs cost_p, but in the sgd_updates_adadelta function, it optimizes over dropout_cost_p ? ? I am confused on this. Could you please explain this to me if you have time ?

Thanks in advance.

opened by Chunpai 0

Owner

Yoon Kim

GitHub

REST API for sentence tokenization and embedding using Multilingual Universal Sentence Encoder.

What is MUSE? MUSE stands for Multilingual Universal Sentence Encoder - multilingual extension (16 languages) of Universal Sentence Encoder (USE). MUS

47 Sep 5, 2022

Using Bert as the backbone model for lime, designed for NLP task explanation (sentence pair text classification task)

Lime Comparing deep contextualized model for sentences highlighting task. In addition, take the classic explanation model "LIME" with bert-base model

2 Jan 18, 2022

Malware-Related Sentence Classification

Malware-Related Sentence Classification This repo contains the code for the ICTAI 2021 paper "Enrichment of Features for Malware-Related Sentence Clas

1 Mar 26, 2022

The model is designed to train a single and large neural network in order to predict correct translation by reading the given sentence.

Neural Machine Translation communication system The model is basically direct to convert one source language to another targeted language using encode

7 Sep 22, 2022

PyTorch implementation of convolutional neural networks-based text-to-speech synthesis models

Deepvoice3_pytorch PyTorch implementation of convolutional networks-based text-to-speech synthesis models: arXiv:1710.07654: Deep Voice 3: Scaling Tex

1.8k Dec 30, 2022

Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Pytorch-NLU，一个中文文本分类、序列标注工具包，支持中文长文本、短文本的多类、多标签分类任务，支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

186 Dec 24, 2022

multi-label，classifier，text classification，多标签文本分类，文本分类，BERT，ALBERT，multi-label-classification，seq2seq，attention，beam search

30 Dec 12, 2022

Convolutional Neural Networks for Sentence Classification

Related tags

Overview

Convolutional Neural Networks for Sentence Classification

Requirements

Data Preprocessing

Running the models (CPU)

Using the GPU

Example output

Other Implementations

TensorFlow

Torch

Hyperparameters

Comments

Owner

Yoon Kim

REST API for sentence tokenization and embedding using Multilingual Universal Sentence Encoder.

Using Bert as the backbone model for lime, designed for NLP task explanation (sentence pair text classification task)

Malware-Related Sentence Classification

The model is designed to train a single and large neural network in order to predict correct translation by reading the given sentence.

PyTorch implementation of convolutional neural networks-based text-to-speech synthesis models

Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

multi-label，classifier，text classification，多标签文本分类，文本分类，BERT，ALBERT，multi-label-classification，seq2seq，attention，beam search

A sentence aligner for comparable corpora

Sentence Embeddings with BERT & XLNet

Extract Keywords from sentence or Replace keywords in sentences.

Sentence Embeddings with BERT & XLNet

Extract Keywords from sentence or Replace keywords in sentences.

Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)

SimCSE: Simple Contrastive Learning of Sentence Embeddings

Language-Agnostic SEntence Representations

Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning

InferSent sentence embeddings

source code for paper: WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach.

Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer