Convolutional Neural Networks for Sentence Classification

Overview

Convolutional Neural Networks for Sentence Classification

Code for the paper Convolutional Neural Networks for Sentence Classification (EMNLP 2014).

Runs the model on Pang and Lee's movie review dataset (MR in the paper). Please cite the original paper when using the data.

Requirements

Code is written in Python (2.7) and requires Theano (0.7).

Using the pre-trained word2vec vectors will also require downloading the binary file from https://code.google.com/p/word2vec/

Data Preprocessing

To process the raw data, run

python process_data.py path

where path points to the word2vec binary file (i.e. GoogleNews-vectors-negative300.bin file). This will create a pickle object called mr.p in the same folder, which contains the dataset in the right format.

Note: This will create the dataset with different fold-assignments than was used in the paper. You should still be getting a CV score of >81% with CNN-nonstatic model, though.

Running the models (CPU)

Example commands:

THEANO_FLAGS=mode=FAST_RUN,device=cpu,floatX=float32 python conv_net_sentence.py -nonstatic -rand
THEANO_FLAGS=mode=FAST_RUN,device=cpu,floatX=float32 python conv_net_sentence.py -static -word2vec
THEANO_FLAGS=mode=FAST_RUN,device=cpu,floatX=float32 python conv_net_sentence.py -nonstatic -word2vec

This will run the CNN-rand, CNN-static, and CNN-nonstatic models respectively in the paper.

Using the GPU

GPU will result in a good 10x to 20x speed-up, so it is highly recommended. To use the GPU, simply change device=cpu to device=gpu (or whichever gpu you are using). For example:

THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python conv_net_sentence.py -nonstatic -word2vec

Example output

CPU output:

epoch: 1, training time: 219.72 secs, train perf: 81.79 %, val perf: 79.26 %
epoch: 2, training time: 219.55 secs, train perf: 82.64 %, val perf: 76.84 %
epoch: 3, training time: 219.54 secs, train perf: 92.06 %, val perf: 80.95 %

GPU output:

epoch: 1, training time: 16.49 secs, train perf: 81.80 %, val perf: 78.32 %
epoch: 2, training time: 16.12 secs, train perf: 82.53 %, val perf: 76.74 %
epoch: 3, training time: 16.16 secs, train perf: 91.87 %, val perf: 81.37 %

Other Implementations

TensorFlow

Denny Britz has an implementation of the model in TensorFlow:

https://github.com/dennybritz/cnn-text-classification-tf

He also wrote a nice tutorial on it, as well as a general tutorial on CNNs for NLP.

Torch

HarvardNLP group has an implementation in Torch.

https://github.com/harvardnlp/sent-conv-torch

Hyperparameters

At the time of my original experiments I did not have access to a GPU so I could not run a lot of different experiments. Hence the paper is missing a lot of things like ablation studies and variance in performance, and some of the conclusions were premature (e.g. regularization does not always seem to help).

Ye Zhang has written a very nice paper doing an extensive analysis of model variants (e.g. filter widths, k-max pooling, word2vec vs Glove, etc.) and their effect on performance.

Comments
  • other datasets

    other datasets

    When the code is used with a different dataset than the provided one it crashed if the longest sentence in the other dataset is longer than the longest sentence in the provided dataset. This can be fixed easily, the max length it set in the code manually.

    opened by alex-j-j 2
  • NotImplementedError: The image and the kernel must have the same type.inputs(float32), kerns(float64)

    NotImplementedError: The image and the kernel must have the same type.inputs(float32), kerns(float64)

    Hi, I'm facing this error while trying to reproduce the experiments. Does anyone know how to solve this ? I have changed nothing and I completed the requirements. Thanks for your help :) github issue

    opened by moses9591 0
  • AttributeError: 'module' object has no attribute 'LeNetConvPoolLayer'

    AttributeError: 'module' object has no attribute 'LeNetConvPoolLayer'

    Hello, I'm sorry to write this, when I import this project into my workspace,there are two red line under two place,LeNetConvPoolLayer and MLPDropout,line 88 and 95 in the conv_net_classes.py file;then I add theano. prefix ,the line dispeared;but when I run the conv_net_classes.py file,it is wrong --AttributeError: 'module' object has no attribute 'LeNetConvPoolLayer',how can I dispose it?

    opened by 1394125422 0
  • Getting class probablitly vectors from the intermediate layers

    Getting class probablitly vectors from the intermediate layers

    Hi, I am trying to use CNN_sentence to classify tweets into one of the K predefined topics. Although I am able to get the final ouput class for each tweet, I am more interested in the probability vector in the layer right before the ouput layer based on which its decided which class the tweet should belong to. eg. If I have 3 classes and the output class for a given tweet is [2], the I assume the previous layer would be dealing with a class-wise probability which could be something like [0.3 , 0.78, 0.1] for class 1, 2, 3 respectively(just an example).

    [test_loss,y_pred] = test_model_all(test_set_x,test_set_y) the variable "y_pred" gives me the final output but not the class probabilities but not class probabilities. Can you suggest a way to get these probs ?

    opened by rohitiyer91 0
  • a pickle file problem

    a pickle file problem

    Hi, @yoonkim I am a beginner of natural language processing and machine learning. Since 'GoogleNews-vectors-negative300.bin' file size is quite large, all of my attemps for making a pickle file ('mr.p') failed. Could you give me some pieces of advice for making 'mr.p' with 16GB~32GB RAM if you don't mind?

    And.. I wonder if 'mr.p' also need a chunk process to solve the memory problem. (I little know about pickle file..)

    Thank you

    opened by ghost 3
  • How much memory do I need to process bin file (i.e. GoogleNews-vectors-negative300.bin)

    How much memory do I need to process bin file (i.e. GoogleNews-vectors-negative300.bin)

    Hi everyone,

    I try to run the "process_data.py" file with the same word2vec binary file (i.e. GoogleNews-vectors-negative300.bin)  but it didn't work. The process got killed after 30 mint approx.

    Before I was thinking, it may be a memory problem, but I tried on the server (256GB RAM and 16GB GPU) too but unfortunately found same results (i.e. program got killed after running approx. 30 mint).

    what could be possible reasons?

    Your response will be highly appreciable.

    opened by usama6832 3
  • multilabel classificaion

    multilabel classificaion

    Hi What changes have to be done in this code to allow multi label classification? is there any resource that I can refer to extend your code to allow multi label classification?

    opened by shaikarshad 1
  • Confused with vocab in process_data.py, need heeeeeeelp

    Confused with vocab in process_data.py, need heeeeeeelp

    I'm a new bee in Sentiment Analysis and recently I'm trying to use CNN to apply to Sentiment Analysis. Yoon's paper helps me a lot and I really appreicate that.

    I want to understand every piece of code in this repo, but I get some trouble when I read process_data.py. Variable vocab is a type of dictionary and it should store the frequency of each word occurred in MR datas, which is {word, word_frequency}, but in the function build_data_cv, Yoon used set to store words in each line, which means the duplicate words will be removed, in this case how can we calculate the occurred times of each word ?

        vocab = defaultdict(float)   # dict to store words with its frequences
        with open(pos_file, "rb") as f:
            for line in f:       
                rev = []
                rev.append(line.strip())
                if clean_string:
                    orig_rev = clean_str(" ".join(rev))
                else:
                    orig_rev = " ".join(rev).lower()
                words = set(orig_rev.split()) # use set to store words, which means duplicate words will be removed in current line
    
    

    IS THERE ANYBODY CAN HELP ME? THANKS A LOT!!!

    opened by Larry955 2
  • confused on the dropout_cost_p and cost_p ??

    confused on the dropout_cost_p and cost_p ??

    I am not familiar with Theano. But it seems in the train_model function, it outputs cost_p, but in the sgd_updates_adadelta function, it optimizes over dropout_cost_p ? ? I am confused on this. Could you please explain this to me if you have time ?

    Thanks in advance.

    opened by Chunpai 0
Owner
Yoon Kim
Yoon Kim
REST API for sentence tokenization and embedding using Multilingual Universal Sentence Encoder.

What is MUSE? MUSE stands for Multilingual Universal Sentence Encoder - multilingual extension (16 languages) of Universal Sentence Encoder (USE). MUS

Dani El-Ayyass 47 Sep 5, 2022
Using Bert as the backbone model for lime, designed for NLP task explanation (sentence pair text classification task)

Lime Comparing deep contextualized model for sentences highlighting task. In addition, take the classic explanation model "LIME" with bert-base model

JHJu 2 Jan 18, 2022
Malware-Related Sentence Classification

Malware-Related Sentence Classification This repo contains the code for the ICTAI 2021 paper "Enrichment of Features for Malware-Related Sentence Clas

Chau Nguyen 1 Mar 26, 2022
The model is designed to train a single and large neural network in order to predict correct translation by reading the given sentence.

Neural Machine Translation communication system The model is basically direct to convert one source language to another targeted language using encode

Nishant Banjade 7 Sep 22, 2022
PyTorch implementation of convolutional neural networks-based text-to-speech synthesis models

Deepvoice3_pytorch PyTorch implementation of convolutional networks-based text-to-speech synthesis models: arXiv:1710.07654: Deep Voice 3: Scaling Tex

Ryuichi Yamamoto 1.8k Dec 30, 2022
Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Pytorch-NLU,一个中文文本分类、序列标注工具包,支持中文长文本、短文本的多类、多标签分类任务,支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

null 186 Dec 24, 2022
multi-label,classifier,text classification,多标签文本分类,文本分类,BERT,ALBERT,multi-label-classification,seq2seq,attention,beam search

multi-label,classifier,text classification,多标签文本分类,文本分类,BERT,ALBERT,multi-label-classification,seq2seq,attention,beam search

hellonlp 30 Dec 12, 2022
A sentence aligner for comparable corpora

About Yalign is a tool for extracting parallel sentences from comparable corpora. Statistical Machine Translation relies on parallel corpora (eg.. eur

Machinalis 128 Aug 24, 2022
Sentence Embeddings with BERT & XLNet

Sentence Transformers: Multilingual Sentence Embeddings using BERT / RoBERTa / XLM-RoBERTa & Co. with PyTorch This framework provides an easy method t

Ubiquitous Knowledge Processing Lab 9.1k Jan 2, 2023
Extract Keywords from sentence or Replace keywords in sentences.

FlashText This module can be used to replace keywords in sentences or extract keywords from sentences. It is based on the FlashText algorithm. Install

Vikash Singh 5.3k Jan 1, 2023
Sentence Embeddings with BERT & XLNet

Sentence Transformers: Multilingual Sentence Embeddings using BERT / RoBERTa / XLM-RoBERTa & Co. with PyTorch This framework provides an easy method t

Ubiquitous Knowledge Processing Lab 4.2k Feb 18, 2021
Extract Keywords from sentence or Replace keywords in sentences.

FlashText This module can be used to replace keywords in sentences or extract keywords from sentences. It is based on the FlashText algorithm. Install

Vikash Singh 4.7k Feb 17, 2021
Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)

Bunkai Bunkai is a sentence boundary (SB) disambiguation tool for Japanese texts. Quick Start $ pip install bunkai $ echo -e '宿を予約しました♪!まだ2ヶ月も先だけど。早すぎ

Megagon Labs 160 Dec 23, 2022
SimCSE: Simple Contrastive Learning of Sentence Embeddings

SimCSE: Simple Contrastive Learning of Sentence Embeddings This repository contains the code and pre-trained models for our paper SimCSE: Simple Contr

Princeton Natural Language Processing 2.5k Jan 7, 2023
Language-Agnostic SEntence Representations

LASER Language-Agnostic SEntence Representations LASER is a library to calculate and use multilingual sentence embeddings. NEWS 2019/11/08 CCMatrix is

Facebook Research 3.2k Jan 4, 2023
Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning

GenSen Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning Sandeep Subramanian, Adam Trischler, Yoshua B

Maluuba Inc. 309 Oct 19, 2022
InferSent sentence embeddings

InferSent InferSent is a sentence embeddings method that provides semantic representations for English sentences. It is trained on natural language in

Facebook Research 2.2k Dec 27, 2022
source code for paper: WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach.

WhiteningBERT Source code and data for paper WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach. Preparation git clone https://github.com

null 49 Dec 17, 2022
Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer

ConSERT Code for our ACL 2021 paper - ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer Requirements torch==1.6.0

Yan Yuanmeng 478 Dec 25, 2022