NeuralTalk is a Python+numpy project for learning Multimodal Recurrent Neural Networks that describe images with sentences.

Overview

#NeuralTalk

Warning: Deprecated. Hi there, this code is now quite old and inefficient, and now deprecated. I am leaving it on Github for educational purposes, but if you would like to run or train image captioning I warmly recommend my new code release NeuralTalk2. NeuralTalk2 is written in Torch and is SIGNIFICANTLY (I mean, ~100x+) faster because it is batched and runs on the GPU. It also supports CNN finetuning, which helps a lot with performance.

This project contains Python+numpy source code for learning Multimodal Recurrent Neural Networks that describe images with sentences.

This line of work was recently featured in a New York Times article and has been the subject of multiple academic papers from the research community over the last few months. This code currently implements the models proposed by Vinyals et al. from Google (CNN + LSTM) and by Karpathy and Fei-Fei from Stanford (CNN + RNN). Both models take an image and predict its sentence description with a Recurrent Neural Network (either an LSTM or an RNN).

Overview

The pipeline for the project looks as follows:

  • The input is a dataset of images and 5 sentence descriptions that were collected with Amazon Mechanical Turk. In particular, this code base is set up for Flickr8K, Flickr30K, and MSCOCO datasets.
  • In the training stage, the images are fed as input to RNN and the RNN is asked to predict the words of the sentence, conditioned on the current word and previous context as mediated by the hidden layers of the neural network. In this stage, the parameters of the networks are trained with backpropagation.
  • In the prediction stage, a witheld set of images is passed to RNN and the RNN generates the sentence one word at a time. The results are evaluated with BLEU score. The code also includes utilities for visualizing the results in HTML.

Dependencies

Python 2.7, modern version of numpy/scipy, perl (if you want to do BLEU score evaluation), argparse module. Most of these are okay to install with pip. To install all dependencies at once, run the command pip install -r requirements.txt

I only tested this code with Ubuntu 12.04, but I tried to make it as generic as possible (e.g. use of os module for file system interactions etc. So it might work on Windows and Mac relatively easily.)

Protip: you really want to link your numpy to use a BLAS implementation for its matrix operations. I use virtualenv and link numpy against a system installation of OpenBLAS. Doing this will make this code almost an order of time faster because it relies very heavily on large matrix multiplies.

Getting started

  1. Get the code. $ git clone the repo and install the Python dependencies
  2. Get the data. I don't distribute the data in the Git repo, instead download the data/ folder from here. Also, this download does not include the raw image files, so if you want to visualize the annotations on raw images, you have to obtain the images from Flickr8K / Flickr30K / COCO directly and dump them into the appropriate data folder.
  3. Train the model. Run the training $ python driver.py (see many additional argument settings inside the file) and wait. You'll see that the learning code writes checkpoints into cv/ and periodically reports its status in status/ folder.
  4. Monitor the training. The status can be inspected manually by reading the JSON and printing whatever you wish in a second process. In practice I run cross-validations on a cluster, so my cv/ folder fills up with a lot of checkpoints that I further filter and inspect with other scripts. I am including my cluster training status visualization utility as well if you like. Run a local webserver (e.g. $ python -m SimpleHTTPServer 8123) and then open monitorcv.html in your browser on http://localhost:8123/monitorcv.html, or whatever the web server tells you the path is. You will have to edit the file to setup the paths properly and point it at the right json files.
  5. Evaluate model checkpoints. To evaluate a checkpoint from cv/, run the evaluate_sentence_predctions.py script and pass it the path to a checkpoint.
  6. Visualize the predictions. Use the included html file visualize_result_struct.html to visualize the JSON struct produced by the evaluation code. This will visualize the images and their predictions. Note that you'll have to download the raw images from the individual dataset pages and place them into the corresponding data/ folder.

Lastly, note that this is currently research code, so a lot of the documentation is inside individual Python files. If you wish to work with this code, you'll have to get familiar with it and be comfortable reading Python code.

Pretrained model

Some pretrained models can be found in the NeuralTalk Model Zoo. The slightly hairy part is that if you wish to apply these models to some arbitrary new image (one not from Flickr8k/30k/COCO) you have to first extract the CNN features. I use the 16-layer VGG network from Simonyan and Zisserman, because the model is beautiful, powerful and available with Caffe. There is opportunity for putting the preprocessing and inference into a single nice function that uses the Python wrapper to get the features and then runs the pretrained sentence model. I might add this in the future.

Using the model to predict on new images

The code allows you to easily predict and visualize results of running the model on COCO/Flickr8K/Flick30K images. If you want to run the code on arbitrary image (e.g. on your file system), things get a little more complicated because we need to first need to pipe your image through the VGG CNN to get the 4096-D activations on top.

Have a look inside the folder example_images for instructions on how to do this. Currently, the code for extracting the raw features from each image is in Matlab, so you will need it installed on your system. Caffe also has a wrapper for Python, but I wasn't yet able to use the Python wrapper to exactly reproduce the features I get from Matlab. The example_images will walk you through the process, and you will eventually use predict_on_images.py to run the prediction.

Using your own data

The input to the system is the data folder, which contains the Flickr8K, Flickr30K and MSCOCO datasets. In particular, each folder (e.g. data/flickr8k) contains a dataset.json file that stores the image paths and sentences in the dataset (all images, sentences, raw preprocessed tokens, splits, and the mappings between images and sentences). Each folder additionally contains vgg_feats.mat , which is a .mat file that stores the CNN features from all images, one per row, using the VGG Net from ILSVRC 2014. Finally, there is the imgs/ folder that holds the raw images. I also provide the Matlab script that I used to extract the features, which you may find helpful if you wish to use a different dataset. This is inside the matlab_features_reference/ folder, and see the Readme file in that folder for more information.

License

BSD license.

Comments
  • CAFFE API error

    CAFFE API error

    When I tried to run the python scripts python_features/extract_features.py today, I met with a problem as follow:

    Traceback (most recent call last):
      File "./extract_features.py", line 102, in <module>
        net = caffe.Net(args.model_def, args.model)
    Boost.Python.ArgumentError: Python argument types in
        Net.__init__(Net, str, str)
    did not match C++ signature:
        __init__(boost::python::api::object, std::string, std::string, int)
        __init__(boost::python::api::object, std::string, int)
    

    Then I search this error on the Internet, and I find a same issue in caffe's issue page: Caffe#1905. I think it's an error caused by the update of Caffe's API. So I change the code in extract_features.py#101 as: net = caffe.Net(args.model_def, args.model, caffe.TEST). It worked, but a new problem came out:

    Traceback (most recent call last):
      File "./extract_features.py", line 102, in <module>
        caffe.set_phase_test()
    AttributeError: 'module' object has no attribute 'set_phase_test'
    

    I think the reason is that some APIs in python_features/extract_features.py are too old.

    opened by Beanocean 7
  • Question about usage of RCNN

    Question about usage of RCNN

    Hello, I recently read your paper, and very much appreciate about you sharing your codes here.

    By the way, on your paper it is indicated that you first extracted top regions of obtained by RCNN and then get the CNN features, however I do not see that object detection part in your implementation. Either in training and test phase, it seems not using object detection functionality. Is it because it still works fine using the holistic image?

    Thank you.

    opened by jazzsaxmafia 5
  • Caffe python wrapper

    Caffe python wrapper

    Hi Andrej,

    I wrote a a function to generate image features using the python caffe wrapper. The only source of discrepency I came across between your matlab code and the python code has to do with image resizing. Matlab imresize by default does cubic interpolation and by deafult does antialiasing correction, while the caff.io.reisze now does linear interpolation. When I dumped the matlab preprocssed images to disk and loaded them into python I got the exact same caffe predictions.

    Currently the code contains hardcoded paramters (image mean, cropping dimensions) that you used in the matlab code.

    Thanks for sharing your code.

    Ahmed

    opened by ahmedosman 5
  • Have you implemented Visual-Semantic Alignments ?

    Have you implemented Visual-Semantic Alignments ?

    Thanks for your kindness to release these codes! It helps me a lot! I am interested in your cvpr paper : Deep Visual-Semantic Alignments for Generating Image Descriptions. But I did not found anything about Visual-Semantic Alignments in this released code, have I missed something ? thanks !

    opened by qmiwang 3
  • question about dropout implementation

    question about dropout implementation

    Hi Andrej,

    I have been learning a ton about RNNs and their implementation from looking through your code. I have a (perhaps silly) question about your dropout implementation. You claim that your code creates a mask that drops a fraction, drop_prob, of the units and then scales the remaining units by 1/(1-drop_prob). This doesn't seem correct to me since you are sampling using np.random.randn, which seems to sample from a normal distribution of mean 0 and variance 1.

    For example, if you set drop_prob=1 (and ignore the fact that this makes your scale factor infinite) then you should be dropping all the units, but in reality you will be testing the boolean condition np.random.randn(some_shape)<(1-drop_prob). Since np.random.rand gives you negative values half the time (on average) you will only drop half the units (on average).

    It seems like you want to be sampling from a uniform distribution from 0 to 1 in order for this to work properly.

    Best, Sam

    opened by sballas8 3
  • list index out of range error

    list index out of range error

    I created coco_sample directory containing the following files.

    • COCO_val2014_000000463825.jpg
    • model_checkpoint_coco_visionlab43.stanford.edu_lstm_11.14.p (from here)
    • tasks.txt (containing one line COCO_val2014_000000463825.jpg)
    • vgg_feats.mat (from here)

    I ran the following command.

    python predict_on_images.py coco_sample/model_checkpoint_coco_visionlab43.stanford.edu_lstm_11.14.p -r coco_sample
    

    I got an error message as below.

    parsed parameters: { "beam_size": 1, "checkpoint_path": "coco_sample/model_checkpoint_coco_visionlab43.stanford.edu_lstm_11.14.p", "root_path": "coco_sample" } loading checkpoint coco_sample/model_checkpoint_coco_visionlab43.stanford.edu_lstm_11.14.p image 0/123287: /home/ec2-user/neuraltalk/imagernn/lstm_generator.py:227: RuntimeWarning: overflow encountered in exp IFOGf[t,:3_d] = 1.0/(1.0+np.exp(-IFOG[t,:3_d])) PRED: (-14.587771) a man and a woman sitting on a bench in the middle of a park image 1/123287: Traceback (most recent call last): File "predict_on_images.py", line 109, in main(params) File "predict_on_images.py", line 66, in main img['local_file_path'] =img_names[n] IndexError: list index out of range

    Isn't it possible to run predict_on_images.py on a few images?

    opened by pecorarista 3
  • init_model_from argument has no effect on where driver starts

    init_model_from argument has no effect on where driver starts

    It seems like, even when a checkpoint file is passed into --init_model_from argument, it starts from epoch 0.00 and acts like the initial model was never even passed in.

    opened by EricZeiberg 2
  • MRFs for text segment alignments

    MRFs for text segment alignments

    Hi Andrej, Thank you very much for open sourcing the code! You paper talks about MRFs for decoding text segment alignments to images, but I couldn't find any code related to that. Am I missing something?

    Thanks Pradeep.

    opened by pradeepkaruturi 2
  • Python Caffe Features using Matlab like imresize

    Python Caffe Features using Matlab like imresize

    Hi Andrej,

    I am done with the matlab like imresize implementation (imresize function below). The output from the prepare_image_batch match the output from my python preprocess_image to the 4 decimal place ( because python can store decimals to a larger precision than matlab), Attached is a Histogram of error between matlab's output prepare_image_batch image and python's output preprocess_image. matlab_python_imresize.

    Moreover I compared the final predictions from the new python script py_caffe_feat_extract.py with the new imresize in python and with caffe's imresize and compared them to matlab's prediction. Attached is a side by side histogram of error, the maximum discrepency with the new python imresize is 0.3 compared to +/-1.5 for the caffe image resize. Again I think if I limited python precision to 4 decimal place from the start , the residual error of 0.3 will go down to 0.

    prediction_discrepency

    **All these results are based on Caltech 101 dataset.

    If you are ok with that script, then I'll submit another pull request making the changes we agreed on earlier to py_predict_images.py script

    opened by ahmedosman 2
  • extract_features.py fix for caffe update

    extract_features.py fix for caffe update

    Fixed extract_features.py based on this issue: https://github.com/karpathy/neuraltalk/issues/31

    This modification allowed me to run extract_features.py successfully with latest caffe

    opened by alyxb 1
  • extract_features python script updated. Some input flags (caffe and o…

    extract_features python script updated. Some input flags (caffe and o…

    extract_features python script updated. Some input flags (caffe and out) are now optional. Added bicubic interpolation to imresize. The vgg_feats.mat is now generated at the end of the script.

    opened by SimoV8 1
  • docs: fix simple typo, witheld -> withheld

    docs: fix simple typo, witheld -> withheld

    There is a small typo in Readme.md.

    Should read withheld rather than witheld.

    Semi-automated pull request generated by https://github.com/timgates42/meticulous/blob/master/docs/NOTE.md

    opened by timgates42 0
  • predict_on_images.py error

    predict_on_images.py error

    usage: predict_on_images.py [-h] [-r ROOT_PATH] [-b BEAM_SIZE] checkpoint_path
    predict_on_images.py: error: the following arguments are required: checkpoint_path
    An exception has occurred, use %tb to see the full traceback.
    

    this error happened. what should i do?

    opened by brightbsit 0
  • eval_sentence_predictions.py: error: too few arguments

    eval_sentence_predictions.py: error: too few arguments

    ~/tf/neuraltalk-master$ python eval_sentence_predictions.py usage: eval_sentence_predictions.py [-h] [-b BEAM_SIZE] [--result_struct_filename RESULT_STRUCT_FILENAME] [-m MAX_IMAGES] [-d DUMP_FOLDER] checkpoint_path eval_sentence_predictions.py: error: too few arguments

    when i run this script i got this error,is checkpoint path error or others?thank you.

    opened by NEUdeep 3
  • Fix broken headings in Markdown files

    Fix broken headings in Markdown files

    GitHub changed the way Markdown headings are parsed, so this change fixes it.

    See bryant1410/readmesfix for more information.

    Tackles bryant1410/readmesfix#1

    opened by bryant1410 0
  • multi-bleu.perl

    multi-bleu.perl

    Hi,

    Is this the same script with the Moses's multi-bleu.perl? I've seen that there are some modifications to the original version. I've been investigating that why my baseline model's (Google NIC with VGG-E) BLEU-2-3-4 performance is really low but what I've found is we are not using the same evaluation scripts. I know that this task is different than machine translation task, though. So, my questions are,

    • What's the intention behind the BLEU evaluation script modification?
    • Is all captioning people evaluate their models with this approach?

    Thanks in advance.

    opened by ilkerkesen 0
Owner
Andrej
I like to train Deep Neural Nets on large datasets.
Andrej
Code and datasets for the paper "Combining Events and Frames using Recurrent Asynchronous Multimodal Networks for Monocular Depth Prediction" (RA-L, 2021)

Combining Events and Frames using Recurrent Asynchronous Multimodal Networks for Monocular Depth Prediction This is the code for the paper Combining E

Robotics and Perception Group 69 Dec 26, 2022
An image base contains 490 images for learning (400 cars and 90 boats), and another 21 images for testingAn image base contains 490 images for learning (400 cars and 90 boats), and another 21 images for testing

SVM Données Une base d’images contient 490 images pour l’apprentissage (400 voitures et 90 bateaux), et encore 21 images pour fait des tests. Prétrait

Achraf Rahouti 3 Nov 30, 2021
This repository contains the official implementation code of the paper Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis, accepted at EMNLP 2021.

MultiModal-InfoMax This repository contains the official implementation code of the paper Improving Multimodal Fusion with Hierarchical Mutual Informa

Deep Cognition and Language Research (DeCLaRe) Lab 89 Dec 26, 2022
Composable transformations of Python+NumPy programsComposable transformations of Python+NumPy programs

Chex Chex is a library of utilities for helping to write reliable JAX code. This includes utils to help: Instrument your code (e.g. assertions) Debug

DeepMind 506 Jan 8, 2023
PyKale is a PyTorch library for multimodal learning and transfer learning as well as deep learning and dimensionality reduction on graphs, images, texts, and videos

PyKale is a PyTorch library for multimodal learning and transfer learning as well as deep learning and dimensionality reduction on graphs, images, texts, and videos. By adopting a unified pipeline-based API design, PyKale enforces standardization and minimalism, via reusing existing resources, reducing repetitions and redundancy, and recycling learning models across areas.

PyKale 370 Dec 27, 2022
MLP-Numpy - A simple modular implementation of Multi Layer Perceptron in pure Numpy.

MLP-Numpy A simple modular implementation of Multi Layer Perceptron in pure Numpy. I used the Iris dataset from scikit-learn library for the experimen

Soroush Omranpour 1 Jan 1, 2022
Official implementation for NIPS'17 paper: PredRNN: Recurrent Neural Networks for Predictive Learning Using Spatiotemporal LSTMs.

PredRNN: A Recurrent Neural Network for Spatiotemporal Predictive Learning The predictive learning of spatiotemporal sequences aims to generate future

THUML: Machine Learning Group @ THSS 243 Dec 26, 2022
An implementation of DeepMind's Relational Recurrent Neural Networks in PyTorch.

relational-rnn-pytorch An implementation of DeepMind's Relational Recurrent Neural Networks (Santoro et al. 2018) in PyTorch. Relational Memory Core (

Sang-gil Lee 241 Nov 18, 2022
Code for the paper "Ordered Neurons: Integrating Tree Structures into Recurrent Neural Networks"

ON-LSTM This repository contains the code used for word-level language model and unsupervised parsing experiments in Ordered Neurons: Integrating Tree

Yikang Shen 572 Nov 21, 2022
Physics-informed convolutional-recurrent neural networks for solving spatiotemporal PDEs

PhyCRNet Physics-informed convolutional-recurrent neural networks for solving spatiotemporal PDEs Paper link: [ArXiv] By: Pu Ren, Chengping Rao, Yang

Pu Ren 11 Aug 23, 2022
Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision. ICCV 2021.

Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision Download links and PyTorch implementation of "Towers of Ba

Blakey Wu 40 Dec 14, 2022
MAGMA - a GPT-style multimodal model that can understand any combination of images and language

MAGMA -- Multimodal Augmentation of Generative Models through Adapter-based Finetuning Authors repo (alphabetical) Constantin (CoEich), Mayukh (Mayukh

Aleph Alpha GmbH 331 Jan 3, 2023
Official repository for the paper "Can You Learn an Algorithm? Generalizing from Easy to Hard Problems with Recurrent Networks"

Easy-To-Hard The official repository for the paper "Can You Learn an Algorithm? Generalizing from Easy to Hard Problems with Recurrent Networks". Gett

Avi Schwarzschild 52 Sep 8, 2022
Differentiable architecture search for convolutional and recurrent networks

Differentiable Architecture Search Code accompanying the paper DARTS: Differentiable Architecture Search Hanxiao Liu, Karen Simonyan, Yiming Yang. arX

Hanxiao Liu 3.7k Jan 9, 2023
A CNN implementation using only numpy. Supports multidimensional images, stride, etc.

A CNN implementation using only numpy. Supports multidimensional images, stride, etc. Speed up due to heavy use of slicing and mathematical simplification..

null 2 Nov 30, 2021
Machine Learning From Scratch. Bare bones NumPy implementations of machine learning models and algorithms with a focus on accessibility. Aims to cover everything from linear regression to deep learning.

Machine Learning From Scratch About Python implementations of some of the fundamental Machine Learning models and algorithms from scratch. The purpose

Erik Linder-Norén 21.8k Jan 9, 2023
Implementing Graph Convolutional Networks and Information Retrieval Mechanisms using pure Python and NumPy

Implementing Graph Convolutional Networks and Information Retrieval Mechanisms using pure Python and NumPy

Noah Getz 3 Jun 22, 2022
Deep Multimodal Neural Architecture Search

MMNas: Deep Multimodal Neural Architecture Search This repository corresponds to the PyTorch implementation of the MMnas for visual question answering

Vision and Language Group@ MIL 23 Dec 21, 2022