NeuralTalk is a Python+numpy project for learning Multimodal Recurrent Neural Networks that describe images with sentences.

Andrej

Last update: Jan 7, 2023

Related tags

Deep Learning neuraltalk

Overview

#NeuralTalk

Warning: Deprecated. Hi there, this code is now quite old and inefficient, and now deprecated. I am leaving it on Github for educational purposes, but if you would like to run or train image captioning I warmly recommend my new code release NeuralTalk2. NeuralTalk2 is written in Torch and is SIGNIFICANTLY (I mean, ~100x+) faster because it is batched and runs on the GPU. It also supports CNN finetuning, which helps a lot with performance.

This project contains Python+numpy source code for learning Multimodal Recurrent Neural Networks that describe images with sentences.

This line of work was recently featured in a New York Times article and has been the subject of multiple academic papers from the research community over the last few months. This code currently implements the models proposed by Vinyals et al. from Google (CNN + LSTM) and by Karpathy and Fei-Fei from Stanford (CNN + RNN). Both models take an image and predict its sentence description with a Recurrent Neural Network (either an LSTM or an RNN).

Overview

The pipeline for the project looks as follows:

The input is a dataset of images and 5 sentence descriptions that were collected with Amazon Mechanical Turk. In particular, this code base is set up for Flickr8K, Flickr30K, and MSCOCO datasets.
In the training stage, the images are fed as input to RNN and the RNN is asked to predict the words of the sentence, conditioned on the current word and previous context as mediated by the hidden layers of the neural network. In this stage, the parameters of the networks are trained with backpropagation.
In the prediction stage, a witheld set of images is passed to RNN and the RNN generates the sentence one word at a time. The results are evaluated with BLEU score. The code also includes utilities for visualizing the results in HTML.

Dependencies

Python 2.7, modern version of numpy/scipy, perl (if you want to do BLEU score evaluation), argparse module. Most of these are okay to install with pip. To install all dependencies at once, run the command pip install -r requirements.txt

I only tested this code with Ubuntu 12.04, but I tried to make it as generic as possible (e.g. use of os module for file system interactions etc. So it might work on Windows and Mac relatively easily.)

Protip: you really want to link your numpy to use a BLAS implementation for its matrix operations. I use virtualenv and link numpy against a system installation of OpenBLAS. Doing this will make this code almost an order of time faster because it relies very heavily on large matrix multiplies.

Getting started

Get the code. $ git clone the repo and install the Python dependencies
Get the data. I don't distribute the data in the Git repo, instead download the data/ folder from here. Also, this download does not include the raw image files, so if you want to visualize the annotations on raw images, you have to obtain the images from Flickr8K / Flickr30K / COCO directly and dump them into the appropriate data folder.
Train the model. Run the training $ python driver.py (see many additional argument settings inside the file) and wait. You'll see that the learning code writes checkpoints into cv/ and periodically reports its status in status/ folder.
Monitor the training. The status can be inspected manually by reading the JSON and printing whatever you wish in a second process. In practice I run cross-validations on a cluster, so my cv/ folder fills up with a lot of checkpoints that I further filter and inspect with other scripts. I am including my cluster training status visualization utility as well if you like. Run a local webserver (e.g. $ python -m SimpleHTTPServer 8123) and then open monitorcv.html in your browser on http://localhost:8123/monitorcv.html, or whatever the web server tells you the path is. You will have to edit the file to setup the paths properly and point it at the right json files.
Evaluate model checkpoints. To evaluate a checkpoint from cv/, run the evaluate_sentence_predctions.py script and pass it the path to a checkpoint.
Visualize the predictions. Use the included html file visualize_result_struct.html to visualize the JSON struct produced by the evaluation code. This will visualize the images and their predictions. Note that you'll have to download the raw images from the individual dataset pages and place them into the corresponding data/ folder.

Lastly, note that this is currently research code, so a lot of the documentation is inside individual Python files. If you wish to work with this code, you'll have to get familiar with it and be comfortable reading Python code.

Pretrained model

Some pretrained models can be found in the NeuralTalk Model Zoo. The slightly hairy part is that if you wish to apply these models to some arbitrary new image (one not from Flickr8k/30k/COCO) you have to first extract the CNN features. I use the 16-layer VGG network from Simonyan and Zisserman, because the model is beautiful, powerful and available with Caffe. There is opportunity for putting the preprocessing and inference into a single nice function that uses the Python wrapper to get the features and then runs the pretrained sentence model. I might add this in the future.

Using the model to predict on new images

The code allows you to easily predict and visualize results of running the model on COCO/Flickr8K/Flick30K images. If you want to run the code on arbitrary image (e.g. on your file system), things get a little more complicated because we need to first need to pipe your image through the VGG CNN to get the 4096-D activations on top.

Have a look inside the folder example_images for instructions on how to do this. Currently, the code for extracting the raw features from each image is in Matlab, so you will need it installed on your system. Caffe also has a wrapper for Python, but I wasn't yet able to use the Python wrapper to exactly reproduce the features I get from Matlab. The example_images will walk you through the process, and you will eventually use predict_on_images.py to run the prediction.

Using your own data

The input to the system is the data folder, which contains the Flickr8K, Flickr30K and MSCOCO datasets. In particular, each folder (e.g. data/flickr8k) contains a dataset.json file that stores the image paths and sentences in the dataset (all images, sentences, raw preprocessed tokens, splits, and the mappings between images and sentences). Each folder additionally contains vgg_feats.mat , which is a .mat file that stores the CNN features from all images, one per row, using the VGG Net from ILSVRC 2014. Finally, there is the imgs/ folder that holds the raw images. I also provide the Matlab script that I used to extract the features, which you may find helpful if you wish to use a different dataset. This is inside the matlab_features_reference/ folder, and see the Readme file in that folder for more information.

License

BSD license.

Comments

CAFFE API error

When I tried to run the python scripts python_features/extract_features.py today, I met with a problem as follow:

Traceback (most recent call last):
  File "./extract_features.py", line 102, in <module>
    net = caffe.Net(args.model_def, args.model)
Boost.Python.ArgumentError: Python argument types in
    Net.__init__(Net, str, str)
did not match C++ signature:
    __init__(boost::python::api::object, std::string, std::string, int)
    __init__(boost::python::api::object, std::string, int)

Then I search this error on the Internet, and I find a same issue in caffe's issue page: Caffe#1905. I think it's an error caused by the update of Caffe's API. So I change the code in extract_features.py#101 as: net = caffe.Net(args.model_def, args.model, caffe.TEST). It worked, but a new problem came out:

Traceback (most recent call last):
  File "./extract_features.py", line 102, in <module>
    caffe.set_phase_test()
AttributeError: 'module' object has no attribute 'set_phase_test'

I think the reason is that some APIs in python_features/extract_features.py are too old.

opened by Beanocean 7

Question about usage of RCNN

Hello, I recently read your paper, and very much appreciate about you sharing your codes here.

By the way, on your paper it is indicated that you first extracted top regions of obtained by RCNN and then get the CNN features, however I do not see that object detection part in your implementation. Either in training and test phase, it seems not using object detection functionality. Is it because it still works fine using the holistic image?

Thank you.

opened by jazzsaxmafia 5
Caffe python wrapper

Hi Andrej,

I wrote a a function to generate image features using the python caffe wrapper. The only source of discrepency I came across between your matlab code and the python code has to do with image resizing. Matlab imresize by default does cubic interpolation and by deafult does antialiasing correction, while the caff.io.reisze now does linear interpolation. When I dumped the matlab preprocssed images to disk and loaded them into python I got the exact same caffe predictions.

Currently the code contains hardcoded paramters (image mean, cropping dimensions) that you used in the matlab code.

Thanks for sharing your code.

Ahmed

opened by ahmedosman 5
Have you implemented Visual-Semantic Alignments ?

Thanks for your kindness to release these codes! It helps me a lot! I am interested in your cvpr paper : Deep Visual-Semantic Alignments for Generating Image Descriptions. But I did not found anything about Visual-Semantic Alignments in this released code, have I missed something ? thanks !

opened by qmiwang 3
question about dropout implementation

Hi Andrej,

I have been learning a ton about RNNs and their implementation from looking through your code. I have a (perhaps silly) question about your dropout implementation. You claim that your code creates a mask that drops a fraction, drop_prob, of the units and then scales the remaining units by 1/(1-drop_prob). This doesn't seem correct to me since you are sampling using np.random.randn, which seems to sample from a normal distribution of mean 0 and variance 1.

For example, if you set drop_prob=1 (and ignore the fact that this makes your scale factor infinite) then you should be dropping all the units, but in reality you will be testing the boolean condition np.random.randn(some_shape)<(1-drop_prob). Since np.random.rand gives you negative values half the time (on average) you will only drop half the units (on average).

It seems like you want to be sampling from a uniform distribution from 0 to 1 in order for this to work properly.

Best, Sam

opened by sballas8 3
list index out of range error
I created coco_sample directory containing the following files.

COCO_val2014_000000463825.jpg

model_checkpoint_coco_visionlab43.stanford.edu_lstm_11.14.p (from here)

tasks.txt (containing one line COCO_val2014_000000463825.jpg)

vgg_feats.mat (from here)

I ran the following command.

python predict_on_images.py coco_sample/model_checkpoint_coco_visionlab43.stanford.edu_lstm_11.14.p -r coco_sample

I got an error message as below.

parsed parameters: { "beam_size": 1, "checkpoint_path": "coco_sample/model_checkpoint_coco_visionlab43.stanford.edu_lstm_11.14.p", "root_path": "coco_sample" } loading checkpoint coco_sample/model_checkpoint_coco_visionlab43.stanford.edu_lstm_11.14.p image 0/123287: /home/ec2-user/neuraltalk/imagernn/lstm_generator.py:227: RuntimeWarning: overflow encountered in exp IFOGf[t,:3_d] = 1.0/(1.0+np.exp(-IFOG[t,:3_d])) PRED: (-14.587771) a man and a woman sitting on a bench in the middle of a park image 1/123287: Traceback (most recent call last): File "predict_on_images.py", line 109, in main(params) File "predict_on_images.py", line 66, in main img['local_file_path'] =img_names[n] IndexError: list index out of range

Isn't it possible to run predict_on_images.py on a few images?
opened by pecorarista 3
init_model_from argument has no effect on where driver starts

It seems like, even when a checkpoint file is passed into --init_model_from argument, it starts from epoch 0.00 and acts like the initial model was never even passed in.

opened by EricZeiberg 2
MRFs for text segment alignments

Hi Andrej, Thank you very much for open sourcing the code! You paper talks about MRFs for decoding text segment alignments to images, but I couldn't find any code related to that. Am I missing something?

Thanks Pradeep.

opened by pradeepkaruturi 2
Python Caffe Features using Matlab like imresize

Hi Andrej,

I am done with the matlab like imresize implementation (imresize function below). The output from the prepare_image_batch match the output from my python preprocess_image to the 4 decimal place ( because python can store decimals to a larger precision than matlab), Attached is a Histogram of error between matlab's output prepare_image_batch image and python's output preprocess_image. .

Moreover I compared the final predictions from the new python script py_caffe_feat_extract.py with the new imresize in python and with caffe's imresize and compared them to matlab's prediction. Attached is a side by side histogram of error, the maximum discrepency with the new python imresize is 0.3 compared to +/-1.5 for the caffe image resize. Again I think if I limited python precision to 4 decimal place from the start , the residual error of 0.3 will go down to 0.

**All these results are based on Caltech 101 dataset.

If you are ok with that script, then I'll submit another pull request making the changes we agreed on earlier to py_predict_images.py script

opened by ahmedosman 2
extract_features.py fix for caffe update

Fixed extract_features.py based on this issue: https://github.com/karpathy/neuraltalk/issues/31

This modification allowed me to run extract_features.py successfully with latest caffe

opened by alyxb 1
extract_features python script updated. Some input flags (caffe and o…

extract_features python script updated. Some input flags (caffe and out) are now optional. Added bicubic interpolation to imresize. The vgg_feats.mat is now generated at the end of the script.

opened by SimoV8 1
docs: fix simple typo, witheld -> withheld

There is a small typo in Readme.md.

Should read withheld rather than witheld.

Semi-automated pull request generated by https://github.com/timgates42/meticulous/blob/master/docs/NOTE.md

opened by timgates42 0

predict_on_images.py error

usage: predict_on_images.py [-h] [-r ROOT_PATH] [-b BEAM_SIZE] checkpoint_path
predict_on_images.py: error: the following arguments are required: checkpoint_path
An exception has occurred, use %tb to see the full traceback.

this error happened. what should i do?

opened by brightbsit 0

eval_sentence_predictions.py: error: too few arguments

~/tf/neuraltalk-master$ python eval_sentence_predictions.py usage: eval_sentence_predictions.py [-h] [-b BEAM_SIZE] [--result_struct_filename RESULT_STRUCT_FILENAME] [-m MAX_IMAGES] [-d DUMP_FOLDER] checkpoint_path eval_sentence_predictions.py: error: too few arguments

when i run this script i got this error,is checkpoint path error or others?thank you.

opened by NEUdeep 3
Fix broken headings in Markdown files

GitHub changed the way Markdown headings are parsed, so this change fixes it.

See bryant1410/readmesfix for more information.

Tackles bryant1410/readmesfix#1

opened by bryant1410 0
multi-bleu.perl
Hi,

Is this the same script with the Moses's multi-bleu.perl? I've seen that there are some modifications to the original version. I've been investigating that why my baseline model's (Google NIC with VGG-E) BLEU-2-3-4 performance is really low but what I've found is we are not using the same evaluation scripts. I know that this task is different than machine translation task, though. So, my questions are,

What's the intention behind the BLEU evaluation script modification?

Is all captioning people evaluate their models with this approach?

Thanks in advance.
opened by ilkerkesen 0

NeuralTalk is a Python+numpy project for learning Multimodal Recurrent Neural Networks that describe images with sentences.

Related tags

Overview

Overview

Dependencies

Getting started

Pretrained model

Using the model to predict on new images

Using your own data

License

Comments

Owner

Andrej

Code and datasets for the paper "Combining Events and Frames using Recurrent Asynchronous Multimodal Networks for Monocular Depth Prediction" (RA-L, 2021)

An image base contains 490 images for learning (400 cars and 90 boats), and another 21 images for testingAn image base contains 490 images for learning (400 cars and 90 boats), and another 21 images for testing

This repository contains the official implementation code of the paper Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis, accepted at EMNLP 2021.

Composable transformations of Python+NumPy programsComposable transformations of Python+NumPy programs

PyKale is a PyTorch library for multimodal learning and transfer learning as well as deep learning and dimensionality reduction on graphs, images, texts, and videos

MLP-Numpy - A simple modular implementation of Multi Layer Perceptron in pure Numpy.

Official implementation for NIPS'17 paper: PredRNN: Recurrent Neural Networks for Predictive Learning Using Spatiotemporal LSTMs.

An implementation of DeepMind's Relational Recurrent Neural Networks in PyTorch.

Code for the paper "Ordered Neurons: Integrating Tree Structures into Recurrent Neural Networks"

Physics-informed convolutional-recurrent neural networks for solving spatiotemporal PDEs

Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision. ICCV 2021.

A python-image-classification web application project, written in Python and served through the Flask Microframework. This Project implements the VGG16 covolutional neural network, through Keras and Tensorflow wrappers, to make predictions on uploaded images.

MAGMA - a GPT-style multimodal model that can understand any combination of images and language

Official repository for the paper "Can You Learn an Algorithm? Generalizing from Easy to Hard Problems with Recurrent Networks"

Differentiable architecture search for convolutional and recurrent networks

A CNN implementation using only numpy. Supports multidimensional images, stride, etc.

Machine Learning From Scratch. Bare bones NumPy implementations of machine learning models and algorithms with a focus on accessibility. Aims to cover everything from linear regression to deep learning.

Implementing Graph Convolutional Networks and Information Retrieval Mechanisms using pure Python and NumPy

Deep Multimodal Neural Architecture Search