Image captioning - Tensorflow implementation of Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Guoming Wang

Last update: Dec 28, 2022

Related tags

PyTorch Learning Resources image_captioning

Overview

Introduction

This neural system for image captioning is roughly based on the paper "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention" by Xu et al. (ICML2015). The input is an image, and the output is a sentence describing the content of the image. It uses a convolutional neural network to extract visual features from the image, and uses a LSTM recurrent neural network to decode these features into a sentence. A soft attention mechanism is incorporated to improve the quality of the caption. This project is implemented using the Tensorflow library, and allows end-to-end training of both CNN and RNN parts.

Prerequisites

Tensorflow (instructions)
NumPy (instructions)
OpenCV (instructions)
Natural Language Toolkit (NLTK) (instructions)
Pandas (instructions)
Matplotlib (instructions)
tqdm (instructions)

Usage

Preparation: Download the COCO train2014 and val2014 data here. Put the COCO train2014 images in the folder train/images, and put the file captions_train2014.json in the folder train. Similarly, put the COCO val2014 images in the folder val/images, and put the file captions_val2014.json in the folder val. Furthermore, download the pretrained VGG16 net here or ResNet50 net here if you want to use it to initialize the CNN part.
Training: To train a model using the COCO train2014 data, first setup various parameters in the file config.py and then run a command like this:

python main.py --phase=train \
    --load_cnn \
    --cnn_model_file='./vgg16_no_fc.npy'\
    [--train_cnn]

Turn on --train_cnn if you want to jointly train the CNN and RNN parts. Otherwise, only the RNN part is trained. The checkpoints will be saved in the folder models. If you want to resume the training from a checkpoint, run a command like this:

python main.py --phase=train \
    --load \
    --model_file='./models/xxxxxx.npy'\
    [--train_cnn]

To monitor the progress of training, run the following command:

tensorboard --logdir='./summary/'

Evaluation: To evaluate a trained model using the COCO val2014 data, run a command like this:

python main.py --phase=eval \
    --model_file='./models/xxxxxx.npy' \
    --beam_size=3

The result will be shown in stdout. Furthermore, the generated captions will be saved in the file val/results.json.

Inference: You can use the trained model to generate captions for any JPEG images! Put such images in the folder test/images, and run a command like this:

python main.py --phase=test \
    --model_file='./models/xxxxxx.npy' \
    --beam_size=3

The generated captions will be saved in the folder test/results.

Results

A pretrained model with default configuration can be downloaded here. This model was trained solely on the COCO train2014 data. It achieves the following BLEU scores on the COCO val2014 data (with beam size=3):

BLEU-1 = 70.3%
BLEU-2 = 53.6%
BLEU-3 = 39.8%
BLEU-4 = 29.5%

Here are some captions generated by this model:

References

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, Yoshua Bengio. ICML 2015.
The original implementation in Theano
An earlier implementation in Tensorflow
Microsoft COCO dataset

Comments

Well trained model on COCO train 2014?

This link can not open now.

Could somebody share the well trained model the author provided? Or has somebody already trained on your own computers and save the model?

Thanks very much!

opened by ivy94419 4
subprocess.py", line 1024, in _execute_child raise child_exception OSError: [Errno 2] No such file or directory

Loading the model from ./models/289999.npy... 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 47/47 [00:01<00:00, 23.69it/s] 47 tensors loaded. Evaluating the model ... batch: 100%|█████████████████████████████████████████████████████████████████████████████████| 1266/1266 [2:02:42<00:00, 5.83s/it] Loading and preparing results...
DONE (t=0.13s) creating index... index created! tokenization... Traceback (most recent call last): File "main.py", line 69, in tf.app.run() File "/home/viktor/anaconda2/envs/captureimage4/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "main.py", line 58, in main model.eval(sess, coco, data, vocabulary) File "/home/viktor/anaconda2/envs/captureimage4/scr/image_captioning-master/base_model.py", line 108, in eval scorer.evaluate() File "/home/viktor/anaconda2/envs/captureimage4/scr/image_captioning-master/utils/coco/pycocoevalcap/eval.py", line 31, in evaluate gts = tokenizer.tokenize(gts) File "/home/viktor/anaconda2/envs/captureimage4/scr/image_captioning-master/utils/coco/pycocoevalcap/tokenizer/ptbtokenizer.py", line 52, in tokenize stdout=subprocess.PIPE) File "/home/viktor/anaconda2/envs/captureimage4/lib/python2.7/subprocess.py", line 390, in init errread, errwrite) File "/home/viktor/anaconda2/envs/captureimage4/lib/python2.7/subprocess.py", line 1024, in _execute_child raise child_exception OSError: [Errno 2] No such file or directory (captureimage4) viktor@viktor-System-Product-Name:

opened by MikhailovSergei 1
Which layer of Google's NASNet should I use for extracting features for attention?

I am implementing a network similar to this one, but want to use the pre-trained CNN with max accuracy over 2012ILSVRC dataset, i.e., NASNet-large. Usually, people go by extracting the last convolution layer features. But NASNet's architecture is relatively complex and I couldn't find a direct Conv layer. Below is the Tensorboard visualization of the "final_layer" cell of NASNet:

And below is the second last cell:

To me, the relu node I've selected in first image([1,11,11,4032]) seems close to what's needed for attention, but I am not sure. Any help will be highly appreciated.

opened by aayushARM 1
FileNotFoundError: [Errno 2] No such file or directory: "'./models/289999.npy'"
I'm using the pretrained model provided by the author to generate some captions for my jpg images. All the following lines get printed in the console.

loading annotations into memory...

Done (t=0.58s)

creating index...

index created!

Building the vocabulary...

Vocabulary built.

Number of words = 5000

Building the dataset...

Dataset built.

Building the CNN...

CNN built.

Building the RNN...

RNN built.

Loading the model from './models/289999.npy'...

but after this I get an error FileNotFoundError: [Errno 2] No such file or directory: "'./models/289999.npy'", I already have the model in the models directory. Can anyone help me with this
opened by abhisingh192 0
Pretrained model efficiency

I had used the pretrained model that you shared and I got bad results of captioning... Which cnn did you use for it? the default value from config.py which is vgg16? Can you share a model which was trained with resnet, and with more advanced configurations that can help getting better results?

In general, which configuration were you changing to achieve that?

Thanks in advance.

opened by zbeedatm 0
python3.5 ImportError: No module named '_sqlite3' nltk

hi,guys! when i tried to run train.py on Ubuntu16.04 with python3.5, i got this error, could anyone tell me how to fix this problem? i don't have the root right& i don't want to reinstall python from source. thank you, please!

opened by AngelaDevHao 0
SystemError: returned NULL without setting an error
def func(images): images = np.array(images) newSize = [300,300] print('image_A->',images[0]) #print('boxes->',boxes) image = cv2.imread(images[0]) print(image) scale_x = newSize[0] / image.shape[1] scale_y = newSize[1] / image.shape[0] image = cv2.resize(image, (newSize[0], newSize[1])) return image def train(): print(tf.__version__) images,boxes,labels,difficulties= PascalVOCDataset() boxes = tf.ragged.constant(boxes) dataset = tf.data.Dataset.from_tensor_slices((images,boxes)).shuffle(100).batch(1) dataset = dataset.map(lambda image,box: tf.py_function(func=func, inp = [image],Tout=tf.string)) #print(dataset) #dataset = dataset.map(func) for image,box in dataset: break def main(): train() if __name__ =='__main__': main()

image_A-> b'/media/jake/mark-4tb3/input/datasets/pascal/VOCtrainval_11-May-2012/VOCdevkit/VOC2012/JPEGImages/2008_000259.jpg' 2020-05-28 22:28:27.374154: W tensorflow/core/framework/op_kernel.cc:1610] Unknown: SystemError: returned NULL without setting an error Traceback (most recent call last):

File "/home/jake/venv/lib/python3.7/site-packages/tensorflow_core/python/ops/script_ops.py", line 219, in call return func(device, token, args)

File "/home/jake/venv/lib/python3.7/site-packages/tensorflow_core/python/ops/script_ops.py", line 113, in call ret = self._func(*args)

File "/home/jake/Gits/ssd_tensorflow/train.py", line 21, in func image = cv2.imread(images[0])

SystemError: returned NULL without setting an error

2020-05-28 22:28:27.374302: W tensorflow/core/framework/op_kernel.cc:1622] OP_REQUIRES failed at iterator_ops.cc:929 : Unknown: {{function_node _inference_Dataset_map_24}} SystemError: returned NULL without setting an error Traceback (most recent call last):

File "/home/jake/venv/lib/python3.7/site-packages/tensorflow_core/python/ops/script_ops.py", line 219, in call return func(device, token, args)

File "/home/jake/venv/lib/python3.7/site-packages/tensorflow_core/python/ops/script_ops.py", line 113, in call ret = self._func(*args)

File "/home/jake/Gits/ssd_tensorflow/train.py", line 21, in func image = cv2.imread(images[0])

SystemError: returned NULL without setting an error

[[{{node EagerPyFunc}}]]

Traceback (most recent call last): File "/home/jake/Gits/ssd_tensorflow/train.py", line 49, in main() File "/home/jake/Gits/ssd_tensorflow/train.py", line 47, in main train() File "/home/jake/Gits/ssd_tensorflow/train.py", line 42, in train for image,box in dataset: File "/home/jake/venv/lib/python3.7/site-packages/tensorflow_core/python/data/ops/iterator_ops.py", line 622, in next return self.next() File "/home/jake/venv/lib/python3.7/site-packages/tensorflow_core/python/data/ops/iterator_ops.py", line 666, in next return self._next_internal() File "/home/jake/venv/lib/python3.7/site-packages/tensorflow_core/python/data/ops/iterator_ops.py", line 651, in _next_internal output_shapes=self._flat_output_shapes) File "/home/jake/venv/lib/python3.7/site-packages/tensorflow_core/python/ops/gen_dataset_ops.py", line 2673, in iterator_get_next_sync _six.raise_from(_core._status_to_exception(e.code, message), None) File "", line 3, in raise_from tensorflow.python.framework.errors_impl.UnknownError: {{function_node _inference_Dataset_map_24}} SystemError: returned NULL without setting an error Traceback (most recent call last):

File "/home/jake/venv/lib/python3.7/site-packages/tensorflow_core/python/ops/script_ops.py", line 219, in call return func(device, token, args)

File "/home/jake/venv/lib/python3.7/site-packages/tensorflow_core/python/ops/script_ops.py", line 113, in call ret = self._func(*args)

File "/home/jake/Gits/ssd_tensorflow/train.py", line 21, in func image = cv2.imread(images[0])

SystemError: returned NULL without setting an error

[[{{node EagerPyFunc}}]] [Op:IteratorGetNextSync]

Process finished with exit code 1
opened by SlowMonk 0
No captions generated while testing
I'm testing the model on my images, I run the code like this python main.py --phase=test \ --model_file='./models/289999.npy' \ --beam_size=3, the code runs and finally this shows up in the terminal.

Testing the model ...

path: 0it [00:00, ?it/s]

Testing complete.

But when i check the test/results folder it is empty, no captions are generated. Can anyone help me with this
opened by abhisingh192 0
Chinese image caption， In the result, multiple words of the same type appear

Hello, I am using the COCO dataset, A two-layer LSTM model, one layer for top-down attention, and one layer for language models.

Extracting words with jieba I used all the words in the picture description that occurred more than 3 times as a dictionary file, and a total of 14,226 words. words = [w for w in word_freq.keys () if word_freq [w]> 3]

After training the model, when using it, multiple words of the same type appear in the result, such as:

Note notebook laptop computer on bed A little girl little girl girl standing together

How can I solve this problem?

opened by cylvzj 1
why can't I get any result

(py_36) E:\Users\pythoncx\workspace\image_captioning-master>python main.py --phase=train --load_cnn --cnn_model_file='./vgg16_no_fc.npy' --train_cnn

after i input above， nothing happen，please tell me

opened by HIHIHAHEI 1

Owner

Guoming Wang

Personal account focusing on Artificial Intelligence

GitHub

PyTorch Implementation of Fully Convolutional Networks. (Training code to reproduce the original result is available.)

pytorch-fcn PyTorch implementation of Fully Convolutional Networks. Requirements pytorch >= 0.2.0 torchvision >= 0.1.8 fcn >= 6.1.5 Pillow scipy tqdm

1.6k Jan 4, 2023

Interactive deep learning book with multi-framework code, math, and discussions. Adopted at 200 universities.

D2L.ai: Interactive Deep Learning Book with Multi-Framework Code, Math, and Discussions Book website | STAT 157 Course at UC Berkeley | Latest version

16k Jan 3, 2023

Open source guides/codes for mastering deep learning to deploying deep learning in production in PyTorch, Python, C++ and more.

Deep Learning Materials by Deep Learning Wizard Start Learning Now Please head to www.deeplearningwizard.com to start learning! It is mobile/tablet fr

572 Dec 28, 2022

A collection of various deep learning architectures, models, and tips

Deep Learning Models A collection of various deep learning architectures, models, and tips for TensorFlow and PyTorch in Jupyter Notebooks. Traditiona

15.5k Jan 7, 2023

PyTorch tutorials and best practices.

Effective PyTorch Table of Contents Part I: PyTorch Fundamentals PyTorch basics Encapsulate your model with Modules Broadcasting the good and the ugly

1.5k Jan 4, 2023

This is a gentle introductin on how to start using an awesome library called Weights and Biases.

?? W&B Minimal PyTorch Tutorial This tutorial is also accompanied with a PyTorch source code, it can be found in src folder. Furthermore, all plots an

8 Aug 20, 2022

Fully Automated YouTube Channel ▶️with Added Extra Features.

Fully Automated Youtube Channel ▒█▀▀█ █▀▀█ ▀▀█▀▀ ▀▀█▀▀ █░░█ █▀▀▄ █▀▀ █▀▀█ ▒█▀▀▄ █░░█ ░░█░░ ░▒█░░ █░░█ █▀▀▄ █▀▀ █▄▄▀ ▒█▄▄█ ▀▀▀▀ ░░▀░░ ░▒█░░ ░▀▀▀ ▀▀▀░

249 Jan 2, 2023

A Telegram Bot for adding Footer caption beside main caption of Telegram Channel Messages.

Footer-Bot A Telegram Bot for adding Footer caption beside main caption of Telegram Channel Messages. Best for Telegram Movie Channels. Made by @AbirH

35 Jan 2, 2023

improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

CLIP-ViL In our paper "How Much Can CLIP Benefit Vision-and-Language Tasks?", we show the improvement of CLIP features over the traditional resnet fea

310 Dec 28, 2022

Simple image captioning model - CLIP prefix captioning.

688 Jan 4, 2023

use tensorflow 2.0 to tell a dog and cat from a specified picture

dog_or_cat use tensorflow 2.0 to tell a dog and cat from a specified picture This is one of the classic experiments for the introduction of deep learn

1 Oct 22, 2021

Fine-grained Control of Image Caption Generation with Abstract Scene Graphs

Faster R-CNN pretrained on VisualGenome This repository modifies maskrcnn-benchmark for object detection and attribute prediction on VisualGenome data

7 Apr 20, 2021

Show Data: Show your dataset in web browser!

Show Data is to generate html tables for large scale image dataset, especially for the dataset in remote server. It provides some useful commond line tools and fully customizeble API reference to generate html table different tasks.

83 Nov 26, 2022

🐸 Identify anything. pyWhat easily lets you identify emails, IP addresses, and more. Feed it a .pcap file or some text and it'll tell you what it is! 🧙‍♀️

?? Identify anything. pyWhat easily lets you identify emails, IP addresses, and more. Feed it a .pcap file or some text and it'll tell you what it is! ??‍♀️

5.6k Jan 3, 2023

Implementation of the 😇 Attention layer from the paper, Scaling Local Self-Attention For Parameter Efficient Visual Backbones

HaloNet - Pytorch Implementation of the Attention layer from the paper, Scaling Local Self-Attention For Parameter Efficient Visual Backbones. This re

189 Nov 22, 2022

Image Captioning using CNN ,LSTM and Attention

Image Captioning using CNN ,LSTM and Attention This is a deeplearning model which tries to summarize an image into a text . Installation Install this

1 Dec 16, 2021

This is a API/Website to see the attendance recorded in your college website along with how many days you can take days off OR to attend class!!

Bunker-Website This is a GUI version of the Bunker-API along with some visualization charts to see your attendance progress. Website Link Check out th

11 Dec 27, 2022

Image captioning - Tensorflow implementation of Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Related tags

Overview

Introduction

Prerequisites

Usage

Results

References

Comments

Owner

Guoming Wang

PyTorch Implementation of Fully Convolutional Networks. (Training code to reproduce the original result is available.)

Interactive deep learning book with multi-framework code, math, and discussions. Adopted at 200 universities.

Open source guides/codes for mastering deep learning to deploying deep learning in production in PyTorch, Python, C++ and more.

A collection of various deep learning architectures, models, and tips

PyTorch tutorials and best practices.

This is a gentle introductin on how to start using an awesome library called Weights and Biases.

Fully Automated YouTube Channel ▶️with Added Extra Features.

A Telegram Bot for adding Footer caption beside main caption of Telegram Channel Messages.

improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

Simple image captioning model - CLIP prefix captioning.

use tensorflow 2.0 to tell a dog and cat from a specified picture

Fine-grained Control of Image Caption Generation with Abstract Scene Graphs

Show Data: Show your dataset in web browser!

🐸 Identify anything. pyWhat easily lets you identify emails, IP addresses, and more. Feed it a .pcap file or some text and it'll tell you what it is! 🧙‍♀️

Implementation of the 😇 Attention layer from the paper, Scaling Local Self-Attention For Parameter Efficient Visual Backbones

Image Captioning using CNN ,LSTM and Attention

This is a API/Website to see the attendance recorded in your college website along with how many days you can take days off OR to attend class!!

End-to-end image captioning with EfficientNet-b3 + LSTM with Attention

a static website generator to make beautiful customizable pictures galleries that tell a story

A Deep learning based streamlit web app which can tell with which bollywood celebrity your face resembles.