Image captioning - Tensorflow implementation of Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Overview

Introduction

This neural system for image captioning is roughly based on the paper "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention" by Xu et al. (ICML2015). The input is an image, and the output is a sentence describing the content of the image. It uses a convolutional neural network to extract visual features from the image, and uses a LSTM recurrent neural network to decode these features into a sentence. A soft attention mechanism is incorporated to improve the quality of the caption. This project is implemented using the Tensorflow library, and allows end-to-end training of both CNN and RNN parts.

Prerequisites

Usage

  • Preparation: Download the COCO train2014 and val2014 data here. Put the COCO train2014 images in the folder train/images, and put the file captions_train2014.json in the folder train. Similarly, put the COCO val2014 images in the folder val/images, and put the file captions_val2014.json in the folder val. Furthermore, download the pretrained VGG16 net here or ResNet50 net here if you want to use it to initialize the CNN part.

  • Training: To train a model using the COCO train2014 data, first setup various parameters in the file config.py and then run a command like this:

python main.py --phase=train \
    --load_cnn \
    --cnn_model_file='./vgg16_no_fc.npy'\
    [--train_cnn]    

Turn on --train_cnn if you want to jointly train the CNN and RNN parts. Otherwise, only the RNN part is trained. The checkpoints will be saved in the folder models. If you want to resume the training from a checkpoint, run a command like this:

python main.py --phase=train \
    --load \
    --model_file='./models/xxxxxx.npy'\
    [--train_cnn]

To monitor the progress of training, run the following command:

tensorboard --logdir='./summary/'
  • Evaluation: To evaluate a trained model using the COCO val2014 data, run a command like this:
python main.py --phase=eval \
    --model_file='./models/xxxxxx.npy' \
    --beam_size=3

The result will be shown in stdout. Furthermore, the generated captions will be saved in the file val/results.json.

  • Inference: You can use the trained model to generate captions for any JPEG images! Put such images in the folder test/images, and run a command like this:
python main.py --phase=test \
    --model_file='./models/xxxxxx.npy' \
    --beam_size=3

The generated captions will be saved in the folder test/results.

Results

A pretrained model with default configuration can be downloaded here. This model was trained solely on the COCO train2014 data. It achieves the following BLEU scores on the COCO val2014 data (with beam size=3):

  • BLEU-1 = 70.3%
  • BLEU-2 = 53.6%
  • BLEU-3 = 39.8%
  • BLEU-4 = 29.5%

Here are some captions generated by this model: examples

References

Comments
  • Well trained model on COCO train 2014?

    Well trained model on COCO train 2014?

    This link can not open now. image

    Could somebody share the well trained model the author provided? Or has somebody already trained on your own computers and save the model?

    Thanks very much!

    opened by ivy94419 4
  • subprocess.py

    subprocess.py", line 1024, in _execute_child raise child_exception OSError: [Errno 2] No such file or directory

    Loading the model from ./models/289999.npy... 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 47/47 [00:01<00:00, 23.69it/s] 47 tensors loaded. Evaluating the model ... batch: 100%|█████████████████████████████████████████████████████████████████████████████████| 1266/1266 [2:02:42<00:00, 5.83s/it] Loading and preparing results...
    DONE (t=0.13s) creating index... index created! tokenization... Traceback (most recent call last): File "main.py", line 69, in tf.app.run() File "/home/viktor/anaconda2/envs/captureimage4/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "main.py", line 58, in main model.eval(sess, coco, data, vocabulary) File "/home/viktor/anaconda2/envs/captureimage4/scr/image_captioning-master/base_model.py", line 108, in eval scorer.evaluate() File "/home/viktor/anaconda2/envs/captureimage4/scr/image_captioning-master/utils/coco/pycocoevalcap/eval.py", line 31, in evaluate gts = tokenizer.tokenize(gts) File "/home/viktor/anaconda2/envs/captureimage4/scr/image_captioning-master/utils/coco/pycocoevalcap/tokenizer/ptbtokenizer.py", line 52, in tokenize stdout=subprocess.PIPE) File "/home/viktor/anaconda2/envs/captureimage4/lib/python2.7/subprocess.py", line 390, in init errread, errwrite) File "/home/viktor/anaconda2/envs/captureimage4/lib/python2.7/subprocess.py", line 1024, in _execute_child raise child_exception OSError: [Errno 2] No such file or directory (captureimage4) viktor@viktor-System-Product-Name:

    opened by MikhailovSergei 1
  • Which layer of Google's NASNet should I use for extracting features for attention?

    Which layer of Google's NASNet should I use for extracting features for attention?

    I am implementing a network similar to this one, but want to use the pre-trained CNN with max accuracy over 2012ILSVRC dataset, i.e., NASNet-large. Usually, people go by extracting the last convolution layer features. But NASNet's architecture is relatively complex and I couldn't find a direct Conv layer. Below is the Tensorboard visualization of the "final_layer" cell of NASNet:

    tb1

    And below is the second last cell:

    tb2

    To me, the relu node I've selected in first image([1,11,11,4032]) seems close to what's needed for attention, but I am not sure. Any help will be highly appreciated.

    opened by aayushARM 1
  • FileNotFoundError: [Errno 2] No such file or directory:

    FileNotFoundError: [Errno 2] No such file or directory: "'./models/289999.npy'"

    I'm using the pretrained model provided by the author to generate some captions for my jpg images. All the following lines get printed in the console.

    • loading annotations into memory...
    • Done (t=0.58s)
    • creating index...
    • index created!
    • Building the vocabulary...
    • Vocabulary built.
    • Number of words = 5000
    • Building the dataset...
    • Dataset built.
    • Building the CNN...
    • CNN built.
    • Building the RNN...
    • RNN built.
    • Loading the model from './models/289999.npy'...

    but after this I get an error FileNotFoundError: [Errno 2] No such file or directory: "'./models/289999.npy'", I already have the model in the models directory. Can anyone help me with this model

    opened by abhisingh192 0
  • Pretrained model efficiency

    Pretrained model efficiency

    I had used the pretrained model that you shared and I got bad results of captioning... Which cnn did you use for it? the default value from config.py which is vgg16? Can you share a model which was trained with resnet, and with more advanced configurations that can help getting better results?

    In general, which configuration were you changing to achieve that?

    Thanks in advance.

    opened by zbeedatm 0
  • python3.5 ImportError: No module named '_sqlite3' nltk

    python3.5 ImportError: No module named '_sqlite3' nltk

    hi,guys! when i tried to run train.py on Ubuntu16.04 with python3.5, i got this error, could anyone tell me how to fix this problem? i don't have the root right& i don't want to reinstall python from source. thank you, please!

    opened by AngelaDevHao 0
  • SystemError: <built-in function imread> returned NULL without setting an error

    SystemError: returned NULL without setting an error

    def func(images):
        images = np.array(images)
        newSize = [300,300]
        print('image_A->',images[0])
        #print('boxes->',boxes)
    
    
        image = cv2.imread(images[0])
        print(image)
    
        scale_x = newSize[0] / image.shape[1]
        scale_y = newSize[1] / image.shape[0]
        image = cv2.resize(image, (newSize[0], newSize[1]))
        return image
    
    def train():
        print(tf.__version__)
    
        images,boxes,labels,difficulties= PascalVOCDataset()
        boxes = tf.ragged.constant(boxes)
        dataset = tf.data.Dataset.from_tensor_slices((images,boxes)).shuffle(100).batch(1)
        dataset = dataset.map(lambda image,box: tf.py_function(func=func, inp = [image],Tout=tf.string))
    
    
    
        #print(dataset)
        #dataset = dataset.map(func)
    
        for image,box in dataset:
            break
    
    
    def main():
        train()
    if __name__ =='__main__':
        main()
    

    image_A-> b'/media/jake/mark-4tb3/input/datasets/pascal/VOCtrainval_11-May-2012/VOCdevkit/VOC2012/JPEGImages/2008_000259.jpg' 2020-05-28 22:28:27.374154: W tensorflow/core/framework/op_kernel.cc:1610] Unknown: SystemError: returned NULL without setting an error Traceback (most recent call last):

    File "/home/jake/venv/lib/python3.7/site-packages/tensorflow_core/python/ops/script_ops.py", line 219, in call return func(device, token, args)

    File "/home/jake/venv/lib/python3.7/site-packages/tensorflow_core/python/ops/script_ops.py", line 113, in call ret = self._func(*args)

    File "/home/jake/Gits/ssd_tensorflow/train.py", line 21, in func image = cv2.imread(images[0])

    SystemError: returned NULL without setting an error

    2020-05-28 22:28:27.374302: W tensorflow/core/framework/op_kernel.cc:1622] OP_REQUIRES failed at iterator_ops.cc:929 : Unknown: {{function_node _inference_Dataset_map_24}} SystemError: returned NULL without setting an error Traceback (most recent call last):

    File "/home/jake/venv/lib/python3.7/site-packages/tensorflow_core/python/ops/script_ops.py", line 219, in call return func(device, token, args)

    File "/home/jake/venv/lib/python3.7/site-packages/tensorflow_core/python/ops/script_ops.py", line 113, in call ret = self._func(*args)

    File "/home/jake/Gits/ssd_tensorflow/train.py", line 21, in func image = cv2.imread(images[0])

    SystemError: returned NULL without setting an error

     [[{{node EagerPyFunc}}]]
    

    Traceback (most recent call last): File "/home/jake/Gits/ssd_tensorflow/train.py", line 49, in main() File "/home/jake/Gits/ssd_tensorflow/train.py", line 47, in main train() File "/home/jake/Gits/ssd_tensorflow/train.py", line 42, in train for image,box in dataset: File "/home/jake/venv/lib/python3.7/site-packages/tensorflow_core/python/data/ops/iterator_ops.py", line 622, in next return self.next() File "/home/jake/venv/lib/python3.7/site-packages/tensorflow_core/python/data/ops/iterator_ops.py", line 666, in next return self._next_internal() File "/home/jake/venv/lib/python3.7/site-packages/tensorflow_core/python/data/ops/iterator_ops.py", line 651, in _next_internal output_shapes=self._flat_output_shapes) File "/home/jake/venv/lib/python3.7/site-packages/tensorflow_core/python/ops/gen_dataset_ops.py", line 2673, in iterator_get_next_sync _six.raise_from(_core._status_to_exception(e.code, message), None) File "", line 3, in raise_from tensorflow.python.framework.errors_impl.UnknownError: {{function_node _inference_Dataset_map_24}} SystemError: returned NULL without setting an error Traceback (most recent call last):

    File "/home/jake/venv/lib/python3.7/site-packages/tensorflow_core/python/ops/script_ops.py", line 219, in call return func(device, token, args)

    File "/home/jake/venv/lib/python3.7/site-packages/tensorflow_core/python/ops/script_ops.py", line 113, in call ret = self._func(*args)

    File "/home/jake/Gits/ssd_tensorflow/train.py", line 21, in func image = cv2.imread(images[0])

    SystemError: returned NULL without setting an error

     [[{{node EagerPyFunc}}]] [Op:IteratorGetNextSync]
    

    Process finished with exit code 1

    opened by SlowMonk 0
  • No captions generated while testing

    No captions generated while testing

    I'm testing the model on my images, I run the code like this python main.py --phase=test \ --model_file='./models/289999.npy' \ --beam_size=3, the code runs and finally this shows up in the terminal.

    • Testing the model ...
    • path: 0it [00:00, ?it/s]
    • Testing complete.

    But when i check the test/results folder it is empty, no captions are generated. Can anyone help me with this

    opened by abhisingh192 0
  • Chinese image caption, In the result, multiple words of the same type appear

    Chinese image caption, In the result, multiple words of the same type appear

    Hello, I am using the COCO dataset, A two-layer LSTM model, one layer for top-down attention, and one layer for language models.

    Extracting words with jieba I used all the words in the picture description that occurred more than 3 times as a dictionary file, and a total of 14,226 words. words = [w for w in word_freq.keys () if word_freq [w]> 3]

    After training the model, when using it, multiple words of the same type appear in the result, such as:

    Note notebook laptop computer on bed A little girl little girl girl standing together

    How can I solve this problem?

    opened by cylvzj 1
  • why can't I get any result

    why can't I get any result

    (py_36) E:\Users\pythoncx\workspace\image_captioning-master>python main.py --phase=train --load_cnn --cnn_model_file='./vgg16_no_fc.npy' --train_cnn

    after i input above, nothing happen,please tell me

    opened by HIHIHAHEI 1
Owner
Guoming Wang
Personal account focusing on Artificial Intelligence
Guoming Wang
PyTorch Implementation of Fully Convolutional Networks. (Training code to reproduce the original result is available.)

pytorch-fcn PyTorch implementation of Fully Convolutional Networks. Requirements pytorch >= 0.2.0 torchvision >= 0.1.8 fcn >= 6.1.5 Pillow scipy tqdm

Kentaro Wada 1.6k Jan 4, 2023
Interactive deep learning book with multi-framework code, math, and discussions. Adopted at 200 universities.

D2L.ai: Interactive Deep Learning Book with Multi-Framework Code, Math, and Discussions Book website | STAT 157 Course at UC Berkeley | Latest version

Dive into Deep Learning (D2L.ai) 16k Jan 3, 2023
Open source guides/codes for mastering deep learning to deploying deep learning in production in PyTorch, Python, C++ and more.

Deep Learning Materials by Deep Learning Wizard Start Learning Now Please head to www.deeplearningwizard.com to start learning! It is mobile/tablet fr

Ritchie Ng 572 Dec 28, 2022
A collection of various deep learning architectures, models, and tips

Deep Learning Models A collection of various deep learning architectures, models, and tips for TensorFlow and PyTorch in Jupyter Notebooks. Traditiona

Sebastian Raschka 15.5k Jan 7, 2023
PyTorch tutorials and best practices.

Effective PyTorch Table of Contents Part I: PyTorch Fundamentals PyTorch basics Encapsulate your model with Modules Broadcasting the good and the ugly

Vahid Kazemi 1.5k Jan 4, 2023
This is a gentle introductin on how to start using an awesome library called Weights and Biases.

?? W&B Minimal PyTorch Tutorial This tutorial is also accompanied with a PyTorch source code, it can be found in src folder. Furthermore, all plots an

Nauryzbay K 8 Aug 20, 2022
Fully Automated YouTube Channel ▶️with Added Extra Features.

Fully Automated Youtube Channel ▒█▀▀█ █▀▀█ ▀▀█▀▀ ▀▀█▀▀ █░░█ █▀▀▄ █▀▀ █▀▀█ ▒█▀▀▄ █░░█ ░░█░░ ░▒█░░ █░░█ █▀▀▄ █▀▀ █▄▄▀ ▒█▄▄█ ▀▀▀▀ ░░▀░░ ░▒█░░ ░▀▀▀ ▀▀▀░

sam-sepiol 249 Jan 2, 2023
A Telegram Bot for adding Footer caption beside main caption of Telegram Channel Messages.

Footer-Bot A Telegram Bot for adding Footer caption beside main caption of Telegram Channel Messages. Best for Telegram Movie Channels. Made by @AbirH

Abir Hasan 35 Jan 2, 2023
improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

CLIP-ViL In our paper "How Much Can CLIP Benefit Vision-and-Language Tasks?", we show the improvement of CLIP features over the traditional resnet fea

null 310 Dec 28, 2022
Simple image captioning model - CLIP prefix captioning.

Simple image captioning model - CLIP prefix captioning.

null 688 Jan 4, 2023
use tensorflow 2.0 to tell a dog and cat from a specified picture

dog_or_cat use tensorflow 2.0 to tell a dog and cat from a specified picture This is one of the classic experiments for the introduction of deep learn

你这个代码我看不懂 1 Oct 22, 2021
Fine-grained Control of Image Caption Generation with Abstract Scene Graphs

Faster R-CNN pretrained on VisualGenome This repository modifies maskrcnn-benchmark for object detection and attribute prediction on VisualGenome data

Shizhe Chen 7 Apr 20, 2021
Show Data: Show your dataset in web browser!

Show Data is to generate html tables for large scale image dataset, especially for the dataset in remote server. It provides some useful commond line tools and fully customizeble API reference to generate html table different tasks.

Dechao Meng 83 Nov 26, 2022
🐸 Identify anything. pyWhat easily lets you identify emails, IP addresses, and more. Feed it a .pcap file or some text and it'll tell you what it is! 🧙‍♀️

?? Identify anything. pyWhat easily lets you identify emails, IP addresses, and more. Feed it a .pcap file or some text and it'll tell you what it is! ??‍♀️

Brandon 5.6k Jan 3, 2023
Implementation of the 😇 Attention layer from the paper, Scaling Local Self-Attention For Parameter Efficient Visual Backbones

HaloNet - Pytorch Implementation of the Attention layer from the paper, Scaling Local Self-Attention For Parameter Efficient Visual Backbones. This re

Phil Wang 189 Nov 22, 2022
Image Captioning using CNN ,LSTM and Attention

Image Captioning using CNN ,LSTM and Attention This is a deeplearning model which tries to summarize an image into a text . Installation Install this

ASUTOSH GHANTO 1 Dec 16, 2021
This is a API/Website to see the attendance recorded in your college website along with how many days you can take days off OR to attend class!!

Bunker-Website This is a GUI version of the Bunker-API along with some visualization charts to see your attendance progress. Website Link Check out th

Mathana Mathav 11 Dec 27, 2022
End-to-end image captioning with EfficientNet-b3 + LSTM with Attention

Image captioning End-to-end image captioning with EfficientNet-b3 + LSTM with Attention Model is seq2seq model. In the encoder pretrained EfficientNet

null 2 Feb 10, 2022
a static website generator to make beautiful customizable pictures galleries that tell a story

Prosopopee Prosopopee. Static site generator for your story. Make beautiful customizable pictures galleries that tell a story using a static website g

Bram 259 Dec 19, 2022
A Deep learning based streamlit web app which can tell with which bollywood celebrity your face resembles.

Project Name: Which Bollywood Celebrity You look like A Deep learning based streamlit web app which can tell with which bollywood celebrity your face

BAPPY AHMED 20 Dec 28, 2021