ECCV2020 paper: Fashion Captioning: Towards Generating Accurate Descriptions with Semantic Rewards. Code and Data.

Overview

This repo contains some of the codes for the following paper Fashion Captioning: Towards Generating Accurate Descriptions with Semantic Rewards. Code and Data.

Special Note:

  1. This dataset is much bigger than the one used on ECCV 2020. The larger one has almost 1M images while the other one contains only about half of it (even though you might find 993K in the paper).
  2. The evaluation codes are now adopted from self-critical.pytorch.
  3. Because of the two reasons above, we now should have better CIDEr scores. However, the other scores might be lower. We will try to update the scores soon.

Codes:

Now this repo only contains codes for SAT, BUTD and CNN-C as was written in the paper.

evalcap folder can be downloaded from here.

To run the code for training, do sh train.sh. To test, sh test.sh

I kept having bad results for CNN-C model, with all the generations in the val set be the same. I had the same issue when I tried to adopt from self-critical.pytorch. This never happened before when I ran the experiments for the ECCV paper. I really appreciate if anyone find the reason why this happened.

Dataset:

To get the preprocessed data, use this or email: Xuewen Yang @ [email protected] if you need the raw data.

For other issues, please create an issue on this repo.

If you want to download the original dataset (some data might be missing), you can:

  1. First download the json file from here.
  2. Then use wget or other download scripts. For example, wget https://n.nordstrommedia.com/id/sr3/58d1a13f-b6b6-4e68-b2ff-3a3af47c422e.jpeg Remember to ignore anything after .jpeg in the url to get high resolution images, otherwise, very low resolution images are downloaded.
  3. Sometimes the description is no longer available, we can retrieve it from the 'detail_info' part.

License:

  1. The dataset is under license in the LICENSE file.
  2. No commercial use.

Citation:

If you use this data, please cite:

@inproceedings{XuewenECCV20Fashion,
Author = {Xuewen Yang and Heming Zhang and Di Jin and Yingru Liu and Chi-Hao Wu and Jianchao Tan and Dongliang Xie and Jue Wang and Xin Wang},
Title = {Fashion Captioning: Towards Generating Accurate Descriptions with Semantic Rewards},
booktitle = {ECCV},
Year = {2020}
}
Comments
  • About data process

    About data process

    Sir, thank you very much for publicing the code.

    Recently I have download all items with the first color from FACAD. The json format of dataset is named "meta_all_129927.json".

    When I check the download data, I get 126753 items which are the first color data in FACAD. There are about 1200 images which are lack of link (basically, one product lacks one link) and 3 items do not have any links. Data processing is very troublesome

    I use the data of 100000 images, the result is too low and data process must have some problems

    Results of the val(use the train data to evaluate): B4: 0.09 M: 0.07 R: 0.11 C: 0.23

    So I download the data named 'TEST_IMAGES_5.hdf5,VAL_IMAGES_5.hdf5,TRAIN_IMAGES_5.hdf5' you provide.

    For example, the num of images in TEST_IMAGES_5.HDF5 and TEST_IMAGEPATH_5.json are 99981, however, the other file with json format like 'TEST_CAPLENS_5.json', 'TEST_CAPTIONS_5.json' ... are 99946. I only check the test data.

    May you check the data upload? I use the TEST_IMAGEPATH_5.json to find image id and get the description, attribute.... I do not know this way is right.

    Today, I download the code, find some files is lack in yml such as sat.yml: data_folder: /home/xuewyang/Xuewen/Research/data/FACAD/jsons model_folder: /home/xuewyang/Xuewen/Research/model/fashion/captioning/SAT2 checkpoint: /home/xuewyang/Xuewen/Research/model/fashion/captioning/SAT/vanilla/BEST_checkpoint_2.pth.tar

    I sent a few emails, and I got no reply. I am very, very interested in fashion caption and hope to get your help

    opened by tangyuhao2016 25
  • About Expreiment Settings and Performances

    About Expreiment Settings and Performances

    Thanks for your sharing your dataset. It seems to be a really useful and fantastic work! But I'm getting into troubles when I try to replicate some results.

    I used the code in ruotian's repo to try some baselines. I trained the 'att2in' and 'adaatt' model using XE loss on FACAD, but got really bad performance on BLEUs, METEOR, ROUGEL, and CIDEr. Even when I use the training split to evaluate the trained model, the socres are still much lower than reported in the paper except CIDEr.

    And I also find that the training loss can drop to 1.8 after epochs of training, while the loss on the val split stops at about 3.1. It seems that I've come across an overfit but I've no idea about the reason, as I think the amout of the data is big enough to avoid overfitting. Note that these models all behave well on COCO dataset. And I think I've preprocessed FACAD into the COCO format.

    The only difference is that, in COCO, each image is paired with 5 captions. While in FACAD, each image is paired with only one caption, and sometimes different images share one same caption. I don't know if this difference results in the terrible performance.

    Do you have any ideas on these problems? Are there any significant details for data preprocessing or training?

    opened by LONGRYUU 22
  • Scores of the 3 released baselines.

    Scores of the 3 released baselines.

    Thanks for your released code. The code is well-structured, I replaced the dataloader with my own implemention and it still works well. But I still got some issues.

    I've trained the SAT and BUTD model for about 15 epochs now, they all achieve high scores but the differences are quite large, especially on CIDEr. The CIDEr of them are about 192.6 and 144.6 respectively. Are these results alright? So what scores have you got with these models?

    Detailed results are as follows: SAT: Bleu_1: 0.495 Bleu_2: 0.348 Bleu_3: 0.267 Bleu_4: 0.219 METEOR: 0.215 ROUGE_L: 0.465 CIDEr: 1.928

    BUTD: Bleu_1: 0.462 Bleu_2: 0.302 Bleu_3: 0.213 Bleu_4: 0.161 METEOR: 0.193 ROUGE_L: 0.432 CIDEr: 1.446

    Besides, I've also trained CNNC for 4 epochs. I find it quite slow to eval and it achieves really low scores. Bleu_1: 0.158 Bleu_2: 0.057 Bleu_3: 0.020 Bleu_4: 0.009 METEOR: 0.060 ROUGE_L: 0.131 CIDEr: 0.094

    opened by LONGRYUU 15
  • Problems about the proposed approch in the paper.

    Problems about the proposed approch in the paper.

    I got some puzzles while reading your paper.

    1.How to get the attribute vector z? More specifically, how to transfer the image features into vector z? In the paper, z is obtained from a feed-forward layer, what functions in pytorch did you use to combine this layer? Linear functions or convolutional layers? There could be several strategies to compress the 3-dimension image features into a vector.

    2.In formulation 8, the subscript 1/n is put out of the brackets, is it a typo error? Does it mean β * P(1) * √(P(2)) or β * √(P(1) * P(2))?

    opened by LONGRYUU 5
  • Dataset Details Mismatch

    Dataset Details Mismatch

    Is dataset used in the paper different from the preprocessed dataset provided on google drive? Or Am I missing something?
    Preprocessed data from the google drive: TRAIN: 888293 VAL: 19915 TEST: 101225

    From paper Section 5.1: It contains 993K images and 130K descriptions, and we split the whole dataset, with approximately 794K image-description pairs for training, 99K for validation, and the remaining 100K for test.

    opened by gourango01 0
  • could you please provide a pre-trained model?

    could you please provide a pre-trained model?

    Hi, very cool work, thanks a lot for making your code public! it would be great if you could share a pre-trained model + sample script to create captions on new images. Would be super cool and helpful.

    Would that be possible? Thanks! Z.

    opened by zoharbarzelay 0
  • About file preparation step and Training procedure

    About file preparation step and Training procedure

    Thank you for your works. It looks interested and useful works. but I found some problems in the step of using this new dataset and training step. I need you to explain what I should do such as:

    1. What files should I download for training the model?
    2. Which library I should install?
    3. etc.

    Could you explain it for me, please Looking forward to your reply @xuewyang, Thank you.

    opened by donnaphat-ut 0
  • About structure details and attribute learning

    About structure details and attribute learning

    Thank you. I meet some problems when duplicating the model.

    1. Like the " the encoder is a pre-trained CNN, which takes an image as the input and extracts B image features, X={x0, x1,...xB}. Is that X means the feature map (batchsize* 2048* 14* 14) outputs from the last convolution layer of resnet101?

    2.Like the figure3, the avg pooling of the feature map (batchsize * 2048) is sent into the feed forward networks. How many layers consist in the FF, only one layer (2048 * 990) followed by a sigmoid or more layers? Is the output of the FF before sigmoid chosen as the Z or after sigmoid chosen as the Z?

    1. Whether the attribute learning has been pretrained seperately before and then added to fintune the caption model or attribute learning and caption model are learning together from the beginning?

    4.When we get z, whether z is cat with y(word embedding) as input of caption model or z cat with the output of image features after attention model?

    Looking forward to your reply.

    opened by tangyuhao2016 23
Owner
Xuewen Yang
PhD in the research field of Computer Vision and NLP.
Xuewen Yang
Evaluation Pipeline for our ECCV2020: Journey Towards Tiny Perceptual Super-Resolution.

Journey Towards Tiny Perceptual Super-Resolution Test code for our ECCV2020 paper: https://arxiv.org/abs/2007.04356 Our x4 upscaling pre-trained model

Royson 6 Mar 30, 2022
Code for the ECCV2020 paper "A Differentiable Recurrent Surface for Asynchronous Event-Based Data"

A Differentiable Recurrent Surface for Asynchronous Event-Based Data Code for the ECCV2020 paper "A Differentiable Recurrent Surface for Asynchronous

Marco Cannici 21 Oct 5, 2022
Simple image captioning model - CLIP prefix captioning.

Simple image captioning model - CLIP prefix captioning.

null 688 Jan 4, 2023
Codes for paper "Towards Diverse Paragraph Captioning for Untrimmed Videos". CVPR 2021

Towards Diverse Paragraph Captioning for Untrimmed Videos This repository contains PyTorch implementation of our paper Towards Diverse Paragraph Capti

Yuqing Song 61 Oct 11, 2022
Inferring Lexicographically-Ordered Rewards from Preferences

Inferring Lexicographically-Ordered Rewards from Preferences Code author: Alihan Hüyük ([email protected]) This repository contains the source code nec

Alihan Hüyük 1 Feb 13, 2022
Code for "Learning Canonical Representations for Scene Graph to Image Generation", Herzig & Bar et al., ECCV2020

Learning Canonical Representations for Scene Graph to Image Generation (ECCV 2020) Roei Herzig*, Amir Bar*, Huijuan Xu, Gal Chechik, Trevor Darrell, A

roei_herzig 24 Jul 7, 2022
The source code for Generating Training Data with Language Models: Towards Zero-Shot Language Understanding.

SuperGen The source code for Generating Training Data with Language Models: Towards Zero-Shot Language Understanding. Requirements Before running, you

Yu Meng 38 Dec 12, 2022
Fine-Tune EleutherAI GPT-Neo to Generate Netflix Movie Descriptions in Only 47 Lines of Code Using Hugginface And DeepSpeed

GPT-Neo-2.7B Fine-Tuning Example Using HuggingFace & DeepSpeed Installation cd venv/bin ./pip install -r ../../requirements.txt ./pip install deepspe

Nikita 180 Jan 5, 2023
Nonuniform-to-Uniform Quantization: Towards Accurate Quantization via Generalized Straight-Through Estimation. In CVPR 2022.

Nonuniform-to-Uniform Quantization This repository contains the training code of N2UQ introduced in our CVPR 2022 paper: "Nonuniform-to-Uniform Quanti

Zechun Liu 60 Dec 28, 2022
Official implementation of "Not only Look, but also Listen: Learning Multimodal Violence Detection under Weak Supervision" ECCV2020

XDVioDet Official implementation of "Not only Look, but also Listen: Learning Multimodal Violence Detection under Weak Supervision" ECCV2020. The proj

peng 64 Dec 12, 2022
PyTorch reimplementation of hand-biomechanical-constraints (ECCV2020)

Hand Biomechanical Constraints Pytorch Unofficial PyTorch reimplementation of Hand-Biomechanical-Constraints (ECCV2020). This project reimplement foll

Hao Meng 59 Dec 20, 2022
Random Erasing Data Augmentation. Experiments on CIFAR10, CIFAR100 and Fashion-MNIST

Random Erasing Data Augmentation =============================================================== black white random This code has the source code for

Zhun Zhong 654 Dec 26, 2022
Everything you want about DP-Based Federated Learning, including Papers and Code. (Mechanism: Laplace or Gaussian, Dataset: femnist, shakespeare, mnist, cifar-10 and fashion-mnist. )

Differential Privacy (DP) Based Federated Learning (FL) Everything about DP-based FL you need is here. (所有你需要的DP-based FL的信息都在这里) Code Tip: the code o

wenzhu 83 Dec 24, 2022
The LaTeX and Python code for generating the paper, experiments' results and visualizations reported in each paper is available (whenever possible) in the paper's directory

This repository contains the software implementation of most algorithms used or developed in my research. The LaTeX and Python code for generating the

João Fonseca 3 Jan 3, 2023
Train emoji embeddings based on emoji descriptions.

emoji2vec This is my attempt to train, visualize and evaluate emoji embeddings as presented by Ben Eisner, Tim Rocktäschel, Isabelle Augenstein, Matko

Miruna Pislar 17 Sep 3, 2022
Code for ACM MM2021 paper "Complementary Trilateral Decoder for Fast and Accurate Salient Object Detection"

CTDNet The PyTorch code for ACM MM2021 paper "Complementary Trilateral Decoder for Fast and Accurate Salient Object Detection" Requirements Python 3.6

CVTEAM 28 Oct 20, 2022
Fashion Landmark Estimation with HRNet

HRNet for Fashion Landmark Estimation (Modified from deep-high-resolution-net.pytorch) Introduction This code applies the HRNet (Deep High-Resolution

SVIP Lab 91 Dec 26, 2022
(CVPR2021) Kaleido-BERT: Vision-Language Pre-training on Fashion Domain

Kaleido-BERT: Vision-Language Pre-training on Fashion Domain Mingchen Zhuge*, Dehong Gao*, Deng-Ping Fan#, Linbo Jin, Ben Chen, Haoming Zhou, Minghui

null 248 Dec 4, 2022