ECCV2020 paper: Fashion Captioning: Towards Generating Accurate Descriptions with Semantic Rewards. Code and Data.

Xuewen Yang

Last update: Dec 8, 2022

Related tags

Deep Learning Fashion_Captioning

Overview

This repo contains some of the codes for the following paper Fashion Captioning: Towards Generating Accurate Descriptions with Semantic Rewards. Code and Data.

Special Note:

This dataset is much bigger than the one used on ECCV 2020. The larger one has almost 1M images while the other one contains only about half of it (even though you might find 993K in the paper).
The evaluation codes are now adopted from self-critical.pytorch.
Because of the two reasons above, we now should have better CIDEr scores. However, the other scores might be lower. We will try to update the scores soon.

Codes:

Now this repo only contains codes for SAT, BUTD and CNN-C as was written in the paper.

evalcap folder can be downloaded from here.

To run the code for training, do sh train.sh. To test, sh test.sh

I kept having bad results for CNN-C model, with all the generations in the val set be the same. I had the same issue when I tried to adopt from self-critical.pytorch. This never happened before when I ran the experiments for the ECCV paper. I really appreciate if anyone find the reason why this happened.

Dataset:

To get the preprocessed data, use this or email: Xuewen Yang @ [email protected] if you need the raw data.

For other issues, please create an issue on this repo.

If you want to download the original dataset (some data might be missing), you can:

First download the json file from here.
Then use wget or other download scripts. For example, wget https://n.nordstrommedia.com/id/sr3/58d1a13f-b6b6-4e68-b2ff-3a3af47c422e.jpeg Remember to ignore anything after .jpeg in the url to get high resolution images, otherwise, very low resolution images are downloaded.
Sometimes the description is no longer available, we can retrieve it from the 'detail_info' part.

License:

The dataset is under license in the LICENSE file.
No commercial use.

Citation:

If you use this data, please cite:

@inproceedings{XuewenECCV20Fashion,
Author = {Xuewen Yang and Heming Zhang and Di Jin and Yingru Liu and Chi-Hao Wu and Jianchao Tan and Dongliang Xie and Jue Wang and Xin Wang},
Title = {Fashion Captioning: Towards Generating Accurate Descriptions with Semantic Rewards},
booktitle = {ECCV},
Year = {2020}
}

Comments

About data process

Sir, thank you very much for publicing the code.

Recently I have download all items with the first color from FACAD. The json format of dataset is named "meta_all_129927.json".

When I check the download data, I get 126753 items which are the first color data in FACAD. There are about 1200 images which are lack of link (basically, one product lacks one link) and 3 items do not have any links. Data processing is very troublesome

I use the data of 100000 images, the result is too low and data process must have some problems

Results of the val(use the train data to evaluate): B4: 0.09 M: 0.07 R: 0.11 C: 0.23

So I download the data named 'TEST_IMAGES_5.hdf5,VAL_IMAGES_5.hdf5,TRAIN_IMAGES_5.hdf5' you provide.

For example, the num of images in TEST_IMAGES_5.HDF5 and TEST_IMAGEPATH_5.json are 99981, however, the other file with json format like 'TEST_CAPLENS_5.json', 'TEST_CAPTIONS_5.json' ... are 99946. I only check the test data.

May you check the data upload? I use the TEST_IMAGEPATH_5.json to find image id and get the description, attribute.... I do not know this way is right.

Today, I download the code, find some files is lack in yml such as sat.yml: data_folder: /home/xuewyang/Xuewen/Research/data/FACAD/jsons model_folder: /home/xuewyang/Xuewen/Research/model/fashion/captioning/SAT2 checkpoint: /home/xuewyang/Xuewen/Research/model/fashion/captioning/SAT/vanilla/BEST_checkpoint_2.pth.tar

I sent a few emails, and I got no reply. I am very, very interested in fashion caption and hope to get your help

opened by tangyuhao2016 25
About Expreiment Settings and Performances

Thanks for your sharing your dataset. It seems to be a really useful and fantastic work! But I'm getting into troubles when I try to replicate some results.

I used the code in ruotian's repo to try some baselines. I trained the 'att2in' and 'adaatt' model using XE loss on FACAD, but got really bad performance on BLEUs, METEOR, ROUGEL, and CIDEr. Even when I use the training split to evaluate the trained model, the socres are still much lower than reported in the paper except CIDEr.

And I also find that the training loss can drop to 1.8 after epochs of training, while the loss on the val split stops at about 3.1. It seems that I've come across an overfit but I've no idea about the reason, as I think the amout of the data is big enough to avoid overfitting. Note that these models all behave well on COCO dataset. And I think I've preprocessed FACAD into the COCO format.

The only difference is that, in COCO, each image is paired with 5 captions. While in FACAD, each image is paired with only one caption, and sometimes different images share one same caption. I don't know if this difference results in the terrible performance.

Do you have any ideas on these problems? Are there any significant details for data preprocessing or training?

opened by LONGRYUU 22
Scores of the 3 released baselines.

Thanks for your released code. The code is well-structured, I replaced the dataloader with my own implemention and it still works well. But I still got some issues.

I've trained the SAT and BUTD model for about 15 epochs now, they all achieve high scores but the differences are quite large, especially on CIDEr. The CIDEr of them are about 192.6 and 144.6 respectively. Are these results alright? So what scores have you got with these models?

Detailed results are as follows: SAT: Bleu_1: 0.495 Bleu_2: 0.348 Bleu_3: 0.267 Bleu_4: 0.219 METEOR: 0.215 ROUGE_L: 0.465 CIDEr: 1.928

BUTD: Bleu_1: 0.462 Bleu_2: 0.302 Bleu_3: 0.213 Bleu_4: 0.161 METEOR: 0.193 ROUGE_L: 0.432 CIDEr: 1.446

Besides, I've also trained CNNC for 4 epochs. I find it quite slow to eval and it achieves really low scores. Bleu_1: 0.158 Bleu_2: 0.057 Bleu_3: 0.020 Bleu_4: 0.009 METEOR: 0.060 ROUGE_L: 0.131 CIDEr: 0.094

opened by LONGRYUU 15
Problems about the proposed approch in the paper.

I got some puzzles while reading your paper.

1.How to get the attribute vector z? More specifically, how to transfer the image features into vector z? In the paper, z is obtained from a feed-forward layer, what functions in pytorch did you use to combine this layer? Linear functions or convolutional layers? There could be several strategies to compress the 3-dimension image features into a vector.

2.In formulation 8, the subscript 1/n is put out of the brackets, is it a typo error? Does it mean β * P(1) * √(P(2)) or β * √(P(1) * P(2))？

opened by LONGRYUU 5
Dataset Details Mismatch

Is dataset used in the paper different from the preprocessed dataset provided on google drive? Or Am I missing something?
Preprocessed data from the google drive: TRAIN: 888293 VAL: 19915 TEST: 101225

From paper Section 5.1: It contains 993K images and 130K descriptions, and we split the whole dataset, with approximately 794K image-description pairs for training, 99K for validation, and the remaining 100K for test.

opened by gourango01 0
could you please provide a pre-trained model?

Hi, very cool work, thanks a lot for making your code public! it would be great if you could share a pre-trained model + sample script to create captions on new images. Would be super cool and helpful.

Would that be possible? Thanks! Z.

opened by zoharbarzelay 0
About file preparation step and Training procedure
Thank you for your works. It looks interested and useful works. but I found some problems in the step of using this new dataset and training step. I need you to explain what I should do such as:

What files should I download for training the model?

Which library I should install?

etc.

Could you explain it for me, please Looking forward to your reply @xuewyang, Thank you.
opened by donnaphat-ut 0
About structure details and attribute learning
Thank you. I meet some problems when duplicating the model.

Like the " the encoder is a pre-trained CNN, which takes an image as the input and extracts B image features, X={x0, x1,...xB}. Is that X means the feature map (batchsize* 2048* 14* 14) outputs from the last convolution layer of resnet101?

2.Like the figure3, the avg pooling of the feature map (batchsize * 2048) is sent into the feed forward networks. How many layers consist in the FF, only one layer (2048 * 990) followed by a sigmoid or more layers? Is the output of the FF before sigmoid chosen as the Z or after sigmoid chosen as the Z?

Whether the attribute learning has been pretrained seperately before and then added to fintune the caption model or attribute learning and caption model are learning together from the beginning?

4.When we get z, whether z is cat with y(word embedding) as input of caption model or z cat with the output of image features after attention model?

Looking forward to your reply.
opened by tangyuhao2016 23

Owner

Xuewen Yang

PhD in the research field of Computer Vision and NLP.

GitHub

Evaluation Pipeline for our ECCV2020: Journey Towards Tiny Perceptual Super-Resolution.

Journey Towards Tiny Perceptual Super-Resolution Test code for our ECCV2020 paper: https://arxiv.org/abs/2007.04356 Our x4 upscaling pre-trained model

6 Mar 30, 2022

Code for the ECCV2020 paper "A Differentiable Recurrent Surface for Asynchronous Event-Based Data"

A Differentiable Recurrent Surface for Asynchronous Event-Based Data Code for the ECCV2020 paper "A Differentiable Recurrent Surface for Asynchronous

21 Oct 5, 2022

Simple image captioning model - CLIP prefix captioning.

688 Jan 4, 2023

Codes for paper "Towards Diverse Paragraph Captioning for Untrimmed Videos". CVPR 2021

Towards Diverse Paragraph Captioning for Untrimmed Videos This repository contains PyTorch implementation of our paper Towards Diverse Paragraph Capti

61 Oct 11, 2022

This is the unofficial code of Deep Dual-resolution Networks for Real-time and Accurate Semantic Segmentation of Road Scenes. which achieve state-of-the-art trade-off between accuracy and speed on cityscapes and camvid, without using inference acceleration and extra data

Deep Dual-resolution Networks for Real-time and Accurate Semantic Segmentation of Road Scenes Introduction This is the unofficial code of Deep Dual-re

113 Dec 23, 2022

Inferring Lexicographically-Ordered Rewards from Preferences

Inferring Lexicographically-Ordered Rewards from Preferences Code author: Alihan Hüyük ([email protected]) This repository contains the source code nec

1 Feb 13, 2022

Code for "Learning Canonical Representations for Scene Graph to Image Generation", Herzig & Bar et al., ECCV2020

Learning Canonical Representations for Scene Graph to Image Generation (ECCV 2020) Roei Herzig*, Amir Bar*, Huijuan Xu, Gal Chechik, Trevor Darrell, A

24 Jul 7, 2022

The source code for Generating Training Data with Language Models: Towards Zero-Shot Language Understanding.

SuperGen The source code for Generating Training Data with Language Models: Towards Zero-Shot Language Understanding. Requirements Before running, you

38 Dec 12, 2022

Fine-Tune EleutherAI GPT-Neo to Generate Netflix Movie Descriptions in Only 47 Lines of Code Using Hugginface And DeepSpeed

GPT-Neo-2.7B Fine-Tuning Example Using HuggingFace & DeepSpeed Installation cd venv/bin ./pip install -r ../../requirements.txt ./pip install deepspe

180 Jan 5, 2023

Nonuniform-to-Uniform Quantization: Towards Accurate Quantization via Generalized Straight-Through Estimation. In CVPR 2022.

Nonuniform-to-Uniform Quantization This repository contains the training code of N2UQ introduced in our CVPR 2022 paper: "Nonuniform-to-Uniform Quanti

60 Dec 28, 2022

Official implementation of "Not only Look, but also Listen: Learning Multimodal Violence Detection under Weak Supervision" ECCV2020

XDVioDet Official implementation of "Not only Look, but also Listen: Learning Multimodal Violence Detection under Weak Supervision" ECCV2020. The proj

64 Dec 12, 2022

PyTorch reimplementation of hand-biomechanical-constraints (ECCV2020)

Hand Biomechanical Constraints Pytorch Unofficial PyTorch reimplementation of Hand-Biomechanical-Constraints (ECCV2020). This project reimplement foll

59 Dec 20, 2022

Random Erasing Data Augmentation. Experiments on CIFAR10, CIFAR100 and Fashion-MNIST

Random Erasing Data Augmentation =============================================================== black white random This code has the source code for

654 Dec 26, 2022

Everything you want about DP-Based Federated Learning, including Papers and Code. (Mechanism: Laplace or Gaussian, Dataset: femnist, shakespeare, mnist, cifar-10 and fashion-mnist. )

Differential Privacy (DP) Based Federated Learning (FL) Everything about DP-based FL you need is here. （所有你需要的DP-based FL的信息都在这里） Code Tip: the code o

83 Dec 24, 2022

The LaTeX and Python code for generating the paper, experiments' results and visualizations reported in each paper is available (whenever possible) in the paper's directory

This repository contains the software implementation of most algorithms used or developed in my research. The LaTeX and Python code for generating the

3 Jan 3, 2023

ECCV2020 paper: Fashion Captioning: Towards Generating Accurate Descriptions with Semantic Rewards. Code and Data.

Related tags

Overview

Special Note:

Codes:

Dataset:

License:

Citation:

Comments

About data process

About Expreiment Settings and Performances

Scores of the 3 released baselines.

Problems about the proposed approch in the paper.

Dataset Details Mismatch

could you please provide a pre-trained model?

About file preparation step and Training procedure

About structure details and attribute learning

Owner

Xuewen Yang

Evaluation Pipeline for our ECCV2020: Journey Towards Tiny Perceptual Super-Resolution.

Code for the ECCV2020 paper "A Differentiable Recurrent Surface for Asynchronous Event-Based Data"

Simple image captioning model - CLIP prefix captioning.

Codes for paper "Towards Diverse Paragraph Captioning for Untrimmed Videos". CVPR 2021

This is the unofficial code of Deep Dual-resolution Networks for Real-time and Accurate Semantic Segmentation of Road Scenes. which achieve state-of-the-art trade-off between accuracy and speed on cityscapes and camvid, without using inference acceleration and extra data

Inferring Lexicographically-Ordered Rewards from Preferences

Code for "Learning Canonical Representations for Scene Graph to Image Generation", Herzig & Bar et al., ECCV2020

The source code for Generating Training Data with Language Models: Towards Zero-Shot Language Understanding.

Fine-Tune EleutherAI GPT-Neo to Generate Netflix Movie Descriptions in Only 47 Lines of Code Using Hugginface And DeepSpeed

Nonuniform-to-Uniform Quantization: Towards Accurate Quantization via Generalized Straight-Through Estimation. In CVPR 2022.

Official implementation of "Not only Look, but also Listen: Learning Multimodal Violence Detection under Weak Supervision" ECCV2020

PyTorch reimplementation of hand-biomechanical-constraints (ECCV2020)

Random Erasing Data Augmentation. Experiments on CIFAR10, CIFAR100 and Fashion-MNIST

Everything you want about DP-Based Federated Learning, including Papers and Code. (Mechanism: Laplace or Gaussian, Dataset: femnist, shakespeare, mnist, cifar-10 and fashion-mnist. )

The LaTeX and Python code for generating the paper, experiments' results and visualizations reported in each paper is available (whenever possible) in the paper's directory

Train emoji embeddings based on emoji descriptions.

Code for ACM MM2021 paper "Complementary Trilateral Decoder for Fast and Accurate Salient Object Detection"

Fashion Landmark Estimation with HRNet

(CVPR2021) Kaleido-BERT: Vision-Language Pre-training on Fashion Domain