WACV 2022 Paper - Is An Image Worth Five Sentences? A New Look into Semantics for Image-Text Matching

Andres

Last update: Dec 17, 2022

Related tags

Computer Vision semantic_adaptive_margin

Overview

Is An Image Worth Five Sentences? A New Look into Semantics for Image-Text Matching

Code based on our WACV 2022 Accepted Paper: https://arxiv.org/pdf/2110.02623.pdf

Project is built on top of the [CVSE] (https://github.com/BruceW91/CVSE) in PyTorch. However, it is easy to adapt to different Image-Text Matching models (SCAN, VSRN, SGRAF). Regarding the proposed metric code and evaluation, please visit: https://github.com/furkanbiten/ncs_metric.

Introduction

The task of image-text matching aims to map representations from different modalities into a common joint visual-textual embedding. However, the most widely used datasets for this task, MSCOCO and Flickr30K, are actually image captioning datasets that offer a very limited set of relationships between images and sentences in their ground-truth annotations. This limited ground truth information forces us to use evaluation metrics based on binary relevance: given a sentence query we consider only one image as relevant. However, many other relevant images or captions may be present in the dataset. In this work, we propose two metrics that evaluate the degree of semantic relevance of retrieved items, independently of their annotated binary relevance. Additionally, we incorporate a novel strategy that uses an image captioning metric, CIDEr, to define a Semantic Adaptive Margin (SAM) to be optimized in a standard triplet loss. By incorporating our formulation to existing models, a large improvement is obtained in scenarios where available training data is limited. We also demonstrate that the performance on the annotated image-caption pairs is maintained while improving on other non-annotated relevant items when employing the full training set. The code for our new metric can be found at https://github.com/furkanbiten/ncs_metric and model https://github.com/andrespmd/semantic_adaptive_margin

Install Environment

Git clone the project.

Create Conda environment:

$ conda env create -f env.yml

Activate the environment:

$ conda activate pytorch12

Download Metric Data

Please download the following compressed file from:

Uncompress the downloaded file under the main project folder. The uncompressed folder name should be "cider".

Detecting Text in Natural Image with Connectionist Text Proposal Network (ECCV'16)

Detecting Text in Natural Image with Connectionist Text Proposal Network The codes are used for implementing CTPN for scene text detection, described

1.3k Dec 22, 2022

AdvancedEAST is an algorithm used for Scene image text detect, which is primarily based on EAST, and the significant improvement was also made, which make long text predictions more accurate.https://github.com/huoyijie/raspberrypi-car

AdvancedEAST AdvancedEAST is an algorithm used for Scene image text detect, which is primarily based on EAST:An Efficient and Accurate Scene Text Dete

1.2k Dec 29, 2022

Learning Camera Localization via Dense Scene Matching, CVPR2021

This repository contains code of our CVPR 2021 paper - "Learning Camera Localization via Dense Scene Matching" by Shitao Tang, Chengzhou Tang, Rui Hua

65 Dec 1, 2022

Source code of our TPAMI'21 paper Dual Encoding for Video Retrieval by Text and CVPR'19 paper Dual Encoding for Zero-Example Video Retrieval.

Dual Encoding for Video Retrieval by Text Source code of our TPAMI'21 paper Dual Encoding for Video Retrieval by Text and CVPR'19 paper Dual Encoding

81 Dec 1, 2022

Comments

A question about sampling

Thanks for your great work.

I have a question about sampling.

In your paper, you use hard negative (HN) and soft negative(SN) to sample negative items. "In HN, the negative item in each triplet is selected as the closest to the anchor in a batch." "SN refers to picking the furthest negative item to the anchor within the batch."

The sampling methods described in the paper and that implemented by the code are opposite. Is there anything wrong in this code? https://github.com/AndresPMD/semantic_adaptive_margin/blob/1e8bf2f1836498c48df030cb0a967b72b52e8460/model_CVSE.py#L727 https://github.com/AndresPMD/semantic_adaptive_margin/blob/1e8bf2f1836498c48df030cb0a967b72b52e8460/model_CVSE.py#L730

opened by Li-Zheng-94 1

WACV 2022 Paper - Is An Image Worth Five Sentences? A New Look into Semantics for Image-Text Matching

Related tags

Overview

Is An Image Worth Five Sentences? A New Look into Semantics for Image-Text Matching

Introduction

Install Environment

Download Metric Data

You might also like...

Detecting Text in Natural Image with Connectionist Text Proposal Network (ECCV'16)

AdvancedEAST is an algorithm used for Scene image text detect, which is primarily based on EAST, and the significant improvement was also made, which make long text predictions more accurate.https://github.com/huoyijie/raspberrypi-car

Learning Camera Localization via Dense Scene Matching, CVPR2021

Source code of our TPAMI'21 paper Dual Encoding for Video Retrieval by Text and CVPR'19 paper Dual Encoding for Zero-Example Video Retrieval.

Code for CVPR 2022 paper "SoftGroup for Instance Segmentation on 3D Point Clouds"

Code for CVPR 2022 paper "Bailando: 3D dance generation via Actor-Critic GPT with Choreographic Memory"

Official code for ROCA: Robust CAD Model Retrieval and Alignment from a Single Image (CVPR 2022)

text detection mainly based on ctpn model in tensorflow, id card detect, connectionist text proposal network

OCR system for Arabic language that converts images of typed text to machine-encoded text.

Comments

A question about sampling

Owner

Andres

Using Opencv ,based on Augmental Reality(AR) and will show the feature matching of image and then by finding its matching

An Implementation of the alogrithm in paper IncepText: A New Inception-Text Module with Deformable PSROI Pooling for Multi-Oriented Scene Text Detection

Code for CVPR'2022 paper ✨ "Predict, Prevent, and Evaluate: Disentangled Text-Driven Image Manipulation Empowered by Pre-Trained Vision-Language Model"

This is a c++ project deploying a deep scene text reading pipeline with tensorflow. It reads text from natural scene images. It uses frozen tensorflow graphs. The detector detect scene text locations. The recognizer reads word from each detected bounding box.

Use Convolutional Recurrent Neural Network to recognize the Handwritten line text image without pre segmentation into words or characters. Use CTC loss Function to train.

Deskew is a command line tool for deskewing scanned text documents. It uses Hough transform to detect "text lines" in the image. As an output, you get an image rotated so that the lines are horizontal.

Code for the paper STN-OCR: A single Neural Network for Text Detection and Text Recognition

Code related to "Have Your Text and Use It Too! End-to-End Neural Data-to-Text Generation with Semantic Fidelity" paper

Official code for "Bridging Video-text Retrieval with Multiple Choice Questions", CVPR 2022 (Oral).

keras复现场景文本检测网络CPTN: 《Detecting Text in Natural Image with Connectionist Text Proposal Network》；欢迎试用，关注，并反馈问题...