README.md shall be finished soon.

WSSGG

0 Overview
1 Installation
2 Settings
- 2.1 VG-GT-Graph and VG-Cap-Graph
- 2.2 COCO-Cap-Graph
3 Training and Evaluation
4 Visualization
5 Reference

0 Overview

Our model uses the image's paired caption as weak supervision to learn the entities in the image and the relations among them. At inference time, it generates scene graphs without help from texts. To learn our model, we first allow context information to propagate on the text graph to enrich the entity word embeddings (Sec. 3.1). We found this enrichment provides better localization of the visual objects. Then, we optimize a text-query-guided attention model (Sec. 3.2) to provide the image-level entity prediction and associate the text entities with visual regions best describing them. We use the joint probability to choose boxes associated with both subject and object (Sec. 3.3), then use the top scoring boxes to learn better grounding (Sec. 3.4). Finally, we use an RNN (Sec. 3.5) to capture the vision-language common-sense and refine our predictions.

1 Installation

git clone "https://github.com/yekeren/WSSGG.git" && cd "WSSGG"

We use Tensorflow 1.5 and Python 3.6.4. To continue, please ensure that at least the correct Python version is installed. requirements.txt defines the list of python packages we installed. Simply run pip install -r requirements.txt to install these packages after setting up python. Next, run protoc protos/*.proto --python_out=. to compile the required protobuf protocol files, which are used for storing configurations.

pip install -r requirements.txt
protoc protos/*.proto --python_out=.

1.1 Faster-RCNN

Our Faster-RCNN implementation relies on the Tensorflow object detection API. Users can use git clone "https://github.com/tensorflow/models.git" "tensorflow_models" && ln -s "tensorflow_models/research/object_detection" to set up. Also, don't forget to using protoc to compire the protos used by the detection API.

The specific Faster-RCNN model we use is faster_rcnn_inception_resnet_v2_atrous_lowproposals_oidv2 to keep it the same as the VSPNet. More information is in Tensorflow object detection zoo.

git clone "https://github.com/tensorflow/models.git" "tensorflow_models" 
ln -s "tensorflow_models/research/object_detection"
cd tensorflow_models/research/; protoc object_detection/protos/*.proto --python_out=.; cd -

mkdir -p "zoo"
wget -P "zoo" "http://download.tensorflow.org/models/object_detection/faster_rcnn_inception_resnet_v2_atrous_lowproposals_oid_2018_01_28.tar.gz"
tar xzvf zoo/faster_rcnn_inception_resnet_v2_atrous_lowproposals_oid_2018_01_28.tar.gz -C "zoo"

1.2 Language Parser

Though we indicate the dependency on spacy in requirements.txt, we still need to run python -m spacy download en for English. Then, we checkout the tool at SceneGraphParser by running git clone "https://github.com/vacancy/SceneGraphParser.git" && ln -s "SceneGraphParser/sng_parser"

python -m spacy download en
git clone "https://github.com/vacancy/SceneGraphParser.git"
ln -s "SceneGraphParser/sng_parser"

1.3 GloVe Embeddings

We use the pre-trained 300-D GloVe embeddings.

wget -P "zoo" "http://nlp.stanford.edu/data/glove.6B.zip"
unzip "zoo/glove.6B.zip" -d "zoo"

python "dataset-tools/export_glove_words_and_embeddings.py" \
  --glove_file "zoo/glove.6B.300d.txt" \
  --output_vocabulary_file "zoo/glove_word_tokens.txt" \
  --output_vocabulary_word_embedding_file "zoo/glove_word_vectors.npy"

2 Settings

To avoid the time-consuming Faster RCNN processes in 2.1 and 2.2, users can directly download the features we provided at the following URLs. Then, the scripts create_vg_settings.sh and create_coco_setting.sh will check the existense of the Faster-RCNN features and skip the processs if they are provided. Please note that in the following table, we assume the directory for holding the VG and COCO data to be vg-gt-cap and coco-cap.

Name	URLs	Please extract to directory
VG Faster-RCNN features	https://storage.googleapis.com/weakly-supervised-scene-graphs-generation/vg_frcnn_proposals.zip	vg-gt-cap/frcnn_proposals/
COCO Faster-RCNN features	https://storage.googleapis.com/weakly-supervised-scene-graphs-generation/coco_frcnn_proposals.zip	coco-cap/frcnn_proposals/

2.1 VG-GT-Graph and VG-Cap-Graph

Typing sh dataset-tools/create_vg_settings.sh "vg-gt-cap" will generate VG-related files under the folder "vg-gt-cap" (for both VG-GT-Graph and VG-Cap-Graph settings). Basically, it will download the datasets and launch the following programs under the dataset-tools directory.

Name	Desc.
create_vg_frcnn_proposals.py	Extract VG visual proposals using Faster-RCNN
create_vg_text_graphs.py	Extract VG text graphs using Language Parser
create_vg_vocabulary	Gather the VG vocabulary
create_vg_gt_graph_tf_record.py	Generate TF record files for the VG-GT-Graph setting
create_vg_cap_graph_tf_record.py	Generate TF record files for the VG-Cap-Graph setting

2.2 COCO-Cap-Graph

Typing sh dataset-tools/create_coco_settings.sh "coco-cap" "vg-gt-cap" will generate COCO-related files under the folder "coco-cap" (for COCO-Cap-Graph setting). Basically, it will download the datasets and launch the following programs under the dataset-tools directory. Please note that the "vg-gt-cap" directory should be created in that we need to get the split information (either Zareian et al. or Xu et al.).

Name	Desc.
create_coco_frcnn_proposals.py	Extract COCO visual proposals using Faster-RCNN
create_coco_text_graphs.py	Extract COCO text graphs using Language Parser
create_coco_vocabulary	Gather the COCO vocabulary
create_coco_cap_graph_tf_record.py	Generate TF record files for the COCO-Cap-Graph setting

3 Training and Evaluation

Multi-GPUs (5 GPUs in our case) training cost less than 2.5 hours to train a single model, while single-GPU strategy requires more than 8 hours.

3.1 Multi-GPUs training

We use TF distributed training to train the models shown in our paper. For example, the following command shall create and train a model specified by the proto config file configs/GT-Graph-Zareian/base_phr_ite_seq.pbtxt, and save the trained model to a directory named "logs/base_phr_ite_seq". In train.sh, we create 1 ps, 1, chief, 3 workers, and 1 evaluator. The 6 instances are distributed on 5 GPUS (4 for training and 1 for evaluation).

sh train.sh \
  "configs/GT-Graph-Zareian/base_phr_ite_seq.pbtxt" \
  "logs/base_phr_ite_seq"

3.2 Single-GPU training

Our model can also be trained using single GPU strategy such as follow. However, we would suggest to half the learning rate or explore for better other hyper-parameters.

python "modeling/trainer_main.py" \
  --pipeline_proto "configs/GT-Graph-Zareian/base_phr_ite_seq.pbtxt" \
  --model_dir ""logs/base_phr_ite_seq""

3.3 Performance on test set

During the training process, there is an evaluator measuring the model's performance on the validation set and save the best model checkpoint. Finally, we use the following command to evaluate the saved model's performance on the test set. This evaluation process will last for 2-3 hours depends on the post-process parameters (e.g., see here). Currently, there are many kinds of stuff written in pure python, which we would later optimize to utilize GPU better to reduce the final evaluation time.

python "modeling/trainer_main.py" \
  --pipeline_proto "configs/GT-Graph-Zareian/base_phr_ite_seq.pbtxt" \
  --model_dir ""logs/base_phr_ite_seq"" \
  --job test

3.4 Primary configs and implementations

Take configs/GT-Graph-Zareian/base_phr_ite_seq.pbtxt as an example, the following configs control the model's behavior.

Name	Desc.	Impl.
linguistic_options	Specify the phrasal context modeling, remove the section to disable it.	models/cap2sg_linguistic.py
grounding_options	Specify the grounding options.	models/cap2sg_grounding.py
detection_options	Specify the WSOD model, `num_iterations` to control the iterative process.	models/cap2sg_detection.py
relation_options	Specify the relation detection modeling.	models/cap2sg_relation.py
common_sense_options	Specify the sequential context modeling, remove the section to disable it.	models/cap2sg_common_sense.py

4 Visualization

Please see cap2sg.ipynb.

5 Reference

If you find this project helps, please cite our CVPR2021 paper :)

@InProceedings{Ye_2021_CVPR,
  author = {Ye, Keren and Kovashka, Adriana},
  title = {Linguistic Structures as Weak Supervision for Visual Scene Graph Generation},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2021}
}

Also, please take a look at our old work in ICCV2019.

@InProceedings{Ye_2019_ICCV,
  author = {Ye, Keren and Zhang, Mingda and Kovashka, Adriana and Li, Wei and Qin, Danfeng and Berent, Jesse},
  title = {Cap2Det: Learning to Amplify Weak Caption Supervision for Object Detection},
  booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
  month = {October},
  year = {2019}
}

Self-training with Weak Supervision (NAACL 2021)

This repo holds the code for our weak supervision framework, ASTRA, described in our NAACL 2021 paper: "Self-Training with Weak Supervision"

148 Nov 20, 2022

Hierarchical Metadata-Aware Document Categorization under Weak Supervision (WSDM'21)

Hierarchical Metadata-Aware Document Categorization under Weak Supervision This project provides a weakly supervised framework for hierarchical metada

53 Sep 17, 2022

WRENCH: Weak supeRvision bENCHmark

🔧 What is it? Wrench is a benchmark platform containing diverse weak supervision tasks. It also provides a common and easy framework for development

176 Dec 28, 2022

Codes for our paper "SentiLARE: Sentiment-Aware Language Representation Learning with Linguistic Knowledge" (EMNLP 2020)

SentiLARE: Sentiment-Aware Language Representation Learning with Linguistic Knowledge Introduction SentiLARE is a sentiment-aware pre-trained language

74 Dec 30, 2022

[CVPR2021] The source code for our paper 《Removing the Background by Adding the Background: Towards Background Robust Self-supervised Video Representation Learning》.

TBE The source code for our paper "Removing the Background by Adding the Background: Towards Background Robust Self-supervised Video Representation Le

150 Dec 28, 2022

Learning trajectory representations using self-supervision and programmatic supervision.

A weakly-supervised scene graph generation codebase. The implementation of our CVPR2021 paper ``Linguistic Structures as Weak Supervision for Visual Scene Graph Generation''

Related tags

Overview

WSSGG

0 Overview

1 Installation

1.1 Faster-RCNN

1.2 Language Parser

1.3 GloVe Embeddings

2 Settings

2.1 VG-GT-Graph and VG-Cap-Graph

2.2 COCO-Cap-Graph

3 Training and Evaluation

3.1 Multi-GPUs training

3.2 Single-GPU training

3.3 Performance on test set

3.4 Primary configs and implementations

4 Visualization

5 Reference

You might also like...

Self-training with Weak Supervision (NAACL 2021)

Hierarchical Metadata-Aware Document Categorization under Weak Supervision (WSDM'21)

WRENCH: Weak supeRvision bENCHmark

WRENCH: Weak supeRvision bENCHmark

Codes for our paper "SentiLARE: Sentiment-Aware Language Representation Learning with Linguistic Knowledge" (EMNLP 2020)

[CVPR2021] The source code for our paper 《Removing the Background by Adding the Background: Towards Background Robust Self-supervised Video Representation Learning》.

Learning trajectory representations using self-supervision and programmatic supervision.

Mixup for Supervision, Semi- and Self-Supervision Learning Toolbox and Benchmark

Weakly Supervised Learning of Rigid 3D Scene Flow

Owner

Keren Ye

The PyTorch implementation of DiscoBox: Weakly Supervised Instance Segmentation and Semantic Correspondence from Box Supervision.

Code and data of the ACL 2021 paper: Few-Shot Text Ranking with Meta Adapted Synthetic Weak Supervision

Official implementation of "Not only Look, but also Listen: Learning Multimodal Violence Detection under Weak Supervision" ECCV2020

Open source implementation of AceNAS: Learning to Rank Ace Neural Architectures with Weak Supervision of Weight Sharing

DiscoBox: Weakly Supervised Instance Segmentation and Semantic Correspondence from Box Supervision

Learning Pixel-level Semantic Affinity with Image-level Supervision for Weakly Supervised Semantic Segmentation, CVPR 2018

The official implementation of CVPR 2021 Paper: Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation.

Weak-supervised Visual Geo-localization via Attention-based Knowledge Distillation

PyTorch implementation of our Adam-NSCL algorithm from our CVPR2021 (oral) paper "Training Networks in Null Space for Continual Learning"

A curated list of programmatic weak supervision papers and resources