Code, Models and Datasets for OpenViDial Dataset

Last update: Dec 8, 2022

Related tags

Deep Learning OpenViDial

Overview

OpenViDial

This repo contains downloading instructions for the OpenViDial dataset in 《OpenViDial: A Large-Scale, Open-Domain Dialogue Dataset with Visual Contexts》 along with the code to reproduce results in the paper (See Section Baselines).

Introduction

When humans converse, what a speaker will say next significantly depends on what he sees. OpenViDial is a largescale multi-module dialogue dataset for this purpose. The dialogue turns and visual contexts are extracted from movies and TV series, where each dialogue turn is paired with the corresponding visual context in which it takes place. OpenViDial contains a total number of 1.1 million dialogue turns, and thus 1.1 million visual contexts stored in images.

The following are two short conversations where visual contexts are crucial.

Detailed statistics for OpenViDial

Attribute	value
Number of turns	1.1M
Number of images	1.1M
Vocab size before BPE	70K
Vocab size after BPE	30K
Average length of each episode	14
Average length of each turn	7.6

Download the Dataset

The main folder origin_dir contains training/valid/test sets, each of which is made up by the following files:

├──origin_dir
      └── train.dialogue.jsonl // each line is an episode of dialogue, which a list of IDs.    
      └── train.origin.txt // each line corresponds to a dialogue text utterence, with the ID being its line number (staring with 0).
      └── train_images // containing images (visual contexts) in which the text utterence take place, with ID being the image filename (0,1,2, etc)
            └── 0.jpg
            └── 1.jpg
            └── ...
      └── valid.* (i.e., valid.dialogue.jsonl, valid.origin.txt, valid_images)
      └── test.*  (i.e., test.dialogue.jsonl, test.origin.txt, test_images)

If you'd like to take a glance at the a sample of the dataset instead of downloading the full dataset, we provide a data sample here

Data download:

Download [train|valid|test].origin.txt and [train|valid|test].dialogue.jsonl here
Download test_images (~ 20G) here
Download valid_images (~ 20G) here
Download train_images: Since train_images is too big (~ 170G), we split it to 11 zip files (each of which is 17G). Download seperate files zip_train here. Then download and run cat.sh here to include all files in the same directory.
Move all files to origin_dir.

Models

We proposed three models for this dataset. Please refer to the paper for details:

Model #1 - NoVisual: use only dialog texts without visual information

Model #2 - CoarseVisual: use texts and a pretrained ResNet50 on ImageNet to compute 1000-d feature from each picture

Model #3 - FineVisual: use texts and a pretrained Faster R-CNN on Genome to compute 2048-d * K objects features from each picture

Faster R-CNN is an object detection framework. The detection sample and attention over objects during text decoding is shown below.

Requirements

python >= 3.6
pip install -r requirements.txt

Preprocess directory structure

preprocessed_data_dir is a directory that contains all the preprocessed files (text, image feature mmap, offsets, etc.) generated from origin_data_dir and we use them in training models. The directory structure is shown below.

Note: every train* file or directory should have a 'valid' and a 'test' counterpart, we ignore them below for simplicity.

├──preprocessed_data_dir
      └── train.features.mmap  // numpy mmap array file of shape [num_sents, 1000], each row is a 1000-d ResNet-50 feature
      └── train.objects.mmap  // numpy mmap array file of shape [num_sents, 20, 2048],  faster-rcnn object feature file, each row contain 20 objects feature, which is 2048-d
      └── train.objects_mask.mmap  // numpy mmap array file of shape [num_sents, 20],  faster-rcnn mask file, each row contain 20 objects mask, 1 for valid, 0 for mask
      └── train.offsets.npy  // numpy array file of shape [num_episodes], each item is the offsets of one episode
      └── train.sent_num.npy // numpy array file of shape [num_episodes], each item is the sentence number of one episode

Preprocess text data

We use Moses Tokenizer to tokenize texts and generate offsets arrays: bash ./scripts/preprocess_video_data.sh and followed with byte-pair-encoding and fairseq-preprocess binarization: bash ./scripts/preprocess_text_data.sh

Note: You need to change DATA_DIR, ORIGIN_DIR, OUTPUT_DIR to your own path

Prepare pre-computed CNN features and Faster-RCNN features

Download CNN-pooling features(Used for Model #2 - CoarseVisual)

Preprocessed ResNet50 features (*.features.mmap) (~4G) can be downloaded from here and move them under preprocessed_data_dir/

Download Faster R-CNN features(Used for Model #3 - FineVisual)

Preprocessed Faster R-CNN objects features (*objects.mmap, *objects_mask.mmap) (~160G) can be downloaded from here then move them under preprocessed_data_dir/

Since file train.objects.mmap is too large(100G+), we splitted it to many small pieces like train.objects.mmap.split*, and you need another step to merge all those files together: cat * train.objects.mmap.split* >train.objects.mmap

(Optional) Extract features on your own

If you want to extract some feature on your own, or you'd like to know details of extracting visual features, see video_dialogue_model/extract_features/extract_features.md

Train and Evaluate Model #1 - NoVisual

bash scripts/reproduce_baselines/text_only.sh will train and evaluate NoVisual, Remember to change MODEL_DIR and DATA_DIR for your setup

Train and Evaluate Model #2 - CoarseVisual

bash scripts/reproduce_baselines/text_and_img_feature.sh will train and evaluate CoarseVisual. Remember to change MODEL_DIR and DATA_DIR for your setup

Train and Evaluate Model #3 - FineVisual

bash scripts/reproduce_baselines/text_and_img_objects.sh will train and evaluate FineVisual, Remember to change MODEL_DIR and DATA_DIR for your setup

Other Statistics

get length/diversity/stopwords% statistics of system output: train/stats.py

Model benchmark

Model	BLEU-1	BLEU-2	BLEU-4	Stopword%	Dis-1	Dis-2	Dis-3	Dis-4
1-NV	14.01	3.98	1.07	58.1%	0.0091	0.0355	0.0682	0.1018
2-CV	14.58	4.35	1.14	54.2%	0.0108	0.0448	0.0915	0.1465
3-FV	15.61	4.71	1.22	52.9%	0.0118	0.0502	0.1082	0.1778

Comments

result

I run your code of NV and CV model. The BLEU-4 is 1.21 and 1.22 respectively. Then I use grep ^D gen.out | cut -f3- > sys.txt to get the sys.txt. But the performance is poor.

=====Stats of /deepo_data/sys_NV.txt===== Diversity-1: 0.0028171826554375134 Diversity-2: 0.012234149152032867 Diversity-3: 0.02608896729461698 Diversity-4: 0.04205556064912441 StopWords%: 0.5369782208034367; StopWords/Sent: 3.8842692900782727 AvgLength: 7.233569518455623 =====Stats of /deepo_data/sys_CV.txt===== Diversity-1: 0.0029348115275008298 Diversity-2: 0.012985147977712393 Diversity-3: 0.027433267080460996 Diversity-4: 0.04410504609184471 StopWords%: 0.5485848448526971; StopWords/Sent: 3.9040424742831488 AvgLength: 7.116570045480276

The stopwords seem normal. But the diversity performances pool. The line of sys_NV and sys_CV are both 51231. sys_CV.txt

opened by wjczf123 9
problem when run FV model

2021-03-20 12:27:30 | INFO | fairseq.utils | CUDA enviroments for all 4 workers 2021-03-20 12:27:30 | INFO | fairseq_cli.train | training on 4 devices (GPUs/TPUs) 2021-03-20 12:27:30 | INFO | fairseq_cli.train | max tokens per GPU = 8000 and max sentences per GPU = 32 2021-03-20 12:27:30 | INFO | fairseq.trainer | no existing checkpoint found train_logs/reproduce_img_object/layer3_lr2e-4_bsz128_drop0.3_warmup6000/checkpoint_last.pt 2021-03-20 12:27:30 | INFO | fairseq.trainer | loading train data for epoch 1 2021-03-20 12:27:30 | INFO | video_dialogue_model.data.object_dataset | find minimum truncate of preprocessed_data_dir-train: 0 2021-03-20 12:38:39 | INFO | fairseq.data.data_utils | loaded 974803 examples from: preprocessed_data_dir/train 2021-03-20 12:39:10 | INFO | fairseq.trainer | NOTE: your device may support faster training with --fp16 epoch 001: 0%| | 0/7045 [00:00<?, ?it/s]2021-03-20 12:39:10 | INFO | fairseq.trainer | begin training epoch 1 Traceback (most recent call last): File "/usr/local/bin/fairseq-train", line 8, in sys.exit(cli_main()) File "/usr/local/lib/python3.6/dist-packages/fairseq_cli/train.py", line 352, in cli_main distributed_utils.call_main(args, main) File "/usr/local/lib/python3.6/dist-packages/fairseq/distributed_utils.py", line 286, in call_main nprocs=args.distributed_num_procs, File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 200, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 158, in start_processes while not context.join(): File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 108, in join (error_index, name) Exception: process 1 terminated with signal SIGKILL

This seems to be a problem caused by insufficient memory. The memory of my computer is 200G. How much memory does the FV model need? Or is it caused by other reasons?

opened by wjczf123 4
problem when run FineVisual model

There is a bug as follows: mm = mmap.mmap(fid.fileno(), bytes, access=acc, offset=start) ValueError: mmap length is greater than file size

The size of object features files: 149G train.objects.mmap 11G train.objects.mmap.splitaa 11G train.objects.mmap.splitab 11G train.objects.mmap.splitac 11G train.objects.mmap.splitad 11G train.objects.mmap.splitae 11G train.objects.mmap.splitaf 11G train.objects.mmap.splitag 11G train.objects.mmap.splitah 11G train.objects.mmap.splitai 11G train.objects.mmap.splitaj 11G train.objects.mmap.splitak 11G train.objects.mmap.splital 11G train.objects.mmap.splitam 2.1G train.objects.mmap.splitan 17G train.objects.mmap.splitao

opened by phellonchen 3
result

Hi, I run your code (text_only). I get the gen.out file. The result seems that (Last line in gen.out): Generate test with beam=5: BLEU4 = 1.21, 15.6/1.4/0.5/0.2 (BP=0.953, ratio=0.954, syslen=370561, reflen=388568)

The results are correct? What means that results? Do the results include BLEU-1, BLEU-2, BLEU-4, Dis-1, Dis-2, Dis-3, and Dis-4?

opened by wjczf123 2
The 'rcnn' module required in run_rcnn.py is missing
I‘m trying to extract rcnn features by myself using provided 'run_rcnn.py' script, however the missing module 'rcnn' is required in line 38.

from rcnn.dataset import get_dataloader

So where can I find the get_dataloader function ? THX
opened by stupidHIGH 1
The model and config in the feature extract scripts are mismatched
Thank you for the excellent works! I am trying to extract features from my own dataset, but it seems that the model and config in the feature_extract readme are mismatched. I can't load them in the run_rcnn.py.

cd data wget https://dl.fbaipublicfiles.com/vilbert-multi-task/detectron_model.pth wget https://dl.fbaipublicfiles.com/vilbert-multi-task/detectron_config.yaml
opened by linhaowei1 0
OpenViDial 2.0 download unstable

As most of the compressed files of OpenViDial 2.0 are more than 100GB, could you please split them into smaller ones for a better downloading stability?

opened by Aman-4-Real 0
Cannot find train/valid/test.src.jsonl

Hi. When I try to reproduce the part of mmi. I could not find train/valid/test.src.jsonl, so i could not reproduce it. Could you please tell me where it is ? Thanks a lot

opened by sysu19351162 1
OSError: Model file not found: checkpoint_best.pt

Hello, I am running out to report this error, I haven't lack this file, I downloaded one online, but the dimension is wrong, please ask how I want to solve it, thank you.

opened by chensihang1 3
ValueError: cannot mmap an empty file

When I want to view the shape of train.features.mmap, numpy reports an error. How can I solve this problem

By the way, can I directly use the mmap file (such as train/valid/test. features.mmap) as the video feature, for example, save it as an .npy file for multimodal training

thank you

opened by xiang-xiang-zhu 6
About the training set 10.zip

We note that the provided training set 10.zip contains only 4 images, is this correct? Because we found that the total training set is less than 170G.

opened by zhizhi111 0

Code, Models and Datasets for OpenViDial Dataset

Related tags

Overview

OpenViDial

Introduction

Detailed statistics for OpenViDial

Download the Dataset

Models

Requirements

Preprocess directory structure

Preprocess text data

Prepare pre-computed CNN features and Faster-RCNN features

Download CNN-pooling features(Used for Model #2 - CoarseVisual)

Download Faster R-CNN features(Used for Model #3 - FineVisual)

(Optional) Extract features on your own

Train and Evaluate Model #1 - NoVisual

Train and Evaluate Model #2 - CoarseVisual

Train and Evaluate Model #3 - FineVisual

Other Statistics

Model benchmark

Comments

Owner

This is the official source code for SLATE. We provide the code for the model, the training code, and a dataset loader for the 3D Shapes dataset. This code is implemented in Pytorch.

An easy way to build PyTorch datasets. Modularly build datasets and automatically cache processed results

Deep Learning Datasets Maker is a QGIS plugin to make datasets creation easier for raster and vector data.

Cl datasets - PyTorch image dataloaders and utility functions to load datasets for supervised continual learning

Source code, datasets and trained models for the paper Learning Advanced Mathematical Computations from Examples (ICLR 2021), by François Charton, Amaury Hayat (ENPC-Rutgers) and Guillaume Lample

Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics

Neurons Dataset API - The official dataloader and visualization tools for Neurons Datasets.

Reference implementation of code generation projects from Facebook AI Research. General toolkit to apply machine learning to code, from dataset creation to model training and evaluation. Comes with pretrained models.

This is the dataset and code release of the OpenRooms Dataset.

Datasets accompanying the paper ConditionalQA: A Complex Reading Comprehension Dataset with Conditional Answers.

Source code and dataset for ACL2021 paper: "ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive Learning".

Official Implementation and Dataset of "PPR10K: A Large-Scale Portrait Photo Retouching Dataset with Human-Region Mask and Group-Level Consistency", CVPR 2021

Toolbox of models, callbacks, and datasets for AI/ML researchers.

Datasets, Transforms and Models specific to Computer Vision

A novel method to tune language models. Codes and datasets for paper ``GPT understands, too''.

covid question answering datasets and fine tuned models

PyTorch implementation of popular datasets and models in remote sensing

Base pretrained models and datasets in pytorch (MNIST, SVHN, CIFAR10, CIFAR100, STL10, AlexNet, VGG16, VGG19, ResNet, Inception, SqueezeNet)