Code, Models and Datasets for OpenViDial Dataset

Overview

OpenViDial

This repo contains downloading instructions for the OpenViDial dataset in 《OpenViDial: A Large-Scale, Open-Domain Dialogue Dataset with Visual Contexts》 along with the code to reproduce results in the paper (See Section Baselines).

Introduction

When humans converse, what a speaker will say next significantly depends on what he sees. OpenViDial is a largescale multi-module dialogue dataset for this purpose. The dialogue turns and visual contexts are extracted from movies and TV series, where each dialogue turn is paired with the corresponding visual context in which it takes place. OpenViDial contains a total number of 1.1 million dialogue turns, and thus 1.1 million visual contexts stored in images.

The following are two short conversations where visual contexts are crucial.

Detailed statistics for OpenViDial

Attribute value
Number of turns 1.1M
Number of images 1.1M
Vocab size before BPE 70K
Vocab size after BPE 30K
Average length of each episode 14
Average length of each turn 7.6

Download the Dataset

The main folder origin_dir contains training/valid/test sets, each of which is made up by the following files:

├──origin_dir
      └── train.dialogue.jsonl // each line is an episode of dialogue, which a list of IDs.    
      └── train.origin.txt // each line corresponds to a dialogue text utterence, with the ID being its line number (staring with 0).
      └── train_images // containing images (visual contexts) in which the text utterence take place, with ID being the image filename (0,1,2, etc)
            └── 0.jpg
            └── 1.jpg
            └── ...
      └── valid.* (i.e., valid.dialogue.jsonl, valid.origin.txt, valid_images)
      └── test.*  (i.e., test.dialogue.jsonl, test.origin.txt, test_images)

If you'd like to take a glance at the a sample of the dataset instead of downloading the full dataset, we provide a data sample here

Data download:

  1. Download [train|valid|test].origin.txt and [train|valid|test].dialogue.jsonl here
  2. Download test_images (~ 20G) here
  3. Download valid_images (~ 20G) here
  4. Download train_images: Since train_images is too big (~ 170G), we split it to 11 zip files (each of which is 17G). Download seperate files zip_train here. Then download and run cat.sh here to include all files in the same directory.
  5. Move all files to origin_dir.

Models

We proposed three models for this dataset. Please refer to the paper for details:

  • Model #1 - NoVisual: use only dialog texts without visual information
  • Model #2 - CoarseVisual: use texts and a pretrained ResNet50 on ImageNet to compute 1000-d feature from each picture
  • Model #3 - FineVisual: use texts and a pretrained Faster R-CNN on Genome to compute 2048-d * K objects features from each picture

Faster R-CNN is an object detection framework. The detection sample and attention over objects during text decoding is shown below.

Requirements

  • python >= 3.6
  • pip install -r requirements.txt

Preprocess directory structure

preprocessed_data_dir is a directory that contains all the preprocessed files (text, image feature mmap, offsets, etc.) generated from origin_data_dir and we use them in training models. The directory structure is shown below.

Note: every train* file or directory should have a 'valid' and a 'test' counterpart, we ignore them below for simplicity.

├──preprocessed_data_dir
      └── train.features.mmap  // numpy mmap array file of shape [num_sents, 1000], each row is a 1000-d ResNet-50 feature
      └── train.objects.mmap  // numpy mmap array file of shape [num_sents, 20, 2048],  faster-rcnn object feature file, each row contain 20 objects feature, which is 2048-d
      └── train.objects_mask.mmap  // numpy mmap array file of shape [num_sents, 20],  faster-rcnn mask file, each row contain 20 objects mask, 1 for valid, 0 for mask
      └── train.offsets.npy  // numpy array file of shape [num_episodes], each item is the offsets of one episode
      └── train.sent_num.npy // numpy array file of shape [num_episodes], each item is the sentence number of one episode

Preprocess text data

We use Moses Tokenizer to tokenize texts and generate offsets arrays: bash ./scripts/preprocess_video_data.sh and followed with byte-pair-encoding and fairseq-preprocess binarization: bash ./scripts/preprocess_text_data.sh

Note: You need to change DATA_DIR, ORIGIN_DIR, OUTPUT_DIR to your own path

Prepare pre-computed CNN features and Faster-RCNN features

Download CNN-pooling features(Used for Model #2 - CoarseVisual)

Preprocessed ResNet50 features (*.features.mmap) (~4G) can be downloaded from here and move them under preprocessed_data_dir/

Download Faster R-CNN features(Used for Model #3 - FineVisual)

Preprocessed Faster R-CNN objects features (*objects.mmap, *objects_mask.mmap) (~160G) can be downloaded from here then move them under preprocessed_data_dir/

Since file train.objects.mmap is too large(100G+), we splitted it to many small pieces like train.objects.mmap.split*, and you need another step to merge all those files together: cat * train.objects.mmap.split* >train.objects.mmap

(Optional) Extract features on your own

If you want to extract some feature on your own, or you'd like to know details of extracting visual features, see video_dialogue_model/extract_features/extract_features.md

Train and Evaluate Model #1 - NoVisual

bash scripts/reproduce_baselines/text_only.sh will train and evaluate NoVisual, Remember to change MODEL_DIR and DATA_DIR for your setup

Train and Evaluate Model #2 - CoarseVisual

bash scripts/reproduce_baselines/text_and_img_feature.sh will train and evaluate CoarseVisual. Remember to change MODEL_DIR and DATA_DIR for your setup

Train and Evaluate Model #3 - FineVisual

bash scripts/reproduce_baselines/text_and_img_objects.sh will train and evaluate FineVisual, Remember to change MODEL_DIR and DATA_DIR for your setup

Other Statistics

  • get length/diversity/stopwords% statistics of system output: train/stats.py

Model benchmark

Model BLEU-1 BLEU-2 BLEU-4 Stopword% Dis-1 Dis-2 Dis-3 Dis-4
1-NV 14.01 3.98 1.07 58.1% 0.0091 0.0355 0.0682 0.1018
2-CV 14.58 4.35 1.14 54.2% 0.0108 0.0448 0.0915 0.1465
3-FV 15.61 4.71 1.22 52.9% 0.0118 0.0502 0.1082 0.1778
Comments
  • result

    result

    I run your code of NV and CV model. The BLEU-4 is 1.21 and 1.22 respectively. Then I use grep ^D gen.out | cut -f3- > sys.txt to get the sys.txt. But the performance is poor.

    =====Stats of /deepo_data/sys_NV.txt===== Diversity-1: 0.0028171826554375134 Diversity-2: 0.012234149152032867 Diversity-3: 0.02608896729461698 Diversity-4: 0.04205556064912441 StopWords%: 0.5369782208034367; StopWords/Sent: 3.8842692900782727 AvgLength: 7.233569518455623 =====Stats of /deepo_data/sys_CV.txt===== Diversity-1: 0.0029348115275008298 Diversity-2: 0.012985147977712393 Diversity-3: 0.027433267080460996 Diversity-4: 0.04410504609184471 StopWords%: 0.5485848448526971; StopWords/Sent: 3.9040424742831488 AvgLength: 7.116570045480276

    The stopwords seem normal. But the diversity performances pool. The line of sys_NV and sys_CV are both 51231. sys_CV.txt

    opened by wjczf123 9
  • problem when run FV model

    problem when run FV model

    2021-03-20 12:27:30 | INFO | fairseq.utils | CUDA enviroments for all 4 workers 2021-03-20 12:27:30 | INFO | fairseq_cli.train | training on 4 devices (GPUs/TPUs) 2021-03-20 12:27:30 | INFO | fairseq_cli.train | max tokens per GPU = 8000 and max sentences per GPU = 32 2021-03-20 12:27:30 | INFO | fairseq.trainer | no existing checkpoint found train_logs/reproduce_img_object/layer3_lr2e-4_bsz128_drop0.3_warmup6000/checkpoint_last.pt 2021-03-20 12:27:30 | INFO | fairseq.trainer | loading train data for epoch 1 2021-03-20 12:27:30 | INFO | video_dialogue_model.data.object_dataset | find minimum truncate of preprocessed_data_dir-train: 0 2021-03-20 12:38:39 | INFO | fairseq.data.data_utils | loaded 974803 examples from: preprocessed_data_dir/train 2021-03-20 12:39:10 | INFO | fairseq.trainer | NOTE: your device may support faster training with --fp16 epoch 001: 0%| | 0/7045 [00:00<?, ?it/s]2021-03-20 12:39:10 | INFO | fairseq.trainer | begin training epoch 1 Traceback (most recent call last): File "/usr/local/bin/fairseq-train", line 8, in sys.exit(cli_main()) File "/usr/local/lib/python3.6/dist-packages/fairseq_cli/train.py", line 352, in cli_main distributed_utils.call_main(args, main) File "/usr/local/lib/python3.6/dist-packages/fairseq/distributed_utils.py", line 286, in call_main nprocs=args.distributed_num_procs, File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 200, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 158, in start_processes while not context.join(): File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 108, in join (error_index, name) Exception: process 1 terminated with signal SIGKILL

    This seems to be a problem caused by insufficient memory. The memory of my computer is 200G. How much memory does the FV model need? Or is it caused by other reasons?

    opened by wjczf123 4
  • problem when run FineVisual model

    problem when run FineVisual model

    There is a bug as follows: mm = mmap.mmap(fid.fileno(), bytes, access=acc, offset=start) ValueError: mmap length is greater than file size

    The size of object features files: 149G train.objects.mmap 11G train.objects.mmap.splitaa 11G train.objects.mmap.splitab 11G train.objects.mmap.splitac 11G train.objects.mmap.splitad 11G train.objects.mmap.splitae 11G train.objects.mmap.splitaf 11G train.objects.mmap.splitag 11G train.objects.mmap.splitah 11G train.objects.mmap.splitai 11G train.objects.mmap.splitaj 11G train.objects.mmap.splitak 11G train.objects.mmap.splital 11G train.objects.mmap.splitam 2.1G train.objects.mmap.splitan 17G train.objects.mmap.splitao

    opened by phellonchen 3
  • result

    result

    Hi, I run your code (text_only). I get the gen.out file. The result seems that (Last line in gen.out): Generate test with beam=5: BLEU4 = 1.21, 15.6/1.4/0.5/0.2 (BP=0.953, ratio=0.954, syslen=370561, reflen=388568)

    The results are correct? What means that results? Do the results include BLEU-1, BLEU-2, BLEU-4, Dis-1, Dis-2, Dis-3, and Dis-4?

    opened by wjczf123 2
  • The 'rcnn' module required in run_rcnn.py is missing

    The 'rcnn' module required in run_rcnn.py is missing

    I‘m trying to extract rcnn features by myself using provided 'run_rcnn.py' script, however the missing module 'rcnn' is required in line 38.

    from rcnn.dataset import get_dataloader
    

    So where can I find the get_dataloader function ? THX

    opened by stupidHIGH 1
  • The model and config in the feature extract scripts are mismatched

    The model and config in the feature extract scripts are mismatched

    Thank you for the excellent works! I am trying to extract features from my own dataset, but it seems that the model and config in the feature_extract readme are mismatched. I can't load them in the run_rcnn.py.

    cd data
    wget https://dl.fbaipublicfiles.com/vilbert-multi-task/detectron_model.pth
    wget https://dl.fbaipublicfiles.com/vilbert-multi-task/detectron_config.yaml
    
    opened by linhaowei1 0
  • OpenViDial 2.0 download unstable

    OpenViDial 2.0 download unstable

    As most of the compressed files of OpenViDial 2.0 are more than 100GB, could you please split them into smaller ones for a better downloading stability?

    opened by Aman-4-Real 0
  • Cannot find train/valid/test.src.jsonl

    Cannot find train/valid/test.src.jsonl

    Hi. When I try to reproduce the part of mmi. I could not find train/valid/test.src.jsonl, so i could not reproduce it. Could you please tell me where it is ? Thanks a lot

    opened by sysu19351162 1
  • OSError: Model file not found: checkpoint_best.pt

    OSError: Model file not found: checkpoint_best.pt

    Hello, I am running out to report this error, I haven't lack this file, I downloaded one online, but the dimension is wrong, please ask how I want to solve it, thank you.

    opened by chensihang1 3
  • ValueError: cannot mmap an empty file

    ValueError: cannot mmap an empty file

    When I want to view the shape of train.features.mmap, numpy reports an error. How can I solve this problem

    By the way, can I directly use the mmap file (such as train/valid/test. features.mmap) as the video feature, for example, save it as an .npy file for multimodal training

    thank you

    opened by xiang-xiang-zhu 6
  • About the training set 10.zip

    About the training set 10.zip

    We note that the provided training set 10.zip contains only 4 images, is this correct? Because we found that the total training set is less than 170G.

    opened by zhizhi111 0
Owner
null
This is the official source code for SLATE. We provide the code for the model, the training code, and a dataset loader for the 3D Shapes dataset. This code is implemented in Pytorch.

SLATE This is the official source code for SLATE. We provide the code for the model, the training code and a dataset loader for the 3D Shapes dataset.

Gautam Singh 66 Dec 26, 2022
An easy way to build PyTorch datasets. Modularly build datasets and automatically cache processed results

EasyDatas An easy way to build PyTorch datasets. Modularly build datasets and automatically cache processed results Installation pip install git+https

Ximing Yang 4 Dec 14, 2021
Deep Learning Datasets Maker is a QGIS plugin to make datasets creation easier for raster and vector data.

Deep Learning Dataset Maker Deep Learning Datasets Maker is a QGIS plugin to make datasets creation easier for raster and vector data. How to use Down

deepbands 25 Dec 15, 2022
Cl datasets - PyTorch image dataloaders and utility functions to load datasets for supervised continual learning

Continual learning datasets Introduction This repository contains PyTorch image

berjaoui 5 Aug 28, 2022
Source code, datasets and trained models for the paper Learning Advanced Mathematical Computations from Examples (ICLR 2021), by François Charton, Amaury Hayat (ENPC-Rutgers) and Guillaume Lample

Maths from examples - Learning advanced mathematical computations from examples This is the source code and data sets relevant to the paper Learning a

Facebook Research 171 Nov 23, 2022
Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics

Dataset Cartography Code for the paper Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics at EMNLP 2020. This repository cont

AI2 125 Dec 22, 2022
Neurons Dataset API - The official dataloader and visualization tools for Neurons Datasets.

Neurons Dataset API - The official dataloader and visualization tools for Neurons Datasets. Introduction We propose our dataloader API for loading and

null 1 Nov 19, 2021
Reference implementation of code generation projects from Facebook AI Research. General toolkit to apply machine learning to code, from dataset creation to model training and evaluation. Comes with pretrained models.

This repository is a toolkit to do machine learning for programming languages. It implements tokenization, dataset preprocessing, model training and m

Facebook Research 408 Jan 1, 2023
This is the dataset and code release of the OpenRooms Dataset.

This is the dataset and code release of the OpenRooms Dataset.

Visual Intelligence Lab of UCSD 95 Jan 8, 2023
Datasets accompanying the paper ConditionalQA: A Complex Reading Comprehension Dataset with Conditional Answers.

ConditionalQA Datasets accompanying the paper ConditionalQA: A Complex Reading Comprehension Dataset with Conditional Answers. Disclaimer This dataset

null 2 Oct 14, 2021
Source code and dataset for ACL2021 paper: "ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive Learning".

ERICA Source code and dataset for ACL2021 paper: "ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive L

THUNLP 75 Nov 2, 2022
Official Implementation and Dataset of "PPR10K: A Large-Scale Portrait Photo Retouching Dataset with Human-Region Mask and Group-Level Consistency", CVPR 2021

Portrait Photo Retouching with PPR10K Paper | Supplementary Material PPR10K: A Large-Scale Portrait Photo Retouching Dataset with Human-Region Mask an

null 184 Dec 11, 2022
Toolbox of models, callbacks, and datasets for AI/ML researchers.

Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch Website • Installation • Main

Pytorch Lightning 1.4k Dec 30, 2022
Datasets, Transforms and Models specific to Computer Vision

torchvision The torchvision package consists of popular datasets, model architectures, and common image transformations for computer vision. Installat

null 13.1k Jan 2, 2023
A novel method to tune language models. Codes and datasets for paper ``GPT understands, too''.

P-tuning A novel method to tune language models. Codes and datasets for paper ``GPT understands, too''. How to use our code We have released the code

THUDM 562 Dec 27, 2022
covid question answering datasets and fine tuned models

Covid-QA Fine tuned models for question answering on Covid-19 data. Hosted Inference This model has been contributed to huggingface.Click here to see

Abhijith Neil Abraham 19 Sep 9, 2021
PyTorch implementation of popular datasets and models in remote sensing

PyTorch Remote Sensing (torchrs) (WIP) PyTorch implementation of popular datasets and models in remote sensing tasks (Change Detection, Image Super Re

isaac 222 Dec 28, 2022
Base pretrained models and datasets in pytorch (MNIST, SVHN, CIFAR10, CIFAR100, STL10, AlexNet, VGG16, VGG19, ResNet, Inception, SqueezeNet)

This is a playground for pytorch beginners, which contains predefined models on popular dataset. Currently we support mnist, svhn cifar10, cifar100 st

Aaron Chen 2.4k Dec 28, 2022