System-oriented IR evaluations are limited to rather abstract understandings of real user behavior

Overview

Validating Simulations of User Query Variants

This repository contains the scripts of the experiments and evaluations, simulated queries, as well as the figures of:

Timo Breuer, Norbert Fuhr, and Philipp Schaer. 2022. Validating Simulations of User Query Variants. In Proceedings of the 44th European Conference on IR Research, ECIR 2022.

System-oriented IR evaluations are limited to rather abstract understandings of real user behavior. As a solution, simulating user interactions provides a cost-efficient way to support system-oriented experiments with more realistic directives when no interaction logs are available. While there are several user models for simulated clicks or result list interactions, very few attempts have been made towards query simulations, and it has not been investigated if these can reproduce properties of real queries. In this work, we validate simulated user query variants with the help of TREC test collections in reference to real user queries that were made for the corresponding topics. Besides, we introduce a simple yet effective method that gives better reproductions of real queries than the established methods. Our evaluation framework validates the simulations regarding the retrieval performance, reproducibility of topic score distributions, shared task utility, effort and effect, and query term similarity when compared with real user query variants. While the retrieval effectiveness and statistical properties of the topic score distributions as well as economic aspects are close to that of real queries, it is still challenging to simulate exact term matches and later query reformulations.

Directory overview

Directory Description
config/ Contains configuration files for the query simulations, experiments, and evaluations.
data/ Contains (intermediate) output data of the simulations and experiments as well as the figures of the paper.
eval/ Contains scripts of the experiments and evaluations.
sim/ Contains scripts of the query simulations.

Setup

  1. Install Anserini and index Core17 (The New York Times Annotated Corpus) according to the regression guide:
anserini/target/appassembler/bin/IndexCollection \
    -collection NewYorkTimesCollection \
    -input /path/to/core17/ \
    -index anserini/indexes/lucene-index.core17 \
    -generator DefaultLuceneDocumentGenerator \
    -threads 4 \
    -storePositions \
    -storeDocvectors \
    -storeRaw \
    -storeContents \
    > anserini/logs/log.core17 &
  1. Install the required Python packages:
pip install -r requirements.txt

Query simulation

In order to prepare the language models and simulate the queries, the scripts have to executed in the order shown in the following table. All of the outputs can be found in the data/ directory. For the sake of better code readability the names of the query reformulation strategies have been mapped: S1S1; S2S2; S2'S3; S3S4; S3'S5; S4S6; S4'S7; S4''S8. The names of the scripts and output files comply with this name mapping.

Script Description Output files
sim/make_background.py Make the background language model form all index terms of Core17. The background model is required for Controlled Query Generation (CQG) by Jordan et al. data/lm/background.csv
sim/make_cqg.py Make the CQG language models with different parameters of lambda from 0.0 to 1.0. data/lm/cqg.json
sim/simulate_queries_s12345.py Simulate TTS and KIS queries with strategies S1 to S3' data/queries/s12345.csv
sim/simulate_queries_s678.py Simulate TTS and KIS queries with strategies S4 to S4'' data/queries/s678.csv

Experimental evaluation and results

In order to reproduce the experiments of the study, the scripts have to executed in the order shown in the following table.

Script Description Output files Reproduction of ...
eval/arp.py, eval/arp_first.py, eval/arp_max.py Retrieval performance: Evaluate the Average Retrieval Performance (ARP). data/experimental_results/arp.csv, data/experimental_results/arp_first.csv, data/experimental_results/arp_max.csv Tab. A.1
eval/rmse_s12345.py, eval/rmse_s678.py Retrieval performance: Evaluate the Root-Mean-Square-Error (RMSE). data/experimental_results/rmse_map.csv, data/experimental_results/rmse_ndcg.csv, data/experimental_results/rmse_p1000.csv, data/experimental_results/rmse_uqv_vs_s12345_kis_ndcg.csv, data/experimental_results/rmse_uqv_vs_s12345_tts_ndcg.csv, data/figures/rmse_map.pdf, data/figures/rmse_ndcg.pdf, data/figures/rmse_p1000.pdf, data/figures/rmse_uqv_vs_s12345_kis_ndcg.pdf, data/figures/rmse_uqv_vs_s12345_tts_ndcg.pdf Fig. A.1, Fig. 1
eval/t-test.py Retrieval performance: Evaluate the p-values of paired t-tests. data/experimental_results/ttest.csv, data/figures/ttest.pdf Fig. A.2
eval/system_orderings.py Shared task utility: Evaluate Kendall's tau between relative system orderings. data/experimental_results/system_orderings.csv, data/figures/system_orderings.pdf Fig. 2 (left)
eval/sdcg.py Effort and effect: Evaluate the Session Discounted Cumulative Gain (sDCG). data/experimental_results/sdcg_3queries.csv, data/experimental_results/sdcg_5queries.csv, data/experimental_results/sdcg_10queries.csv, data/figures/sdcg_3queries.pdf, data/figures/sdcg_5queries.pdf, data/figures/sdcg_10queries.pdf Fig. 3 (top)
eval/economic.py Effort and effect: Evaluate tradeoffs between number of queries and browsing depth by isoquants. data/experimental_results/economic0.3.csv, data/experimental_results/economic0.4.csv, data/experimental_results/economic0.5.csv, data/figures/economic0.3.pdf, data/figures/economic0.4.pdf, data/figures/economic0.5.pdf Fig. 3 (bottom)
eval/jaccard_similarity.py Query term similarity: Evaluate query term similarities. data/experimental_results/jacc.csv, data/figures/jacc.pdf Fig. 2 (right)
You might also like...
PyTorch code for EMNLP 2021 paper: Don't be Contradicted with Anything! CI-ToD: Towards Benchmarking Consistency for Task-oriented Dialogue System
PyTorch code for EMNLP 2021 paper: Don't be Contradicted with Anything! CI-ToD: Towards Benchmarking Consistency for Task-oriented Dialogue System

Don’t be Contradicted with Anything!CI-ToD: Towards Benchmarking Consistency for Task-oriented Dialogue System This repository contains the PyTorch im

PyTorch code for EMNLP 2021 paper: Don't be Contradicted with Anything! CI-ToD: Towards Benchmarking Consistency for Task-oriented Dialogue System
PyTorch code for EMNLP 2021 paper: Don't be Contradicted with Anything! CI-ToD: Towards Benchmarking Consistency for Task-oriented Dialogue System

PyTorch code for EMNLP 2021 paper: Don't be Contradicted with Anything! CI-ToD: Towards Benchmarking Consistency for Task-oriented Dialogue System

Code for Transformers Solve Limited Receptive Field for Monocular Depth Prediction

Official PyTorch code for Transformers Solve Limited Receptive Field for Monocular Depth Prediction. Guanglei Yang, Hao Tang, Mingli Ding, Nicu Sebe,

Regularizing Generative Adversarial Networks under Limited Data (CVPR 2021)
Regularizing Generative Adversarial Networks under Limited Data (CVPR 2021)

Regularizing Generative Adversarial Networks under Limited Data [Project Page][Paper] Implementation for our GAN regularization method. The proposed r

Code and data for ACL2021 paper Cross-Lingual Abstractive Summarization with Limited Parallel Resources.

Multi-Task Framework for Cross-Lingual Abstractive Summarization (MCLAS) The code for ACL2021 paper Cross-Lingual Abstractive Summarization with Limit

The source codes for ACL 2021 paper 'BoB: BERT Over BERT for Training Persona-based Dialogue Models from Limited Personalized Data'
The source codes for ACL 2021 paper 'BoB: BERT Over BERT for Training Persona-based Dialogue Models from Limited Personalized Data'

BoB: BERT Over BERT for Training Persona-based Dialogue Models from Limited Personalized Data This repository provides the implementation details for

Run Effective Large Batch Contrastive Learning on Limited Memory GPU

Gradient Cache Gradient Cache is a simple technique for unlimitedly scaling contrastive learning batch far beyond GPU memory constraint. This means tr

Code and data for ACL2021 paper Cross-Lingual Abstractive Summarization with Limited Parallel Resources.

Multi-Task Framework for Cross-Lingual Abstractive Summarization (MCLAS) The code for ACL2021 paper Cross-Lingual Abstractive Summarization with Limit

[NeurIPS 2021] Deceive D: Adaptive Pseudo Augmentation for GAN Training with Limited Data
[NeurIPS 2021] Deceive D: Adaptive Pseudo Augmentation for GAN Training with Limited Data

Deceive D: Adaptive Pseudo Augmentation for GAN Training with Limited Data (NeurIPS 2021) This repository provides the official PyTorch implementation

Owner
IR Group at Technische Hochschule Köln
IR Group at Technische Hochschule Köln
Just playing with getting CLIP Guided Diffusion running locally, rather than having to use colab.

CLIP-Guided-Diffusion Just playing with getting CLIP Guided Diffusion running locally, rather than having to use colab. Original colab notebooks by Ka

Nerdy Rodent 336 Dec 9, 2022
Point Cloud Denoising input segmentation output raw point-cloud valid/clear fog rain de-noised Abstract Lidar sensors are frequently used in environme

Point Cloud Denoising input segmentation output raw point-cloud valid/clear fog rain de-noised Abstract Lidar sensors are frequently used in environme

null 75 Nov 24, 2022
Fine-grained Control of Image Caption Generation with Abstract Scene Graphs

Faster R-CNN pretrained on VisualGenome This repository modifies maskrcnn-benchmark for object detection and attribute prediction on VisualGenome data

Shizhe Chen 7 Apr 20, 2021
Edge-oriented Convolution Block for Real-time Super Resolution on Mobile Devices, ACM Multimedia 2021

Codes for ECBSR Edge-oriented Convolution Block for Real-time Super Resolution on Mobile Devices Xindong Zhang, Hui Zeng, Lei Zhang ACM Multimedia 202

xindong zhang 236 Dec 26, 2022
Code for "ShineOn: Illuminating Design Choices for Practical Video-based Virtual Clothing Try-on", accepted at WACV 2021 Generation of Human Behavior Workshop.

ShineOn: Illuminating Design Choices for Practical Video-based Virtual Clothing Try-on [ Paper ] [ Project Page ] This repository contains the code fo

Andrew Jong 97 Dec 13, 2022
[CVPR2021] UAV-Human: A Large Benchmark for Human Behavior Understanding with Unmanned Aerial Vehicles

UAV-Human Official repository for CVPR2021: UAV-Human: A Large Benchmark for Human Behavior Understanding with Unmanned Aerial Vehicle Paper arXiv Res

null 129 Jan 4, 2023
BABEL: Bodies, Action and Behavior with English Labels [CVPR 2021]

BABEL is a large dataset with language labels describing the actions being performed in mocap sequences. BABEL labels about 43 hours of mocap sequences from AMASS [1] with action labels.

null 113 Dec 28, 2022
Our CIKM21 Paper "Incorporating Query Reformulating Behavior into Web Search Evaluation"

Reformulation-Aware-Metrics Introduction This codebase contains source-code of the Python-based implementation of our CIKM 2021 paper. Chen, Jia, et a

xuanyuan14 5 Mar 5, 2022
1st ranked 'driver careless behavior detection' for AI Online Competition 2021, hosted by MSIT Korea.

2021AICompetition-03 본 repo 는 mAy-I Inc. 팀으로 참가한 2021 인공지능 온라인 경진대회 중 [이미지] 운전 사고 예방을 위한 운전자 부주의 행동 검출 모델] 태스크 수행을 위한 레포지토리입니다. mAy-I 는 과학기술정보통신부가 주최하

Junhyuk Park 9 Dec 1, 2022
This was initially the repo for the project of PSYC626@USC of Asaf Mazar, Millad Kassaie and Georgios Chochlakis named "Powered by the Will? Exploring Lay Theories of Behavior Change through Social Media"

Subreddit Analysis This repo includes tools for Subreddit analysis, originally developed for our class project of PSYC 626 in USC, titled "Powered by

Georgios Chochlakis 1 Dec 17, 2021