System-oriented IR evaluations are limited to rather abstract understandings of real user behavior

Related tags

Deep Learning ecir2022-uqv-sim

Overview

Validating Simulations of User Query Variants

This repository contains the scripts of the experiments and evaluations, simulated queries, as well as the figures of:

Timo Breuer, Norbert Fuhr, and Philipp Schaer. 2022. Validating Simulations of User Query Variants. In Proceedings of the 44th European Conference on IR Research, ECIR 2022.

System-oriented IR evaluations are limited to rather abstract understandings of real user behavior. As a solution, simulating user interactions provides a cost-efficient way to support system-oriented experiments with more realistic directives when no interaction logs are available. While there are several user models for simulated clicks or result list interactions, very few attempts have been made towards query simulations, and it has not been investigated if these can reproduce properties of real queries. In this work, we validate simulated user query variants with the help of TREC test collections in reference to real user queries that were made for the corresponding topics. Besides, we introduce a simple yet effective method that gives better reproductions of real queries than the established methods. Our evaluation framework validates the simulations regarding the retrieval performance, reproducibility of topic score distributions, shared task utility, effort and effect, and query term similarity when compared with real user query variants. While the retrieval effectiveness and statistical properties of the topic score distributions as well as economic aspects are close to that of real queries, it is still challenging to simulate exact term matches and later query reformulations.

Directory overview

Directory	Description
`config/`	Contains configuration files for the query simulations, experiments, and evaluations.
`data/`	Contains (intermediate) output data of the simulations and experiments as well as the figures of the paper.
`eval/`	Contains scripts of the experiments and evaluations.
`sim/`	Contains scripts of the query simulations.

Setup

Install Anserini and index Core17 (The New York Times Annotated Corpus) according to the regression guide:

anserini/target/appassembler/bin/IndexCollection \
    -collection NewYorkTimesCollection \
    -input /path/to/core17/ \
    -index anserini/indexes/lucene-index.core17 \
    -generator DefaultLuceneDocumentGenerator \
    -threads 4 \
    -storePositions \
    -storeDocvectors \
    -storeRaw \
    -storeContents \
    > anserini/logs/log.core17 &

Install the required Python packages:

pip install -r requirements.txt

Query simulation

In order to prepare the language models and simulate the queries, the scripts have to executed in the order shown in the following table. All of the outputs can be found in the data/ directory. For the sake of better code readability the names of the query reformulation strategies have been mapped: S1 → S1; S2 → S2; S2' → S3; S3 → S4; S3' → S5; S4 → S6; S4' → S7; S4'' → S8. The names of the scripts and output files comply with this name mapping.

Script	Description	Output files
`sim/make_background.py`	Make the background language model form all index terms of Core17. The background model is required for Controlled Query Generation (CQG) by Jordan et al.	`data/lm/background.csv`
`sim/make_cqg.py`	Make the CQG language models with different parameters of lambda from 0.0 to 1.0.	`data/lm/cqg.json`
`sim/simulate_queries_s12345.py`	Simulate TTS and KIS queries with strategies S1 to S3'	`data/queries/s12345.csv`
`sim/simulate_queries_s678.py`	Simulate TTS and KIS queries with strategies S4 to S4''	`data/queries/s678.csv`

Experimental evaluation and results

In order to reproduce the experiments of the study, the scripts have to executed in the order shown in the following table.

Script	Description	Output files	Reproduction of ...
`eval/arp.py`, `eval/arp_first.py`, `eval/arp_max.py`	Retrieval performance: Evaluate the Average Retrieval Performance (ARP).	`data/experimental_results/arp.csv`, `data/experimental_results/arp_first.csv`, `data/experimental_results/arp_max.csv`	`Tab. A.1`
`eval/rmse_s12345.py`, `eval/rmse_s678.py`	Retrieval performance: Evaluate the Root-Mean-Square-Error (RMSE).	`data/experimental_results/rmse_map.csv`, `data/experimental_results/rmse_ndcg.csv`, `data/experimental_results/rmse_p1000.csv`, `data/experimental_results/rmse_uqv_vs_s12345_kis_ndcg.csv`, `data/experimental_results/rmse_uqv_vs_s12345_tts_ndcg.csv`, `data/figures/rmse_map.pdf`, `data/figures/rmse_ndcg.pdf`, `data/figures/rmse_p1000.pdf`, `data/figures/rmse_uqv_vs_s12345_kis_ndcg.pdf`, `data/figures/rmse_uqv_vs_s12345_tts_ndcg.pdf`	`Fig. A.1`, `Fig. 1`
`eval/t-test.py`	Retrieval performance: Evaluate the p-values of paired t-tests.	`data/experimental_results/ttest.csv`, `data/figures/ttest.pdf`	`Fig. A.2`
`eval/system_orderings.py`	Shared task utility: Evaluate Kendall's tau between relative system orderings.	`data/experimental_results/system_orderings.csv`, `data/figures/system_orderings.pdf`	`Fig. 2 (left)`
`eval/sdcg.py`	Effort and effect: Evaluate the Session Discounted Cumulative Gain (sDCG).	`data/experimental_results/sdcg_3queries.csv`, `data/experimental_results/sdcg_5queries.csv`, `data/experimental_results/sdcg_10queries.csv`, `data/figures/sdcg_3queries.pdf`, `data/figures/sdcg_5queries.pdf`, `data/figures/sdcg_10queries.pdf`	`Fig. 3 (top)`
`eval/economic.py`	Effort and effect: Evaluate tradeoffs between number of queries and browsing depth by isoquants.	`data/experimental_results/economic0.3.csv`, `data/experimental_results/economic0.4.csv`, `data/experimental_results/economic0.5.csv`, `data/figures/economic0.3.pdf`, `data/figures/economic0.4.pdf`, `data/figures/economic0.5.pdf`	`Fig. 3 (bottom)`
`eval/jaccard_similarity.py`	Query term similarity: Evaluate query term similarities.	`data/experimental_results/jacc.csv`, `data/figures/jacc.pdf`	`Fig. 2 (right)`

You might also like...

PyTorch code for EMNLP 2021 paper: Don't be Contradicted with Anything! CI-ToD: Towards Benchmarking Consistency for Task-oriented Dialogue System

Don’t be Contradicted with Anything!CI-ToD: Towards Benchmarking Consistency for Task-oriented Dialogue System This repository contains the PyTorch im

25 Sep 6, 2022

PyTorch code for EMNLP 2021 paper: Don't be Contradicted with Anything! CI-ToD: Towards Benchmarking Consistency for Task-oriented Dialogue System

12 Sep 26, 2021

Code for Transformers Solve Limited Receptive Field for Monocular Depth Prediction

Official PyTorch code for Transformers Solve Limited Receptive Field for Monocular Depth Prediction. Guanglei Yang, Hao Tang, Mingli Ding, Nicu Sebe,

152 Dec 16, 2022

Regularizing Generative Adversarial Networks under Limited Data (CVPR 2021)

Regularizing Generative Adversarial Networks under Limited Data [Project Page][Paper] Implementation for our GAN regularization method. The proposed r

148 Nov 18, 2022

Code and data for ACL2021 paper Cross-Lingual Abstractive Summarization with Limited Parallel Resources.

Multi-Task Framework for Cross-Lingual Abstractive Summarization (MCLAS) The code for ACL2021 paper Cross-Lingual Abstractive Summarization with Limit

43 Nov 7, 2022

The source codes for ACL 2021 paper 'BoB: BERT Over BERT for Training Persona-based Dialogue Models from Limited Personalized Data'

BoB: BERT Over BERT for Training Persona-based Dialogue Models from Limited Personalized Data This repository provides the implementation details for

124 Dec 27, 2022

System-oriented IR evaluations are limited to rather abstract understandings of real user behavior

Related tags

Overview

Validating Simulations of User Query Variants

Directory overview

Setup

Query simulation

Experimental evaluation and results

You might also like...

PyTorch code for EMNLP 2021 paper: Don't be Contradicted with Anything! CI-ToD: Towards Benchmarking Consistency for Task-oriented Dialogue System

PyTorch code for EMNLP 2021 paper: Don't be Contradicted with Anything! CI-ToD: Towards Benchmarking Consistency for Task-oriented Dialogue System

Code for Transformers Solve Limited Receptive Field for Monocular Depth Prediction

Regularizing Generative Adversarial Networks under Limited Data (CVPR 2021)

Code and data for ACL2021 paper Cross-Lingual Abstractive Summarization with Limited Parallel Resources.

The source codes for ACL 2021 paper 'BoB: BERT Over BERT for Training Persona-based Dialogue Models from Limited Personalized Data'

Run Effective Large Batch Contrastive Learning on Limited Memory GPU

Code and data for ACL2021 paper Cross-Lingual Abstractive Summarization with Limited Parallel Resources.

[NeurIPS 2021] Deceive D: Adaptive Pseudo Augmentation for GAN Training with Limited Data

Owner

IR Group at Technische Hochschule Köln

Just playing with getting CLIP Guided Diffusion running locally, rather than having to use colab.

Point Cloud Denoising input segmentation output raw point-cloud valid/clear fog rain de-noised Abstract Lidar sensors are frequently used in environme

Fine-grained Control of Image Caption Generation with Abstract Scene Graphs

Edge-oriented Convolution Block for Real-time Super Resolution on Mobile Devices, ACM Multimedia 2021

Code for "ShineOn: Illuminating Design Choices for Practical Video-based Virtual Clothing Try-on", accepted at WACV 2021 Generation of Human Behavior Workshop.

[CVPR2021] UAV-Human: A Large Benchmark for Human Behavior Understanding with Unmanned Aerial Vehicles

BABEL: Bodies, Action and Behavior with English Labels [CVPR 2021]

Our CIKM21 Paper "Incorporating Query Reformulating Behavior into Web Search Evaluation"

1st ranked 'driver careless behavior detection' for AI Online Competition 2021, hosted by MSIT Korea.

This was initially the repo for the project of PSYC626@USC of Asaf Mazar, Millad Kassaie and Georgios Chochlakis named "Powered by the Will? Exploring Lay Theories of Behavior Change through Social Media"