Note:
- A curated list of awesome papers for Semantic Retrieval, including some early methods and recent neural models for information retrieval tasks (e.g., ad-hoc retrieval, open-domain QA, community-based QA, and automatic conversation).
- For researchers who want to acquire semantic models for re-ranking stages, we refer readers to the awesome NeuIR survey by Guo et.al.
- Any feedback and contribution are welcome, please open an issue or contact me.
Contents
- Survey Paper
- Chapter 1: Classical Term-based Retrieval
- Chapter 2: Early Methods for Semantic Retrieval
- Chapter 3: Neural Methods for Semantic Retrieval
- Chapter 4: Other Resources
Survey Paper
- Semantic Matching in Search(Li et al., 2014)
- Pretrained Transformers for Text Ranking: BERT and Beyond(Lin et al., 2021, arXiv)
- Semantic Models for the First-stage Retrieval: A Comprehensive Review (Guo et al., 2021, TOIS)
- A Proposed Conceptual Framework for a Representational Approach to Information Retrieval(Lin et al., 2021, arXiv)
Classical Term-based Retrieval
- A Vector Space Model for Automatic Indexing(1975, VSM)
- Developments in Automatic Text Retrieval(1991, TFIDF)
- Term-weighting Approaches in Automatic Text Retrieval(1988, TFIDF)
- Relevance Weighting of Search Terms(1976, BIM)
- A Theoretical Basis for the Use of Co-occurrence Data in Information Retrieval(1997, Tree Dependence Model)
- The Probabilistic Relevance Framework: BM25 and Beyond(2010, BM25)
- A Language Modeling Approach to Information Retrieval(1998, QL)
- Statistical Language Models for Information Retrieval(2007, LM for IR)
- Hypergeometric Language Model and Zipf-Like Scoring Function for Web Document Similarity Retrieval(2010, LM for IR)
- Probabilistic Models of Information Retrieval Based on Measuring the Divergence from Randomness(2002, DFR)
Early Methods for Semantic Retrieval
Query Expansion
- Global Model
- Local Model
- Relevance Feedback in Information Retrieval(1971, Rocchio PRF)
- Model-based Feedback in the Language Modeling Approach to Information Retrieval(2001, Divergence Minimization Model)
- UMass at TREC 2004: Novelty and HARD(2004, RM3 for PRF)
- Selecting Good Expansion Terms for Pseudo-Relevance Feedback(2008, PRF)
- A Comparative Study of Methods for Estimating Query Language Models with Pseudo Feedback(2009)
- Pseudo-Relevance Feedback Based on Matrix Factorization(2016)
- Reducing the Risk of Query Expansion via Robust Constrained Optimization(2009,query drift problem)
- Query Expansion using Local and Global Document Analysis(2017)
Document Expansion
- Corpus Structure, Language Models, and Ad Hoc Information Retrieval(2004)
- Cluster-Based Retrieval Using Language Models(2004)
- Language Model Information Retrieval with Document Expansion(2006)
- Document Expansion Based on WordNet for Robust IR(2010)
- Improving Retrieval of Short Texts Through Document Expansion(2012)
- Document Expansion Using External Collections(2017, WordNet-based)
- Document Expansion versus Query Expansion for Ad-hoc Retrieval(2005)
Term Dependency Model
- Experiments in Automatic Phrase Indexing For Document Retrieval: A Comparison of Syntactic and Non-Syntactic Methods(1987, VSM + term dependency)
- Term-weighting Approaches in Automatic Text Retrieval(1988, VSM + term dependency)
- An Analysis of Statistical and Syntactic Phrases(1997, VSM + term dependency)
- A Probabilistic Model of Information Retrieval: Development and Comparative Experiments(2000, VSM + term dependency)
- Relevance Ranking Using Kernels(2010, BM25 + term dependency)
- A General Language Model for Information Retrieval (1999, LM + term dependency)
- Biterm Language Models for Document Retrieval(2002, LM + term dependency)
- Capturing Term Dependencies using a Language Model based on Sentence Trees(2002, LM + term dependency)
- Dependence Language Model for Information Retrieval(2004, LM + term dependency)
- A Generative Theory of Relevance(2008)
- A Markov Random Field Model for Term Dependencies(2005, SDM)
Topic Model
- Generalized Vector Space Model In Information Retrieval(1985, GVSM)
- Indexing by Latent Semantic Analysis(1990, LSI for IR)
- Probabilistic Latent Semantic Indexing(2017, PLSA, linearly combine)
- Corpus Structure, Language Models, and Ad Hoc Information Retrieval(2004, smoothing)
- Regularizing Ad Hoc Retrieval Scores(2005, smoothing)
- LDA-Based Document Models for Ad-Hoc Retrieval(2006, LDA for IR and LDA for LM smoothing)
- A Comparative Study of Utilizing Topic Models for Information Retrieval(2009, smoothing)
- Investigating Task Performance of Probabilistic Topic Models: An Empirical Study of PLSA and LDA(2010)
- Latent Semantic Indexing (LSI) Fails for TREC Collections(2011)
Translation Model
- Information Retrieval as Statistical Translation(1999)
- Estimation of Statistical Translation Models Based on Mutual Information for Ad Hoc Information Retrieval(2010)
- Clickthrough-Based Translation Models for Web Search: From Word Models to Phrase Models(2010)
- Axiomatic Analysis of Translation Language Model for Information Retrieval(2012)
- Query Rewriting Using Monolingual Statistical Machine Translation(2010, for query expansion)
- Towards Concept-based Translation Models using Search Logs for Query Expansion(2012, for query expansion)
Neural Methods for Semantic Retrieval
Sparse Retrieval Methods
-
Term Re-weighting
- Learning to Reweight Terms with Distributed Representations(Zheng et al., 2015, SIGIR, DeepTR)
- Integrating and Evaluating Neural Word Embeddings in Information Retrieval(Zuccon et al., 2015, ADCS, NTLM)
- Learning Term Discrimination(Frej et al, 2020, SIGIR, TVD)
- Context-Aware Sentence/Passage Term Importance Estimation for First Stage Retrieval(Dai et al., 2019, arXiv, DeepCT)
- Context-Aware Term Weighting For First-Stage Passage Retrieval(Dai et al., 2020, SIGIR, DeepCT)
- Efficiency Implications of Term Weighting for Passage Retrieval(Mackenzie et al., 2020, SIGIR, DeepCT)
- Context-Aware Document Term Weighting for Ad-Hoc Search(Dai et al., 2020, WWW, HDCT)
- A Few Brief Notes on DeepImpact, COIL, and a Conceptual Framework for Information Retrieval Techniques(Lin et al., 2021, arXiv, uniCOIL)
-
Expansion
- Document Expansion by query Prediction(Nogueira et al., 2019, arXiv, Doc2Query)
- From doc2query to docTTTTTquery(Nogueira et al., 2019, arXiv, DocTTTTTQuery)
- A Unified Pretraining Framework for Passage Ranking and Expansion(Yan et al., 2021, AAAI, UED)
- Generation-augmented Retrieval for Open-domain Question Answering(Mao et al., 2020, ACL, GAR, query expansion)
-
Expansion + Term Re-weighting
- Expansion via Prediction of Importance with Contextualization(MacAvaney et al., 2020, SIGIR, EPIC)
- SparTerm: Learning Term-based Sparse Representation for Fast Text Retrieval(Bai et al., 2020, arXiv, SparTerm)
- SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking(Formal et al., 2021, SIGIR, SPLADE)
- SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval(Formal et al., 2021, arXiv, SPLADEv2)
- Learning Passage Impacts for Inverted Indexes(Mallia et al., 2021, SIGIR, DeepImapct)
- TILDE: Term Independent Likelihood moDEl for Passage Re-ranking(Zhuang et al., 2021, SIGIR, TILDE)
- Fast Passage Re-ranking with Contextualized Exact Term Matching and Efficient Passage Expansion(Zhuang et al., 2021, arXiv, TILDEv2)
-
Sparse Representation Learning
- Semantic Hashing(Salakhutdinov et al., 2009)
- From Neural Re-Ranking to Neural Ranking: Learning a Sparse Representation for Inverted Indexing(Zamani et al., 2018, CIKM, SNRM)
- UHD-BERT: Bucketed Ultra-High Dimensional Sparse Representations for Full Ranking(Jang et al., 2021, arXiv, UHD-BERT)
- Efficient Passage Retrieval with Hashing for Open-domain Question Answering(Yamada et al., 2021, ACL, BPR)
- Composite Code Sparse Autoencoders for First Stage Retrieval(Lassance et al., 2021, SIGIR, CCSA)
Dense Retrieval Methods
- Word-Embedding-based
- Aggregating Continuous Word Embeddings for Information Retrieval(Clinchant et al., 2013, ACL, FV)
- Monolingual and Cross-Lingual Information Retrieval Models Based on (Bilingual) Word Embeddings(Vulic et al., 2015, SIGIR)
- Short Text Similarity with Word Embeddings(Kenter et al., 2015, CIKM, OoB)
- A Dual Embedding Space Model for Document Ranking(Mitra et al., 2016, arXiv, DESM)
- Efficient Natural Language Response Suggestion for Smart Reply(Henderson et al., 2017, arXiv)
- End-to-End Retrieval in Continuous Space(Gillick et al., 2018, arXiv)
- Phrase-Indexed Question Answering: A New Challenge for Scalable Document Comprehension(Seo et al., 2018, EMNLP, PIQA)
- Dense Passage Retrieval for Open-Domain Question Answering(Karpukhin et al., 2020, EMNLP, DPR)
- Retrieval-augmented generation for knowledge-intensive NLP tasks(Lewis et al., 2020, NIPS, RAG)
- RepBERT: Contextualized Text Embeddings for First-Stage Retrieval(Zhan et al., 2020, arXiv, RepBERT)
- CoRT: Complementary Rankings from Transformers(Wrzalik et al., 2020, NAACL, CoRT)
- DC-BERT: Decoupling Question and Document for Efficient Contextual Encoding(Nie et al., 2020, SIGIR, DC-BERT)
- Neural Retrieval for Question Answering with Cross-Attention Supervised Data Augmentation(Yang et al., 2021, ACL, data augmentation)
- Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval(Xiong et al., 2020, arXiv, ANCE)
- Learning To Retrieve: How to Train a Dense Retrieval Model Effectively and Efficiently(Zhan et al., 2020, arXiv, LTRe)
- GLOW : Global Weighted Self-Attention Network for Web(Shan et al, 2020, arXiv, GLOW)
- An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering(Qu et al., 2021, ACL, RocketQA)
- Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling(Hofstätter et al., 2021, SIGIR, TAS-Balanced)
- Optimizing Dense Retrieval Model Training with Hard Negatives(Zhan et al., 2021, SIGIR, STAR/ADORE)
- Few-Shot Conversational Dense Retrieval(Yu et al., 2021, SIGIR)
- Learning Dense Representations of Phrases at Scale(Lee et al., 2021, ACL, DensePhrases)
- More Robust Dense Retrieval with Contrastive Dual Learning(Lee et al., 2021, ICTIR, DANCE)
- PAIR: Leveraging Passage-Centric Similarity Relation for Improving Dense Passage Retrieval(Ren et al., 2021, ACL, PAIR)
- Relevance-guided Supervision for OpenQA with ColBERT(Khattab et al., 2021, TACL, ColBERT-QA)
- End-to-End Training of Multi-Document Reader and Retriever for Open-Domain Question Answering(Sachan et al., 2021, arXiv, EMDR^2)
- Improving Query Representations for Dense Retrieval with Pseudo Relevance Feedback(Yu et al, 2021, CIKM, ANCE-PRF)
- Pseudo-Relevance Feedback for Multiple Representation Dense Retrieval(Wang et al., 2021, ICTIR, ColBERT-PRF)
- A Discriminative Semantic Ranker for Question Retrieval(Cai et al., 2021, ICTIR, DenseTrans)
- Representation Decoupling for Open-Domain Passage Retrieval(Wu et al., 2021, arXiv)
- RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking(Ren et al., 2021, EMNLP, RocketQAv2)
- Knowledge Distillation
- Distilling Dense Representations for Ranking using Tightly-Coupled Teachers(Lin et al., 2020, arXiv, TCT-ColBERT)
- Distilling Knowledge for Fast Retrieval-based Chat-bots(Tahami et al., 2020, SIGIR)
- Distilling Knowledge from Reader to Retriever for Question Answering(Izacard et al., 2020, arXiv)
- Is Retriever Merely an Approximator of Reader?(Yang et al., 2020, arXiv)
- Twinbert: Distilling knowledge to twin-structured compressed bert models for large-scale retrieval.
- Improving Bi-encoder Document Ranking Models with Two Rankers and Multi-teacher Distillation(Choi et al., 2021, SIGIR, TRMD)
- Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation(Hofstätter et al., 2021, arXiv, Margin-MSE loss)
- Multi-vector Representation
- Multi-Hop Paragraph Retrieval for Open-Domain Question Answering(Feldman et al., 2019, ACL, MUPPET)
- Sparse, Dense, and Attentional Representations for Text Retrieval(Luan et al., 2020, TACL, ME-BERT)
- ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT(Khattab et al., 2020, SIGIR, ColBERT)
- COIL: Revisit Exact Lexical Match in Information Retrieval with Contextualized Inverted List(Gao et al., 2021, NACL, COIL)
- Improving Document Representations by Generating Pseudo Query Embeddings for Dense Retrieval(Tang et al., 2021, ACL)
- Phrase Retrieval Learns Passage Retrieval, Too(Lee et al., 2021, EMNLP, DensePhrases)
- Query Embedding Pruning for Dense Retrieval(Tonellotto et al., 2021, CIKM)
- Accelerate Interaction-based Models
- Incorporating Query Term Independence Assumption for Efficient Retrieval and Ranking using Deep Neural Networks(Mitra et al., 2019, arXiv)
- Efficient Interaction-based Neural Ranking with Locality Sensitive Hashing(Ji et al., 2019, WWW)
- Poly-encoders: Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring(Humeau et al., 2020, ICLR, Poly-encoders)
- Modularized Transfomer-based Ranking Framework(Gao et al., 2020, EMNLP, MORES)
- Efficient Document Re-Ranking for Transformers by Precomputing Term Representations(MacAvaney et al., 2020, SIGIR, PreTTR)
- DeFormer: Decomposing Pre-trained Transformers for Faster Question Answering(Cao et al., 2020, ACL, DeFormer)
- SPARTA: Efficient Open-Domain Question Answering via Sparse Transformer Matching Retrieval(Zhao et al., 2020, arXiv, SPARTA)
- Conformer-Kernel with Query Term Independence for Document Retrieval(Mitra et al., 2020, arXiv)
- Pre-training
- Latent Retrieval for Weakly Supervised Open Domain Question Answering(Lee et al., 2019, ACL, ORQA)
- Retrieval-Augmented Language Model Pre-Training(Guu et al., 2020, ICML, REALM)
- Pre-training Tasks for Embedding-based Large-scale Retrieval(Chang et al., 2020, ICLR, BFS+WLP+MLM)
- Embedding-based Zero-shot Retrieval through Query Generation(Liang et al., 2020, arXiv, query generation)
- Zero-shot Neural Passage Retrieval via Domain-targeted Synthetic Question Generation(Ma et al., 2020, arXiv, query generation)
- Towards Robust Neural Retrieval Models with Synthetic Pre-Training(Reddy et al., 2021, arXiv, query generation)
- Is Your Language Model Ready for Dense Representation Fine-tuning?(Gao et al., 2021, EMNLP, Condenser)
- Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval(Gao et al., 2021, arXiv, coCondenser)
- Less is More: Pre-training a Strong Siamese Encoder Using a Weak Decoder(Lu et al., 2021, EMNLP, SEED-Encoder)
- Pre-trained Language Model for Web-scale Retrieval in Baidu Search(Liu et al., 2021, KDD)
- Pre-training for Ad-hoc Retrieval: Hyperlink is Also You Need(Ma et al., 2021, CIKM, HARP)
- Joint Learning with Index
- Joint Learning of Deep Retrieval Model and Product Quantization based Embedding Index(Zhang et al., 2021, SIGIR, Poeem)
- Jointly Optimizing Query Encoder and Product Quantization to Improve Retrieval Performance(Zhan et al., 2021, CIKM, JPQ)
- Matching-oriented Product Quantization For Ad-hoc Retrieval(Xiao et al., 2021, EMNLP, MoPQ)
- Learning Discrete Representations via Constrained Clustering for Effective and Efficient Dense Retrieval(Zhan et al, 2022, WSDM, RepCONC)
- Debias
- Learning Robust Dense Retrieval Models from Incomplete Relevance Labels(Prakash et al., 2021, SIGIR, RANCE)
- Zero-shot
- Zero-Shot Dense Retrieval with Momentum Adversarial Domain Invariant Representations(Xin et al., 2021, arXiv, MoDIR)
- Probing Analysis
- The Curse of Dense Low-Dimensional Information Retrieval for Large Index Sizes(Reimers et al., 2021, ACL)
- Simple and Effective Unsupervised Redundancy Elimination to Compress Dense Vectors for Passage Retrieval(Ma et al., EMNLP, 2021, redundancy)
- BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models( Thakur et al., 2021, NeurIPS, transferability)
- Salient Phrase Aware Dense Retrieval: Can a Dense Retriever Imitate a Sparse One?(Chen et al., 2021, arXiv)
- Simple Entity-Centric Questions Challenge Dense Retrievers(Sciavolino et al., 2021, EMNLP)
Hybrid Retrieval Methods
- Word-Embedding-based
- Monolingual and Cross-Lingual Information Retrieval Models Based on (Bilingual) Word Embeddings(Vulic et al., 2015, SIGIR, linearly combine)
- Word Embedding based Generalized Language Model for Information Retrieval(Ganguly et al., 2015, SIGIR, GLM)
- Representing Documents and Queries as Sets of Word Embedded Vectors for Information Retrieval(Roy et al., 2016, SIGIR, linearly combine)
- A Dual Embedding Space Model for Document Ranking(Mitra et al., 2016, WWW, DESM_mixture, linearly combine)
- Off the Beaten Path: Let’s Replace Term-Based Retrieval with k-NN Search(Boytsov et al., 2016, CIKM, BM25+translation model)
- Learning Hybrid Representations to Retrieve Semantically Equivalent Questions(Santos et al., 2015, ACL, BOW-CNN)
- Real-Time Open-Domain Question Answering with Dense-Sparse Phrase Index (Seo et al., 2019, ACL, DenSPI)
- Contextualized Sparse Representations for Real-Time Open-Domain Question Answering(Lee et al., 2020, ACL, SPARC)
- CoRT: Complementary Rankings from Transformers(Wrzalik et al., 2020, NAACL, CoRT_BM25)
- Sparse, Dense, and Attentional Representations for Text Retrieval(Luan et al., 2020, TACL, ME-Hybrid)
- Complement Lexical Retrieval Model with Semantic Residual Embeddings(Gao et al., 2020, ECIR, CLEAR)
- Leveraging Semantic and Lexical Matching to Improve the Recall of Document Retrieval Systems: A Hybrid Approach(Kuzi et al., 2020, arXiv, Hybrid)
- A Few Brief Notes on DeepImpact, COIL, and a Conceptual Framework for Information Retrieval Techniques(Lin et al., 2021, arXiv, uniCOIL)
- Contextualized Offline Relevance Weighting for Efficient and Effective Neural Retrieval(Chen et al., 2021, SIGIR)
- Predicting Efficiency/Effectiveness Trade-offs for Dense vs. Sparse Retrieval Strategy Selection(Arabzadeh et al., 2021, CIKM)
- Fast Forward Indexes for Efficient Document Ranking(Leonhardt et al., 2021, arXiv)
Other Resources
Other Tasks
- E-commerce Search
- Deep Interest Network for Click-Through Rate Prediction(Zhou et al., 2018, KDD, DIN)
- From Semantic Retrieval to Pairwise Ranking: Applying Deep Learning in E-commerce Search(Li et al., 2019, SIGIR, Jingdong)
- Multi-Interest Network with Dynamic Routing for Recommendation at Tmall(Li et al., 2019, CIKM, MIND, Tmall)
- Towards Personalized and Semantic Retrieval: An End-to-End Solution for E-commerce Search via Embedding Learning(Zhang et al., 2020, SIGIR, DPSR, Jingdong)
- Deep Multi-Interest Network for Click-through Rate Prediction(Xiao et al., 2020, CIKM, DMIN)
- Deep Retrieval: An End-to-End Learnable Structure Model for Large-Scale Recommendations(Gao et al., 2020, arXiv)
- Embedding-based Product Retrieval in Taobao Search(Li et al., 2021, KDD, taobao)
- Embracing Structure in Data for Billion-Scale Semantic Product Search(Lakshman et al., 2021, arXiv, Amazon)
- Sponsored Search
- MOBIUS: Towards the Next Generation of Query-Ad Matching in Baidu’s Sponsored Search(Fan et al., 2019, KDD, Baidu)
- Image Retrieval
- Binary Neural Network Hashing for Image Retrieval(Zhang et al., 2021, SIGIR, BNNH)
- Deep Self-Adaptive Hashing for Image Retrieval(Lin et al., 2021, CIKM, DSAH)
- Report on the First HIPstIR Workshop on the Future of Information Retrieval(Dietz et al., 2019, SIGIR, workshop)
- Let’s measure run time! Extending the IR replicability infrastructure to include performance aspects(Hofstätter et al., 2019, SIGIR)
- Embedding-based Retrieval in Facebook Search(Huang et al., 2020, KDD, EBR)
- Learning K-way D-dimensional Discrete Codes for Compact Embedding Representations(Chen et al., 2018, ICML)
Datasets
- 【MS MARCO】MS MARCO: A Human Generated MAchine Reading COmprehension Dataset
- 【TREC CAR】TREC Complex Answer Retrieval Overview
- 【TREC DL】Overview of the TREC 2019 deep learning track
- 【TREC COVID】TREC-COVID: Constructing a Pandemic Information Retrieval Test Collection
Indexing Methods
- Tree-based
- Hashing-based
- Quantization-based
- Product Quantization for Nearest Neighbor Search(2010, PQ)
- Optimized Product Quantization(2013, OPQ)
- Graph-based
- Toolkits
- Faiss: a library for efficient similarity search and clustering of dense vectors
- SPTAG: A library for fast approximate nearest neighbor search
- OpenMatch: An Open-Source Package for Information Retrieval
- Pyserini: A Python Toolkit for Reproducible Information Retrieval Research with Sparse and Dense Representations
- ElasticSearch