Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations.

Castorini

Last update: Dec 29, 2022

Related tags

Deep Learning information-retrieval

Overview

Pyserini

Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations. Retrieval using sparse representations is provided via integration with our group's Anserini IR toolkit, which is built on Lucene. Retrieval using dense representations is provided via integration with Facebook's Faiss library.

Pyserini is primarily designed to provide effective, reproducible, and easy-to-use first-stage retrieval in a multi-stage ranking architecture. Our toolkit is self-contained as a standard Python package and comes with queries, relevance judgments, pre-built indexes, and evaluation scripts for many commonly used IR test collections

With Pyserini, it's easy to reproduce runs on a number of standard IR test collections! A low-effort way to try things out is to look at our online notebooks, which will allow you to get started with just a few clicks.

Package Installation

Install via PyPI (requires Python 3.6+):

pip install pyserini

Sparse retrieval depends on Anserini, which is itself built on Lucene, and thus Java 11.

Dense retrieval depends on neural networks and requires a more complex set of dependencies. A pip installation will automatically pull in the 🤗 Transformers library to satisfy the package requirements. Pyserini also depends on PyTorch and Faiss, but since these packages may require platform-specific custom configuration, they are not explicitly listed in the package requirements. We leave the installation of these packages to you.

The software ecosystem is rapidly evolving and a potential source of frustration is incompatibility among different versions of underlying dependencies. We provide additional detailed installation instructions here.

Development Installation

If you're planning on just using Pyserini, then the pip instructions above are fine. However, if you're planning on contributing to the codebase or want to work with the latest not-yet-released features, you'll need a development installation. For this, clone our repo with the --recurse-submodules option to make sure the tools/ submodule also gets cloned.

The tools/ directory, which contains evaluation tools and scripts, is actually this repo, integrated as a Git submodule (so that it can be shared across related projects). Build as follows (you might get warnings, but okay to ignore):

cd tools/eval && tar xvfz trec_eval.9.0.4.tar.gz && cd trec_eval.9.0.4 && make && cd ../../..
cd tools/eval/ndeval && make && cd ../../..

Next, you'll need to clone and build Anserini. It makes sense to put both pyserini/ and anserini/ in a common folder. After you've successfully built Anserini, copy the fatjar, which will be target/anserini-X.Y.Z-SNAPSHOT-fatjar.jar into pyserini/resources/jars/. As with the pip installation, a potential source of frustration is incompatibility among different versions of underlying dependencies. For these and other issues, we provide additional detailed installation instructions here.

You can confirm everything is working by running the unit tests:

python -m unittest

Assuming all tests pass, you should be ready to go!

Quick Links

How do I search?
How do I fetch a document?
How do I index and search my own documents?
How do I reproduce results on Robust04, MS MARCO...?
How do I configure search? (Guide to Interactive Search)
How do I manually download indexes? (Guide to Interactive Search)
How do I perform dense and hybrid retrieval? (Guide to Interactive Search)
How do I iterate over index terms and access term statistics? (Index Reader API)
How do I traverse postings? (Index Reader API)
How do I access and manipulate term vectors? (Index Reader API)
How do I compute the tf-idf or BM25 score of a document? (Index Reader API)
How do I access basic index statistics? (Index Reader API)
How do I access underlying Lucene analyzers? (Analyzer API)
How do I build custom Lucene queries? (Query Builder API)
How do I iterate over raw collections? (Collection API)

How do I search?

Pyserini supports sparse retrieval (e.g., BM25 ranking using bag-of-words representations), dense retrieval (e.g., nearest-neighbor search on transformer-encoded representations), as well hybrid retrieval that integrates both approaches.

Sparse Retrieval

The SimpleSearcher class provides the entry point for sparse retrieval using bag-of-words representations. Anserini supports a number of pre-built indexes for common collections that it'll automatically download for you and store in ~/.cache/pyserini/indexes/. Here's how to use a pre-built index for the MS MARCO passage ranking task and issue a query interactively:

from pyserini.search import SimpleSearcher

searcher = SimpleSearcher.from_prebuilt_index('msmarco-passage')
hits = searcher.search('what is a lobster roll?')

for i in range(0, 10):
    print(f'{i+1:2} {hits[i].docid:7} {hits[i].score:.5f}')

The results should be as follows:

 1 7157707 11.00830
 2 6034357 10.94310
 3 5837606 10.81740
 4 7157715 10.59820
 5 6034350 10.48360
 6 2900045 10.31190
 7 7157713 10.12300
 8 1584344 10.05290
 9 533614  9.96350
10 6234461 9.92200

To further examine the results:

# Grab the raw text:
hits[0].raw

# Grab the raw Lucene Document:
hits[0].lucene_document

Pre-built indexes are hosted on University of Waterloo servers. The following method will list available pre-built indexes:

SimpleSearcher.list_prebuilt_indexes()

A description of what's available can be found here. Alternatively, see this answer for how to download an index manually.

Dense Retrieval

The SimpleDenseSearcher class provides the entry point for dense retrieval, and its usage is quite similar to SimpleSearcher. The only additional thing we need to specify for dense retrieval is the query encoder.

from pyserini.dsearch import SimpleDenseSearcher, TctColBertQueryEncoder

encoder = TctColBertQueryEncoder('castorini/tct_colbert-msmarco')
searcher = SimpleDenseSearcher.from_prebuilt_index(
    'msmarco-passage-tct_colbert-hnsw',
    encoder
)
hits = searcher.search('what is a lobster roll')

for i in range(0, 10):
    print(f'{i+1:2} {hits[i].docid:7} {hits[i].score:.5f}')

If you encounter an error (on macOS), you'll need the following:

import os
os.environ['KMP_DUPLICATE_LIB_OK']='True'

The results should be as follows:

 1 7157710 70.53742
 2 7157715 70.50040
 3 7157707 70.13804
 4 6034350 69.93666
 5 6321969 69.62683
 6 4112862 69.34587
 7 5515474 69.21354
 8 7157708 69.08416
 9 6321974 69.06841
10 2920399 69.01737

Hybrid Sparse-Dense Retrieval

The HybridSearcher class provides the entry point to perform hybrid sparse-dense retrieval:

from pyserini.search import SimpleSearcher
from pyserini.dsearch import SimpleDenseSearcher, TctColBertQueryEncoder
from pyserini.hsearch import HybridSearcher

ssearcher = SimpleSearcher.from_prebuilt_index('msmarco-passage')
encoder = TctColBertQueryEncoder('castorini/tct_colbert-msmarco')
dsearcher = SimpleDenseSearcher.from_prebuilt_index(
    'msmarco-passage-tct_colbert-hnsw',
    encoder
)
hsearcher = HybridSearcher(dsearcher, ssearcher)
hits = hsearcher.search('what is a lobster roll')

for i in range(0, 10):
    print(f'{i+1:2} {hits[i].docid:7} {hits[i].score:.5f}')

The results should be as follows:

 1 7157715 71.56022
 2 7157710 71.52962
 3 7157707 71.23887
 4 6034350 70.98502
 5 6321969 70.61903
 6 4112862 70.33807
 7 5515474 70.20574
 8 6034357 70.11168
 9 5837606 70.09911
10 7157708 70.07636

In general, hybrid retrieval will be more effective than dense retrieval, which will be more effective than sparse retrieval.

How do I fetch a document?

Another commonly used feature in Pyserini is to fetch a document (i.e., its text) given its docid. This is easy to do:

from pyserini.search import SimpleSearcher

searcher = SimpleSearcher.from_prebuilt_index('msmarco-passage')
doc = searcher.doc('7157715')

From doc, you can access its contents as well as its raw representation. The contents hold the representation of what's actually indexed; the raw representation is usually the original "raw document". A simple example can illustrate this distinction: for an article from CORD-19, raw holds the complete JSON of the article, which obviously includes the article contents, but has metadata and other information as well. The contents contain extracts from the article that's actually indexed (for example, the title and abstract). In most cases, contents can be deterministically reconstructed from raw. When building the index, we specify flags to store contents and/or raw; it is rarely the case that we store both, since that would be a waste of space. In the case of the pre-built msmacro-passage index, we only store raw. Thus:

# Document contents: what's actually indexed.
# Note, this is not stored in the pre-built msmacro-passage index.
doc.contents()
                                                                                                   
# Raw document
doc.raw()

As you'd expected, doc.id() returns the docid, which is 7157715 in this case. Finally, doc.lucene_document() returns the underlying Lucene Document (i.e., a Java object). With that, you get direct access to the complete Lucene API for manipulating documents.

Since each text in the MS MARCO passage corpus is a JSON object, we can read the document into Python and manipulate:

import json
json_doc = json.loads(doc.raw())

json_doc['contents']
# 'contents' of the document:
# A Lobster Roll is a bread roll filled with bite-sized chunks of lobster meat...

Every document has a docid, of type string, assigned by the collection it is part of. In addition, Lucene assigns each document a unique internal id (confusingly, Lucene also calls this the docid), which is an integer numbered sequentially starting from zero to one less than the number of documents in the index. This can be a source of confusion but the meaning is usually clear from context. Where there may be ambiguity, we refer to the external collection docid and Lucene's internal docid to be explicit. Programmatically, the two are distinguished by type: the first is a string and the second is an integer.

As an important side note, Lucene's internal docids are not stable across different index instances. That is, in two different index instances of the same collection, Lucene is likely to have assigned different internal docids for the same document. This is because the internal docids are assigned based on document ingestion order; this will vary due to thread interleaving during indexing (which is usually performed on multiple threads).

The doc method in searcher takes either a string (interpreted as an external collection docid) or an integer (interpreted as Lucene's internal docid) and returns the corresponding document. Thus, a simple way to iterate through all documents in the collection (and for example, print out its external collection docid) is as follows:

for i in range(searcher.num_docs):
    print(searcher.doc(i).docid())

How do I index and search my own documents?

To build sparse (i.e., Lucene inverted indexes) on your own document collections, following the instructions below. To build dense indexes (e.g., the output of transformer encoders) on your own document collections, see instructions here. The following covers English documents; if you want to index and search multilingual documents, check out this answer.

Pyserini (via Anserini) provides ingestors for document collections in many different formats. The simplest, however, is the following JSON format:

{
  "id": "doc1",
  "contents": "this is the contents."
}

A document is simply comprised of two fields, a docid and contents. Pyserini accepts collections comprised of these documents organized in three different ways:

Folder with each JSON in its own file, like this.
Folder with files, each of which contains an array of JSON documents, like this.
Folder with files, each of which contains a JSON on an individual line, like this (often called JSONL format).

So, the quickest way to get started is to write a script that converts your documents into the above format. Then, you can invoke the indexer (here, we're indexing JSONL, but any of the other formats work as well):

python -m pyserini.index -collection JsonCollection \
                         -generator DefaultLuceneDocumentGenerator \
                         -threads 1 \
                         -input integrations/resources/sample_collection_jsonl \
                         -index indexes/sample_collection_jsonl \
                         -storePositions -storeDocvectors -storeRaw

Three options control the type of index that is built:

-storePositions: builds a standard positional index
-storeDocvectors: stores doc vectors (required for relevance feedback)
-storeRaw: stores raw documents

If you don't specify any of the three options above, Pyserini builds an index that only stores term frequencies. This is sufficient for simple "bag of words" querying (and yields the smallest index size).

Once indexing is done, you can use SimpleSearcher to search the index:

from pyserini.search import SimpleSearcher

searcher = SimpleSearcher('indexes/sample_collection_jsonl')
hits = searcher.search('document')

for i in range(len(hits)):
    print(f'{i+1:2} {hits[i].docid:4} {hits[i].score:.5f}')

You should get something like the following:

 1 doc2 0.25620
 2 doc3 0.23140

If you want to perform a batch retrieval run (e.g., directly from the command line), organize all your queries in a tsv file, like here. The format is simple: the first field is a query id, and the second field is the query itself. Note that the file extension must end in .tsv so that Pyserini knows what format the queries are in.

Then, you can run:

$ python -m pyserini.search --topics integrations/resources/sample_queries.tsv \
                            --index indexes/sample_collection_jsonl \
                            --output run.sample.txt \
                            --bm25

$ cat run.sample.txt 
1 Q0 doc2 1 0.256200 Anserini
1 Q0 doc3 2 0.231400 Anserini
2 Q0 doc1 1 0.534600 Anserini
3 Q0 doc1 1 0.256200 Anserini
3 Q0 doc2 2 0.256199 Anserini
4 Q0 doc3 1 0.483000 Anserini

Note that output run file is in standard TREC format.

You can also add extra fields in your documents when needed, e.g. text features. For example, the SpaCy Named Entity Recognition (NER) result of contents could be stored as an additional field NER.

{
  "id": "doc1",
  "contents": "The Manhattan Project and its atomic bomb helped bring an end to World War II. Its legacy of peaceful uses of atomic energy continues to have an impact on history and science.",
  "NER": {
            "ORG": ["The Manhattan Project"],
            "MONEY": ["World War II"]
         }
}

Reproduction Guides

With Pyserini, it's easy to reproduce runs on a number of standard IR test collections!

Sparse Retrieval

Reproducing runs directly from the Python package
Reproducing Robust04 baselines for ad hoc retrieval
Reproducing the BM25 baseline for MS MARCO V1 Passage Ranking
Reproducing the BM25 baseline for MS MARCO V1 Document Ranking
Reproducing the multi-field BM25 baseline for MS MARCO V1 Document Ranking from Elasticsearch
Reproducing BM25 baselines on the MS MARCO V2 Collections
Reproducing DeepImpact experiments for MS MARCO V1 Passage Ranking
Reproducing uniCOIL experiments with doc2query-T5 expansions for MS MARCO V1
Reproducing uniCOIL experiments with TILDE expansions for MS MARCO V1 Passage Ranking
Reproducing uniCOIL experiments with TILDE expansions for MS MARCO V2 Passage Ranking
Reproducing uniCOIL experiments on the MS MARCO V2 Collections
Reproducing SPLADEv2 experiments for MS MARCO V1 Passage Ranking

Dense Retrieval

Reproducing TCT-ColBERTv1 experiments on the MS MARCO V1 Collections
Reproducing TCT-ColBERTv2 experiments on the MS MARCO V1 Collections
Reproducing TCT-ColBERTv2 experiments on the MS MARCO V2 Collections
Reproducing DPR experiments
Reproducing BPR experiments
Reproducing ANCE experiments
Reproducing DistilBERT KD experiments
Reproducing DistilBERT Balanced Topic Aware Sampling experiments
Reproducing SBERT dense retrieval experiments
Reproducing ADORE dense retrieval experiments
Reproducing Vector PRF experiments
Reproducing ANCE-PRF experiments

Baselines

Pyserini provides baselines for a number of datasets.

Baselines for KILT: a benchmark for Knowledge Intensive Language Tasks
Baselines for TripClick: a large-scale dataset of click logs in the health domain
Baselines (in Anserini) for the FEVER (Fact Extraction and VERification) dataset

Additional Documentation

Known Issues

Anserini is designed to work with JDK 11. There was a JRE path change above JDK 9 that breaks pyjnius 1.2.0, as documented in this issue, also reported in Anserini here and here. This issue was fixed with pyjnius 1.2.1 (released December 2019). The previous error was documented in this notebook and this notebook documents the fix.

Release History

v0.14.0: November 8, 2021 [Release Notes]
v0.13.0: July 3, 2021 [Release Notes]
v0.12.0: May 5, 2021 [Release Notes]
v0.11.0.0: February 18, 2021 [Release Notes]
v0.10.1.0: January 8, 2021 [Release Notes]
v0.10.0.1: December 2, 2020 [Release Notes]
v0.10.0.0: November 26, 2020 [Release Notes]
v0.9.4.0: June 26, 2020 [Release Notes]
v0.9.3.1: June 11, 2020 [Release Notes]
v0.9.3.0: May 27, 2020 [Release Notes]
v0.9.2.0: May 15, 2020 [Release Notes]
v0.9.1.0: May 6, 2020 [Release Notes]
v0.9.0.0: April 18, 2020 [Release Notes]
v0.8.1.0: March 22, 2020 [Release Notes]
v0.8.0.0: March 12, 2020 [Release Notes]
v0.7.2.0: January 25, 2020 [Release Notes]
v0.7.1.0: January 9, 2020 [Release Notes]
v0.7.0.0: December 13, 2019 [Release Notes]
v0.6.0.0: November 2, 2019

With v0.11.0.0 and before, Pyserini versions adopted the convention of X.Y.Z.W, where X.Y.Z tracks the version of Anserini, and W is used to distinguish different releases on the Python end. Starting with Anserini v0.12.0, Anserini and Pyserini versions have become decoupled.

Comments

Dense search replication, starting from hgf model
Here's I think our end target: start with hgf model from model hub - assume that's fix.

Be able to encode corpus and queries - scripts for doing so should be in https://github.com/castorini/pyserini/tree/master/scripts

Scripts for building hnsw index, also in scripts/

(1) and (2) are what we store as "pre-built".

This will allow replication and bring every part of the pipeline in sync - other than training the encoder model.

@MXueguang @justram @jacklin64 thoughts?
opened by lintool 18
Multiple language support?

Hi,

Does pyserini currently support languages other than language? Specifically, I am asking about using features such as creating an index by python -m pyserini.index -collection JsonCollection -generator DefaultLuceneDocumentGenerator ... and using searcher.search. If yes, how do I integrate it in python script?

Thank you!

opened by velocityCavalry 16
SimpleSearcher.search memory leak

When calling search method of SimpleSearcher I noticed RAM usage increase with every new iteration. Could you tell me please how to decrease memory leak?

opened by dmitrijeuseew 16
Fold qrels into pyserini directly
Follow up to #310 - there, we folded the eval scripts directly into pyserini. Now let's do the same with the qrels.

In actuality, the qrels are already in the anserini jar, since this entire directory is included in the fatjar: https://github.com/castorini/anserini/tree/master/src/main/resources/topics-and-qrels

Trick is how to get the qrels out...

This is, in fact, how we can access the topics in anserini: https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/search/topicreader/Topics.java#L22 https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/search/topicreader/TopicReader.java#L143

And pyserini just wraps the Java methods above.

With that background, I propose to apply the same treatment to qrels.

Add a method in Anserini (on the Java end) to read qrels from resources/topics-and-qrels/ into a String. We can use the same "ids" as the topics. Build around here: https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/util/Qrels.java

On the Python end, we call the Java method, which reads the qrels as a string. Then we write back the string into ~/.cache/pyserini.

Our eval scripts can then reference ~/.cache/pyserini.

And at the end of the day, we'll be able to do this directly:

$ python -m pyserini.search --topics robust04 --index robust04 --output run.robust04.txt --bm25 $ python -m pyserini.eval.trec_eval --qrels robust04 -m map -m P.30 run.robust04.txt

(With no need to download any intermediate data... everything is self contained!)

@MXueguang thoughts? Do you like it? Any better way?
opened by lintool 16
Add automate downloading of indexes
Currently, this change supports 'ms-marco-passage', 'ms-marco-doc' and 'TREC Disks 4 & 5'.

If the index exists, skip the download and use the index under '(pyserini)/indexes'.

If not, download the index to cache(~/.cache/pyserini/indexes) and extract the index to (pyserini)/indexes. Finally, delete the gz file in cache. Should we keep the gz file in cache?
opened by qguo96 16
Resolve tiny differences between Anserini and Pyserini on MS MARCO: query iteration order
If we look at the Python replications: https://github.com/castorini/pyserini/blob/master/docs/pypi-replication.md Compared against Anserini replications: e.g., https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-doc-leaderboard.md

We'll note tiny differences - e.g., for MS MARCO doc, baselines - pyserini:

##################### MRR @100: 0.2770296928568709 QueriesRanked: 5193 #####################

Compared to anserini:

##################### MRR @100: 0.2770296928568702 QueriesRanked: 5193 #####################

Previously, we tracked it down issue #257

I'd like to fix it so get identical results moving forward - my proposed fix is a bit janky, but it'll work: Let's just store, in Python code, an array of integers corresponding to ids of the queries in the original queries file. When we're iterating over the dataset in pyserini.search, we just follow the order of the integers.

Slightly better, we introduce a new query iterator abstraction and hide this implementation detail in there. So the query iterator would take in the current dictionary, and an optional array holding the iteration order.

Thoughts @MXueguang? I was thinking you could work on this?
opened by lintool 15
DPR replication docs

Hi @MXueguang - when everything is implemented DPR should probably get it's own separate replication page, like for MS MARCO: https://github.com/castorini/pyserini/blob/master/docs/experiments-msmarco-passage.md

Containing both spare, hybrid, and dense retrieval.

Then we can add a replication log also - starting point for people interested in working more on it.

opened by lintool 14
Incorrect encoding on Windows

When using pyserini under Windows, it seems that the encoding of strings is breaking when passed to the JNI via the pyjnius package.

It happens when a string is encoded as UTF-8 like this JString(my_str.encode('utf-8')) (e.g., https://github.com/castorini/pyserini/blob/master/pyserini/search/_searcher.py#L114). It only occurs under Windows as it must collide with the default Windows encoding CP-1252.

I discussed this issue with the maintainers of pyjnius and it seems that to make it work independently from the platform, the .encode('utf-8') could simply be dropped.

Was there a reason why this manual encoding was used in pyserini?

I created a branch with the changes, I could do a PR if you wish.

opened by stekiri 13

Dense retrieval draft

An example of usage, since dense index doesn't contains raw data, I loaded the corpus separately.

import numpy as np
from pyserini.search import SimpleDenseSearcher

searcher = SimpleDenseSearcher.from_prebuilt_index('msmarco_passage_0', 'collection.tsv')

query_emb = np.random.random(768).astype('float32')
result = searcher.search(query_emb)

result[0].raw
>> 'Lander, WY Sales Tax Rate. The current total local sales tax rate in Lander, WY is 5.000%. The December 2015 total local sales tax rate was also 5.000%. Lander, WY is in Fremont County. Lander is in the following zip codes: 82520.'

result[0].docid
>> '350921'

result[0].score
>> 0.42547345

searcher.doc('123')
>> Document(docid='123', raw='With a number of condo developments springing up in the city, it can be difficult to narrow down your choices for the perfect Montreal condo for sale. Our skilled agents organize your steps towards meeting your goals with our condo projects located in popular and trendy neighbourhoods.')

opened by MXueguang 13

IndexOutOfBoundsException calling get_term_counts

This is code to print the top tf.idf-weighted terms from documents in a run:

reader = IndexReader.from_prebuilt_index('robust04')
for topic, docs in run.items():
    print('---', topic)
    for doc in docs:
        print('---', doc)
        vec = reader.get_document_vector(doc)
        weighted = []
        for term, tf in vec.items():
            print('---', term, tf)
            df, cf = reader.get_term_counts(term)
            tfidf = tf / df
            heapq.heappush(weighted, (tfidf, term))
        for weight, term in heapq.nlargest(10, weighted):
            print(topic, doc, term, weight)

The run I am iterating is a BM25 retrieval run on robust04 from Pyserini. On topic 301, document FBIS4-40260, term 'it' (tf=2), I get the following error:

Traceback (most recent call last):
  File "/Users/soboroff/pyserini-fire/./top-terms.py", line 33, in <module>
    df, cf = reader.get_term_counts(term)
  File "/Users/soboroff/pyserini-fire/venv/lib/python3.10/site-packages/pyserini/index/_base.py", line 259, in get_term_counts
    term_map = self.object.getTermCountsWithAnalyzer(self.reader, JString(term.encode('utf-8')), analyzer)
  File "jnius/jnius_export_class.pxi", line 884, in jnius.JavaMethod.__call__
  File "jnius/jnius_export_class.pxi", line 1056, in jnius.JavaMethod.call_staticmethod
  File "jnius/jnius_utils.pxi", line 91, in jnius.check_exception
jnius.JavaException: JVM exception occurred: Index 0 out of bounds for length 0 java.lang.IndexOutOfBoundsException

opened by isoboroff 12

Unable to do Dense search against own index
My environment:

OS - Ubuntu 18.04

Java 11.0.11

Python 3.8.8

Python Package versions:

torch 1.8.1

faiss-cpu 1.7.0

pyserini 0.12.0

Problem 1

I followed instructions to create my own minimal index and was able to run the Sparse Retrieval example successfully. However, when I tried to run the Dense retrieval example using the TctColBertQueryEncoder, I encountered the following issues that seem to be caused by me having a newer version of the transformers library, where the requires_faiss and requires_pytorch methods have been replaced with a more general requires_backends method in transformers.file_utils. The following files were affected.

pyserini/dsearch/_dsearcher.py pyserini/dsearch/_model.py

Problem 2

Replacing them in place in the Pyserini code in my site-packages allowed me to move forward, but now I get the error message:

RuntimeError: Error in faiss::FileIOReader::FileIOReader(const char*) at /__w/faiss-wheels/faiss-wheels/faiss/faiss/impl/io.cpp:81: Error: 'f' failed: could not open /path/to/lucene_index/index for reading: No such file or directory

The /path/to/lucene_index above is a folder where my lucene index was built using pyserini.index. I am guessing that an additional ANN index might be required to be built from the data to allow Dense searching to happen? I looked in the help for pyserini.index but there did not seem to be anything that indicated creation of ANN index.

I can live with the first problem (since I have a local solution) but obviously some fix to that would be nice. For the second problem, some documentation or help with building a local index for dense searching will be very much appreciated.

Thanks!
opened by sujitpal 12
Broken links in prebuilt READMEs

From here: https://github.com/castorini/pyserini/blob/master/docs/prebuilt-indexes.md

Link to robust04 README is broken. Might want to go through and make sure they all work...

opened by lintool 0
Fill in missing conditions in MS MARCO V1 repro maxtrix

Here: https://castorini.github.io/pyserini/2cr/msmarco-v1-passage.html

We're missing a bunch of conditions that we should add.

@MXueguang this is probably pretty easy to do right?

opened by lintool 0
Refactor Dependencies

Initial PR Based on https://github.com/castorini/pyserini/issues/1375

Modularize imports so that LuceneSearcher does not rely on Faiss, torch, and transformers

opened by ToluClassics 1
Importing LuceneSearcher relies on FAISS and Torch

Currently, importing LuceneSearcher fails if faiss and torch aren't installed. (They aren't installed by design because they're platform-specific, see: https://github.com/castorini/pyserini#installation)

This is likely caused by the imports in the following init file: https://github.com/castorini/pyserini/blob/master/pyserini/search/init.py#L23-L26

A fix would need to modularize those imports.

If no one gets to it before me, I will attempt to send a PR to fix this.

opened by cakiki 1

Releases(pyserini-0.19.2)

pyserini-0.19.2(Dec 17, 2022)

https://pypi.org/project/pyserini/0.19.2/
Source code(tar.gz)
Source code(zip)
pyserini-0.19.1(Nov 12, 2022)

https://pypi.org/project/pyserini/0.19.1/
Source code(tar.gz)
Source code(zip)
pyserini-0.19.0(Nov 2, 2022)

https://pypi.org/project/pyserini/0.19.0/
Source code(tar.gz)
Source code(zip)
pyserini-0.18.0(Sep 26, 2022)

https://pypi.org/project/pyserini/0.18.0/
Source code(tar.gz)
Source code(zip)
pyserini-0.17.1(Aug 13, 2022)

https://pypi.org/project/pyserini/0.17.1/
Source code(tar.gz)
Source code(zip)
pyserini-0.17.0(May 28, 2022)

https://pypi.org/project/pyserini/0.17.0/
Source code(tar.gz)
Source code(zip)
pyserini-0.16.1(May 12, 2022)

https://pypi.org/project/pyserini/0.16.1/
Source code(tar.gz)
Source code(zip)
pyserini-0.16.0(Mar 2, 2022)

https://pypi.org/project/pyserini/0.16.0/
Source code(tar.gz)
Source code(zip)
pyserini-0.15.0(Jan 21, 2022)

https://pypi.org/project/pyserini/0.15.0/
Source code(tar.gz)
Source code(zip)
pyserini-0.14.0(Nov 8, 2021)

https://pypi.org/project/pyserini/0.14.0/
Source code(tar.gz)
Source code(zip)
pyserini-0.13.0(Jul 3, 2021)

https://pypi.org/project/pyserini/0.13.0/
Source code(tar.gz)
Source code(zip)
pyserini-0.12.0(May 5, 2021)

https://pypi.org/project/pyserini/0.12.0/
Source code(tar.gz)
Source code(zip)
pyserini-0.11.0.0(Feb 18, 2021)

https://pypi.org/project/pyserini/0.11.0.0/
Source code(tar.gz)
Source code(zip)
pyserini-0.10.1.0(Jan 8, 2021)

https://pypi.org/project/pyserini/0.10.1.0/
Source code(tar.gz)
Source code(zip)
pyserini-0.10.0.1(Dec 2, 2020)

https://pypi.org/project/pyserini/0.10.0.1/
Source code(tar.gz)
Source code(zip)
pyserini-0.10.0.0(Nov 26, 2020)

https://pypi.org/project/pyserini/0.10.0.0/
Source code(tar.gz)
Source code(zip)
pyserini-0.9.4.0(Jun 26, 2020)

https://pypi.org/project/pyserini/0.9.4.0/
Source code(tar.gz)
Source code(zip)
pyserini-0.9.3.1(Jun 11, 2020)

https://pypi.org/project/pyserini/0.9.3.1/
Source code(tar.gz)
Source code(zip)
pyserini-0.9.3.0(May 27, 2020)

https://pypi.org/project/pyserini/0.9.3.0/
Source code(tar.gz)
Source code(zip)
pyserini-0.9.2.0(May 15, 2020)

https://pypi.org/project/pyserini/0.9.2.0/
Source code(tar.gz)
Source code(zip)
pyserini-0.9.1.0(May 6, 2020)

https://pypi.org/project/pyserini/0.9.1.0/
Source code(tar.gz)
Source code(zip)
pyserini-0.9.0.0(Apr 18, 2020)

https://pypi.org/project/pyserini/
Source code(tar.gz)
Source code(zip)
pyserini-0.8.1.0(Mar 22, 2020)

https://pypi.org/project/pyserini/
Source code(tar.gz)
Source code(zip)
pyserini-0.8.0.0(Mar 12, 2020)

https://pypi.org/project/pyserini/
Source code(tar.gz)
Source code(zip)
pyserini-0.7.2.0(Jan 25, 2020)

https://pypi.org/project/pyserini/
Source code(tar.gz)
Source code(zip)
pyserini-0.7.1.0(Jan 10, 2020)

https://pypi.org/project/pyserini/
Source code(tar.gz)
Source code(zip)
pyserini-0.7.0.0(Dec 13, 2019)

https://pypi.org/project/pyserini/
Source code(tar.gz)
Source code(zip)
pyserini-0.6.0.0(Nov 2, 2019)

https://pypi.org/project/pyserini/
Source code(tar.gz)
Source code(zip)

Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations.

Related tags

Overview

Pyserini

Package Installation

Development Installation

Quick Links

How do I search?

Sparse Retrieval

Dense Retrieval

Hybrid Sparse-Dense Retrieval

How do I fetch a document?

How do I index and search my own documents?

Reproduction Guides

Sparse Retrieval

Dense Retrieval

Baselines

Additional Documentation

Known Issues

Release History

Comments

Releases(pyserini-0.19.2)

pyserini-0.19.2(Dec 17, 2022)

pyserini-0.19.1(Nov 12, 2022)

pyserini-0.19.0(Nov 2, 2022)

pyserini-0.18.0(Sep 26, 2022)

pyserini-0.17.1(Aug 13, 2022)

pyserini-0.17.0(May 28, 2022)

pyserini-0.16.1(May 12, 2022)

pyserini-0.16.0(Mar 2, 2022)

pyserini-0.15.0(Jan 21, 2022)

pyserini-0.14.0(Nov 8, 2021)

pyserini-0.13.0(Jul 3, 2021)

pyserini-0.12.0(May 5, 2021)

pyserini-0.11.0.0(Feb 18, 2021)

pyserini-0.10.1.0(Jan 8, 2021)

pyserini-0.10.0.1(Dec 2, 2020)

pyserini-0.10.0.0(Nov 26, 2020)

pyserini-0.9.4.0(Jun 26, 2020)

pyserini-0.9.3.1(Jun 11, 2020)

pyserini-0.9.3.0(May 27, 2020)

pyserini-0.9.2.0(May 15, 2020)

pyserini-0.9.1.0(May 6, 2020)

pyserini-0.9.0.0(Apr 18, 2020)

pyserini-0.8.1.0(Mar 22, 2020)

pyserini-0.8.0.0(Mar 12, 2020)

pyserini-0.7.2.0(Jan 25, 2020)

pyserini-0.7.1.0(Jan 10, 2020)

pyserini-0.7.0.0(Dec 13, 2019)

pyserini-0.6.0.0(Nov 2, 2019)

Owner

Castorini

Differentiable Neural Computers, Sparse Access Memory and Sparse Differentiable Neural Computers, for Pytorch

Sparse-dense operators implementation for Paddle

This is the official implementation of "One Question Answering Model for Many Languages with Cross-lingual Dense Passage Retrieval".

Scalable training for dense retrieval models.

Personal implementation of paper "Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval"

PyTorch implementation of: Michieli U. and Zanuttigh P., "Continual Semantic Segmentation via Repulsion-Attraction of Sparse and Disentangled Latent Representations", CVPR 2021.

This is the code for the paper "Jinkai Zheng, Xinchen Liu, Wu Liu, Lingxiao He, Chenggang Yan, Tao Mei: Gait Recognition in the Wild with Dense 3D Representations and A Benchmark. (CVPR 2022)"

Learning Dense Representations of Phrases at Scale (Lee et al., 2020)

Image-retrieval-baseline - MUGE Multimodal Retrieval Baseline

[EMNLP 2021] MuVER: Improving First-Stage Entity Retrieval with Multi-View Entity Representations

Implementing Graph Convolutional Networks and Information Retrieval Mechanisms using pure Python and NumPy

The official implementation for ACL 2021 "Challenges in Information Seeking QA: Unanswerable Questions and Paragraph Retrieval".

FAIR's research platform for object detection research, implementing popular algorithms like Mask R-CNN and RetinaNet.

A PyTorch Implementation of the paper - Choi, Woosung, et al. "Investigating u-nets with various intermediate blocks for spectrogram-based singing voice separation." 21th International Society for Music Information Retrieval Conference, ISMIR. 2020.

A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)

Lightweight, Python library for fast and reproducible experimentation :microscope:

A research toolkit for particle swarm optimization in Python

Reference implementation of code generation projects from Facebook AI Research. General toolkit to apply machine learning to code, from dataset creation to model training and evaluation. Comes with pretrained models.

Code for the paper: Learning Adversarially Robust Representations via Worst-Case Mutual Information Maximization (https://arxiv.org/abs/2002.11798)