ColBERT: Contextualized Late Interaction over BERT (SIGIR'20)

Stanford Future Data Systems

Last update: Jan 8, 2023

Related tags

Deep Learning ColBERT

Overview

Update: if you're looking for ColBERTv2 code, you can find it alongside a new simpler API, in the branch new_api.

ColBERT

ColBERT is a fast and accurate retrieval model, enabling scalable BERT-based search over large text collections in tens of milliseconds.

Figure 1: ColBERT's late interaction, efficiently scoring the fine-grained similarity between a queries and a passage.

As Figure 1 illustrates, ColBERT relies on fine-grained contextual late interaction: it encodes each passage into a matrix of token-level embeddings (shown above in blue). Then at search time, it embeds every query into another matrix (shown in green) and efficiently finds passages that contextually match the query using scalable vector-similarity (MaxSim) operators.

These rich interactions allow ColBERT to surpass the quality of single-vector representation models, while scaling efficiently to large corpora. You can read more in our papers:

ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT (SIGIR'20).
Relevance-guided Supervision for OpenQA with ColBERT (TACL'21; to appear).

Installation

ColBERT (currently: v0.2.0) requires Python 3.7+ and Pytorch 1.6+ and uses the HuggingFace Transformers library.

We strongly recommend creating a conda environment using:

conda env create -f conda_env.yml
conda activate colbert-v0.2

If you face any problems, please open a new issue and we'll help you promptly!

Overview

Using ColBERT on a dataset typically involves the following steps.

Step 0: Preprocess your collection. At its simplest, ColBERT works with tab-separated (TSV) files: a file (e.g., collection.tsv) will contain all passages and another (e.g., queries.tsv) will contain a set of queries for searching the collection.

Step 1: Train a ColBERT model. You can train your own ColBERT model and validate performance on a suitable development set.

Step 2: Index your collection. Once you're happy with your ColBERT model, you need to index your collection to permit fast retrieval. This step encodes all passages into matrices, stores them on disk, and builds data structures for efficient search.

Step 3: Search the collection with your queries. Given your model and index, you can issue queries over the collection to retrieve the top-k passages for each query.

Below, we illustrate these steps via an example run on the MS MARCO Passage Ranking task.

Data

This repository works directly with a simple tab-separated file format to store queries, passages, and top-k ranked lists.

Queries: each line is qid \t query text.
Collection: each line is pid \t passage text.
Top-k Ranking: each line is qid \t pid \t rank.

This works directly with the data format of the MS MARCO Passage Ranking dataset. You will need the training triples (triples.train.small.tar.gz), the official top-1000 ranked lists for the dev set queries (top1000.dev), and the dev set relevant passages (qrels.dev.small.tsv). For indexing the full collection, you will also need the list of passages (collection.tar.gz).

Training

Training requires a list of <query, positive passage, negative passage> tab-separated triples.

You can supply full-text triples, where each line is query text \t positive passage text \t negative passage text. Alternatively, you can supply the query and passage IDs as a JSONL file [qid, pid+, pid-] per line, in which case you should specify --collection path/to/collection.tsv and --queries path/to/queries.train.tsv.

CUDA_VISIBLE_DEVICES="0,1,2,3" \
python -m torch.distributed.launch --nproc_per_node=4 -m \
colbert.train --amp --doc_maxlen 180 --mask-punctuation --bsize 32 --accum 1 \
--triples /path/to/MSMARCO/triples.train.small.tsv \
--root /root/to/experiments/ --experiment MSMARCO-psg --similarity l2 --run msmarco.psg.l2

You can use one or more GPUs by modifying CUDA_VISIBLE_DEVICES and --nproc_per_node.

Validation

Before indexing into ColBERT, you can compare a few checkpoints by re-ranking a top-k set of documents per query. This will use ColBERT on-the-fly: it will compute document representations during query evaluation.

This script requires the top-k list per query, provided as a tab-separated file whose every line contains a tuple queryID \t passageID \t rank, where rank is {1, 2, 3, ...} for each query. The script also accepts the format of MS MARCO's top1000.dev and top1000.eval and you can optionally supply relevance judgements (qrels) for evaluation. This is a tab-separated file whose every line has a quadruple <query ID, 0, passage ID, 1>, like qrels.dev.small.tsv.

Example command:

python -m colbert.test --amp --doc_maxlen 180 --mask-punctuation \
--collection /path/to/MSMARCO/collection.tsv \
--queries /path/to/MSMARCO/queries.dev.small.tsv \
--topk /path/to/MSMARCO/top1000.dev  \
--checkpoint /root/to/experiments/MSMARCO-psg/train.py/msmarco.psg.l2/checkpoints/colbert-200000.dnn \
--root /root/to/experiments/ --experiment MSMARCO-psg  [--qrels path/to/qrels.dev.small.tsv]

Indexing

For fast retrieval, indexing precomputes the ColBERT representations of passages.

Example command:

CUDA_VISIBLE_DEVICES="0,1,2,3" OMP_NUM_THREADS=6 \
python -m torch.distributed.launch --nproc_per_node=4 -m \
colbert.index --amp --doc_maxlen 180 --mask-punctuation --bsize 256 \
--checkpoint /root/to/experiments/MSMARCO-psg/train.py/msmarco.psg.l2/checkpoints/colbert-200000.dnn \
--collection /path/to/MSMARCO/collection.tsv \
--index_root /root/to/indexes/ --index_name MSMARCO.L2.32x200k \
--root /root/to/experiments/ --experiment MSMARCO-psg

The index created here allows you to re-rank the top-k passages retrieved by another method (e.g., BM25).

We typically recommend that you use ColBERT for end-to-end retrieval, where it directly finds its top-k passages from the full collection. For this, you need FAISS indexing.

FAISS Indexing for end-to-end retrieval

For end-to-end retrieval, you should index the document representations into FAISS.

python -m colbert.index_faiss \
--index_root /root/to/indexes/ --index_name MSMARCO.L2.32x200k \
--partitions 32768 --sample 0.3 \
--root /root/to/experiments/ --experiment MSMARCO-psg

Retrieval

In the simplest case, you want to retrieve from the full collection:

python -m colbert.retrieve \
--amp --doc_maxlen 180 --mask-punctuation --bsize 256 \
--queries /path/to/MSMARCO/queries.dev.small.tsv \
--nprobe 32 --partitions 32768 --faiss_depth 1024 \
--index_root /root/to/indexes/ --index_name MSMARCO.L2.32x200k \
--checkpoint /root/to/experiments/MSMARCO-psg/train.py/msmarco.psg.l2/checkpoints/colbert-200000.dnn \
--root /root/to/experiments/ --experiment MSMARCO-psg

You may also want to re-rank a top-k set that you've retrieved before with ColBERT or with another model. For this, use colbert.rerank similarly and additionally pass --topk.

If you have a large set of queries (or want to reduce memory usage), use batch-mode retrieval and/or re-ranking. This can be done by passing --batch --retrieve_only to colbert.retrieve and passing --batch --log-scores to colbert.rerank alongside --topk with the unordered.tsv output of this retrieval run.

Some use cases (e.g., building a user-facing search engines) require more control over retrieval. For those, you typically don't want to use the command line for retrieval. Instead, you want to import our retrieval API from Python and directly work with that (e.g., to build a simple REST API). Instructions for this are coming soon, but you will just need to adapt/modify the retrieval loop in colbert/ranking/retrieval.py#L33.

Releases

v0.2.0: Sep 2020
v0.1.0: June 2020

Comments

question about unordered.tsv
@okhat

Because I have lots of queries that I want to process, I wanted to train in batches so I used the following command for retrieval:

!python -m colbert.retrieve --amp --doc_maxlen 512 --query_maxlen 512 --bsize 1 \ --queries small_test_queries.tsv --partitions 65536 --index_root ./experiments/indexes --index_name large_train_index \ --checkpoint ./experiments/dirty/train.py/2021-12-06_08.01.48/checkpoints/colbert-32000.dnn \ --depth 10000 --batch --retrieve_only

And in doing so it creates a file "unordered.tsv" , but the results in the file look weird:

From my understanding, the columns are (query id, document id, rank), but the rank column is filled up with -1.

When I run validation on a single query by using ColBERT on the fly, it produces pretty good results, but of course it is slow because I have not done the necessary preprocessing (However, I believe this suggests that my model has been trained properly, so the issue probably does not have to do with BERT itself).
opened by puzzlecollector 33
Can't build faiss index
Thanks for the great repo! I'm trying to build a faiss index for retrieval, but can't get the script to run. I was originally using python3.8 and torch 1.8 in a docker container, but also downgraded to torch 1.6 to see if that would work.

I'm running

CUDA_VISIBLE_DEVICES="0,1" OMP_NUM_THREADS=1 \ python3.8 -m torch.distributed.launch --nproc_per_node=1 -m \ index --root $PWD/experiments/ --amp --doc_maxlen 180 --mask-punctuation --bsize 256 \ --checkpoint out-of-the-box-model.pt \ --collection passages.tsv \ --index_root /faiss --index_name INDEX \ --root $PWD/experiments/ --experiment out_of_the_box \ --local_rank 0

But I get the error "Default process group is not initialized". That also happens even if I manually put dist.init_process_group('nccl') in the script.

Do you know why this is happening?

Thanks!
opened by JamesDeAntonis 23
Instructions on using ColBERT
@okhat I am trying to use ColBERT for a document retrieval project I am working on and I'd like to ask if I have understood the procedure correctly. I am trying to perform a ranking task based on the similarity of passages. So if a query passage comes in, then among the document passages I have, the system has to retrieve the top-K most similar documents to the query.

Because both the queries and documents are long, I guess I would have to first preprocess them using utility/preprocess/docs2passages. If my understanding is correct his method just simply chunks the long text in a sliding window manner right?

Afterwards I need to prepare the dataset following the format query \t positive passage \t negative passage in .tsv format. Then I type in the following command in the command line to train my custom ColBERT model:

python -m colbert.train --amp --doc_maxlen 180 --mask-punctuation --bsize 32 --accum 1 -- train_dataset.tsv

Once the training is complete then there will be model checkpoint file that is saved. From this step onwards, I am planning to use the saved model checkpoint and the pyterrior framework

I am aware that first I need to index all my test documents using FAISS and I am planning to use following code to do so

import pyterrier_colbert.indexing indexer = pyterrier_colbert.indexing.ColBERTIndexer(checkpoint, "/content", "colbertindex", chunksize=3) indexer.index(test_df)

Here the test_df is a corpus of test documents (pandas dataframe) where it will have two columns: docno and text.

Once the indexing is done, then I will proceed to passing in the test queries to rank the top-K documents that were indexed in the previous step. To achieve this I will use the following code (again using pyterrier):

pyterrier_colbert_factory = indexer.ranking_factory() colbert_e2e = pyterrier_colbert_factory.end_to_end() (colbert_e2e % 10).search(<query text>)

Does this step look about right? I will proceed as I have written above and if I encounter any problems I will ask again in this thread. Thank you :)
opened by puzzlecollector 22
Errors when trying to interface directly with the underlying API for re-ranking

I keep getting the pictured error:

Upon investigation, I see that stride is referenced here but isn't defined prior in the method. Can you please explain if this is intended or a bug?

Thanks

opened by JamesDeAntonis 18
Make possible to pip install
Hello, thanks for your repository and SIGIR paper. We would like to develop wrappers on top of ColBERT. Would it be possible to make the repo compatible with pip. This would need:

make a setup.py

rename src directory as colbert
opened by cmacdonald 14
Performance Issues with RoBERTa Models
I am currently training a multilingual Model with your approach and with the bert-base-multilingual-uncased it works great. Now I tried switching to xlm-roberta-base (which in general is better pre-trained than the mBERT) but performance is far off. Both are trained on the same system with the same batch size.

Here a plot of the loss over training steps:

Evaluation performance is very different as well: mBERT @ 32k Steps: MRR@10 0.22 XLM-RoBERTa @ 32k Steps: MRR@10 0.07

As RoBERTa uses a BPE vocab i had to add unused tokens by hand and initalize the embedding for them randomly (transformers does that with mean=0 and std=0.02):

self.tokenizer = XLMRobertaTokenizer.from_pretrained('xlm-roberta-base') self.tokenizer.add_tokens(["[unused0]"]) self.tokenizer.add_tokens(["[unused1]"]) self.skiplist = {w: True for w in string.punctuation} self.bert = XLMRobertaModel(config) self.bert.resize_token_embeddings(len(self.tokenizer))

https://github.com/Phil1108/ColBERT/blob/b786e2e7ef1c13bac97a5f8a35aa9e5ff9f3425f/src/model.py#L24

Is the bad performance caused by the problem that RoBERTa doesn't include a next-sentence Prediction in its pre-training? Or is the ColBERT approach not transferable to BPE vocabs? Or is my way of adding unused tokens to the vocab causing undetected problems?
opened by Phil1108 13
Error when indexing with ColBERTv2

Hi, I'm using the new_api branch and the provided ColBERTv2 checkpoint to index a 2M dataset but the indexer doesn't seem to be able to create a ivf.pt, which I assume is related to faiss. Seems like something is getting lost. I'll debug but I just thought of dropping this here first.

opened by vjeronymo2 10
about ColBERT(BertPreTrainedModel)
Hello, I am reading your code to replicate the experiment. There are some questions about the model in "model.py".

in the query() function, "queries" are word lists. So, it can not be input into the self.tokenizer.encode() function. The standard input for tokenizer.encode() should be text.

in the doc() function,

docs = [["[unused1]"] + self._tokenize(d)[:self.doc_maxlen-3] for d in docs]

the result of "self._tokenize()" is a word list, not a word-piece list, so it is improper to be cut by the doc_maxlen which limits the number of word-piece tokens. 3. although in the paper it is said that "Unlike queries, we do not append [mask] tokens to documents.", in the code the encoding function is "_encode()" for both queries and docs with the same [mask] padding.
opened by KaishuaiXu 9

UnboundLocalError: local variable 'batch_idx' referenced before assignment

I am trying to train a ColBERT model on a new dataset based on the code snippet from the README. I get stuck here:

#> Starting...
nranks = 2 	 num_gpus = 2 	 device=0
#> Starting...
nranks = 2 	 num_gpus = 2 	 device=1
Using config.bsize = 16 (per process) and config.accumsteps = 1
{
    "ncells": null,
    "centroid_score_threshold": null,
    "ndocs": null,
    "index_path": null,
    "nbits": 1,
    "kmeans_niters": 4,
    "resume": false,
    "similarity": "cosine",
    "bsize": 32,
    "accumsteps": 1,
    "lr": 3e-6,
    "maxsteps": 500000,
    "save_every": null,
    "warmup": null,
    "warmup_bert": null,
    "relu": false,
    "nway": 2,
    "use_ib_negatives": false,
    "reranker": false,
...
[Jul 18, 19:30:32] #> Got 98380 queries. All QIDs are unique.

[Jul 18, 19:30:32] #> Got 98380 queries. All QIDs are unique.

Some weights of HF_ColBERT were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['linear.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of HF_ColBERT were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['linear.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[Jul 18, 19:30:42] #> Done with all triples!
Process Process-3:
Traceback (most recent call last):
  File "/home/IAIS/ebritochac/anaconda3/envs/colbert/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/IAIS/ebritochac/anaconda3/envs/colbert/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "../colbert/infra/launcher.py", line 115, in setup_new_process
    return_val = callee(config, *args)
  File "../colbert/training/training.py", line 146, in train
    ckpt_path = manage_checkpoints(config, colbert, optimizer, batch_idx+1, savepath=None, consumed_all_triples=True)
UnboundLocalError: local variable 'batch_idx' referenced before assignment

opened by ebritoc 8

Training stopping after few hundred steps
Hello Omar, Thank you very much for sharing your awesome work. I am currently trying to test ColBERT on the BioAsq8 dataset (3200 queries with approximately 10 relevant documents per query in the training set).

I have two questions about ColBERT.

The main problem that I face is that when I train ColBERT on a triples.p file, the training sometimes stop after couple of hundred or thousand steps : ie, each step takes less than a second and then it stops printing and saving ckpt, but the process is still running and doesn't exit. Is there somewhere in the code where it stops the training when the avg loss doesn't evolve anymore? I don't see it in the code. And my GPU has still available memory.

Also, I was wondering if there is a reason why you don't consider epochs in your training code. I guess that for MSMARCO dataset, the triples.p file is long enough. But in general, there is no problems to iterate over the train set right? Especially, I want to sample the the negatives not randomly but with BM25 negatives.

Again thank you for this great repository. Alexandre
opened by alexjout 8

ModuleNotFoundError: No module named 'utility'

Hi, I wanted to use your model in my python project and added it therefore to my project (virtual pip env) by using the following command:

pip install git+https://github.com/stanford-futuredata/ColBERT

However, when I start my python project and wanted to use your demo code from the demo notebook, I'm always getting the following error:

Traceback (most recent call last):
  File "C:\backend\webservice.py", line 8, in <module>
    from controllers.colbert_controller import ColBertController
  File "C:\backend\controllers\colbert_controller.py", line 3, in <module>
    from colbert.infra import Run, RunConfig, ColBERTConfig
  File "C:\backend\venv\lib\site-packages\colbert\__init__.py", line 1, in <module>
    from .trainer import Trainer
  File "C:\backend\venv\lib\site-packages\colbert\trainer.py", line 1, in <module>
    from colbert.infra.run import Run
  File "C:\backend\venv\lib\site-packages\colbert\infra\__init__.py", line 1, in <module>
    from .run import *
  File "C:\backend\venv\lib\site-packages\colbert\infra\run.py", line 7, in <module>
    from colbert.infra.config import RunConfig
  File "C:\backend\venv\lib\site-packages\colbert\infra\config\__init__.py", line 1, in <module>
    from .config import *
  File "C:\backend\venv\lib\site-packages\colbert\infra\config\config.py", line 3, in <module>
    from .base_config import BaseConfig
  File "C:\backend\venv\lib\site-packages\colbert\infra\config\base_config.py", line 11, in <module>
    from utility.utils.save_metadata import get_metadata_only
ModuleNotFoundError: No module named 'utility'

Any idea how to solve this error? Thanks.

opened by lalenzos 7

Possible to train from checkpointed model with new triples?
I have successfully trained a new model from scratch using triples generated from the ground truth of queries and relevant docs in my collection. However, when trying to train with these triples starting from the provided ColBERTv2 checkpoint, I run into an assertion error:

AssertionError: Q.size(0)=1024, D_padded.size(0)=32 Traceback (most recent call last): <skipped previous calls> File "<repo_base>/ColBERT/colbert/modeling/colbert.py", line 173, in colbert_score assert Q.size(0) in [1, D_padded.size(0)], f"Q.size(0)={Q.size(0)}, D_padded.size(0)={D_padded.size(0)}" AssertionError: Q.size(0)=1024, D_padded.size(0)=32

(I added the assertion message.)

I tried to do this by following the code snippet in the training section of the README, but making the following change:

checkpoint_path = trainer.train(checkpoint=initial_checkpoint_path)

where initial_checkpoint_path points to the dir containing the downloaded checkpoint.

Is this the correct way? Thanks.
opened by bagchisu 0

indices should be either on cpu or on the same device as the indexed tensor (cpu)

It happens when I run the intro.ipynb ,

results = searcher.search(query, k=3)

the output is

RuntimeError                              Traceback (most recent call last)
Cell In[19], line 6
      3 print(f"#> {query}")
      5 # Find the top-3 passages for this query
----> 6 results = searcher.search(query, k=3)
      8 # Print out the top-k retrieved passages
      9 for passage_id, passage_rank, passage_score in zip(*results):

File ~/ColBERT/docs/../colbert/searcher.py:61, in Searcher.search(self, text, k, filter_fn)
     59 def search(self, text: str, k=10, filter_fn=None):
     60     Q = self.encode(text)
---> 61     return self.dense_search(Q, k, filter_fn=filter_fn)

File ~/ColBERT/docs/../colbert/searcher.py:108, in Searcher.dense_search(self, Q, k, filter_fn)
    105     if self.config.ndocs is None:
    106         self.configure(ndocs=max(k * 4, 4096))
--> 108 pids, scores = self.ranker.rank(self.config, Q, filter_fn=filter_fn)
    110 return pids[:k], list(range(1, k+1)), scores[:k]

File ~/ColBERT/docs/../colbert/search/index_storage.py:79, in IndexScorer.rank(self, config, Q, filter_fn)
     77 def rank(self, config, Q, filter_fn=None):
     78     with torch.inference_mode():
---> 79         pids, centroid_scores = self.retrieve(config, Q)
     81         if filter_fn is not None:
     82             pids = filter_fn(pids)

File ~/ColBERT/docs/../colbert/search/index_storage.py:69, in IndexScorer.retrieve(self, config, Q)
     67 def retrieve(self, config, Q):
     68     Q = Q[:, :config.query_maxlen]   # NOTE: Candidate generation uses only the query tokens
---> 69     embedding_ids, centroid_scores = self.generate_candidates(config, Q)
     71     return embedding_ids, centroid_scores
File ~/ColBERT/docs/../colbert/search/candidate_generation.py:55, in CandidateGeneration.generate_candidates(self, config, Q)
     52     Q = Q.cuda().half()
     53 assert Q.dim() == 2
---> 55 pids, centroid_scores = self.generate_candidate_pids(Q, ncells)
     57 sorter = pids.sort()
     58 pids = sorter.values

File ~/ColBERT/docs/../colbert/search/candidate_generation.py:34, in CandidateGeneration.generate_candidate_pids(self, Q, ncells)
     31 def generate_candidate_pids(self, Q, ncells):
     32     cells, scores = self.get_cells(Q, ncells)
---> 34     pids, cell_lengths = self.ivf.lookup(cells)
     35     if self.use_gpu:
     36         pids = pids.cuda()

File ~/ColBERT/docs/../colbert/search/strided_tensor.py:75, in StridedTensor.lookup(self, pids, output)
     74 def lookup(self, pids, output='packed'):
---> 75     pids, lengths, offsets = self._prepare_lookup(pids)
     77     if self.use_gpu:
     78         stride = lengths.max().item()

File ~/ColBERT/docs/../colbert/search/strided_tensor.py:67, in StridedTensor._prepare_lookup(self, pids)
     65     pids = pids.cuda()
     66 pids = pids.long()
---> 67 lengths = self.lengths[pids]
     68 if self.use_gpu:
     69     lengths = lengths.cuda()

RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)

seems like the device setting got mismatched and i overwrite the prepare_lookup func

def _prepare_lookup(self, pids):
        if isinstance(pids, list):
            pids = torch.tensor(pids)

        assert pids.dim() == 1

        if self.use_gpu:
            pids = pids.cuda()
        pids = pids.long()
        lengths = self.lengths[pids]
        if self.use_gpu:
            lengths = lengths.cuda()
        offsets = self.offsets[pids]

        return pids, lengths, offsets

another device mismatch happened

opened by lyj201002 4

Subclass from colbert.data.Collection fails cast assertion
The cast() classmethod in colbert.data.Collection has uses the following to test the type of the obj argument:

if type(obj) is cls: return obj

This fails when the collection object passed to colbert.Searcher is a subclass of Collection. Perhaps a better implementation for the type check in cast would be:

if isinstance(obj, cls): return obj
opened by bagchisu 3
intro.ipynb notebook not running first cell on CPU environment: missing modules
If you try to run the first cell, you get as the end of the trace the error: modulenotfounderror: No module named 'git' I fixed it with

pip install GitPython

then you get an error about missing transformers so ran this

pip install transformers[torch]

then it runs fine. these two packages should be added to the CPU environment yaml file, or added to the README.md as part of the install instructions.
opened by jramirezpr 2

Decompression returns zero vectors

Hi, when reconstructing vectors from codes and residuals I always get zero vectors. The relevant code is related to the torch extensions. See minimal example:

from colbert.indexing.codecs.residual import ResidualCodec, ResidualEmbeddings

codec = ResidualCodec.load("/path/to/index/")

a = ResidualEmbeddings(torch.Tensor([1,2,3]),
                       torch.randint(256, (3, 32),
                                     dtype=torch.uint8))

codec.decompress_residuals(a.residuals,
                           codec.bucket_weights,
                           codec.reversed_bit_map,
                           codec.decompression_lookup_table,
                           a.codes,
                           codec.centroids,
                           codec.dim,
                           codec.nbits)

this returns zero vectors. The index has content (examine the pt files).

Did anyone encounter this? is this a bug or an issue with GPU drivers?

Thanks!

opened by danielfleischer 12

Owner

Stanford Future Data Systems

We are a CS research group at Stanford building data-intensive systems

GitHub

NAACL2021 - COIL Contextualized Lexical Retriever

COIL Repo for our NAACL paper, COIL: Revisit Exact Lexical Match in Information Retrieval with Contextualized Inverted List. The code covers learning

108 Dec 31, 2022

I-BERT: Integer-only BERT Quantization

I-BERT: Integer-only BERT Quantization HuggingFace Implementation I-BERT is also available in the master branch of HuggingFace! Visit the following li

139 Dec 27, 2022

Source code for NAACL 2021 paper "TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference"

TR-BERT Source code and dataset for "TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference". The code is based on huggaface's transformers.

37 Oct 30, 2022

LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)

LV-BERT Introduction In this repo, we introduce LV-BERT by exploiting layer variety for BERT. For detailed description and experimental results, pleas

14 Aug 24, 2022

VD-BERT: A Unified Vision and Dialog Transformer with BERT

VD-BERT: A Unified Vision and Dialog Transformer with BERT PyTorch Code for the following paper at EMNLP2020: Title: VD-BERT: A Unified Vision and Dia

44 Nov 1, 2022

Pre-trained BERT Models for Ancient and Medieval Greek, and associated code for LaTeCH 2021 paper titled - "A Pilot Study for BERT Language Modelling and Morphological Analysis for Ancient and Medieval Greek"

Ancient Greek BERT The first and only available Ancient Greek sub-word BERT model! State-of-the-art post fine-tuning on Part-of-Speech Tagging and Mor

22 Dec 8, 2022

PaddleRobotics is an open-source algorithm library for robots based on Paddle, including open-source parts such as human-robot interaction, complex motion control, environment perception, SLAM positioning, and navigation.

简体中文 | English PaddleRobotics paddleRobotics是基于paddle的机器人开源算法库集，包括人机交互、复杂运动控制、环境感知、slam定位导航等开源算法部分。人机交互主动多模交互技术TFVT-HRI 主动多模交互技术是通过视觉、语音、触摸传感器等输入机器人

185 Dec 26, 2022

Repo for CVPR2021 paper "QPIC: Query-Based Pairwise Human-Object Interaction Detection with Image-Wide Contextual Information"

QPIC: Query-Based Pairwise Human-Object Interaction Detection with Image-Wide Contextual Information by Masato Tamura, Hiroki Ohashi, and Tomoaki Yosh

105 Dec 23, 2022

[CVPR 2021] Modular Interactive Video Object Segmentation: Interaction-to-Mask, Propagation and Difference-Aware Fusion

364 Jan 3, 2023

CPF: Learning a Contact Potential Field to Model the Hand-object Interaction

Contact Potential Field This repo contains model, demo, and test codes of our paper: CPF: Learning a Contact Potential Field to Model the Hand-object

99 Dec 26, 2022

Synthesizing Long-Term 3D Human Motion and Interaction in 3D in CVPR2021

Long-term-Motion-in-3D-Scenes This is an implementation of the CVPR'21 paper "Synthesizing Long-Term 3D Human Motion and Interaction in 3D". Please ch

76 Dec 13, 2022

Populating 3D Scenes by Learning Human-Scene Interaction https://posa.is.tue.mpg.de/

Populating 3D Scenes by Learning Human-Scene Interaction [Project Page] [Paper] License Software Copyright License for non-commercial scientific resea

81 Nov 8, 2022

This is the repo for the paper `SumGNN: Multi-typed Drug Interaction Prediction via Efficient Knowledge Graph Summarization'. (published in Bioinformatics'21)

SumGNN: Multi-typed Drug Interaction Prediction via Efficient Knowledge Graph Summarization This is the code for our paper ``SumGNN: Multi-typed Drug

58 Dec 21, 2022

PIGLeT: Language Grounding Through Neuro-Symbolic Interaction in a 3D World [ACL 2021]

piglet PIGLeT: Language Grounding Through Neuro-Symbolic Interaction in a 3D World [ACL 2021] This repo contains code and data for PIGLeT. If you like

51 Oct 8, 2022

Official repository for HOTR: End-to-End Human-Object Interaction Detection with Transformers (CVPR'21, Oral Presentation)

Official PyTorch Implementation for HOTR: End-to-End Human-Object Interaction Detection with Transformers (CVPR'2021, Oral Presentation) HOTR: End-to-

114 Nov 28, 2022

ColBERT: Contextualized Late Interaction over BERT (SIGIR'20)

Related tags

Overview

ColBERT

ColBERT is a fast and accurate retrieval model, enabling scalable BERT-based search over large text collections in tens of milliseconds.

Installation

Overview

Data

Training

Validation

Indexing

FAISS Indexing for end-to-end retrieval

Retrieval

Releases

Comments

Owner

Stanford Future Data Systems

NAACL2021 - COIL Contextualized Lexical Retriever

I-BERT: Integer-only BERT Quantization

Source code for NAACL 2021 paper "TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference"

LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)

VD-BERT: A Unified Vision and Dialog Transformer with BERT

Pre-trained BERT Models for Ancient and Medieval Greek, and associated code for LaTeCH 2021 paper titled - "A Pilot Study for BERT Language Modelling and Morphological Analysis for Ancient and Medieval Greek"

PaddleRobotics is an open-source algorithm library for robots based on Paddle, including open-source parts such as human-robot interaction, complex motion control, environment perception, SLAM positioning, and navigation.

Repo for CVPR2021 paper "QPIC: Query-Based Pairwise Human-Object Interaction Detection with Image-Wide Contextual Information"

[CVPR 2021] Modular Interactive Video Object Segmentation: Interaction-to-Mask, Propagation and Difference-Aware Fusion

CPF: Learning a Contact Potential Field to Model the Hand-object Interaction

Synthesizing Long-Term 3D Human Motion and Interaction in 3D in CVPR2021

Populating 3D Scenes by Learning Human-Scene Interaction https://posa.is.tue.mpg.de/

This is the repo for the paper `SumGNN: Multi-typed Drug Interaction Prediction via Efficient Knowledge Graph Summarization'. (published in Bioinformatics'21)

PIGLeT: Language Grounding Through Neuro-Symbolic Interaction in a 3D World [ACL 2021]

Official repository for HOTR: End-to-End Human-Object Interaction Detection with Transformers (CVPR'21, Oral Presentation)

Code for KDD'20 "An Efficient Neighborhood-based Interaction Model for Recommendation on Heterogeneous Graph"

This's an implementation of deepmind Visual Interaction Networks paper using pytorch

Pytorch Implementation of Interaction Networks for Learning about Objects, Relations and Physics

GBIM(Gesture-Based Interaction map)