ColBERT: Contextualized Late Interaction over BERT (SIGIR'20)

Overview

Update: if you're looking for ColBERTv2 code, you can find it alongside a new simpler API, in the branch new_api.

ColBERT

ColBERT is a fast and accurate retrieval model, enabling scalable BERT-based search over large text collections in tens of milliseconds.

Figure 1: ColBERT's late interaction, efficiently scoring the fine-grained similarity between a queries and a passage.

As Figure 1 illustrates, ColBERT relies on fine-grained contextual late interaction: it encodes each passage into a matrix of token-level embeddings (shown above in blue). Then at search time, it embeds every query into another matrix (shown in green) and efficiently finds passages that contextually match the query using scalable vector-similarity (MaxSim) operators.

These rich interactions allow ColBERT to surpass the quality of single-vector representation models, while scaling efficiently to large corpora. You can read more in our papers:


Installation

ColBERT (currently: v0.2.0) requires Python 3.7+ and Pytorch 1.6+ and uses the HuggingFace Transformers library.

We strongly recommend creating a conda environment using:

conda env create -f conda_env.yml
conda activate colbert-v0.2

If you face any problems, please open a new issue and we'll help you promptly!

Overview

Using ColBERT on a dataset typically involves the following steps.

Step 0: Preprocess your collection. At its simplest, ColBERT works with tab-separated (TSV) files: a file (e.g., collection.tsv) will contain all passages and another (e.g., queries.tsv) will contain a set of queries for searching the collection.

Step 1: Train a ColBERT model. You can train your own ColBERT model and validate performance on a suitable development set.

Step 2: Index your collection. Once you're happy with your ColBERT model, you need to index your collection to permit fast retrieval. This step encodes all passages into matrices, stores them on disk, and builds data structures for efficient search.

Step 3: Search the collection with your queries. Given your model and index, you can issue queries over the collection to retrieve the top-k passages for each query.

Below, we illustrate these steps via an example run on the MS MARCO Passage Ranking task.

Data

This repository works directly with a simple tab-separated file format to store queries, passages, and top-k ranked lists.

  • Queries: each line is qid \t query text.
  • Collection: each line is pid \t passage text.
  • Top-k Ranking: each line is qid \t pid \t rank.

This works directly with the data format of the MS MARCO Passage Ranking dataset. You will need the training triples (triples.train.small.tar.gz), the official top-1000 ranked lists for the dev set queries (top1000.dev), and the dev set relevant passages (qrels.dev.small.tsv). For indexing the full collection, you will also need the list of passages (collection.tar.gz).

Training

Training requires a list of <query, positive passage, negative passage> tab-separated triples.

You can supply full-text triples, where each line is query text \t positive passage text \t negative passage text. Alternatively, you can supply the query and passage IDs as a JSONL file [qid, pid+, pid-] per line, in which case you should specify --collection path/to/collection.tsv and --queries path/to/queries.train.tsv.

CUDA_VISIBLE_DEVICES="0,1,2,3" \
python -m torch.distributed.launch --nproc_per_node=4 -m \
colbert.train --amp --doc_maxlen 180 --mask-punctuation --bsize 32 --accum 1 \
--triples /path/to/MSMARCO/triples.train.small.tsv \
--root /root/to/experiments/ --experiment MSMARCO-psg --similarity l2 --run msmarco.psg.l2

You can use one or more GPUs by modifying CUDA_VISIBLE_DEVICES and --nproc_per_node.

Validation

Before indexing into ColBERT, you can compare a few checkpoints by re-ranking a top-k set of documents per query. This will use ColBERT on-the-fly: it will compute document representations during query evaluation.

This script requires the top-k list per query, provided as a tab-separated file whose every line contains a tuple queryID \t passageID \t rank, where rank is {1, 2, 3, ...} for each query. The script also accepts the format of MS MARCO's top1000.dev and top1000.eval and you can optionally supply relevance judgements (qrels) for evaluation. This is a tab-separated file whose every line has a quadruple <query ID, 0, passage ID, 1>, like qrels.dev.small.tsv.

Example command:

python -m colbert.test --amp --doc_maxlen 180 --mask-punctuation \
--collection /path/to/MSMARCO/collection.tsv \
--queries /path/to/MSMARCO/queries.dev.small.tsv \
--topk /path/to/MSMARCO/top1000.dev  \
--checkpoint /root/to/experiments/MSMARCO-psg/train.py/msmarco.psg.l2/checkpoints/colbert-200000.dnn \
--root /root/to/experiments/ --experiment MSMARCO-psg  [--qrels path/to/qrels.dev.small.tsv]

Indexing

For fast retrieval, indexing precomputes the ColBERT representations of passages.

Example command:

CUDA_VISIBLE_DEVICES="0,1,2,3" OMP_NUM_THREADS=6 \
python -m torch.distributed.launch --nproc_per_node=4 -m \
colbert.index --amp --doc_maxlen 180 --mask-punctuation --bsize 256 \
--checkpoint /root/to/experiments/MSMARCO-psg/train.py/msmarco.psg.l2/checkpoints/colbert-200000.dnn \
--collection /path/to/MSMARCO/collection.tsv \
--index_root /root/to/indexes/ --index_name MSMARCO.L2.32x200k \
--root /root/to/experiments/ --experiment MSMARCO-psg

The index created here allows you to re-rank the top-k passages retrieved by another method (e.g., BM25).

We typically recommend that you use ColBERT for end-to-end retrieval, where it directly finds its top-k passages from the full collection. For this, you need FAISS indexing.

FAISS Indexing for end-to-end retrieval

For end-to-end retrieval, you should index the document representations into FAISS.

python -m colbert.index_faiss \
--index_root /root/to/indexes/ --index_name MSMARCO.L2.32x200k \
--partitions 32768 --sample 0.3 \
--root /root/to/experiments/ --experiment MSMARCO-psg

Retrieval

In the simplest case, you want to retrieve from the full collection:

python -m colbert.retrieve \
--amp --doc_maxlen 180 --mask-punctuation --bsize 256 \
--queries /path/to/MSMARCO/queries.dev.small.tsv \
--nprobe 32 --partitions 32768 --faiss_depth 1024 \
--index_root /root/to/indexes/ --index_name MSMARCO.L2.32x200k \
--checkpoint /root/to/experiments/MSMARCO-psg/train.py/msmarco.psg.l2/checkpoints/colbert-200000.dnn \
--root /root/to/experiments/ --experiment MSMARCO-psg

You may also want to re-rank a top-k set that you've retrieved before with ColBERT or with another model. For this, use colbert.rerank similarly and additionally pass --topk.

If you have a large set of queries (or want to reduce memory usage), use batch-mode retrieval and/or re-ranking. This can be done by passing --batch --retrieve_only to colbert.retrieve and passing --batch --log-scores to colbert.rerank alongside --topk with the unordered.tsv output of this retrieval run.

Some use cases (e.g., building a user-facing search engines) require more control over retrieval. For those, you typically don't want to use the command line for retrieval. Instead, you want to import our retrieval API from Python and directly work with that (e.g., to build a simple REST API). Instructions for this are coming soon, but you will just need to adapt/modify the retrieval loop in colbert/ranking/retrieval.py#L33.

Releases

  • v0.2.0: Sep 2020
  • v0.1.0: June 2020
Comments
  • question about unordered.tsv

    question about unordered.tsv

    @okhat

    Because I have lots of queries that I want to process, I wanted to train in batches so I used the following command for retrieval:

    !python -m colbert.retrieve --amp --doc_maxlen 512 --query_maxlen 512 --bsize 1 \
    --queries small_test_queries.tsv --partitions 65536 --index_root ./experiments/indexes --index_name large_train_index \
    --checkpoint ./experiments/dirty/train.py/2021-12-06_08.01.48/checkpoints/colbert-32000.dnn \
    --depth 10000 --batch --retrieve_only
    

    And in doing so it creates a file "unordered.tsv" , but the results in the file look weird: Screen Shot 2021-12-07 at 7 18 37 PM

    From my understanding, the columns are (query id, document id, rank), but the rank column is filled up with -1.

    When I run validation on a single query by using ColBERT on the fly, it produces pretty good results, but of course it is slow because I have not done the necessary preprocessing (However, I believe this suggests that my model has been trained properly, so the issue probably does not have to do with BERT itself).

    opened by puzzlecollector 33
  • Can't build faiss index

    Can't build faiss index

    Thanks for the great repo! I'm trying to build a faiss index for retrieval, but can't get the script to run. I was originally using python3.8 and torch 1.8 in a docker container, but also downgraded to torch 1.6 to see if that would work.

    I'm running

    CUDA_VISIBLE_DEVICES="0,1" OMP_NUM_THREADS=1 \
    python3.8 -m torch.distributed.launch --nproc_per_node=1 -m \
            index --root $PWD/experiments/ --amp --doc_maxlen 180 --mask-punctuation --bsize 256 \
            --checkpoint out-of-the-box-model.pt \
            --collection passages.tsv \
            --index_root /faiss --index_name INDEX \
            --root $PWD/experiments/ --experiment out_of_the_box \
            --local_rank 0
    

    But I get the error "Default process group is not initialized". That also happens even if I manually put dist.init_process_group('nccl') in the script.

    Do you know why this is happening?

    Thanks!

    opened by JamesDeAntonis 23
  • Instructions on using ColBERT

    Instructions on using ColBERT

    @okhat I am trying to use ColBERT for a document retrieval project I am working on and I'd like to ask if I have understood the procedure correctly. I am trying to perform a ranking task based on the similarity of passages. So if a query passage comes in, then among the document passages I have, the system has to retrieve the top-K most similar documents to the query.

    • Because both the queries and documents are long, I guess I would have to first preprocess them using utility/preprocess/docs2passages. If my understanding is correct his method just simply chunks the long text in a sliding window manner right?

    • Afterwards I need to prepare the dataset following the format query \t positive passage \t negative passage in .tsv format. Then I type in the following command in the command line to train my custom ColBERT model:

    python -m colbert.train --amp --doc_maxlen 180 --mask-punctuation --bsize 32 --accum 1 -- train_dataset.tsv 
    
    • Once the training is complete then there will be model checkpoint file that is saved. From this step onwards, I am planning to use the saved model checkpoint and the pyterrior framework

    • I am aware that first I need to index all my test documents using FAISS and I am planning to use following code to do so

    import pyterrier_colbert.indexing
    indexer = pyterrier_colbert.indexing.ColBERTIndexer(checkpoint, "/content", "colbertindex", chunksize=3)
    indexer.index(test_df)
    

    Here the test_df is a corpus of test documents (pandas dataframe) where it will have two columns: docno and text.

    • Once the indexing is done, then I will proceed to passing in the test queries to rank the top-K documents that were indexed in the previous step. To achieve this I will use the following code (again using pyterrier):
    pyterrier_colbert_factory = indexer.ranking_factory()
    
    colbert_e2e = pyterrier_colbert_factory.end_to_end()
    (colbert_e2e % 10).search(<query text>)
    

    Does this step look about right? I will proceed as I have written above and if I encounter any problems I will ask again in this thread. Thank you :)

    opened by puzzlecollector 22
  • Errors when trying to interface directly with the underlying API for re-ranking

    Errors when trying to interface directly with the underlying API for re-ranking

    I keep getting the pictured error: Screen Shot 2021-06-14 at 12 43 43 PM

    Upon investigation, I see that stride is referenced here but isn't defined prior in the method. Can you please explain if this is intended or a bug?

    Thanks

    opened by JamesDeAntonis 18
  • Make possible to pip install

    Make possible to pip install

    Hello, thanks for your repository and SIGIR paper. We would like to develop wrappers on top of ColBERT. Would it be possible to make the repo compatible with pip. This would need:

    • make a setup.py
    • rename src directory as colbert
    opened by cmacdonald 14
  • Performance Issues with RoBERTa Models

    Performance Issues with RoBERTa Models

    I am currently training a multilingual Model with your approach and with the bert-base-multilingual-uncased it works great. Now I tried switching to xlm-roberta-base (which in general is better pre-trained than the mBERT) but performance is far off. Both are trained on the same system with the same batch size.

    Here a plot of the loss over training steps:

    Colbert_losss

    Evaluation performance is very different as well: mBERT @ 32k Steps: MRR@10 0.22 XLM-RoBERTa @ 32k Steps: MRR@10 0.07

    As RoBERTa uses a BPE vocab i had to add unused tokens by hand and initalize the embedding for them randomly (transformers does that with mean=0 and std=0.02):

            self.tokenizer = XLMRobertaTokenizer.from_pretrained('xlm-roberta-base')
            self.tokenizer.add_tokens(["[unused0]"])
            self.tokenizer.add_tokens(["[unused1]"])
            
            self.skiplist = {w: True for w in string.punctuation}
    
            self.bert = XLMRobertaModel(config)
            self.bert.resize_token_embeddings(len(self.tokenizer)) 
    

    https://github.com/Phil1108/ColBERT/blob/b786e2e7ef1c13bac97a5f8a35aa9e5ff9f3425f/src/model.py#L24

    Is the bad performance caused by the problem that RoBERTa doesn't include a next-sentence Prediction in its pre-training? Or is the ColBERT approach not transferable to BPE vocabs? Or is my way of adding unused tokens to the vocab causing undetected problems?

    opened by Phil1108 13
  • Error when indexing with ColBERTv2

    Error when indexing with ColBERTv2

    Hi, I'm using the new_api branch and the provided ColBERTv2 checkpoint to index a 2M dataset but the indexer doesn't seem to be able to create a ivf.pt, which I assume is related to faiss. Screenshot 2022-03-24 004907 Seems like something is getting lost. I'll debug but I just thought of dropping this here first.

    opened by vjeronymo2 10
  • about ColBERT(BertPreTrainedModel)

    about ColBERT(BertPreTrainedModel)

    Hello, I am reading your code to replicate the experiment. There are some questions about the model in "model.py".

    1. in the query() function, "queries" are word lists. So, it can not be input into the self.tokenizer.encode() function. The standard input for tokenizer.encode() should be text.
    2. in the doc() function,
    docs = [["[unused1]"] + self._tokenize(d)[:self.doc_maxlen-3] for d in docs]
    

    the result of "self._tokenize()" is a word list, not a word-piece list, so it is improper to be cut by the doc_maxlen which limits the number of word-piece tokens. 3. although in the paper it is said that "Unlike queries, we do not append [mask] tokens to documents.", in the code the encoding function is "_encode()" for both queries and docs with the same [mask] padding.

    opened by KaishuaiXu 9
  • UnboundLocalError: local variable 'batch_idx' referenced before assignment

    UnboundLocalError: local variable 'batch_idx' referenced before assignment

    I am trying to train a ColBERT model on a new dataset based on the code snippet from the README. I get stuck here:

    #> Starting...
    nranks = 2 	 num_gpus = 2 	 device=0
    #> Starting...
    nranks = 2 	 num_gpus = 2 	 device=1
    Using config.bsize = 16 (per process) and config.accumsteps = 1
    {
        "ncells": null,
        "centroid_score_threshold": null,
        "ndocs": null,
        "index_path": null,
        "nbits": 1,
        "kmeans_niters": 4,
        "resume": false,
        "similarity": "cosine",
        "bsize": 32,
        "accumsteps": 1,
        "lr": 3e-6,
        "maxsteps": 500000,
        "save_every": null,
        "warmup": null,
        "warmup_bert": null,
        "relu": false,
        "nway": 2,
        "use_ib_negatives": false,
        "reranker": false,
    ...
    [Jul 18, 19:30:32] #> Got 98380 queries. All QIDs are unique.
    
    [Jul 18, 19:30:32] #> Got 98380 queries. All QIDs are unique.
    
    Some weights of HF_ColBERT were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['linear.weight']
    You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
    Some weights of HF_ColBERT were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['linear.weight']
    You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
    [Jul 18, 19:30:42] #> Done with all triples!
    Process Process-3:
    Traceback (most recent call last):
      File "/home/IAIS/ebritochac/anaconda3/envs/colbert/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
        self.run()
      File "/home/IAIS/ebritochac/anaconda3/envs/colbert/lib/python3.7/multiprocessing/process.py", line 99, in run
        self._target(*self._args, **self._kwargs)
      File "../colbert/infra/launcher.py", line 115, in setup_new_process
        return_val = callee(config, *args)
      File "../colbert/training/training.py", line 146, in train
        ckpt_path = manage_checkpoints(config, colbert, optimizer, batch_idx+1, savepath=None, consumed_all_triples=True)
    UnboundLocalError: local variable 'batch_idx' referenced before assignment
    
    opened by ebritoc 8
  • Training stopping after few hundred steps

    Training stopping after few hundred steps

    Hello Omar, Thank you very much for sharing your awesome work. I am currently trying to test ColBERT on the BioAsq8 dataset (3200 queries with approximately 10 relevant documents per query in the training set).

    I have two questions about ColBERT.

    1. The main problem that I face is that when I train ColBERT on a triples.p file, the training sometimes stop after couple of hundred or thousand steps : ie, each step takes less than a second and then it stops printing and saving ckpt, but the process is still running and doesn't exit. Is there somewhere in the code where it stops the training when the avg loss doesn't evolve anymore? I don't see it in the code. And my GPU has still available memory.

    2. Also, I was wondering if there is a reason why you don't consider epochs in your training code. I guess that for MSMARCO dataset, the triples.p file is long enough. But in general, there is no problems to iterate over the train set right? Especially, I want to sample the the negatives not randomly but with BM25 negatives.

    Again thank you for this great repository. Alexandre

    opened by alexjout 8
  • ModuleNotFoundError: No module named 'utility'

    ModuleNotFoundError: No module named 'utility'

    Hi, I wanted to use your model in my python project and added it therefore to my project (virtual pip env) by using the following command:

    pip install git+https://github.com/stanford-futuredata/ColBERT
    

    However, when I start my python project and wanted to use your demo code from the demo notebook, I'm always getting the following error:

    Traceback (most recent call last):
      File "C:\backend\webservice.py", line 8, in <module>
        from controllers.colbert_controller import ColBertController
      File "C:\backend\controllers\colbert_controller.py", line 3, in <module>
        from colbert.infra import Run, RunConfig, ColBERTConfig
      File "C:\backend\venv\lib\site-packages\colbert\__init__.py", line 1, in <module>
        from .trainer import Trainer
      File "C:\backend\venv\lib\site-packages\colbert\trainer.py", line 1, in <module>
        from colbert.infra.run import Run
      File "C:\backend\venv\lib\site-packages\colbert\infra\__init__.py", line 1, in <module>
        from .run import *
      File "C:\backend\venv\lib\site-packages\colbert\infra\run.py", line 7, in <module>
        from colbert.infra.config import RunConfig
      File "C:\backend\venv\lib\site-packages\colbert\infra\config\__init__.py", line 1, in <module>
        from .config import *
      File "C:\backend\venv\lib\site-packages\colbert\infra\config\config.py", line 3, in <module>
        from .base_config import BaseConfig
      File "C:\backend\venv\lib\site-packages\colbert\infra\config\base_config.py", line 11, in <module>
        from utility.utils.save_metadata import get_metadata_only
    ModuleNotFoundError: No module named 'utility'
    

    Any idea how to solve this error? Thanks.

    opened by lalenzos 7
  • Possible to train from checkpointed model with new triples?

    Possible to train from checkpointed model with new triples?

    I have successfully trained a new model from scratch using triples generated from the ground truth of queries and relevant docs in my collection. However, when trying to train with these triples starting from the provided ColBERTv2 checkpoint, I run into an assertion error:

    AssertionError: Q.size(0)=1024, D_padded.size(0)=32
    Traceback (most recent call last):
      <skipped previous calls>
      File "<repo_base>/ColBERT/colbert/modeling/colbert.py", line 173, in colbert_score
        assert Q.size(0) in [1, D_padded.size(0)], f"Q.size(0)={Q.size(0)}, D_padded.size(0)={D_padded.size(0)}"
    AssertionError: Q.size(0)=1024, D_padded.size(0)=32
    

    (I added the assertion message.)

    I tried to do this by following the code snippet in the training section of the README, but making the following change:

    checkpoint_path = trainer.train(checkpoint=initial_checkpoint_path) 
    

    where initial_checkpoint_path points to the dir containing the downloaded checkpoint.

    Is this the correct way? Thanks.

    opened by bagchisu 0
  • indices should be either on cpu or on the same device as the indexed tensor (cpu)

    indices should be either on cpu or on the same device as the indexed tensor (cpu)

    It happens when I run the intro.ipynb ,

    results = searcher.search(query, k=3)
    

    the output is

    RuntimeError                              Traceback (most recent call last)
    Cell In[19], line 6
          3 print(f"#> {query}")
          5 # Find the top-3 passages for this query
    ----> 6 results = searcher.search(query, k=3)
          8 # Print out the top-k retrieved passages
          9 for passage_id, passage_rank, passage_score in zip(*results):
    
    File ~/ColBERT/docs/../colbert/searcher.py:61, in Searcher.search(self, text, k, filter_fn)
         59 def search(self, text: str, k=10, filter_fn=None):
         60     Q = self.encode(text)
    ---> 61     return self.dense_search(Q, k, filter_fn=filter_fn)
    
    File ~/ColBERT/docs/../colbert/searcher.py:108, in Searcher.dense_search(self, Q, k, filter_fn)
        105     if self.config.ndocs is None:
        106         self.configure(ndocs=max(k * 4, 4096))
    --> 108 pids, scores = self.ranker.rank(self.config, Q, filter_fn=filter_fn)
        110 return pids[:k], list(range(1, k+1)), scores[:k]
    
    File ~/ColBERT/docs/../colbert/search/index_storage.py:79, in IndexScorer.rank(self, config, Q, filter_fn)
         77 def rank(self, config, Q, filter_fn=None):
         78     with torch.inference_mode():
    ---> 79         pids, centroid_scores = self.retrieve(config, Q)
         81         if filter_fn is not None:
         82             pids = filter_fn(pids)
    
    File ~/ColBERT/docs/../colbert/search/index_storage.py:69, in IndexScorer.retrieve(self, config, Q)
         67 def retrieve(self, config, Q):
         68     Q = Q[:, :config.query_maxlen]   # NOTE: Candidate generation uses only the query tokens
    ---> 69     embedding_ids, centroid_scores = self.generate_candidates(config, Q)
         71     return embedding_ids, centroid_scores
    File ~/ColBERT/docs/../colbert/search/candidate_generation.py:55, in CandidateGeneration.generate_candidates(self, config, Q)
         52     Q = Q.cuda().half()
         53 assert Q.dim() == 2
    ---> 55 pids, centroid_scores = self.generate_candidate_pids(Q, ncells)
         57 sorter = pids.sort()
         58 pids = sorter.values
    
    File ~/ColBERT/docs/../colbert/search/candidate_generation.py:34, in CandidateGeneration.generate_candidate_pids(self, Q, ncells)
         31 def generate_candidate_pids(self, Q, ncells):
         32     cells, scores = self.get_cells(Q, ncells)
    ---> 34     pids, cell_lengths = self.ivf.lookup(cells)
         35     if self.use_gpu:
         36         pids = pids.cuda()
    
    File ~/ColBERT/docs/../colbert/search/strided_tensor.py:75, in StridedTensor.lookup(self, pids, output)
         74 def lookup(self, pids, output='packed'):
    ---> 75     pids, lengths, offsets = self._prepare_lookup(pids)
         77     if self.use_gpu:
         78         stride = lengths.max().item()
    
    File ~/ColBERT/docs/../colbert/search/strided_tensor.py:67, in StridedTensor._prepare_lookup(self, pids)
         65     pids = pids.cuda()
         66 pids = pids.long()
    ---> 67 lengths = self.lengths[pids]
         68 if self.use_gpu:
         69     lengths = lengths.cuda()
    
    RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)
    

    seems like the device setting got mismatched and i overwrite the prepare_lookup func

    def _prepare_lookup(self, pids):
            if isinstance(pids, list):
                pids = torch.tensor(pids)
    
            assert pids.dim() == 1
    
            if self.use_gpu:
                pids = pids.cuda()
            pids = pids.long()
            lengths = self.lengths[pids]
            if self.use_gpu:
                lengths = lengths.cuda()
            offsets = self.offsets[pids]
    
            return pids, lengths, offsets
    

    another device mismatch happened

    opened by lyj201002 4
  • Subclass from colbert.data.Collection fails cast assertion

    Subclass from colbert.data.Collection fails cast assertion

    The cast() classmethod in colbert.data.Collection has uses the following to test the type of the obj argument:

            if type(obj) is cls:
                return obj
    

    This fails when the collection object passed to colbert.Searcher is a subclass of Collection. Perhaps a better implementation for the type check in cast would be:

            if isinstance(obj, cls):
                return obj
    
    opened by bagchisu 3
  • intro.ipynb notebook not running first cell on CPU environment: missing modules

    intro.ipynb notebook not running first cell on CPU environment: missing modules

    If you try to run the first cell, you get as the end of the trace the error: modulenotfounderror: No module named 'git' I fixed it with

    pip install GitPython
    

    then you get an error about missing transformers so ran this

     pip install transformers[torch]
    

    then it runs fine. these two packages should be added to the CPU environment yaml file, or added to the README.md as part of the install instructions.

    opened by jramirezpr 2
  • Decompression returns zero vectors

    Decompression returns zero vectors

    Hi, when reconstructing vectors from codes and residuals I always get zero vectors. The relevant code is related to the torch extensions. See minimal example:

    from colbert.indexing.codecs.residual import ResidualCodec, ResidualEmbeddings
    
    codec = ResidualCodec.load("/path/to/index/")
    
    a = ResidualEmbeddings(torch.Tensor([1,2,3]),
                           torch.randint(256, (3, 32),
                                         dtype=torch.uint8))
    
    codec.decompress_residuals(a.residuals,
                               codec.bucket_weights,
                               codec.reversed_bit_map,
                               codec.decompression_lookup_table,
                               a.codes,
                               codec.centroids,
                               codec.dim,
                               codec.nbits)
    

    this returns zero vectors. The index has content (examine the pt files).

    Did anyone encounter this? is this a bug or an issue with GPU drivers?

    Thanks!

    opened by danielfleischer 12
Owner
Stanford Future Data Systems
We are a CS research group at Stanford building data-intensive systems
Stanford Future Data Systems
NAACL2021 - COIL Contextualized Lexical Retriever

COIL Repo for our NAACL paper, COIL: Revisit Exact Lexical Match in Information Retrieval with Contextualized Inverted List. The code covers learning

Luyu Gao 108 Dec 31, 2022
I-BERT: Integer-only BERT Quantization

I-BERT: Integer-only BERT Quantization HuggingFace Implementation I-BERT is also available in the master branch of HuggingFace! Visit the following li

Sehoon Kim 139 Dec 27, 2022
Source code for NAACL 2021 paper "TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference"

TR-BERT Source code and dataset for "TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference". The code is based on huggaface's transformers.

THUNLP 37 Oct 30, 2022
LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)

LV-BERT Introduction In this repo, we introduce LV-BERT by exploiting layer variety for BERT. For detailed description and experimental results, pleas

Weihao Yu 14 Aug 24, 2022
VD-BERT: A Unified Vision and Dialog Transformer with BERT

VD-BERT: A Unified Vision and Dialog Transformer with BERT PyTorch Code for the following paper at EMNLP2020: Title: VD-BERT: A Unified Vision and Dia

Salesforce 44 Nov 1, 2022
Pre-trained BERT Models for Ancient and Medieval Greek, and associated code for LaTeCH 2021 paper titled - "A Pilot Study for BERT Language Modelling and Morphological Analysis for Ancient and Medieval Greek"

Ancient Greek BERT The first and only available Ancient Greek sub-word BERT model! State-of-the-art post fine-tuning on Part-of-Speech Tagging and Mor

Pranaydeep Singh 22 Dec 8, 2022
PaddleRobotics is an open-source algorithm library for robots based on Paddle, including open-source parts such as human-robot interaction, complex motion control, environment perception, SLAM positioning, and navigation.

简体中文 | English PaddleRobotics paddleRobotics是基于paddle的机器人开源算法库集,包括人机交互、复杂运动控制、环境感知、slam定位导航等开源算法部分。 人机交互 主动多模交互技术TFVT-HRI 主动多模交互技术是通过视觉、语音、触摸传感器等输入机器人

null 185 Dec 26, 2022
Repo for CVPR2021 paper "QPIC: Query-Based Pairwise Human-Object Interaction Detection with Image-Wide Contextual Information"

QPIC: Query-Based Pairwise Human-Object Interaction Detection with Image-Wide Contextual Information by Masato Tamura, Hiroki Ohashi, and Tomoaki Yosh

null 105 Dec 23, 2022
[CVPR 2021] Modular Interactive Video Object Segmentation: Interaction-to-Mask, Propagation and Difference-Aware Fusion

[CVPR 2021] Modular Interactive Video Object Segmentation: Interaction-to-Mask, Propagation and Difference-Aware Fusion

Rex Cheng 364 Jan 3, 2023
CPF: Learning a Contact Potential Field to Model the Hand-object Interaction

Contact Potential Field This repo contains model, demo, and test codes of our paper: CPF: Learning a Contact Potential Field to Model the Hand-object

Lixin YANG 99 Dec 26, 2022
Synthesizing Long-Term 3D Human Motion and Interaction in 3D in CVPR2021

Long-term-Motion-in-3D-Scenes This is an implementation of the CVPR'21 paper "Synthesizing Long-Term 3D Human Motion and Interaction in 3D". Please ch

Jiashun Wang 76 Dec 13, 2022
Populating 3D Scenes by Learning Human-Scene Interaction https://posa.is.tue.mpg.de/

Populating 3D Scenes by Learning Human-Scene Interaction [Project Page] [Paper] License Software Copyright License for non-commercial scientific resea

Mohamed Hassan 81 Nov 8, 2022
This is the repo for the paper `SumGNN: Multi-typed Drug Interaction Prediction via Efficient Knowledge Graph Summarization'. (published in Bioinformatics'21)

SumGNN: Multi-typed Drug Interaction Prediction via Efficient Knowledge Graph Summarization This is the code for our paper ``SumGNN: Multi-typed Drug

Yue Yu 58 Dec 21, 2022
PIGLeT: Language Grounding Through Neuro-Symbolic Interaction in a 3D World [ACL 2021]

piglet PIGLeT: Language Grounding Through Neuro-Symbolic Interaction in a 3D World [ACL 2021] This repo contains code and data for PIGLeT. If you like

Rowan Zellers 51 Oct 8, 2022
Official repository for HOTR: End-to-End Human-Object Interaction Detection with Transformers (CVPR'21, Oral Presentation)

Official PyTorch Implementation for HOTR: End-to-End Human-Object Interaction Detection with Transformers (CVPR'2021, Oral Presentation) HOTR: End-to-

Kakao Brain 114 Nov 28, 2022
Code for KDD'20 "An Efficient Neighborhood-based Interaction Model for Recommendation on Heterogeneous Graph"

Heterogeneous INteract and aggreGatE (GraphHINGE) This is a pytorch implementation of GraphHINGE model. This is the experiment code in the following w

Jinjiarui 69 Nov 24, 2022
This's an implementation of deepmind Visual Interaction Networks paper using pytorch

Visual-Interaction-Networks An implementation of Deepmind visual interaction networks in Pytorch. Introduction For the purpose of understanding the ch

Mahmoud Gamal Salem 166 Dec 6, 2022
Pytorch Implementation of Interaction Networks for Learning about Objects, Relations and Physics

Interaction-Network-Pytorch Pytorch Implementraion of Interaction Networks for Learning about Objects, Relations and Physics. Interaction Network is a

null 117 Nov 5, 2022
GBIM(Gesture-Based Interaction map)

手势交互地图 GBIM(Gesture-Based Interaction map),基于视觉深度神经网络的交互地图,通过电脑摄像头观察使用者的手势变化,进而控制地图进行简单的交互。网络使用PaddleX提供的轻量级模型PPYOLO Tiny以及MobileNet V3 small,使得整个模型大小约10MB左右,即使在CPU下也能快速定位和识别手势。

null 8 Feb 10, 2022