Search with BERT vectors in Solr and Elasticsearch

Dmitry Kan

Last update: Dec 29, 2022

Related tags

Text Data & NLP bert-solr-search

Overview

BERT models with Solr and Elasticsearch

streamlit-search_demo_solr-2021-05-13-10-05-91.mp4

streamlit-search_demo_elasticsearch-2021-05-14-22-05-55.mp4

This code is described in the following Medium stories, taking one step at a time:

Neural Search with BERT and Solr (August 18,2020)

Fun with Apache Lucene and BERT Embeddings (November 15, 2020)

Speeding up BERT Search in Elasticsearch (March 15, 2021)

Ask Me Anything about Vector Search (June 20, 2021) This blog post gives the answers to the 3 most interesting questions asked during the AMA session at Berlin Buzzwords 2021. The video recording is available here: https://www.youtube.com/watch?v=blFe2yOD1WA

Tech stack:

bert-as-service
Hugging Face
solr / elasticsearch
streamlit
Python 3.7

Code for dealing with Solr has been copied from the great (and highly recommended) https://github.com/o19s/hello-ltr project.

Install tensorflow

pip install tensorflow==1.15.3

If you try to install tensorflow 2.3, bert service will fail to start, there is an existing issue about it.

If you encounter issues with the above installation, consider installing full list of packages:

pip install -r requirements_freeze.txt

Let's install bert-as-service components

pip install bert-serving-server

pip install bert-serving-client

Download a pre-trained BERT model

into the bert-model/ directory in this project. I have chosen uncased_L-12_H-768_A-12.zip for this experiment. Unzip it.

Now let's start the BERT service

bash start_bert_server.sh

Run a sample bert client

python src/bert_client.py

to compute vectors for 3 sample sentences:

    Bert vectors for sentences ['First do it', 'then do it right', 'then do it better'] : [[ 0.13186474  0.32404128 -0.82704437 ... -0.3711958  -0.39250174
      -0.31721866]
     [ 0.24873531 -0.12334424 -0.38933852 ... -0.44756213 -0.5591355
      -0.11345179]
     [ 0.28627345 -0.18580122 -0.30906814 ... -0.2959366  -0.39310536
       0.07640187]]

This sets up the stage for our further experiment with Solr.

Dataset

This is by far the key ingredient of every experiment. You want to find an interesting collection of texts, that are suitable for semantic level search. Well, maybe all texts are. I have chosen a collection of abstracts from DBPedia, that I downloaded from here: https://wiki.dbpedia.org/dbpedia-version-2016-04 and placed into data/dbpedia directory in bz2 format. You don't need to extract this file onto disk: the provided code will read directly from the compressed file.

Preprocessing and Indexing: Solr

Before running preprocessing / indexing, you need to configure the vector plugin, which allows to index and query the vector data. You can find the plugin for Solr 8.x here: https://github.com/DmitryKey/solr-vector-scoring

After the plugin's jar has been added, configure it in the solrconfig.xml like so:

Schema also requires an addition: field of type VectorField is required in order to index vector data:

Find ready-made schema and solrconfig here: https://github.com/DmitryKey/bert-solr-search/tree/master/solr_conf

Let's preprocess the downloaded abstracts, and index them in Solr. First, execute the following command to start Solr:

bin/solr start -m 2g

If during processing you will notice:

<...>/bert-solr-search/venv/lib/python3.7/site-packages/bert_serving/client/__init__.py:299: UserWarning: some of your sentences have more tokens than "max_seq_len=500" set on the server, as consequence you may get less-accurate or truncated embeddings.
here is what you can do:
- disable the length-check by create a new "BertClient(check_length=False)" when you do not want to display this warning
- or, start a new server with a larger "max_seq_len"
  '- or, start a new server with a larger "max_seq_len"' % self.length_limit)

The index_dbpedia_abstracts_solr.py script will output statistics:

Maximum tokens observed per abstract: 697
Flushing 100 docs
Committing changes
All done. Took: 82.46466588973999 seconds

We know how many abstracts there are:

bzcat data/dbpedia/long_abstracts_en.ttl.bz2 | wc -l
5045733

Preprocessing and Indexing: Elasticsearch

This project implements several ways to index vector data:

src/index_dbpedia_abstracts_elastic.py vanilla Elasticsearch: using dense_vector data type
src/index_dbpedia_abstracts_elastiknn.py Elastiknn plugin: implements own data type. I used elastiknn_dense_float_vector
src/index_dbpedia_abstracts_opendistro.py OpenDistro for Elasticsearch: uses nmslib to build Hierarchical Navigable Small World (HNSW) graphs during indexing

Each indexer relies on ready-made Elasticsearch mapping file, that can be found in es_conf/ directory.

Preprocessing and Indexing: GSI APU

In order to use GSI APU solution, a user needs to produce two files: numpy 2D array with vectors of desired dimension (768 in my case) a pickle file with document ids matching the document ids of the said vectors in Elasticsearch.

After these data files get uploaded to the GSI server, the same data gets indexed in Elasticsearch. The APU powered search is performed on up to 3 Leda-G PCIe APU boards. Since I’ve run into indexing performance with bert-as-service solution, I decided to take SBERT approach from Hugging Face to prepare the numpy and pickle array files. This allowed me to index into Elasticsearch freely at any time, without waiting for days. You can use this script to do this on DBPedia data, which allows choosing between:

EmbeddingModel.HUGGING_FACE_SENTENCE (SBERT)
EmbeddingModel.BERT_UNCASED_768 (bert-as-service)

To generate the numpy and pickle files, use the following script: scr/create_gsi_files.py. This script produces two files:

data/1000000_EmbeddingModel.HUGGING_FACE_SENTENCE_vectors.npy
data/1000000_EmbeddingModel.HUGGING_FACE_SENTENCE_vectors_docids.pkl

Both files are perfectly suitable for indexing with Solr and Elasticsearch.

To test the GSI plugin, you will need to upload these files to GSI server for loading them both to Elasticsearch and APU.

Running the BERT search demo

There are two streamlit demos for running BERT search for Solr and Elasticsearch. Each demo compares to BM25 based search. The following assumes that you have bert-as-service up and running (if not, laucnh it with bash start_bert_server.sh) and either Elasticsearch or Solr running with the index containing field with embeddings.

To run a demo, execute the following on the command line from the project root:

# for experiments with Elasticsearch
streamlit run src/search_demo_elasticsearch.py

# for experiments with Solr
streamlit run src/search_demo_solr.py

Comments

Streamlit JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Dear @DmitryKey , it is a pleasure to follow your guides and go through the examples. Thank you for providing such interesting tools.

Concerning "Neural Search with BERT and Solr" everything seems to go alright including indexing and searching on the solr server, however, once I start streamlit the following errors occur when I search. Do you have any idea about what it may be?

JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Traceback:

File "/home/n/documents/my_jina_app/baas/env/lib/python3.6/site-packages/streamlit/script_runner.py", line 337, in _run_script
    exec(code, module.__dict__)
File "/home/n/documents/my_jina_app/baas/bert-solr-search/src/search_demo_solr.py", line 122, in <module>
    docs, query_time, numfound = sc.query("vector", query)
File "src/client/solr_client.py", line 102, in query
    resp = resp.json()
File "/home/n/documents/my_jina_app/baas/env/lib/python3.6/site-packages/requests/models.py", line 898, in json
    return complexjson.loads(self.text, **kwargs)
File "/home/n/anaconda3/lib/python3.6/json/__init__.py", line 354, in loads
    return _default_decoder.decode(s)
File "/home/n/anaconda3/lib/python3.6/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/home/n/anaconda3/lib/python3.6/json/decoder.py", line 357, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None

Thank you for your consideration.

opened by xeisberg 15

problem for indexing the abstracts.

hi,i run the code "python src/index_dbpedia_abstracts_solr.py",because i don't find the code"index_dbpedia_abstracts.py". the function "parse_dbpedia_data" have two params,but here you give five. so how to solve it?

opened by wacharlin 8
about request.get and post.

hi,I want to know how to update solr? I don't know the meaning of the code.if I want to push data to solr,I should execute the function about commit() and use requests.get(), why i can't use post? and what is the meaning of the flush() and commit()? could you please teach me? thank you.

opened by wacharlin 5
solr conf

I followed the steps given in "https://github.com/DmitryKey/solr-vector-scoring“ to configure the solr plugin on the automatically generated configuration file, and the vector can indeed be stored. The vector plugin part is configured. However, the query operation performed under solr admin does not work well, and there are almost no search results. After that, I used the solr configuration you gave and found that the query works very well. I found that your solr configuration is different from the default. So can you elaborate on your solr configuration process? Or give the part that can strengthen the query？thank you.

opened by wacharlin 5
running problem

hi, thank you for your reply.i have solved the problem so i closed it. but now i run the code"streamlit run src/search_demo_solr.py " and i meet the new problem that i don't know how to do, there is a prolem that status is 500 for bert.but for bm25 i can run it.how to solve it?

opened by wacharlin 5
Typo in create_gsi_files.py

There is a small typo in https://github.com/DmitryKey/bert-solr-search/blob/master/src/create_gsi_files.py with vectors_to_gis_files which prevents one from running the script. vectors_to_gsi_files should be imported and called.

opened by nina-marjanovic 3
the problem of solr

hi,i want to know how to index in the solr?because after starting solr,i put the file "solr_conf/8.0.0" replace my own file."solrconfig.xml" and "managed-schema".but i get the wrong and i don't know if the step is right.so i want to ask you to know how to config the solr?could you help me to solve the problem?thank you.

opened by wacharlin 2
elastiknn_unsupported_operation_exception

Hello Dmitry! I've read your blog on Medium, and your work really helped me to make our search system smarter! I've implemented SBERT and ElastiKNN to our system, and the results are amazing! I only have a bug/problem with my index pattern in Kibana when using Elastiknn type elastiknn_dense_float_vector. Where I create an index pattern, and I go to the "discover" tab in Kibana, I got this error:

Do you have any leads on how can I fix this? And again, thanks a lot Dmitry for sharing your experience with us in your blog/Github.

opened by anasben7 1
solr-conf

hi,could you please provide a config (including managed-schema and solrconfig.xml) version about solr7.2.0? i don't know how to config it and i want to ask you for help. thank you.

opened by wacharlin 1
Upgrade to lucene / solr 8.x

Main issue is that CustomScoreQuery got deprecated and removed. Upgrade to solr 8.x directly depends on upgrade of https://github.com/DmitryKey/solr-vector-scoring

opened by DmitryKey 1
Optimize ODFE config to index 1M vectors

As discussed in https://dmitry-kan.medium.com/speeding-up-bert-search-in-elasticsearch-750f1f34f455 the current configuration of ODFE allows indexing maximum 200k vectors.

The goal is to index 1M vectors to compare with all other KNN implementations.

opened by DmitryKey 0

Owner

Dmitry Kan

I build search engines. Host of the Vector Podcast: https://www.youtube.com/channel/UCCIMPfR7TXyDvlDRXjVhP1g

GitHub

Index different CKAN entities in Solr, not just datasets

ckanext-sitesearch Index different CKAN entities in Solr, not just datasets Requirements This extension requires CKAN 2.9 or higher and Python 3 Featu

3 Dec 2, 2022

:mag: Transformers at scale for question answering & neural search. Using NLP via a modular Retriever-Reader-Pipeline. Supporting DPR, Elasticsearch, HuggingFace's Modelhub...

Haystack is an end-to-end framework for Question Answering & Neural search that enables you to ... ... ask questions in natural language and find gran

6.4k Jan 9, 2023

Cải thiện Elasticsearch trong bài toán semantic search sử dụng phương pháp Sentence Embeddings

Cải thiện Elasticsearch trong bài toán semantic search sử dụng phương pháp Sentence Embeddings Trong bài viết này mình sẽ sử dụng pretrain model SimCS

18 Nov 25, 2022

Connectionist Temporal Classification (CTC) decoding algorithms: best path, beam search, lexicon search, prefix search, and token passing. Implemented in Python.

CTC Decoding Algorithms Update 2021: installable Python package Python implementation of some common Connectionist Temporal Classification (CTC) decod

736 Jan 3, 2023

nlabel is a library for generating, storing and retrieving tagging information and embedding vectors from various nlp libraries through a unified interface.

2 Jun 10, 2022

MILES is a multilingual text simplifier inspired by LSBert - A BERT-based lexical simplification approach proposed in 2018. Unlike LSBert, MILES uses the bert-base-multilingual-uncased model, as well as simple language-agnostic approaches to complex word identification (CWI) and candidate ranking.

MILES Multilingual Lexical Simplifier Explore the docs » Read LSBert Paper · Report Bug · Request Feature About The Project MILES is a multilingual te

45 Oct 19, 2022

VD-BERT: A Unified Vision and Dialog Transformer with BERT

VD-BERT: A Unified Vision and Dialog Transformer with BERT PyTorch Code for the following paper at EMNLP2020: Title: VD-BERT: A Unified Vision and Dia

44 Nov 1, 2022

🦆 Contextually-keyed word vectors

sense2vec: Contextually-keyed word vectors sense2vec (Trask et. al, 2015) is a nice twist on word2vec that lets you learn more interesting and detaile

1.5k Dec 25, 2022

🦆 Contextually-keyed word vectors

sense2vec: Contextually-keyed word vectors sense2vec (Trask et. al, 2015) is a nice twist on word2vec that lets you learn more interesting and detaile

1.2k Feb 17, 2021

🌸 fastText + Bloom embeddings for compact, full-coverage vectors with spaCy

floret: fastText + Bloom embeddings for compact, full-coverage vectors with spaCy floret is an extended version of fastText that can produce word repr

222 Dec 16, 2022

100+ Chinese Word Vectors 上百种预训练中文词向量

Chinese Word Vectors 中文词向量中文 This project provides 100+ Chinese Word Vectors (embeddings) trained with different representations (dense and sparse),

10.4k Jan 9, 2023

LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)

LV-BERT Introduction In this repo, we introduce LV-BERT by exploiting layer variety for BERT. For detailed description and experimental results, pleas

14 Aug 24, 2022

Pytorch-version BERT-flow: One can apply BERT-flow to any PLM within Pytorch framework.

59 Dec 1, 2022

multi-label，classifier，text classification，多标签文本分类，文本分类，BERT，ALBERT，multi-label-classification，seq2seq，attention，beam search

30 Dec 12, 2022

Create a semantic search engine with a neural network (i.e. BERT) whose knowledge base can be updated

Create a semantic search engine with a neural network (i.e. BERT) whose knowledge base can be updated. This engine can later be used for downstream tasks in NLP such as Q&A, summarization, generation, and natural language understanding (NLU).

1 Mar 20, 2022

The following links explain a bit the idea of semantic search and how search mechanisms work by doing retrieve and rerank

Main Idea The following links explain a bit the idea of semantic search and how search mechanisms work by doing retrieve and rerank Semantic Search Re

2 Jan 28, 2022

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Kashgari Overview | Performance | Installation | Documentation | Contributing ?? ?? ?? We released the 2.0.0 version with TF2 Support. ?? ?? ?? If you

2.3k Dec 29, 2022

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Kashgari Overview | Performance | Installation | Documentation | Contributing ?? ?? ?? We released the 2.0.0 version with TF2 Support. ?? ?? ?? If you

2k Feb 9, 2021

Code of paper: A Recurrent Vision-and-Language BERT for Navigation

Recurrent VLN-BERT Code of the Recurrent-VLN-BERT paper: A Recurrent Vision-and-Language BERT for Navigation Yicong Hong, Qi Wu, Yuankai Qi, Cristian

109 Dec 21, 2022