Semantic search through Wikipedia with the Weaviate vector search engine
Weaviate is an open source vector search engine with build-in vectorization and question answering modules. We imported the complete English language Wikipedia article dataset into a single Weaviate instance to conduct semantic search queries through the Wikipedia articles, besides this, we've made all the graph relations between the articles too. We have made the import scripts, pre-processed articles, and backup available so that you can run the complete setup yourself.
In this repository, you'll find the 3-steps needed to replicate the import, but there are also downlaods available to skip the first two steps.
If you like what you see, a
Additional links:
-
💡 Live Demo Weaviate GraphQL front-end -
💡 Live Demo Weaviate RESTful Endpoint - Weaviate documentation
- Weaviate on Github
- PyTorch-BigGraph search with the Weaviate vector search engine (similar project)
- [BLOG] Semantic search through Wikipedia with Weaviate (GraphQL, Sentence-BERT, and BERT Q&A)
- [VIDEO] Wikipedia Vector Search Demo with Weaviate (Henry AI Labs)
Frequently Asked Questions
Q | A |
---|---|
Can I run this setup with a non-English dataset? | Yes – first, you need to go through the whole process (i.e., start with Step 1). E.g., if you want French, you can download the French version of Wikipedia like this: https://dumps.wikimedia.org/frwiki/latest/frwiki-latest-pages-articles.xml.bz2 (note that en if replaced with fr ). Next, you need to change the Weaviate vectorizer module to an appropriate language. You can choose an OOTB language model as outlined here or add your own model as outlined here. |
Can I run this setup with all languages? | Yes – you can follow two strategies. You can use a multilingual model or extend the Weaviate schema to store different languages with different classes. The latter has the upside that you can use multiple vectorizers (e.g., per language) or a more elaborate sharding strategy. But in the end, both are possible. |
Can I run this with Kubernetes? | Of course, you need to start from Step 2. But if you follow the Kubernetes set up in the docs you should be good :-) |
Can I run this with my own data? | Yes! This is just a demo dataset, you can use any data you have and like. Go to the Weaviate docs or join our Slack to get started. |
Acknowledgments
- The
t2v-transformers
module used contains the sentence-transformers-multi-qa-MiniLM-L6-cos transformer created by the SBERT team - Thanks to the team of Obsei for sharing the idea on our Slack channel
Stats
description | value |
---|---|
Articles imported | 11.348.257 |
Paragaphs imported | 27.377.159 |
Graph cross references | 125.447.595 |
Wikipedia version | truthy October 9th, 2021 |
Machine for inference | 12 CPU – 100 GB RAM – 250Gb SSD – 1 x NVIDIA Tesla P4 |
Weaviate version | v1.7.2 |
Dataset size | 122GB |
Example queries
Import
There are 3-steps in the import process. You can also skip the first two and directly import the backup
Step 1: Process the Wikipedia dump
In this process, the Wikipedia dataset is processed and cleaned (the markup is removed, HTML tags are removed, etc). The output file is a JSON Lines document that will be used in the next step.
Process from the Wikimedia dump:
$ cd step-1
$ wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
$ bzip2 -d filename.bz2
$ pip3 install -r requirements.txt
$ python3 process.py
The import takes a few hours, so probably you want to do something like:
$ nohup python3 -u process.py &
You can also download the processed file from October 9th, 2021, and skip the above steps
$ wget https://storage.googleapis.com/semi-technologies-public-data/wikipedia-en-articles.json.gz
$ gunzip wikipedia-en-articles.json.gz
Step 2: Import the dataset and vectorized the content
Weaviate takes care of the complete import and vectorization process but you'll need some GPU and CPU muscle to achieve this. Important to bear in mind is that this is only needed on import time. If you don't want to spend the resources on doing the import, you can go to the next step in the process and download the Weaviate backup. The machine needed for inference is way cheaper.
We will be using a single Weaviate instance, but four Tesla T4 GPUs that we will stuff with 8 models each. To efficiently do this, we are going to add an NGINX load balancer between Weaviate and the vectorizers.
- Every Weaviate text2vec-module will be using a multi-qa-MiniLM-L6-cos-v1 sentence transformer.
- The volume is mounted outside the container to
/var/weaviate
. This allows us to use this folder as a backup that can be imported in the next step. - Make sure to have Docker-compose with GPU support installed.
- The import scripts assumes that the JSON file is called
wikipedia-en-articles.json
.
$ cd step-2
$ docker-compose up -d
$ pip3 install -r requirements.txt
$ python3 import.py
The import takes a few hours, so probably you want to do something like:
$ nohup python3 -u import.py &
After the import is done, you can shut down the Docker containers by running docker-compose down
.
You can now query the dataset!
Step 3: Load from backup
Start here if you want to work with a backup of the dataset without importing it
You can now run the dataset! We would advise running it with 1 GPU, but you can also run it on CPU only (without Q&A). The machine you need for inference is significantly smaller.
Note that Weaviate needs some time to import the backup (if you use the setup mentioned above +/- 15min). You can see the status of the backup in the docker logs of the Weaviate container.
# clone this repository
$ git clone https://github.com/semi-technologies/semantic-search-through-Wikipedia-with-Weaviate/
# go into the backup dir
$ cd step-3
# download the Weaviate backup
$ curl https://storage.googleapis.com/semi-technologies-public-data/weaviate-1.8.0-rc.2-backup-wikipedia-py-en-multi-qa-MiniLM-L6-cos.tar.gz -O
# untar the backup (112G unpacked)
$ tar -xvzf weaviate-1.8.0-rc.2-backup-wikipedia-py-en-multi-qa-MiniLM-L6-cos.tar.gz
# get the unpacked directory
$ echo $(pwd)/var/weaviate
# use the above result (e.g., /home/foobar/var/weaviate)
# update volumes in docker-compose.yml (NOT PERSISTENCE_DATA_PATH!) to the above output
# (e.g.,
# volumes:
# - /home/foobar/var/weaviate:/var/lib/weaviate
# )
#
# With 12 CPUs this process takes about 12 to 15 minutes to complete.
# The Weaviate instance will be available directly, but the cache is pre-filling in this timeframe
With GPU
$ cd step-3
$ docker-compose -f docker-compose-gpu.yml up -d
Without GPU
$ cd step-3
$ docker-compose -f docker-compose-no-gpu.yml up -d
Example queries
"Where is the States General of The Netherlands located?" try it live!
##
# Using the Q&A module I
##
{
Get {
Paragraph(
ask: {
question: "Where is the States General of The Netherlands located?"
properties: ["content"]
}
limit: 1
) {
_additional {
answer {
result
certainty
}
}
content
title
}
}
}
"What was the population of the Dutch city Utrecht in 2019?" try it live!
##
# Using the Q&A module II
##
{
Get {
Paragraph(
ask: {
question: "What was the population of the Dutch city Utrecht in 2019?"
properties: ["content"]
}
limit: 1
) {
_additional {
answer {
result
certainty
}
}
content
title
}
}
}
About the concept "Italian food" try it live!
##
# Generic question about Italian food
##
{
Get {
Paragraph(
nearText: {
concepts: ["Italian food"]
}
limit: 50
) {
content
order
title
inArticle {
... on Article {
title
}
}
}
}
}
"What was Michael Brecker's first saxophone?" in the Wikipedia article about "Michael Brecker" try it live!
##
# Mixing scalar queries and semantic search queries
##
{
Get {
Paragraph(
ask: {
question: "What was Michael Brecker's first saxophone?"
properties: ["content"]
}
where: {
operator: Equal
path: ["inArticle", "Article", "title"]
valueString: "Michael Brecker"
}
limit: 1
) {
_additional {
answer {
result
}
}
content
order
title
inArticle {
... on Article {
title
}
}
}
}
}
Get all Wikipedia graph connections for "jazz saxophone players" try it live!
##
# Mixing semantic search queries with graph connections
##
{
Get {
Paragraph(
nearText: {
concepts: ["jazz saxophone players"]
}
limit: 25
) {
content
order
title
inArticle {
... on Article { # <== Graph connection I
title
hasParagraphs { # <== Graph connection II
... on Paragraph {
title
}
}
}
}
}
}
}