Inverted index creation and query search mechanism on Wikipedia pages.

Piyush Atri

Last update: Nov 27, 2021

Related tags

Search Wikipedia-Search-Engine

Overview

WikiPedia Search Engine

Step 1 : Installing Requirements

Install "stemming" module for python using pip.

Step 2 : Parsing the Data

To parse the data, run the file "wikiIndexer.py"

To run the file, the syntax is: "python WikipediaIndexer.py "

It will parse the whole dump and file the index files in the 'indexFiles' directory. It also creates the document to title mapping file in the current directory named 'docTitleMap.txt' which will be used by the search module later. (NOTE : As of current code, it pushes the index to disk for every 5000 documents encountered. This can be changed by changing the value of ' WRITE_PAGES_TO_FILE' macro in config.py)

Step 3 : Merging the Indexes and Creating Secondary Indexes

This task is done by WikipediaIndexer.py itself. There isn't any need for separate command. It takes the index files from 'pathOfFolder' directory and populates the 'finalIndex' directory with indexes of given chunk size and creates a secondary index named 'secondaryIndex.txt' in the same folder. (NOTE : As of the current code, the chunk size is kept 20000)

Step 4 : Running the Search Engine

To search for queries, run the file 'search.py'. It loads the index from 'finalIndex' (both primary and secondary). It also uses the file 'titleoffset.txt' (which must be in the working directory) to display titles corresponding to the docIDs. After it loads up, it gives the user a prompt to enter the query. After that the result of query is displayed. (top K results).

The format of result is:

	
   
     : 
    
      (Here the DocID is from the XML Dump)
	...

To specify normal queries, type them normally. For Field Queries, follow the format:

	f1:
   
     f2:
    
      ...
	where f1,f2 are fields : t - title, b - body, r - references, i - infobox, c - categories, e - external links

You might also like...

txtai executes machine-learning workflows to transform data and build AI-powered semantic search applications.

3.1k Dec 31, 2022

🔍 Messages Searcher is make for search custom message in all channels in guild and dm.

33 Dec 31, 2022

ForFinder is a search tool for folder and files

ForFinder is a search tool for folder and files. You can use that when you Source Code Analysis at your project's local files or other projects that you are download. Enter a root path and keyword to ForFinder.

7 Oct 25, 2022

Modular search for Django

Haystack Author: Daniel Lindsley Date: 2013/07/28 Haystack provides modular search for Django. It features a unified, familiar API that allows you to

3.4k Jan 4, 2023

Full text search for flask.

flask-msearch Installation To install flask-msearch: pip install flask-msearch # when MSEARCH_BACKEND = "whoosh" pip install whoosh blinker # when MSE

197 Dec 29, 2022

Jina allows you to build deep learning-powered search-as-a-service in just minutes

Cloud-native neural search framework for any kind of data

17k Dec 31, 2022

Senginta is All in one Search Engine Scrapper for used by API or Python Module. It's Free!

Senginta is All in one Search Engine Scrapper. With traditional scrapping, Senginta can be powerful to get result from any Search Engine, and convert to Json. Now support only for Google Product Search Engine (GShop, GVideo and many too) and Baidu Search Engine.

33 Nov 21, 2022

A web search server for ParlAI, including Blenderbot2.

Description A web search server for ParlAI, including Blenderbot2. Querying the server: The server reacting correctly: Uses html2text to strip the mar

119 Jan 6, 2023

rclip - AI-Powered Command-Line Photo Search Tool

rclip is a command-line photo search tool based on the awesome OpenAI's CLIP neural network.

394 Dec 12, 2022

Inverted index creation and query search mechanism on Wikipedia pages.

Related tags

Overview

WikiPedia Search Engine

Step 1 : Installing Requirements

Step 2 : Parsing the Data

Step 3 : Merging the Indexes and Creating Secondary Indexes

Step 4 : Running the Search Engine

You might also like...

txtai executes machine-learning workflows to transform data and build AI-powered semantic search applications.

🔍 Messages Searcher is make for search custom message in all channels in guild and dm.

ForFinder is a search tool for folder and files

Modular search for Django

Full text search for flask.

Jina allows you to build deep learning-powered search-as-a-service in just minutes

Senginta is All in one Search Engine Scrapper for used by API or Python Module. It's Free!

A web search server for ParlAI, including Blenderbot2.

rclip - AI-Powered Command-Line Photo Search Tool

Owner

Piyush Atri

A fast, efficiency python package for searching and getting search results with many different search engines

Deep Image Search - AI-Based Image Search Engine

Search emails from a domain through search engines

Image search service based on imgsmlr extension of PostgreSQL. Support image search by image.

GitScanner is a script to make it easy to search for Exposed Git through an advanced Google search.

Reverse-ikea-image-search - A simple image of ikea search using jina.ai

document organizer with tags and full-text-search, in a simple and clean sqlite3 schema

This project is a sample demo of Arxiv search related to AI/ML Papers built using Streamlit, sentence-transformers and Faiss.

Google Project: Search and auto-complete sentences within given input text files, manipulating data with complex data-structures.

Full-text multi-table search application for Django. Easy to install and use, with good performance.