An open source, non-profit search engine implemented in python

Overview

Mwmbl: No ads, no tracking, no cruft, no profit

Mwmbl is a non-profit, ad-free, free-libre and free-lunch search engine with a focus on useability and speed. At the moment it is little more than an idea together with a proof of concept implementation of the web front-end and search technology on a very small index. A crawler is still to be implemented.

Our vision is a community working to provide top quality search particularly for hackers, funded purely by donations.

Why a non-profit search engine?

The motives of ad-funded search engine are at odds with providing an optimal user experience. These sites are optimised for ad revenue, with user experience taking second place. This means that pages are loaded with ads which are often not clearly distinguished from search results. Also, eitland on Hacker News comments:

Thinking about it it seems logical that for a search engine that practically speaking has monopoly both on users and as mattgb points out - [to some] degree also on indexing - serving the correct answer first is just dumb: if they can keep me going between their search results and tech blogs with their ads embedded one, two or five times extra that means one, two or five times more ad impressions.

But what about...?

The space of alternative search engines has expanded rapidly in recent years. Here's a very incomplete list of some that have interested me:

  • YaCy - an open source distributed search engine
  • search.marginalia.nu - a search engine favouring text-heavy websites
  • Gigablast - a privacy-focused search engine whose owner makes money by selling the technology to third parties
  • Brave
  • DuckDuckGo

Of these, YaCy is the closest in spirit to the idea of a non-profit search engine. The index is distributed across a peer-to-peer network. Unfortunately this design decision makes search very slow.

Marginalia Search is fantastic, but it is more of a personal project than an open source community.

All other search engines that I've come across are for-profit. Please let me know if I've missed one!

Designing for non-profit

To be a good search engine, we need to store many items, but the cost of running the engine is at least proportional to the number of items stored. Our main consideration is thus to reduce the cost per item stored.

The design is founded on the observation that most items rank for a small set of terms. In the extreme version of this, where each item ranks for a single term, the usual inverted index design is grossly inefficient, since we have to store each term at least twice: once in the index and once in the item data itself.

Our design is a giant hash map. We have a single store consisting of a fixed number N of pages. Each page is of a fixed size (currently 4096 bytes to match a page of memory), and consists of a compressed list of items. Given a term for which we want an item to rank, we compute a hash of the term, a value between 0 and N - 1. The item is then stored in the corresponding page.

To retrieve pages, we simply compute the hash of the terms in the user query and load the corresponding pages, filter the items to those containing the term and rank the items. Since each page is small, this can be done very quickly.

Because we compress the list of items, we can rank for more than a single term and maintain an index smaller than the inverted index design. Well, that's the theory. This idea has yet to be tested out on a large scale.

Crawling

Our current index is a small sample of the excellent Common Crawl, restricted to English content and domains which score highly on average in Hacker News submissions. It is likely for a variety of reasons that we will want to go beyond Common Crawl data at some point, so building a crawler becomes inevitable. We plan to start work on a distributed crawler, probably implemented as a browser extension that can be installed by volunteers.

How to contribute

There are lots of ways to help:

  • Volunteer to test out the distributed crawler when it's ready
  • Help out with development of the engine itself
  • Donate some money towards hosting costs and/or founding an official non-profit organisation

If you would like to help in any of these or other ways, thank you! Please email the main author (email address is in the git commit history).

Comments
  • OpenSearch is broken

    OpenSearch is broken

    I don't know this is intentional at the moment, but notice this when I was poking around the page.

    https://mwmbl.org/ Search for application/opensearchdescription+xml in the mark up. Which shows this: image


    I think we should be expecting something like this: https://search.brave.com/

    <link rel="search" type="application/opensearchdescription+xml" title="Brave Search" href="https://cdn.search.brave.com/serp/v1/static/brand/c57da39655b0b08603d88711f8e33aae50500cbcd8d2fc70a0d01e105cbd0985-opensearch.xml">
    
    bug 
    opened by IcecreamSlut 9
  • Unable to run mwmbl locally / index.tinysearch corrupt?

    Unable to run mwmbl locally / index.tinysearch corrupt?

    Hi everyone,

    I'm trying to run a local development env for mwmbl, but I'm experiencing some difficulties.
    Been following the dev guide: https://github.com/mwmbl/mwmbl/wiki/Development-FAQ

    Steps I've done:

    1. Cloning the repo.
    git clone [email protected]:raypatterson77/mwmbl.git
    
    1. Setting up Python venv
    python -m venv .     
    
    1. Activating Python venv:
     source bin/activate       
    
    1. Install dependencies
     pip install .        
    
    1. Downloaded index File (serveral times) and placed it in data folder
    2. Trying to run mwmbl:
    mwmbl-tinysearchengine --config config/tinysearchengine.yaml  
    

    Errors:

    Running

    mwmbl-tinysearchengine --config config/tinysearchengine.yaml  
    

    Gives the error:

    usage: mwmbl-tinysearchengine [-h] --index INDEX --terms TERMS
    mwmbl-tinysearchengine: error: the following arguments are required: --index, --terms
    

    NOTE: /config/tinysearchengine.yaml is not in the master branch, but still available in the update-readme-for-new-crawler branch. Even with /config/tinysearchengine.yaml present,

    mwmbl-tinysearchengine --config config/tinysearchengine.yaml  
    

    is failing with the error above.

    Trying to use --index --terms paramters as following:

    mwmbl-tinysearchengine --index data/index.tinysearch --terms data/terms.csv  
    

    NOTE: the terms CSV file is nowhere present, not sure about what exactly is expected to be in the file, but from the errors I was getting,

    term,count
    

    needs to be present. Creating the file terms.csv under data/terms.csv with above content gives me the error:

    Terms [] []
    Traceback (most recent call last):
      File "/home/o/Dokumente/code/python/mwmbl-dev/bin/mwmbl-tinysearchengine", line 8, in <module>
        sys.exit(main())
      File "/home/o/Dokumente/code/python/mwmbl-dev/lib/python3.10/site-packages/mwmbl/tinysearchengine/app.py", line 39, in main
        with TinyIndex(item_factory=Document, index_path=args.index) as tiny_index:
      File "/home/o/Dokumente/code/python/mwmbl-dev/lib/python3.10/site-packages/mwmbl/tinysearchengine/indexer.py", line 81, in __init__
        metadata = TinyIndexMetadata.from_bytes(metadata_bytes)
      File "/home/o/Dokumente/code/python/mwmbl-dev/lib/python3.10/site-packages/mwmbl/tinysearchengine/indexer.py", line 51, in from_bytes
        raise ValueError("This doesn't seem to be an index file")
    ValueError: This doesn't seem to be an index file
    

    There is no sha sum to validate the integrity of the index.tinysearch file, but as said I downloaded it multiple times and I don't think the file gets corrupted during the download process.

    Am I missing something or is there a problem with the resent version of mwmbl? And is there anywhere a example of the terms CSV file? Is config/tinysearchengine.yaml for some reason no longer in the master branch? If not I can make a pull requests to add it back from the update-readme-for-new-crawler branch

    Docker

    Trying to set up the dev env with docker, ends in similar problems. Unluckily I haven't it documented well, but for building the image config/tinysearchengine.yaml is necessary, as defined in the Dockerfile. After building:

    sudo docker run -p 8080:8080 mwmbl                                                                                                                                                                        
    usage: mwmbl-tinysearchengine [-h] --index INDEX --terms TERMS
    mwmbl-tinysearchengine: error: the following arguments are required: --index, --terms
    

    Env

    uname -a                                                                                                                                                                                                   
    Linux o 5.16.7-arch1-1
    
    python --version                                                                                                                                                                                          
    Python 3.10.2
    
    pip --version                                                                                                                                                                                                 
    pip 21.2.4
    
    docker version                                                                                                                                                                                                
    Client:
     Version:           20.10.12
     API version:       1.41
     Go version:        go1.17.5
     Git commit:        e91ed5707e
     Built:             Mon Dec 13 22:31:40 2021
     OS/Arch:           linux/amd64
     Context:           default
     Experimental:      true
    
    bug 
    opened by raypatterson77 5
  • Prioritise root URLs in search ranking

    Prioritise root URLs in search ranking

    At the moment if you search for "facebook" you will get results about facebook, whereas you should probably get facebook.com/. We should prioritise such root URLs if they exist.

    enhancement 
    opened by daoudclarke 5
  • Building Image from Dockerfile failed

    Building Image from Dockerfile failed

    In Step 14/15 copying the data folder is failing, because the folder is not present. If I manually create the data folder, the image will build, but I can not start the container: `Traceback (most recent call last):

    File "/usr/local/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/local/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/venv/lib/python3.9/site-packages/tinysearchengine/app.py", line 13, in tiny_index = TinyIndex(Document, index_path, NUM_PAGES, PAGE_SIZE) File "/venv/lib/python3.9/site-packages/tinysearchengine/indexer.py", line 76, in init self.index_file = open(self.index_path, 'rb') FileNotFoundError: [Errno 2] No such file or directory: '/data/index.tinysearch'`

    Which files have to be present in this data folder?

    documentation 
    opened by raypatterson77 4
  • Determine whether or not we still need to uninstall numpy on EMR

    Determine whether or not we still need to uninstall numpy on EMR

    can you comment why pip uninstall need to be invoked three times in a row

    https://github.com/mwmbl/mwmbl/blob/03ca368b2acb1a23edb839df7884452d7c26f81d/indexer/bootstrap.sh#L5

    please

    enhancement 
    opened by setop 3
  • [Feature request] Language-based content summary

    [Feature request] Language-based content summary

    I have checked that some websites with different languages availability show the content in the language how was indexed.

    I think that this content summary should be a registry based on the language of the web browser using it, and maybe a same website should be indexed several times to get content summary in different languages or made through the extension with prefixed languages.

    2022-12-05_11-14

    In this picture, it is showed the http link in a Cyrillic-based language instead of English, that is being used by my web browser.

    2022-12-05_11-15

    In this other picture, one of the links is showed in french.

    opened by EchedeyLR 2
  • Flood in browser history

    Flood in browser history

    While you typing a query the search results are updating and URL with this query also, there is no need to store intermediary URL queries in browser history, because it is bulking and make history less readable.

    Screenshot

    Browser History Flood

    bug 
    opened by qualterz 2
  • Store index metadata along with index

    Store index metadata along with index

    At the moment, the number of pages and the page size are stored in the code. This doesn't make sense as different indexes can have different page sizes and number of pages. Instead I suggest storing this metadata either:

    1. in the index itself, in which case we could sacrifice the first 4096 bytes for metadata so as to maintain the physical memory page boundaries
    2. or, we could use a separate file stored along with the index.

    My current preference is for 1.

    enhancement 
    opened by daoudclarke 2
  • Create an evaluation dataset

    Create an evaluation dataset

    We can use the Bing API to:

    1. Identify common search queries using the Autosuggest API. To do this we can query e.g. "n" to autosuggest and get back common queries that begin with "n", e.g. ["next", "news", ...]. This can then be bootstrapped by putting in these common queries to get longer queries. So, send "news" to autosuggest to get ["news uk", "news bbc", "news today"].
    2. Given each query, retrieve the top N results for each query from the Web Search API.

    We should ideally collect a dataset at least 2,000 queries, which we can split into development set and test set of 1,000 each.

    For the evaluation we will want to filter the retrieved results to the same set of domains that we are currently restricted to (top HN scoring domains).

    enhancement 
    opened by daoudclarke 2
  • renamed package to mwmbl

    renamed package to mwmbl

    Merge request includes

    • renamed package to mwmbl in pyproject.toml
    • tinysearchengine and indexer modules have been moved into mwmbl package folder
    • analyse module has been left as is in the root of the repo
    • import statements in tinysearchengine now use mwmbl.tinysearchengine
    • import statements in indexer now use mwmbl.indexer or mwmbl.tinysearchengine or relative imports like .paths
    • import statements in analyse now use mwmbl.indexer or mwmbl.tinysearchengine
    • final CMD in Dockerfile now uses updated path mwmbl.tinysearchengine.app
    • fixed a couple of import statement errors in tinysearchengine/indexer.py

    Notes for reviewer

    • Fixes #15
    • Its recommended to first merge #13 before this PR. However this PR is self sufficient but will have incomplete dependencies.
    • Tested that the building the Dockerfile still works but does not run with the known problem that the index file is missing.
    • Tested that the mwmbl.indexer.extract_local.py runs without failing imports (ends without doing anything since input_queue is empty)
    • Unable to test mwmbl.tinysearchengine.app due to missing index file.
    opened by nitred 2
  • Choice in Git hosting does not match project philosophy

    Choice in Git hosting does not match project philosophy

    Reading the article over at https://daoudclarke.net/search%20engines/2022/07/10/non-profit-search-engine was interesting, goals like these are indeed worth fighting for:

    Even if I make the web better for one person, it’s worth it. Because the way things are is just wrong.

    Though, it is a bit puzzling that the choice in tooling does not follow that same spirit. Why is a project like this hosted on the Google of Git hosting services: GitHub?

    Please consider moving to more open alternatives, e.g. GitLab.

    • https://www.theregister.com/2022/06/30/software_freedom_conservancy_quits_github/
    opened by pennersr 1
  • Community aspect

    Community aspect

    Something between lemmy and an indexer/search engine. Where the visibility for each website would be different in each instance based on the user ratings. Some instance could also block some domains. For example an instance for FOSS enthusiasts that blocked big tech domains. Or an instance for leftists that blocked western media.

    To search a niche topic you could select (if it existed) the instance focusing on that. Instead of doing kung-fu on a general search engine.

    It would show all links, except the ones for the blocked domains, instead of only the ones posted by the users, as lemmy does. But instead of only the links you would also have a button to show a comment section to have a community aspect like lemmy. Comments would have multiple levels, votes, and different sorting options just like lemmy.

    The rating system would require community moderation. A pyramidal trust-based moderation system like discourse so that an instance admin would only have to deal with a few users to make sure the instance remained free of bots and ill-intentioned users skewing the ratings.

    enhancement 
    opened by JediMaster25 3
  • Prepare an API endpoint for testing crawlers

    Prepare an API endpoint for testing crawlers

    During development of crawler code, it would be significantly helpful to have a crawler API endpoint that let crawlers access certain testing pages.

    Some ideas:

    • api.crawler-test.mwmbl.org: The API endpoint.
    • target.crawler-test.mwmbl.org/200.html, target.crawler-test.mwmbl.org/404.html, target.crawler-test.mwmbl.org/noindex.html, target.crawler-test.mwmbl.org/disallow.html, ...: Test cases.
    • disallow-all.crawler-test.mwmbl.org: Another test case.
    • etc.

    Launching such API endpoint locally seems useful too. In principle we can share most of the code between public test endpoint and custom local test endpoint.

    enhancement 
    opened by omasanori 0
  • Add some sites to crawler

    Add some sites to crawler

    opened by daoudclarke 6
  • Homepage of indexed sites are missing

    Homepage of indexed sites are missing

    Great projekt! Searching for e.g. wikipedia, new york times or hacker news does not lead to the start page of the website but to random sub pages. Same issue when searching for the exact start page url directly.

    bug 
    opened by samuel-git 0
Simple algorithm search engine like google in python using function

Mini-Search-Engine-Like-Google I have created the simple algorithm search engine like google in python using function. I am matching every word with w

Sachin Vinayak Dabhade 5 Sep 24, 2021
Google Search Engine Results Pages (SERP) in locally, no API key, no signup required

Local SERP Google Search Engine Results Pages (SERP) in locally, no API key, no signup required Make sure the chromedriver and required package are in

theblackcat102 4 Jun 29, 2021
A sentence search engine that fetches examples from trusted news/media organisations. Great for writing better English.

A sentence search engine that fetches examples from trusted news/media websites. Great for improving writing & speaking better English.

Stephen Appiah 1 Apr 4, 2022
A simple search engine that allow searching for chess games

A simple search engine that allow searching for chess games based on queries about opening names & opening moves. Built with Python 3.10 and python-chess.

Tyler Hoang 1 Jun 17, 2022
A fast, efficiency python package for searching and getting search results with many different search engines

search A fast, efficiency python package for searching and getting search results with many different search engines. Installation To install the pack

Neurs 0 Oct 6, 2022
Search emails from a domain through search engines

EmailFinder - search emails through Search Engines

Josué Encinar 155 Dec 30, 2022
Image search service based on imgsmlr extension of PostgreSQL. Support image search by image.

imgsmlr-server Image search service based on imgsmlr extension of PostgreSQL. Support image search by image. This is a sample application of imgsmlr.

jie 45 Dec 12, 2022
GitScanner is a script to make it easy to search for Exposed Git through an advanced Google search.

GitScanner Legal disclaimer Usage of GitScanner for attacking targets without prior mutual consent is illegal. It is the end user's responsibility to

Kaio Gomes 3 Oct 28, 2022
Reverse-ikea-image-search - A simple image of ikea search using jina.ai

IKEA Reverse Image Search This is a demo project to fetch ikea product images(IK

SOUVIK GHOSH 4 Mar 8, 2022
A Python web searcher library with different search engines

Robert A simple Python web searcher library with different search engines. Install pip install roberthelper Usage from robert import GoogleSearcher

null 1 Dec 23, 2021
Modular search for Django

Haystack Author: Daniel Lindsley Date: 2013/07/28 Haystack provides modular search for Django. It features a unified, familiar API that allows you to

Haystack Search 3.4k Jan 4, 2023
Full text search for flask.

flask-msearch Installation To install flask-msearch: pip install flask-msearch # when MSEARCH_BACKEND = "whoosh" pip install whoosh blinker # when MSE

honmaple 197 Dec 29, 2022
Jina allows you to build deep learning-powered search-as-a-service in just minutes

Cloud-native neural search framework for any kind of data

Jina AI 17k Dec 31, 2022
document organizer with tags and full-text-search, in a simple and clean sqlite3 schema

document organizer with tags and full-text-search, in a simple and clean sqlite3 schema

Manos Pitsidianakis 152 Oct 29, 2022
A web search server for ParlAI, including Blenderbot2.

Description A web search server for ParlAI, including Blenderbot2. Querying the server: The server reacting correctly: Uses html2text to strip the mar

Jules Gagnon-Marchand 119 Jan 6, 2023
This project is a sample demo of Arxiv search related to AI/ML Papers built using Streamlit, sentence-transformers and Faiss.

This project is a sample demo of Arxiv search related to AI/ML Papers built using Streamlit, sentence-transformers and Faiss.

Karn Deb 49 Oct 30, 2022
Google Project: Search and auto-complete sentences within given input text files, manipulating data with complex data-structures.

Auto-Complete Google Project In this project there is an implementation for one feature of Google's search engines - AutoComplete. Autocomplete, or wo

Hadassah Engel 10 Jun 20, 2022
Full-text multi-table search application for Django. Easy to install and use, with good performance.

django-watson django-watson is a fast multi-model full-text search plugin for Django. It is easy to install and use, and provides high quality search

Dave Hall 1.1k Jan 3, 2023
rclip - AI-Powered Command-Line Photo Search Tool

rclip is a command-line photo search tool based on the awesome OpenAI's CLIP neural network.

Yurij Mikhalevich 394 Dec 12, 2022