A web search server for ParlAI, including Blenderbot2.

Overview

Description

A web search server for ParlAI, including Blenderbot2.

Querying the server: Querying the server

The server reacting correctly: The server reacting appropriately

  • Uses html2text to strip the markup out of the page.
  • Uses beautifulsoup4 to parse the title.
  • Currently only uses the googlesearch module to query Google for urls, but is coded in a modular / search engine agnostic way to allow very easily add new search engine support.

Using the googlesearch module is very slow because it parses webpages instead of querying webservices. This is fine for playing with the model, but makes that searcher unusable for training or large scale inference purposes.

To be able to train, one would just have to for example pay for Google Cloud or Microsoft Azure's search services, and derive the Search class to query them.

Quick Start:

First install the requirements:

pip install -r requirements.txt

Run this command in one terminal tab:

python search_server.py serve --host 0.0.0.0:8080

[Optional] You can then test the server with

curl -X POST "http://0.0.0.0:8080" -d "q=baseball&n=1"

Then for example start Blenderbot2 in a different terminal tab:

python -m parlai interactive --model-file zoo:blenderbot2/blenderbot2_3B/model --search_server 0.0.0.0:8080

Colab

There is a jupyter notebook. Just run it. Some instances run out of memory, some don't.

Testing the server:

You need to already be running a server by calling serve on the same hostname and ip. This will create a parlai.agents.rag.retrieve_api.SearchEngineRetriever and try to connect and send a query, and parse the answer.

python search_server.py test_server --host 0.0.0.0:8080

Testing the parser:

python search_server.py test_parser www.some_url_of_your_choice.com/
Comments
  • How do I call your server from the MacOS terminal linux command line?

    How do I call your server from the MacOS terminal linux command line?

    When trying to setup blenderbot2 with your search it does not seem to fetch the queries. On my Mac in the terminal window here's what I did:

    $ git clone https://github.com/RodolfoFigueroa/ParlAI.git ./ParlAI $ git clone https://github.com/pytorch/fairseq ./fairseq $ pip install -r ./ParlAI/requirements.txt $ cd ParlAI/ $ python ./setup.py develop $ cd ../fairseq $ pip install --editable ./ $ cd .. $ git clone https://github.com/JulesGM/ParlAI_SearchEngine $ pip install -r ./ParlAI_SearchEngine/requirements.txt $ python ./ParlAI_SearchEngine/search_server.py serve --host 0.0.0.0:8080& $ parlai interactive --model-file zoo:blenderbot2/blenderbot2_400M/model --search-server http://0.0.0.0:8080

    10:39:39 | Current ParlAI commit: ad57e5281de76ed42a378bdd2bcdef2fafc54ab8 Enter [DONE] if you want to end the episode, [EXIT] to quit. 10:39:39 | creating task(s): interactive Enter Your Message: Who went into space this week? blenderbot2/lib/python3.7/site-packages/torch/_tensor.py:575: UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at ../aten/src/ATen/native/BinaryOps.cpp:467.) return torch.floor_divide(self, other)

    [BlenderBot2Fid]: I'm not sure, but I'm pretty sure it was Elon Musk. POTENTIALLY_UNSAFE_ Enter Your Message: who won the superbowl this year [BlenderBot2Fid]: I don't think Elon Musk has ever been to space, but it would be cool if he did. Enter Your Message: [BlenderBot2Fid]: That's a good question. I don't know who won the Super Bowl this year, though. Enter Your Message:

    and when I try to curl I get the following error:

    $ curl -X GET "http://0.0.0.0:8080?q=baseball" 127.0.0.1 - - [21/Jul/2021 10:32:34] code 501, message Unsupported method ('GET') 127.0.0.1 - - [21/Jul/2021 10:32:34] "GET /?q=baseball HTTP/1.1" 501 -

    Error response

    Error response

    Error code: 501

    Message: Unsupported method ('GET').

    Error code explanation: HTTPStatus.NOT_IMPLEMENTED - Server does not support this operation.

    any help would be appreciated. Thank you.

    opened by hitchingsh 6
  • Adding custom files with information to the search engine

    Adding custom files with information to the search engine

    I've tried the search engine using your colab.ipynb and It worked perfect after several trials. It seemed that colab works better at some hours, whereas it fails when resources demand is high.

    I am still wondering in which way can I add to your ParlAI_SearchEngine a custom directory with text files so the relevant documents can be found also in that directory. This way one can add custom information to the web search. Any idea?

    opened by alelasantillan 4
  • Failed To Connect To Port

    Failed To Connect To Port

    While chatting with blenderbot with the $HOST search_server it stopped taking inputs after a while and when I checked with the command

    !curl -X POST $HOST -d "q=biggest gpt model&n=1"

    it threw the error

    curl: (7) Failed to connect to 0.0.0.0 port 1111: Connection refused

    opened by Jawad1347 2
  • [Fix] Making port an integer

    [Fix] Making port an integer

    Thanks for making this repo!

    I have been using it and noticed that I needed to make a small fix to make it work.

    Need to convert port to integer otherwise it will throw an error.

    def _parse_host(host: str) -> Tuple[str, str]:
        splitted = host.split(":")
        hostname = splitted[0]
        port = splitted[1] if len(splitted) > 1 else _DEFAULT_PORT
        port = int(port) # port is integer
        return hostname, port
    
    opened by mansimov 2
  • ParlAI_SearchEngine license

    ParlAI_SearchEngine license

    Hi. Thanks for making this project public.

    I want to fork this repository and add files, but the license is not specified. Could you please specify the license? Or is it set to standard copyright?

    opened by scy6500 1
  • Made 3 improvements.  First, set timeout to 15 seconds so it does not…

    Made 3 improvements. First, set timeout to 15 seconds so it does not…

    … get stuck. Limit text returned to 2K per result. Set time filter to last 30 days for web pages indexed within last 30 days. These changes improve the information returned and answers given by ParlAI.

    opened by hitchingsh 1
  • Fixes

    Fixes

    I still need lxml and chardet to run so it would be great if you could add it to the requirements

    When I do the query:

    curl -X POST "http://0.0.0:8090" -d "q=Wandavision&n=5"

    it crashes the search_server.py and to fix it I added the line so if the title is non existant it does not crash

    if output_dict["title"]:
        output_dict["title"] = output_dict["title"].replace("\n", "").replace("\r", "")
    

    Note, this error also crashes blenderbot2 and you can reproduce it by typing the following sentence into blenderbot2 while using your search_server.py

    I like the TV show Wandavision

    Note, I have made and tested this fix for both of the cases above and both now work properly. Can you make this to your github?

    PS: Thank you so much for writing this server! It's been a life saver.

    opened by hitchingsh 1
  • breaks after some queries

    breaks after some queries

    when using this together with parlai interactive, after 2-4 conversation turns the following error appears.

    requests.exceptions.ConnectionError: HTTPConnectionPool(host='0.0.0.0', port=8080): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x0000027318DBB388>: Failed to establish a new connection: [WinError 10049]
    

    the full trace:

    Traceback (most recent call last):
      File "C:\Users\Usuario\anaconda3\envs\parlaisearch\lib\site-packages\urllib3\connection.py", line 175, in _new_conn
        (self._dns_host, self.port), self.timeout, **extra_kw
      File "C:\Users\Usuario\anaconda3\envs\parlaisearch\lib\site-packages\urllib3\util\connection.py", line 95, in create_connection
        raise err
      File "C:\Users\Usuario\anaconda3\envs\parlaisearch\lib\site-packages\urllib3\util\connection.py", line 85, in create_connection
        sock.connect(sa)
    OSError: [WinError 10049] La dirección solicitada no es válida en este contexto
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "C:\Users\Usuario\anaconda3\envs\parlaisearch\lib\site-packages\urllib3\connectionpool.py", line 710, in urlopen
        chunked=chunked,
      File "C:\Users\Usuario\anaconda3\envs\parlaisearch\lib\site-packages\urllib3\connectionpool.py", line 398, in _make_request
        conn.request(method, url, **httplib_request_kw)
      File "C:\Users\Usuario\anaconda3\envs\parlaisearch\lib\site-packages\urllib3\connection.py", line 239, in request
        super(HTTPConnection, self).request(method, url, body=body, headers=headers)
      File "C:\Users\Usuario\anaconda3\envs\parlaisearch\lib\http\client.py", line 1281, in request
        self._send_request(method, url, body, headers, encode_chunked)
      File "C:\Users\Usuario\anaconda3\envs\parlaisearch\lib\http\client.py", line 1327, in _send_request
        self.endheaders(body, encode_chunked=encode_chunked)
      File "C:\Users\Usuario\anaconda3\envs\parlaisearch\lib\http\client.py", line 1276, in endheaders
        self._send_output(message_body, encode_chunked=encode_chunked)
      File "C:\Users\Usuario\anaconda3\envs\parlaisearch\lib\http\client.py", line 1036, in _send_output
        self.send(msg)
      File "C:\Users\Usuario\anaconda3\envs\parlaisearch\lib\http\client.py", line 976, in send
        self.connect()
      File "C:\Users\Usuario\anaconda3\envs\parlaisearch\lib\site-packages\urllib3\connection.py", line 205, in connect
        conn = self._new_conn()
      File "C:\Users\Usuario\anaconda3\envs\parlaisearch\lib\site-packages\urllib3\connection.py", line 187, in _new_conn
        self, "Failed to establish a new connection: %s" % e
    urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x0000027318DBB388>: Failed to establish a new connection: [WinError 10049] La dirección solicitada no es válida en este contexto
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "C:\Users\Usuario\anaconda3\envs\parlaisearch\lib\site-packages\requests\adapters.py", line 450, in send
        timeout=timeout
      File "C:\Users\Usuario\anaconda3\envs\parlaisearch\lib\site-packages\urllib3\connectionpool.py", line 786, in urlopen
        method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
      File "C:\Users\Usuario\anaconda3\envs\parlaisearch\lib\site-packages\urllib3\util\retry.py", line 592, in increment
        raise MaxRetryError(_pool, url, error or ResponseError(cause))
    urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='0.0.0.0', port=8080): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x0000027318DBB388>: Failed to establish a new connection: [WinError 10049] La dirección solicitada no es válida en este contexto'))
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "C:\Users\Usuario\anaconda3\envs\parlaisearch\lib\runpy.py", line 193, in _run_module_as_main
        "__main__", mod_spec)
      File "C:\Users\Usuario\anaconda3\envs\parlaisearch\lib\runpy.py", line 85, in _run_code
        exec(code, run_globals)
      File "C:\Users\Usuario\anaconda3\envs\parlaisearch\Scripts\parlai.exe\__main__.py", line 7, in <module>
      File "C:\Users\Usuario\anaconda3\envs\parlaisearch\lib\site-packages\parlai\__main__.py", line 14, in main
        superscript_main()
      File "C:\Users\Usuario\anaconda3\envs\parlaisearch\lib\site-packages\parlai\core\script.py", line 325, in superscript_main
        return SCRIPT_REGISTRY[cmd].klass._run_from_parser_and_opt(opt, parser)
      File "C:\Users\Usuario\anaconda3\envs\parlaisearch\lib\site-packages\parlai\core\script.py", line 108, in _run_from_parser_and_opt
        return script.run()
      File "C:\Users\Usuario\anaconda3\envs\parlaisearch\lib\site-packages\parlai\scripts\interactive.py", line 118, in run
        return interactive(self.opt)
      File "C:\Users\Usuario\anaconda3\envs\parlaisearch\lib\site-packages\parlai\scripts\interactive.py", line 93, in interactive
        world.parley()
      File "C:\Users\Usuario\anaconda3\envs\parlaisearch\lib\site-packages\parlai\tasks\interactive\worlds.py", line 89, in parley
        acts[1] = agents[1].act()
      File "C:\Users\Usuario\anaconda3\envs\parlaisearch\lib\site-packages\parlai\core\torch_agent.py", line 2143, in act
        response = self.batch_act([self.observation])[0]
      File "C:\Users\Usuario\anaconda3\envs\parlaisearch\lib\site-packages\parlai\core\torch_agent.py", line 2239, in batch_act
        output = self.eval_step(batch)
      File "C:\Users\Usuario\anaconda3\envs\parlaisearch\lib\site-packages\projects\blenderbot2\agents\blenderbot2.py", line 790, in eval_step
        output = super().eval_step(batch)
      File "C:\Users\Usuario\anaconda3\envs\parlaisearch\lib\site-packages\parlai\agents\rag\rag.py", line 290, in eval_step
        output = super().eval_step(batch)
      File "C:\Users\Usuario\anaconda3\envs\parlaisearch\lib\site-packages\parlai\core\torch_generator_agent.py", line 876, in eval_step
        batch, self.beam_size, maxlen, prefix_tokens=prefix_tokens
      File "C:\Users\Usuario\anaconda3\envs\parlaisearch\lib\site-packages\parlai\agents\rag\rag.py", line 673, in _generate
        gen_outs = self._rag_generate(batch, beam_size, max_ts, prefix_tokens)
      File "C:\Users\Usuario\anaconda3\envs\parlaisearch\lib\site-packages\parlai\agents\rag\rag.py", line 713, in _rag_generate
        self, batch, beam_size, max_ts, prefix_tokens
      File "C:\Users\Usuario\anaconda3\envs\parlaisearch\lib\site-packages\parlai\core\torch_generator_agent.py", line 1094, in _generate
        encoder_states = model.encoder(*self._encoder_input(batch))
      File "C:\Users\Usuario\anaconda3\envs\parlaisearch\lib\site-packages\projects\blenderbot2\agents\modules.py", line 821, in encoder
        segments,
      File "C:\Users\Usuario\anaconda3\envs\parlaisearch\lib\site-packages\projects\blenderbot2\agents\modules.py", line 226, in encoder
        num_memory_decoder_vecs,
      File "C:\Users\Usuario\anaconda3\envs\parlaisearch\lib\site-packages\projects\blenderbot2\agents\modules.py", line 357, in retrieve_and_concat
        search_queries, query_vec, search_indices
      File "C:\Users\Usuario\anaconda3\envs\parlaisearch\lib\site-packages\projects\blenderbot2\agents\modules.py", line 519, in perform_search
        query_vec[search_indices]  # type: ignore
      File "C:\Users\Usuario\anaconda3\envs\parlaisearch\lib\site-packages\parlai\agents\rag\retrievers.py", line 411, in retrieve
        docs, scores = self.retrieve_and_score(query)
      File "C:\Users\Usuario\anaconda3\envs\parlaisearch\lib\site-packages\parlai\agents\rag\retrievers.py", line 1192, in retrieve_and_score
        search_results_batach = self.search_client.retrieve(search_queries, self.n_docs)
      File "C:\Users\Usuario\anaconda3\envs\parlaisearch\lib\site-packages\parlai\agents\rag\retrieve_api.py", line 132, in retrieve
        return [self._retrieve_single(q, num_ret) for q in queries]
      File "C:\Users\Usuario\anaconda3\envs\parlaisearch\lib\site-packages\parlai\agents\rag\retrieve_api.py", line 132, in <listcomp>
        return [self._retrieve_single(q, num_ret) for q in queries]
      File "C:\Users\Usuario\anaconda3\envs\parlaisearch\lib\site-packages\parlai\agents\rag\retrieve_api.py", line 111, in _retrieve_single
        search_server_resp = self._query_search_server(search_query, num_ret)
      File "C:\Users\Usuario\anaconda3\envs\parlaisearch\lib\site-packages\parlai\agents\rag\retrieve_api.py", line 89, in _query_search_server
        server_response = requests.post(server, data=req)
      File "C:\Users\Usuario\anaconda3\envs\parlaisearch\lib\site-packages\requests\api.py", line 117, in post
        return request('post', url, data=data, json=json, **kwargs)
      File "C:\Users\Usuario\anaconda3\envs\parlaisearch\lib\site-packages\requests\api.py", line 61, in request
        return session.request(method=method, url=url, **kwargs)
      File "C:\Users\Usuario\anaconda3\envs\parlaisearch\lib\site-packages\requests\sessions.py", line 529, in request
        resp = self.send(prep, **send_kwargs)
      File "C:\Users\Usuario\anaconda3\envs\parlaisearch\lib\site-packages\requests\sessions.py", line 645, in send
        r = adapter.send(request, **kwargs)
      File "C:\Users\Usuario\anaconda3\envs\parlaisearch\lib\site-packages\requests\adapters.py", line 519, in send
        raise ConnectionError(e, request=request)
    requests.exceptions.ConnectionError: HTTPConnectionPool(host='0.0.0.0', port=8080): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x0000027318DBB388>: Failed to establish a new connection: [WinError 10049] La dirección solicitada no es válida en este contexto'))
    

    @JulesGM @klshuster

    opened by alexvaca0 2
  • Using ParlAi

    Using ParlAi

    Just wondering is there anyway I could make a class to get a response to simply gall the other procedures with ParlAI to make it easier to use? Also is there a way to increase the number of websites and ares of text it parses so that we can expand the knowledge?

    opened by HelloIshHere 0
  • Added Bing Search Engine, 10X Speedup, Cleaner HTML. Made architectural changes requested by JulesGM

    Added Bing Search Engine, 10X Speedup, Cleaner HTML. Made architectural changes requested by JulesGM

    @JulesGM I have implemented all the architectural and stylistic suggestions you requested. This new pull request adds Bing Search since that was what was used in the ParlAI Blenderbot2 paper. It also allows you to limit the the text per URL since currently Blenderbot only uses the first 512 characters. It allows you to strip out HTML menus. You can also return a clean summary of each web page at 10X faster since it does not need to fetch each URL. I have updated the README with examples to allow you to quickly test these options. Overall it enables the search engine to return significantly higher quality text to Blenderbot2. I will send you a separate private email with the URLs to each of these test URLs, which I have deployed as Docker Containers to Google Cloud in case you do not have a Bing Search Subscription key and want to test them. Thank you again for your time.

    opened by hitchingsh 0
  • Accept pull request with Bing added, 10X speed up option, cleaner web page text returned to ParlAI, command line args?

    Accept pull request with Bing added, 10X speed up option, cleaner web page text returned to ParlAI, command line args?

    @JulesGM I just did a big commit on my fork and would like to submit it as a pull request if you think you'd be willing to accept it. I was wondering if you could take a look at it first and also I'm open to making suggested changes before submitting including stylistic ones like how I coded the command line args

    Commit: https://github.com/hitchingsh/ParlAI_SearchEngine/commit/14f0fba2e255c8d5ef077e8ec43b7b7e50fe5194

    opened by hitchingsh 3
Owner
Jules Gagnon-Marchand
Msc student at Mila in NLP. prev. Research intern at Google Brain, Google AI research, Huawei AI [email protected] https://www.linkedin.com/in/julesgm
Jules Gagnon-Marchand
Search emails from a domain through search engines

EmailFinder - search emails through Search Engines

Josué Encinar 155 Dec 30, 2022
Image search service based on imgsmlr extension of PostgreSQL. Support image search by image.

imgsmlr-server Image search service based on imgsmlr extension of PostgreSQL. Support image search by image. This is a sample application of imgsmlr.

jie 45 Dec 12, 2022
GitScanner is a script to make it easy to search for Exposed Git through an advanced Google search.

GitScanner Legal disclaimer Usage of GitScanner for attacking targets without prior mutual consent is illegal. It is the end user's responsibility to

Kaio Gomes 3 Oct 28, 2022
A fast, efficiency python package for searching and getting search results with many different search engines

search A fast, efficiency python package for searching and getting search results with many different search engines. Installation To install the pack

Neurs 0 Oct 6, 2022
Reverse-ikea-image-search - A simple image of ikea search using jina.ai

IKEA Reverse Image Search This is a demo project to fetch ikea product images(IK

SOUVIK GHOSH 4 Mar 8, 2022
A Python web searcher library with different search engines

Robert A simple Python web searcher library with different search engines. Install pip install roberthelper Usage from robert import GoogleSearcher

null 1 Dec 23, 2021
Modular search for Django

Haystack Author: Daniel Lindsley Date: 2013/07/28 Haystack provides modular search for Django. It features a unified, familiar API that allows you to

Haystack Search 3.4k Jan 4, 2023
Full text search for flask.

flask-msearch Installation To install flask-msearch: pip install flask-msearch # when MSEARCH_BACKEND = "whoosh" pip install whoosh blinker # when MSE

honmaple 197 Dec 29, 2022
Jina allows you to build deep learning-powered search-as-a-service in just minutes

Cloud-native neural search framework for any kind of data

Jina AI 17k Dec 31, 2022
Senginta is All in one Search Engine Scrapper for used by API or Python Module. It's Free!

Senginta is All in one Search Engine Scrapper. With traditional scrapping, Senginta can be powerful to get result from any Search Engine, and convert to Json. Now support only for Google Product Search Engine (GShop, GVideo and many too) and Baidu Search Engine.

null 33 Nov 21, 2022
document organizer with tags and full-text-search, in a simple and clean sqlite3 schema

document organizer with tags and full-text-search, in a simple and clean sqlite3 schema

Manos Pitsidianakis 152 Oct 29, 2022
Google Search Engine Results Pages (SERP) in locally, no API key, no signup required

Local SERP Google Search Engine Results Pages (SERP) in locally, no API key, no signup required Make sure the chromedriver and required package are in

theblackcat102 4 Jun 29, 2021
This project is a sample demo of Arxiv search related to AI/ML Papers built using Streamlit, sentence-transformers and Faiss.

This project is a sample demo of Arxiv search related to AI/ML Papers built using Streamlit, sentence-transformers and Faiss.

Karn Deb 49 Oct 30, 2022
Google Project: Search and auto-complete sentences within given input text files, manipulating data with complex data-structures.

Auto-Complete Google Project In this project there is an implementation for one feature of Google's search engines - AutoComplete. Autocomplete, or wo

Hadassah Engel 10 Jun 20, 2022
Full-text multi-table search application for Django. Easy to install and use, with good performance.

django-watson django-watson is a fast multi-model full-text search plugin for Django. It is easy to install and use, and provides high quality search

Dave Hall 1.1k Jan 3, 2023
rclip - AI-Powered Command-Line Photo Search Tool

rclip is a command-line photo search tool based on the awesome OpenAI's CLIP neural network.

Yurij Mikhalevich 394 Dec 12, 2022
An image inline search telegram bot.

Image-Search-Bot An image inline search telegram bot. Note: Use Telegram picture bot. That is better. Not recommending to deploy this bot. Made with P

Fayas Noushad 24 Oct 21, 2022
txtai executes machine-learning workflows to transform data and build AI-powered semantic search applications.

txtai executes machine-learning workflows to transform data and build AI-powered semantic search applications.

NeuML 3.1k Dec 31, 2022
Simple algorithm search engine like google in python using function

Mini-Search-Engine-Like-Google I have created the simple algorithm search engine like google in python using function. I am matching every word with w

Sachin Vinayak Dabhade 5 Sep 24, 2021