Neural search engine for AI papers

Giancarlo Fissore

Last update: Dec 24, 2022

Related tags

Computer Vision papers-search

Overview

Papers search

Neural search engine for ML papers.

Demo

Usage is simple: input an abstract, get the matching papers. The following demo also showcases the finetuning functionality (notice how the paper marked as "irrelevant" is assigned a lower score after finetuning).

Dataset

We used a stripped-down version of the Kaggle arXiv Dataset in which only the following categories are retained: cs.AI, cs.CL, cs.CV, cs.LG, cs.MA, cs.NE

Setting up the environment

Clone the repository

git clone https://github.com/fissoreg/papers-search/
cd papers-search

For both the folders frontend and backend, run the following commands

cd folder_to_go_into/ # `folder_to_go_into` is either `frontend` or `backend`

python3 -m venv env
source venv/bin/activate

pip install --upgrade pip
pip install -r requirements.txt

Indexing

The app works by suggesting papers whose abstract is similar to the one you provided. The suggestions come from a database of published papers: you need to index all the suggestions for the system to be able to function. This is a lenghty operation, but it needs to be performed only once:

cd backend
python src/app.py --index

For testing, you can index a small number of papers providing the --n argument:

python src/app.py --index --n 10

Running the app

This can be run after indexing (section above).

Run the backend

cd backend
python3 src/app.py

In a new terminal, run the frontend

cd frontend
streamlit run app.py

Connect to http://localhost:8501/ (with your favourite browser).

Formatting, linting and testing

Refer to the Makefile for the specific commands

To format code following the black standard

$ make format

Code linting with flake8

$ make lint

Testing

$ make testdeps
$ make test

Testing with coverage analysis

$ make coverage

Format, test and coverage

$ make build

Contributing

This project is in its starting phase. If you are interested in contributing, don't hesitate to get in touch! (Or go straight to the Issues ;)).

Acknowledgements

Made possible by:

Jina AI
Sentence-Transformers
arXiv: Thank you to arXiv for use of its open access interoperability.
Kaggle

Tesseract Open Source OCR Engine (main repository)

Tesseract OCR About This package contains an OCR engine - libtesseract and a command line program - tesseract. Tesseract 4 adds a new neural net (LSTM

48.4k Jan 9, 2023

Line based ATR Engine based on OCRopy

OCR Engine based on OCRopy and Kraken using python3. It is designed to both be easy to use from the command line but also be modular to be integrated

948 Dec 23, 2022

A Tensorflow model for text recognition (CNN + seq2seq with visual attention) available as a Python package and compatible with Google Cloud ML Engine.

Attention-based OCR Visual attention-based OCR model for image recognition with additional tools for creating TFRecords datasets and exporting the tra

933 Dec 29, 2022

Comments

Specify dependencies versions in requirements.txt.

As the dependencies versions are not specified, the latest versions are considered by default. This should be corrected as the command from jina.types.document.generators import from_csv in backend/app.py is not consistent with jina latest version.

opened by Andrea-Valentini 2
Replaces pip with poetry to handle deps management

This commit aims to replace pip with poetry handle dependencies management.

All the info regarding the dependecies version and python version for both backend and frontend are in the .toml file.

However, the specific backend, frontend, and dev-related dependecies are defined in differents groups and can be installed separately by the following command poetry install --only sample_group.

opened by Andrea-Valentini 1
Dataset processing and download.

The dataset currently in use is a stripped-down version of the Kaggle arXiv Datasetin which only the following categories are retained: cs.AI, cs.CL, cs.CV, cs.LG, cs.MA, cs.NE.

We should self-host this dataset, provide the scripts to process it, and keep it up-to-date with the original ArXiv.
good first issue

opened by fissoreg 4

Neural search engine for AI papers

Related tags

Overview

Papers search

Demo

Dataset

Setting up the environment

Indexing

Running the app

Formatting, linting and testing

Contributing

Acknowledgements

You might also like...

Tesseract Open Source OCR Engine (main repository)

Line based ATR Engine based on OCRopy

A Tensorflow model for text recognition (CNN + seq2seq with visual attention) available as a Python package and compatible with Google Cloud ML Engine.

OCR engine for all the languages

It is a image ocr tool using the Tesseract-OCR engine with the pytesseract package and has a GUI.

Convert PDF/Image to TXT using EasyOcr - the best OCR engine available!

nofacedb/faceprocessor is a face recognition engine for NoFaceDB program complex.

This pyhton script converts a pdf to Image then using tesseract as OCR engine converts Image to Text

Code for AAAI 2021 paper: Sequential End-to-end Network for Efficient Person Search

Comments

Specify dependencies versions in requirements.txt.

Replaces pip with poetry to handle deps management

Dataset processing and download.

Owner

Giancarlo Fissore

A curated list of papers and resources for scene text detection and recognition

A collection of resources (including the papers and datasets) of OCR (Optical Character Recognition).

Tracking the latest progress in Scene Text Detection and Recognition: Must-read papers well organized

Generate a list of papers with publicly available source code in the daily arxiv

Repository of conference publications and source code for first-/ second-authored papers published at NeurIPS, ICML, and ICLR.

A curated list of papers, code and resources pertaining to image composition

The papers published in top-tier AI conferences in recent years.

Automatically download multiple papers by keywords in CVPR

Web interface for browsing arXiv papers

Dirty, ugly, and hopefully useful OCR of Facebook Papers docs released by Gizmodo