Web interface for browsing, search and filtering recent arxiv submissions

Overview

arxiv sanity preserver

Update Nov 27, 2021: you may wish to look at my from-scratch re-write of arxiv-sanity: arxiv-sanity-lite. It is a smaller version of arxiv-sanity that focuses on the core value proposition, is significantly less likely to ever go down, scales better, and has a few additional goodies such as multiple possible tags per account, regular emails of new papers of interest, etc. It is also running live on arxiv-sanity-lite.com.

This project is a web interface that attempts to tame the overwhelming flood of papers on Arxiv. It allows researchers to keep track of recent papers, search for papers, sort papers by similarity to any paper, see recent popular papers, to add papers to a personal library, and to get personalized recommendations of (new or old) Arxiv papers. This code is currently running live at www.arxiv-sanity.com/, where it's serving 25,000+ Arxiv papers from Machine Learning (cs.[CV|AI|CL|LG|NE]/stat.ML) over the last ~3 years. With this code base you could replicate the website to any of your favorite subsets of Arxiv by simply changing the categories in fetch_papers.py.

user interface

Code layout

There are two large parts of the code:

Indexing code. Uses Arxiv API to download the most recent papers in any categories you like, and then downloads all papers, extracts all text, creates tfidf vectors based on the content of each paper. This code is therefore concerned with the backend scraping and computation: building up a database of arxiv papers, calculating content vectors, creating thumbnails, computing SVMs for people, etc.

User interface. Then there is a web server (based on Flask/Tornado/sqlite) that allows searching through the database and filtering papers by similarity, etc.

Dependencies

Several: You will need numpy, feedparser (to process xml files), scikit learn (for tfidf vectorizer, training of SVM), flask (for serving the results), flask_limiter, and tornado (if you want to run the flask server in production). Also dateutil, and scipy. And sqlite3 for database (accounts, library support, etc.). Most of these are easy to get through pip, e.g.:

$ virtualenv env                # optional: use virtualenv
$ source env/bin/activate       # optional: use virtualenv
$ pip install -r requirements.txt

You will also need ImageMagick and pdftotext, which you can install on Ubuntu as sudo apt-get install imagemagick poppler-utils. Bleh, that's a lot of dependencies isn't it.

Processing pipeline

The processing pipeline requires you to run a series of scripts, and at this stage I really encourage you to manually inspect each script, as they may contain various inline settings you might want to change. In order, the processing pipeline is:

  1. Run fetch_papers.py to query arxiv API and create a file db.p that contains all information for each paper. This script is where you would modify the query, indicating which parts of arxiv you'd like to use. Note that if you're trying to pull too many papers arxiv will start to rate limit you. You may have to run the script multiple times, and I recommend using the arg --start-index to restart where you left off when you were last interrupted by arxiv.
  2. Run download_pdfs.py, which iterates over all papers in parsed pickle and downloads the papers into folder pdf
  3. Run parse_pdf_to_text.py to export all text from pdfs to files in txt
  4. Run thumb_pdf.py to export thumbnails of all pdfs to thumb
  5. Run analyze.py to compute tfidf vectors for all documents based on bigrams. Saves a tfidf.p, tfidf_meta.p and sim_dict.p pickle files.
  6. Run buildsvm.py to train SVMs for all users (if any), exports a pickle user_sim.p
  7. Run make_cache.py for various preprocessing so that server starts faster (and make sure to run sqlite3 as.db < schema.sql if this is the very first time ever you're starting arxiv-sanity, which initializes an empty database).
  8. Start the mongodb daemon in the background. Mongodb can be installed by following the instructions here - https://docs.mongodb.com/tutorials/install-mongodb-on-ubuntu/.
  • Start the mongodb server with - sudo service mongod start.
  • Verify if the server is running in the background : The last line of /var/log/mongodb/mongod.log file must be - [initandlisten] waiting for connections on port
  1. Run the flask server with serve.py. Visit localhost:5000 and enjoy sane viewing of papers!

Optionally you can also run the twitter_daemon.py in a screen session, which uses your Twitter API credentials (stored in twitter.txt) to query Twitter periodically looking for mentions of papers in the database, and writes the results to the pickle file twitter.p.

I have a simple shell script that runs these commands one by one, and every day I run this script to fetch new papers, incorporate them into the database, and recompute all tfidf vectors/classifiers. More details on this process below.

protip: numpy/BLAS: The script analyze.py does quite a lot of heavy lifting with numpy. I recommend that you carefully set up your numpy to use BLAS (e.g. OpenBLAS), otherwise the computations will take a long time. With ~25,000 papers and ~5000 users the script runs in several hours on my current machine with a BLAS-linked numpy.

Running online

If you'd like to run the flask server online (e.g. AWS) run it as python serve.py --prod.

You also want to create a secret_key.txt file and fill it with random text (see top of serve.py).

Current workflow

Running the site live is not currently set up for a fully automatic plug and play operation. Instead it's a bit of a manual process and I thought I should document how I'm keeping this code alive right now. I have a script that performs the following update early morning after arxiv papers come out (~midnight PST):

python fetch_papers.py
python download_pdfs.py
python parse_pdf_to_text.py
python thumb_pdf.py
python analyze.py
python buildsvm.py
python make_cache.py

I run the server in a screen session, so screen -S serve to create it (or -r to reattach to it) and run:

python serve.py --prod --port 80

The server will load the new files and begin hosting the site. Note that on some systems you can't use port 80 without sudo. Your two options are to use iptables to reroute ports or you can use setcap to elavate the permissions of your python interpreter that runs serve.py. In this case I'd recommend careful permissions and maybe virtualenv, etc.

Comments
  • Legal?

    Legal?

    Here: https://arxiv.org/help/robots is the "Robots Beware: Indiscriminate automated downloads from this site are not permitted." This makes Your code doing what is explicitly forbidden by arxiv.

    opened by minzastro 8
  • Hosting

    Hosting "fork" for physics categories

    I'm started indexing some of the Physics categories. My plan is to cover all of them, but I've started with physics.* and astro-ph.* for now.

    The site is currently hosted at http://physics.arxiv-sanity.nolife.de/

    I'd be wiling to host it long term, if you want to focus on the already covered categories. Alternatively I could forward the PDFs, thumbnails and extracted texts to you, if you want to incorporate them in your site. What is your plan at the moment?

    How do you want to handle domain names for forks? As a sub domain, or should I register a different one?

    opened by Moredread 7
  • Recommended papers not showing

    Recommended papers not showing

    Hi, the recommendation section of my profile is stuck at Oct. 4th, and i don't see any paper more recent than that. I have added several papers to the library after Oct. 4th, and I'm sure that new papers came out on Arxiv that should have been caught by the recommender. Is there any way I can troubleshoot this??

    Thanks, Daniele

    opened by danielegrattarola 6
  • Any plans for adding feature of voting (e.g. like/dislike) for papers?

    Any plans for adding feature of voting (e.g. like/dislike) for papers?

    Again, would be nice to be able to rank/sort papers in a distributed fashion, e.g. among members of a research group. The more likes a paper collects the more likely it is to be discussed in the next reading group or the like.

    opened by andreas-bulling 6
  • Feature request: allow export of bibtex file for citations

    Feature request: allow export of bibtex file for citations

    Hi, thanks for this great work. I was wondering if it would be possible to add the ability for the users to export the saved papers in their libraries as bibtex file for later reference. This should be fairly easy to achieve if we utilize the https://arxiv2bibtex.org. Basically for each user loop over the papers in their library send their id's one by one to arxiv2bibtex.org read the text in the frame or the textbox and save it in a file. We could even use http://ams.org/mathscinet/ to find the correct citation information.

    opened by kirk86 4
  • Make codebase python3 compatible;

    Make codebase python3 compatible;

    Currently the arxiv-sanity-preserver is not python3 compatible. This pull request does the following:

    (1) reworks imports for urllib and pickle; (2) convert to print_function (from __future__ import print_function); (3) switch from iteritems() and iterkeys() in favor of items() and keys(); (4) starting to wrap functionality of various scripts into functions to remove dependencies on magic value (currently reworked download_pdfs.py - to allow for specifying the timeout, database file and output directory);

    With these changes I believe this should allow python3 to be used to run the code (note: pickling will have to be reworked - if you want to use python3 after running in python2).

    opened by kingtaurus 4
  • Would it be useful to have an extension to this project where you can see the ancestors and predecessors of a research paper?

    Would it be useful to have an extension to this project where you can see the ancestors and predecessors of a research paper?

    For instance, I'm reading this paper and I see it referred to ideas from previously published papers. I want to put this paper as a child of those research papers and maintain a tree so that I can keep track of the ideas from the paper in a systematic manner to aid my research. In other words, I want to visualize the path of knowledge that flows from one research paper to another.

    opened by syed-ahmed 4
  • Flip color scheme of tabs

    Flip color scheme of tabs

    I'm curious if any else feels like the color scheme of the tabs is a bit unintuitive. It seems logical for the selected tab to appear "attached" to the rest of the page, while the unselected ones are shadowed.

    Current: screen shot 2018-10-02 at 4 57 56 pm

    Proposed: screen shot 2018-10-02 at 4 58 44 pm

    enhancement 
    opened by chrisfischer 3
  • Can not start webserver. No such table in database.

    Can not start webserver. No such table in database.

    Hi, sorry for my poor bug report. I'm new with github und such. I'm trying to use your program with two topics in the astrophysics domain. Everything processed fine until the webserver-like thing tries to read some tables.

    ~/arxiv-sanity-preserver ❯❯❯ ./venv/bin/python serve.py --prod /$HOME/arxiv-sanity-preserver/venv/lib/python2.7/site-packages/flask_limiter/extension.py:124: UserWarning: Use of the default get_ipaddr function is discouraged. Please refer to https://flask-limiter.readthedocs.org/#rate-limit-domain for the recommended configuration UserWarning Namespace(num_results=200, port=5000, prod=True) loading db.p... loading tfidf_meta.p... loading sim_dict.p... loading user_sim.p... precomputing papers date sorted... computing top papers... Traceback (most recent call last): File "serve.py", line 415, in top_counts = get_popular() File "serve.py", line 409, in get_popular libs = sqldb.execute('''select * from library''').fetchall() sqlite3.OperationalError: no such table: library

    opened by plus13 3
  • Add COinS support for bibliography managers (Zotero, Mendeley)

    Add COinS support for bibliography managers (Zotero, Mendeley)

    Popular bibliography managers support automatically extracting reference information from webpages that are properly annotated (with machine-readable info).

    This PR adds the requisite machine-readable info for arXiv-sanity paper lists, so that Zotero, etc. users can import papers directly from the arXiv-sanity paper list pages.

    The imported bibliography entries exactly match what you'd get when importing from the arXiv pages individually (minus attachments, unfortunately..)

    opened by hans 3
  • removed temporary thumbnail iteration

    removed temporary thumbnail iteration

    The temporary thumbnail solution felt a bit too complicated and I run some tests.

    convert path/to/file.pdf -thumbnail x156 -quality 80 +append path/to/thumbnail.png does the very same job and the code it's more easy to read

    opened by edoput 3
  • Missing recent top papers; Is redirect to Arxiv-Sanity Lite intentional?

    Missing recent top papers; Is redirect to Arxiv-Sanity Lite intentional?

    First of all, Arxiv Sanity is :fire: . Awesome work.

    However, the old arxiv sanity preserver had an option to see the top recent papers.

    It was by far the most useful feature of arxiv sanity, but the lite version does not have it.

    I just want to ask: was removal of this feature intentional?

    opened by Isinlor 0
  • Arxiv sanity preserver down permanently?

    Arxiv sanity preserver down permanently?

    Hi,

    I noticed that recently arxiv-sanity.com now redirects to arxiv-sanity-lite.com. Does this mean arxiv-sanity is down permanently? If so, can we have an option to download our data? I had a lot of interesting papers liked that I hadn't had the chance to download before the site was shut down.

    opened by kzhang2 5
  • Introducing Skim - A Platform that helps you to skim through papers in this fast moving research world

    Introducing Skim - A Platform that helps you to skim through papers in this fast moving research world

    Skim is an alternative to arxiv-sanity with lot of features.

    We built this inspired by arxiv-sanity and wanted to engineer a platform which is scalable and can evolve with time.

    Introductory Video: https://youtu.be/i6cpBQezPSA

    Features:

    • Charts: Equivalent to arxiv sanity's top hype section
    • Racks: Similar to Spotify, create a rack (playlist) of papers you like/want to read/work on. You can make a rack private/public.
    • Conferences: More than 50+ conferences with deadlines, racks of their proceedings, acceptance rate statistics
    • Paper: Paper implementations links powered by Papers With Code

    We are currently in BETA and opening up. We are looking for people who can provide us with feedback. Request Invite: https://forms.gle/insGeS3Q8Z3XNES16

    cc: @karpathy

    opened by prabhuomkar 4
Owner
Andrej
I like to train Deep Neural Nets on large datasets.
Andrej
Comics/doujinshi reader application. Web-based, will work on desktop and tablet devices with swipe interface.

Yomiko Comics/doujinshi reader application. Web-based, will work on desktop and tablet devices with swipe interface. Scans one or more directories of

Kyubi Systems 26 Aug 10, 2022
Dockernized ZeroTierOne controller with zero-ui web interface.

docker-zerotier-controller Dockernized ZeroTierOne controller with zero-ui web interface. 中文讨论 Customize ZeroTierOne's controller planets Modify patch

sbilly 209 Jan 4, 2023
This is a simple web interface for SimplyTranslate

SimplyTranslate Web This is a simple web interface for SimplyTranslate List of Instances You can find a list of instances here: SimplyTranslate Projec

null 4 Dec 14, 2022
A web interface for a soft serve Git server.

Soft Serve monitor Soft Sevre is a very nice git server. It offers a really nice TUI to browse the repositories on the server. Unfortunately, it does

Maxime Bouillot 5 Apr 26, 2022
A web application which you can search, buy or sell shares with current prices which provided by IEX.

CS50 - Stock Exchange A web application which you can search, buy or sell shares with current prices which provided by IEX. Table of Contents Setup St

null 1 May 28, 2022
python DroneCAN code generation, interface and utilities

UAVCAN v0 stack in Python Python implementation of the UAVCAN v0 protocol stack. UAVCAN is a lightweight protocol designed for reliable communication

DroneCAN 11 Dec 12, 2022
This is a far more in-depth and advanced version of "Write user interface to a file API Sample"

Fusion360-Write-UserInterface This is a far more in-depth and advanced version of "Write user interface to a file API Sample" from https://help.autode

null 4 Mar 18, 2022
apysc is the Python frontend library to create html and js file, that has ActionScript 3 (as3)-like interface.

apysc apysc is the Python frontend library to create HTML and js files, that has ActionScript 3 (as3)-like interface. Notes: Currently developing and

simonritchie 17 Dec 14, 2022
Tools, guides, and resources for blockchain analysts to interface with data on the Ergo platform.

Ergo Intelligence Objective Provide a suite of easy-to-use toolkits, guides, and resources for blockchain analysts and data scientists to quickly unde

Chris 5 Mar 15, 2022
Python interface to IEX and IEX cloud APIs

Python interface to IEX Cloud Referral Please subscribe to IEX Cloud using this referral code. Getting Started Install Install from pip pip install py

IEX Cloud 41 Dec 21, 2022
TrainingBike - Code, models and schematics I've used to interface my stationary training bike with PC.

TrainingBike Code, models and schematics I've used to interface my stationary training bike with PC. You can find more information about the project i

null 1 Jan 1, 2022
Code and data for learning to search in local branching

Code and data for learning to search in local branching

Defeng Liu 7 Dec 6, 2022
This is a Saleae Logic custom high level analyzer that allows you to search and mark specific packets.

SaleaePacketParser This is a Saleae Logic custom high level analyzer that allows you to search and mark specific packets. Field "Search For" is used f

null 1 Dec 16, 2021
Search and Find Jobs in Ethiopia

✨ EthioJobs ✨ Search and Find Jobs in Ethiopia Easy start critical warning Use pycharm No vscode No sublime No Vim No nothing when you want to use

Abdimk 12 Nov 9, 2022
A collection of common regular expressions bundled with an easy to use interface.

CommonRegex Find all times, dates, links, phone numbers, emails, ip addresses, prices, hex colors, and credit card numbers in a string. We did the har

Madison May 1.5k Dec 31, 2022
Python3 Interface to numa Linux library

py-libnuma is python3 interface to numa Linux library so that you can set task affinity and memory affinity in python level for your process which can help you to improve your code's performence.

Dalong 13 Nov 10, 2022
poro is a LCU interface to change some lol's options.

poro is a LCU interface to change some lol's options. with this program you can: change your profile icon change your profiel background image ch

João Dematte 2 Jan 5, 2022
Project Interface For nextcord-ext

Project Interface For nextcord-ext

nextcord-ext 1 Nov 13, 2021
Um sistema de llogin feito em uma interface grafica.

Interface-para-login Um sistema de login feito com JSON. Utilizando a biblioteca Tkinter, eu criei um sistema de login, onde guarda a informações de l

Mobben 1 Nov 28, 2021