Universal Online Judge Spider
Introduction
This is a spider for Universal Online Judge (UOJ) system (https://uoj.ac/).
It also works for all other Online Judges using the UOJ system.
This spider is written in python3, using python selenium webdriver library and ChromeDriver.
It is only tested on Ubuntu 20.04, so the commands in the following section are only available for this system as well.
Features
- Automatic login, no need to obtain cookies manually.
- Convert pages into PDFs with reproducible text rather than simple screenshots.
- Automatically detects the loading of MathJax to ensure that the mathematical formula within the results are displayed correctly.
- Automatically skips pages that already exist (if the corresponding PDF file already exists locally).
- Support for proxy.
- Support for all websites using the UOJ system.
Installation
1. Install python3 and ChromeDriver:
apt install python3 python-pip3 chromium-browser chromium-chromedriver
2. Install selenium library for python3
pip3 install selenium
3. Download this program
Usage
Firstly you have to set these variables:
# [Basic settings]
url = ""
username = ""
password = ""
start_number = 1
end_number = 100
save_dir = "downloads"
# [Advanced settings]
proxy = ""
page_404_title = "404 - "
max_login_time = 60
max_mathjax_start_time = 60
max_mathjax_load_time = 60
Basic settings
url
: the index URL of your target, e.g.https://uoj.ac/
. Please note that the value must end in a slash/
.username
: your username.password
: your password.start_number
: the number of the first problem crawled (minimum).end_number
: the number of the last problem crawled (maximum).save_dir
: the name of the folder where the result will be stored.
Advanced settings
If you don't know what the advanced settings are for, you're probably better not to change them.
proxy
: the address of your proxy server, e.g.HTTP://127.0.0.1:1080
, orSOCKS5://127.0.0.1:1081
. Leave it blank (empty string) if you do not need to use a proxy.page_404_title
: the title of OJ's 404 page. You may use a substring of the title, like404 -
. If the program gets a page title that contains this string, the download of that page will be skipped.max_login_time
: the maximum waiting time for a login attempt, in seconds.max_mathjax_start_time
: the maximum wait time for a MathJax loading message to appear, in seconds.max_mathjax_load_time
: the maximum wait time for a MathJax loading message to disappear (i.e. MathJax rendering is finished), in seconds.
After completing the setup, run:
python3 main.py
Sample result
License
MIT License.