🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

ArchiveBox

Last update: Jan 5, 2023

Related tags

Organization python rss backups firefox pinboard youtube-dl chromium self-hosted wget pocket browser-bookmarks warc web-archiving wayback-machine digipres singlefile headless-browser bookmark-archiver internet-archiving archivebox

Overview

ArchiveBox
_{Open-source self-hosted web archiving.}

"Your own personal internet archive" (网站存档 / 爬虫)

ArchiveBox is a powerful self-hosted internet archiving solution written in Python. You feed it URLs of pages you want to archive, and it saves them to disk in a variety of formats depending on setup and content within.

🔢 Run ArchiveBox via Docker Compose (recommended), Docker, Apt, Brew, or Pip (see below).

apt/brew/pip3 install archivebox

archivebox init                       # run this in an empty folder
archivebox add 'https://example.com'  # start adding URLs to archive
curl https://example.com/rss.xml | archivebox add  # or add via stdin
archivebox schedule --every=day https://example.com/rss.xml

For each URL added, ArchiveBox saves several types of HTML snapshot (wget, Chrome headless, singlefile), a PDF, a screenshot, a WARC archive, any git repositories, images, audio, video, subtitles, article text, and more....

archivebox server --createsuperuser 0.0.0.0:8000   # use the interactive web UI
archivebox list 'https://example.com'  # use the CLI commands (--help for more)
ls ./archive/*/index.json              # or browse directly via the filesystem

You can then manage your snapshots via the filesystem, CLI, Web UI, SQLite DB (./index.sqlite3), Python API (alpha), REST API (alpha), or desktop app (alpha).

At the end of the day, the goal is to sleep soundly knowing that the part of the internet you care about will be automatically preserved in multiple, durable long-term formats that will be accessible for decades (or longer).

⚡️ CLI Usage

# archivebox [subcommand] [--args]
archivebox --version
archivebox help

archivebox init/version/status/config/manage to administer your collection
archivebox add/remove/update/list to manage Snapshots in the archive
archivebox schedule to pull in fresh URLs in regularly from boorkmarks/history/Pocket/Pinboard/RSS/etc.
archivebox oneshot archive single URLs without starting a whole collection
archivebox shell/manage dbshell open a REPL to use the Python API (alpha), or SQL API

^{Demo | Screenshots | Usage}
_{. . . . . . . . . . . . . . . . . . . . . . . . . . . .}

Quickstart

🖥 Supported OSs: Linux/BSD, macOS, Windows 🎮 CPU Architectures: x86, amd64, arm7, arm8 (raspi >=3) 📦 Distributions: docker/apt/brew/pip3/npm (in order of completeness)

(click to expand your preferred ► distribution below for full setup instructions)

Get ArchiveBox with docker-compose on any platform (recommended, everything included out-of-the-box)

First make sure you have Docker installed: https://docs.docker.com/get-docker/

# create a new empty directory and initalize your collection (can be anywhere)
mkdir ~/archivebox && cd ~/archivebox
curl -O 'https://raw.githubusercontent.com/ArchiveBox/ArchiveBox/master/docker-compose.yml'
docker-compose run archivebox init
docker-compose run archivebox --version

# start the webserver and open the UI (optional)
docker-compose run archivebox manage createsuperuser
docker-compose up -d
open 'http://127.0.0.1:8000'

# you can also add links and manage your archive via the CLI:
docker-compose run archivebox add 'https://example.com'
echo 'https://example.com' | docker-compose run archivebox -T add
docker-compose run archivebox status
docker-compose run archivebox help  # to see more options

# when passing stdin/stdout via the cli, use the -T flag
echo 'https://example.com' | docker-compose run -T archivebox add
docker-compose run -T archivebox list --html --with-headers > index.html

This is the recommended way to run ArchiveBox because it includes all the extractors like:
chrome, wget, youtube-dl, git, etc., full-text search w/ sonic, and many other great features.

Get ArchiveBox with docker on any platform

First make sure you have Docker installed: https://docs.docker.com/get-docker/

# create a new empty directory and initalize your collection (can be anywhere)
mkdir ~/archivebox && cd ~/archivebox
docker run -v $PWD:/data -it archivebox/archivebox init
docker run -v $PWD:/data -it archivebox/archivebox --version

# start the webserver and open the UI (optional)
docker run -v $PWD:/data -it -p 8000:8000 archivebox/archivebox server --createsuperuser 0.0.0.0:8000
open http://127.0.0.1:8000

# you can also add links and manage your archive via the CLI:
docker run -v $PWD:/data -it archivebox/archivebox add 'https://example.com'
docker run -v $PWD:/data -it archivebox/archivebox status
docker run -v $PWD:/data -it archivebox/archivebox help  # to see more options

# when passing stdin/stdout via the cli, use only -i (not -it)
echo 'https://example.com' | docker run -v $PWD:/data -i archivebox/archivebox add
docker run -v $PWD:/data -i archivebox/archivebox list --html --with-headers > index.html

Get ArchiveBox with apt on Ubuntu/Debian

This method should work on all Ubuntu/Debian based systems, including x86, amd64, arm7, and arm8 CPUs (e.g. Raspberry Pis >=3).

If you're on Ubuntu >= 20.04, add the apt repository with add-apt-repository:

(on other Ubuntu/Debian-based systems follow the ♰ instructions below)

# add the repo to your sources and install the archivebox package using apt
sudo apt install software-properties-common
sudo add-apt-repository -u ppa:archivebox/archivebox
sudo apt install archivebox

# create a new empty directory and initalize your collection (can be anywhere)
mkdir ~/archivebox && cd ~/archivebox
npm install --prefix . 'git+https://github.com/ArchiveBox/ArchiveBox.git'
archivebox init
archivebox --version

# start the webserver and open the web UI (optional)
archivebox server --createsuperuser 0.0.0.0:8000
open http://127.0.0.1:8000

# you can also add URLs and manage the archive via the CLI and filesystem:
archivebox add 'https://example.com'
archivebox status
archivebox list --html --with-headers > index.html
archivebox list --json --with-headers > index.json
archivebox help  # to see more options

♰ On other Ubuntu/Debian-based systems add these sources directly to /etc/apt/sources.list:

echo "deb http://ppa.launchpad.net/archivebox/archivebox/ubuntu focal main" > /etc/apt/sources.list.d/archivebox.list
echo "deb-src http://ppa.launchpad.net/archivebox/archivebox/ubuntu focal main" >> /etc/apt/sources.list.d/archivebox.list
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys C258F79DCC02E369
sudo apt update
sudo apt install archivebox
sudo snap install chromium
archivebox --version
# then scroll back up and continue the initalization instructions above

(you may need to install some other dependencies manually however)

Get ArchiveBox with brew on macOS

First make sure you have Homebrew installed: https://brew.sh/#install

# install the archivebox package using homebrew
brew install archivebox/archivebox/archivebox

# create a new empty directory and initalize your collection (can be anywhere)
mkdir ~/archivebox && cd ~/archivebox
npm install --prefix . 'git+https://github.com/ArchiveBox/ArchiveBox.git'
archivebox init
archivebox --version

# start the webserver and open the web UI (optional)
archivebox server --createsuperuser 0.0.0.0:8000
open http://127.0.0.1:8000

# you can also add URLs and manage the archive via the CLI and filesystem:
archivebox add 'https://example.com'
archivebox status
archivebox list --html --with-headers > index.html
archivebox list --json --with-headers > index.json
archivebox help  # to see more options

Get ArchiveBox with pip on any platform

First make sure you have Python >= 3.7 installed: https://realpython.com/installing-python/

# install the archivebox package using pip3
pip3 install archivebox

# create a new empty directory and initalize your collection (can be anywhere)
mkdir ~/archivebox && cd ~/archivebox
npm install --prefix . 'git+https://github.com/ArchiveBox/ArchiveBox.git'
archivebox init
archivebox --version
# Install any missing extras like wget/git/chrome/etc. manually as needed

# start the webserver and open the web UI (optional)
archivebox server --createsuperuser 0.0.0.0:8000
open http://127.0.0.1:8000

# you can also add URLs and manage the archive via the CLI and filesystem:
archivebox add 'https://example.com'
archivebox status
archivebox list --html --with-headers > index.html
archivebox list --json --with-headers > index.json
archivebox help  # to see more options

No matter which install method you choose, they all roughly follow this 3-step process and all provide the same CLI, Web UI, and on-disk data format.

Install ArchiveBox: apt/brew/pip3 install archivebox
Start a collection: archivebox init
Start archiving: archivebox add 'https://example.com'

_{. . . . . . . . . . . . . . . . . . . . . . . . . . . .}

DEMO: https://archivebox.zervice.io
Quickstart | Usage | Configuration

Key Features

Free & open source, doesn't require signing up for anything, stores all data locally
Powerful, intuitive command line interface with modular optional dependencies
Comprehensive documentation, active development, and rich community
Extracts a wide variety of content out-of-the-box: media (youtube-dl), articles (readability), code (git), etc.
Supports scheduled/realtime importing from many types of sources
Uses standard, durable, long-term formats like HTML, JSON, PDF, PNG, and WARC
Usable as a oneshot CLI, self-hosted web UI, Python API (BETA), REST API (ALPHA), or desktop app (ALPHA)
Saves all pages to archive.org as well by default for redundancy (can be disabled for local-only mode)
Planned: support for archiving content requiring a login/paywall/cookies (working, but ill-advised until some pending fixes are released)
Planned: support for running JS scripts during archiving, e.g. adblock, autoscroll, modal-hiding, thread-expander, etc.

Input formats

ArchiveBox supports many input formats for URLs, including Pocket & Pinboard exports, Browser bookmarks, Browser history, plain text, HTML, markdown, and more!

echo 'http://example.com' | archivebox add
archivebox add 'https://example.com/some/page'
archivebox add < ~/Downloads/firefox_bookmarks_export.html
archivebox add < any_text_with_urls_in_it.txt
archivebox add --depth=1 'https://example.com/some/downloads.html'
archivebox add --depth=1 'https://news.ycombinator.com#2020-12-12'

# (if using docker add -i when passing via stdin)
echo 'https://example.com' | docker run -v $PWD:/data -i archivebox/archivebox add

# (if using docker-compose add -T when passing via stdin)
echo 'https://example.com' | docker-compose run -T archivebox add

TXT, RSS, XML, JSON, CSV, SQL, HTML, Markdown, or any other text-based format...
Browser history or browser bookmarks (see instructions for: Chrome, Firefox, Safari, IE, Opera, and more...)
Pocket, Pinboard, Instapaper, Shaarli, Delicious, Reddit Saved, Wallabag, Unmark.it, OneTab, and more...

See the Usage: CLI page for documentation and examples.

It also includes a built-in scheduled import feature with archivebox schedule and browser bookmarklet, so you can pull in URLs from RSS feeds, websites, or the filesystem regularly/on-demand.

Output formats

All of ArchiveBox's state (including the index, snapshot data, and config file) is stored in a single folder called the "ArchiveBox data folder". All archivebox CLI commands must be run from inside this folder, and you first create it by running archivebox init.

The on-disk layout is optimized to be easy to browse by hand and durable long-term. The main index is a standard sqlite3 database (it can also be exported as static JSON/HTML), and the archive snapshots are organized by date-added timestamp in the archive/ subfolder. Each snapshot subfolder includes a static JSON and HTML index describing its contents, and the snapshot extrator outputs are plain files within the folder (e.g. media/example.mp4, git/somerepo.git, static/someimage.png, etc.)

# to browse your index statically without running the archivebox server, run:
archivebox list --html --with-headers > index.html
archivebox list --json --with-headers > index.json
# if running these commands with docker-compose, add -T:
# docker-compose run -T archivebox list ...

# then open the static index in a browser
open index.html

# or browse the snapshots via filesystem directly
ls ./archive/<timestamp>/

Index: index.html & index.json HTML and JSON index files containing metadata and details
Title, Favicon, Headers Response headers, site favicon, and parsed site title
Wget Clone: example.com/page-name.html wget clone of the site with warc/<timestamp>.gz
Chrome Headless
- SingleFile: singlefile.html HTML snapshot rendered with headless Chrome using SingleFile
- PDF: output.pdf Printed PDF of site using headless chrome
- Screenshot: screenshot.png 1440x900 screenshot of site using headless chrome
- DOM Dump: output.html DOM Dump of the HTML after rendering using headless chrome
- Readability: article.html/json Article text extraction using Readability
Archive.org Permalink: archive.org.txt A link to the saved site on archive.org
Audio & Video: media/ all audio/video files + playlists, including subtitles & metadata with youtube-dl
Source Code: git/ clone of any repository found on github, bitbucket, or gitlab links
More coming soon! See the Roadmap...

It does everything out-of-the-box by default, but you can disable or tweak individual archive methods via environment variables or config file.

archivebox config --set SAVE_ARCHIVE_DOT_ORG=False
archivebox config --set YOUTUBEDL_ARGS='--max-filesize=500m'
archivebox config --help

Dependencies

You don't need to install all the dependencies, ArchiveBox will automatically enable the relevant modules based on whatever you have available, but it's recommended to use the official Docker image with everything preinstalled.

If you so choose, you can also install ArchiveBox and its dependencies directly on any Linux or macOS systems using the system package manager or by running the automated setup script.

ArchiveBox is written in Python 3 so it requires python3 and pip3 available on your system. It also uses a set of optional, but highly recommended external dependencies for archiving sites: wget (for plain HTML, static files, and WARC saving), chromium (for screenshots, PDFs, JS execution, and more), youtube-dl (for audio and video), git (for cloning git repos), and nodejs (for readability and singlefile), and more.

Caveats

If you're importing URLs containing secret slugs or pages with private content (e.g Google Docs, CodiMD notepads, etc), you may want to disable some of the extractor modules to avoid leaking private URLs to 3rd party APIs during the archiving process.

# don't do this:
archivebox add 'https://docs.google.com/document/d/12345somelongsecrethere'
archivebox add 'https://example.com/any/url/you/want/to/keep/secret/'

# without first disabling share the URL with 3rd party APIs:
archivebox config --set SAVE_ARCHIVE_DOT_ORG=False   # disable saving all URLs in Archive.org
archivebox config --set SAVE_FAVICON=False      # optional: only the domain is leaked, not full URL
archivebox config --set CHROME_BINARY=chromium  # optional: switch to chromium to avoid Chrome phoning home to Google

Be aware that malicious archived JS can also read the contents of other pages in your archive due to snapshot CSRF and XSS protections being imperfect. See the Security Overview page for more details.

# visiting an archived page with malicious JS:
https://127.0.0.1:8000/archive/1602401954/example.com/index.html

# example.com/index.js can now make a request to read everything:
https://127.0.0.1:8000/index.html
https://127.0.0.1:8000/archive/*
# then example.com/index.js can send it off to some evil server

Support for saving multiple snapshots of each site over time will be added soon (along with the ability to view diffs of the changes between runs). For now ArchiveBox is designed to only archive each URL with each extractor type once. A workaround to take multiple snapshots of the same URL is to make them slightly different by adding a hash:

archivebox add 'https://example.com#2020-10-24'
...
archivebox add 'https://example.com#2020-10-25'

Screenshots

Background & Motivation

Vast treasure troves of knowledge are lost every day on the internet to link rot. As a society, we have an imperative to preserve some important parts of that treasure, just like we preserve our books, paintings, and music in physical libraries long after the originals go out of print or fade into obscurity.

Whether it's to resist censorship by saving articles before they get taken down or edited, or just to save a collection of early 2010's flash games you love to play, having the tools to archive internet content enables to you save the stuff you care most about before it disappears.

^{Image from WTF is Link Rot?...}

The balance between the permanence and ephemeral nature of content on the internet is part of what makes it beautiful. I don't think everything should be preserved in an automated fashion, making all content permanent and never removable, but I do think people should be able to decide for themselves and effectively archive specific content that they care about.

Because modern websites are complicated and often rely on dynamic content, ArchiveBox archives the sites in several different formats beyond what public archiving services like Archive.org and Archive.is are capable of saving. Using multiple methods and the market-dominant browser to execute JS ensures we can save even the most complex, finicky websites in at least a few high-quality, long-term data formats.

All the archived links are stored by date bookmarked in ./archive/<timestamp>, and everything is indexed nicely with JSON & HTML files. The intent is for all the content to be viewable with common software in 50 - 100 years without needing to run ArchiveBox in a VM.

Comparison to Other Projects

▶ Check out our community page for an index of web archiving initiatives and projects.

The aim of ArchiveBox is to go beyond what the Wayback Machine and other public archiving services can do, by adding a headless browser to replay sessions accurately, and by automatically extracting all the content in multiple redundant formats that will survive being passed down to historians and archivists through many generations.

User Interface & Intended Purpose

ArchiveBox differentiates itself from similar projects by being a simple, one-shot CLI interface for users to ingest bulk feeds of URLs over extended periods, as opposed to being a backend service that ingests individual, manually-submitted URLs from a web UI. However, we also have the option to add urls via a web interface through our Django frontend.

Private Local Archives vs Centralized Public Archives

Unlike crawler software that starts from a seed URL and works outwards, or public tools like Archive.org designed for users to manually submit links from the public internet, ArchiveBox tries to be a set-and-forget archiver suitable for archiving your entire browsing history, RSS feeds, or bookmarks, ~~including private/authenticated content that you wouldn't otherwise share with a centralized service~~ (do not do this until v0.5 is released with some security fixes). Also by having each user store their own content locally, we can save much larger portions of everyone's browsing history than a shared centralized service would be able to handle.

Storage Requirements

Because ArchiveBox is designed to ingest a firehose of browser history and bookmark feeds to a local disk, it can be much more disk-space intensive than a centralized service like the Internet Archive or Archive.today. However, as storage space gets cheaper and compression improves, you should be able to use it continuously over the years without having to delete anything. In my experience, ArchiveBox uses about 5gb per 1000 articles, but your milage may vary depending on which options you have enabled and what types of sites you're archiving. By default, it archives everything in as many formats as possible, meaning it takes more space than a using a single method, but more content is accurately replayable over extended periods of time. Storage requirements can be reduced by using a compressed/deduplicated filesystem like ZFS/BTRFS, or by setting SAVE_MEDIA=False to skip audio & video files.

Learn more

Whether you want to learn which organizations are the big players in the web archiving space, want to find a specific open-source tool for your web archiving need, or just want to see where archivists hang out online, our Community Wiki page serves as an index of the broader web archiving community. Check it out to learn about some of the coolest web archiving projects and communities on the web!

Community Wiki
- The Master Lists
  Community-maintained indexes of archiving tools and institutions.
- Web Archiving Software
  Open source tools and projects in the internet archiving space.
- Reading List
  Articles, posts, and blogs relevant to ArchiveBox and web archiving in general.
- Communities
  A collection of the most active internet archiving communities and initiatives.
Check out the ArchiveBox Roadmap and Changelog
Learn why archiving the internet is important by reading the "On the Importance of Web Archiving" blog post.
Reach out to me for questions and comments via @ArchiveBoxApp or @theSquashSH on Twitter
Hire us to develop an internet archiving solution for you @MonadicalSAS Monadical.com

Documentation

We use the Github wiki system and Read the Docs (WIP) for documentation.

You can also access the docs locally by looking in the ArchiveBox/docs/ folder.

Getting Started

Reference

More Info

ArchiveBox Development

All contributions to ArchiveBox are welcomed! Check our issues and Roadmap for things to work on, and please open an issue to discuss your proposed implementation before working on things! Otherwise we may have to close your PR if it doesn't align with our roadmap.

Low hanging fruit / easy first tickets:

Setup the dev environment

1. Clone the main code repo (making sure to pull the submodules as well)

git clone --recurse-submodules https://github.com/ArchiveBox/ArchiveBox
cd ArchiveBox
git checkout dev  # or the branch you want to test
git submodule update --init --recursive
git pull --recurse-submodules

2. Option A: Install the Python, JS, and system dependencies directly on your machine

# Install ArchiveBox + python dependencies
python3 -m venv .venv && source .venv/bin/activate && pip install -e '.[dev]'
# or: pipenv install --dev && pipenv shell

# Install node dependencies
npm install

# Check to see if anything is missing
archivebox --version
# install any missing dependencies manually, or use the helper script:
./bin/setup.sh

2. Option B: Build the docker container and use that for development instead

# Optional: develop via docker by mounting the code dir into the container
# if you edit e.g. ./archivebox/core/models.py on the docker host, runserver
# inside the container will reload and pick up your changes
docker build . -t archivebox
docker run -it --rm archivebox version
docker run -it --rm -p 8000:8000 \
    -v $PWD/data:/data \
    -v $PWD/archivebox:/app/archivebox \
    archivebox server 0.0.0.0:8000 --debug --reload

Common development tasks

See the ./bin/ folder and read the source of the bash scripts within. You can also run all these in Docker. For more examples see the Github Actions CI/CD tests that are run: .github/workflows/*.yaml.

Run in DEBUG mode

archivebox config --set DEBUG=True
# or
archivebox server --debug ...

Build and run a Github branch

docker build -t archivebox:dev https://github.com/ArchiveBox/ArchiveBox.git#dev
docker run -it -v $PWD:/data archivebox:dev ...

Run the linters

./bin/lint.sh

(uses flake8 and mypy)

Run the integration tests

./bin/test.sh

(uses pytest -s)

Make migrations or enter a django shell

Make sure to run this whenever you change things in models.py.

cd archivebox/
./manage.py makemigrations

cd path/to/test/data/
archivebox shell
archivebox manage dbshell

(uses pytest -s)

Build the docs, pip package, and docker image

(Normally CI takes care of this, but these scripts can be run to do it manually)

./bin/build.sh

# or individually:
./bin/build_docs.sh
./bin/build_pip.sh
./bin/build_deb.sh
./bin/build_brew.sh
./bin/build_docker.sh

Roll a release

(Normally CI takes care of this, but these scripts can be run to do it manually)

./bin/release.sh

# or individually:
./bin/release_docs.sh
./bin/release_pip.sh
./bin/release_deb.sh
./bin/release_brew.sh
./bin/release_docker.sh

_{This project is maintained mostly in my spare time with the help from generous contributors and Monadical (

✨
hire them for dev work!).}

Sponsor us on Github

Comments

v0.4 (first Django release)
The v0.4 Release

A bunch of big changes:

pip install archivebox is now available

beginnings of transition to Django while maintaining a mostly backwards-compatible CLI

using argparse instead of hand-written CLI system: see archivebox/cli/archivebox.py

new subcommands-based CLI for archivebox (see below)

For more info, see: https://github.com/pirate/ArchiveBox/wiki/Roadmap

Released in this version:

Install Methods:

✅ pip/pipenv install archivebox [--dev]

✅ docker run nikisweeting/archivebox / docker-compose up

❌ apt/brew/pkg/yum/nix/etc install archivebox (maybe later)

Note: apt, brew are now available as of v0.5

Command Line Interface:

✅ archivebox

✅ archivebox version

✅ archivebox help

✅ archivebox init

✅ archivebox status

✅ archivebox add

✅ archivebox remove

✅ archivebox update

✅ archivebox list

✅ archivebox schedule

✅ archivebox config

✅ archivebox server

✅ archivebox shell

✅ archivebox manage

❌ archivebox oneshot (released later in v0.5)

❌ archivebox export (use archivebox list --json > index.json)

❌ archivebox proxy (too complex)

Web UI:

✅ / Main index

✅ /add Page to add new links to the archive (but needs improvement)

✅ /archive/<timestamp>/ Snapshot details page

✅ /archive/<timestamp>/<url> live wget archive of page

✅ /archive/<timestamp>/<extractor> get a specific extractor output for a given snapshot

✅ /archive/<url> shortcut to view most recent snapshot of given url

✅ /archive/<url_hash> shortcut to view most recent snapshot of given url

✅ /admin Admin interface to view and edit archive data

Python API:

✅ from archivebox import add, remove, info, config, etc...

✅ from archivebox.core.models import Snapshot, User, etc...

✅ from archivebox.extractors import media, wget, screenshot, etc...

✅ from archivebox.index import json, sql, html, etc...

✅ from archivebox.parsers import pinboard_rss, pocket_html, generic_json, etc...

(Red ❌ features are still unfinished and will be released in later versions)
opened by pirate 46

Error on Windows 10 when adding URL: UnicodeEncodeError: 'charmap' codec can't encode: character maps to

[i] [2021-03-27 04:40:48] ArchiveBox v0.5.4: archivebox add https://youtube.com/
    > E:\ArchiveBox

[!] Warning: Missing 6 recommended dependencies
    ! WGET_BINARY: wget (unable to detect version)
    ! SINGLEFILE_BINARY: single-file (unable to detect version)
      Hint: npm install --prefix . "git+https://github.com/ArchiveBox/ArchiveBox.git"
            or archivebox config --set SAVE_SINGLEFILE=False to silence this warning

    ! READABILITY_BINARY: readability-extractor (unable to detect version)
      Hint: npm install --prefix . "git+https://github.com/ArchiveBox/ArchiveBox.git"
            or archivebox config --set SAVE_READABILITY=False to silence this warning

    ! MERCURY_BINARY: mercury-parser (unable to detect version)
      Hint: npm install --prefix . "git+https://github.com/ArchiveBox/ArchiveBox.git"
            or archivebox config --set SAVE_MERCURY=False to silence this warning

    ! CHROME_BINARY: unable to find binary (unable to detect version)
    ! RIPGREP_BINARY: rg (unable to detect version)

[+] [2021-03-27 04:40:52] Adding 1 links to index (crawl depth=0)...
    > Saved verbatim input to sources/E:\ArchiveBox\sources\1616820052-import.txt
    > Parsed 1 URLs from input (Plain Text)
    > Found 1 new URLs not already in index

[*] [2021-03-27 04:40:52] Writing 1 links to main index...
    √ E:\ArchiveBox\index.sqlite3

[▶] [2021-03-27 04:40:52] Starting archiving of 1 snapshots in index...
    ! Failed to archive link: UnicodeEncodeError: 'charmap' codec can't encode character '\u25be' in position 9443: character maps to <undefined>

Traceback (most recent call last):
  File "d:\python\lib\runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "d:\python\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "D:\Python\Scripts\archivebox.exe\__main__.py", line 7, in <module>
    from .cli import main
  File "d:\python\lib\site-packages\archivebox\cli\__init__.py", line 129, in main
    run_subcommand(
  File "d:\python\lib\site-packages\archivebox\cli\__init__.py", line 69, in run_subcommand
    module.main(args=subcommand_args, stdin=stdin, pwd=pwd)    # type: ignore
  File "d:\python\lib\site-packages\archivebox\cli\archivebox_add.py", line 85, in main
    add(
  File "d:\python\lib\site-packages\archivebox\util.py", line 112, in typechecked_function
    return func(*args, **kwargs)
  File "d:\python\lib\site-packages\archivebox\main.py", line 592, in add
    archive_links(new_links, overwrite=False, **archive_kwargs)
  File "d:\python\lib\site-packages\archivebox\util.py", line 112, in typechecked_function
    return func(*args, **kwargs)
  File "d:\python\lib\site-packages\archivebox\extractors\__init__.py", line 173, in archive_links
    archive_link(to_archive, overwrite=overwrite, methods=methods, out_dir=Path(link.link_dir))
  File "d:\python\lib\site-packages\archivebox\util.py", line 112, in typechecked_function
    return func(*args, **kwargs)
  File "d:\python\lib\site-packages\archivebox\extractors\__init__.py", line 95, in archive_link
    write_link_details(link, out_dir=out_dir, skip_sql_index=False)
  File "d:\python\lib\site-packages\archivebox\util.py", line 112, in typechecked_function
    return func(*args, **kwargs)
  File "d:\python\lib\site-packages\archivebox\index\__init__.py", line 333, in write_link_details
    write_html_link_details(link, out_dir=out_dir)
  File "d:\python\lib\site-packages\archivebox\util.py", line 112, in typechecked_function
    return func(*args, **kwargs)
  File "d:\python\lib\site-packages\archivebox\index\html.py", line 79, in write_html_link_details
    atomic_write(str(Path(out_dir) / HTML_INDEX_FILENAME), rendered_html)
  File "d:\python\lib\site-packages\archivebox\util.py", line 112, in typechecked_function
    return func(*args, **kwargs)
  File "d:\python\lib\site-packages\archivebox\system.py", line 47, in atomic_write
    f.write(contents)
  File "d:\python\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u25be' in position 9443: character maps to <undefined>

is: bug difficulty: easy

opened by Leontking 41

Bugfix: docker-compose instructions create a sonic container that fails to start

Describe the bug

I followed the docker-compose instructions from the README. This is the result:

[root@Acheron archivebox]# docker-compose ps
         Name                        Command                State             Ports
--------------------------------------------------------------------------------------------
archivebox_archivebox_1   dumb-init -- /app/bin/dock ...   Up         0.0.0.0:8000->8000/tcp
archivebox_sonic_1        sonic -c /etc/sonic.cfg          Exit 101

[root@Acheron archivebox]# docker-compose logs sonic
Attaching to archivebox_sonic_1
sonic_1       | thread 'main' panicked at 'cannot read config file: Os { code: 21, kind: Other, message: "Is a directory" }', src/config/reader.rs:24:14
sonic_1       | note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
sonic_1       | thread 'main' panicked at 'cannot read config file: Os { code: 21, kind: Other, message: "Is a directory" }', src/config/reader.rs:24:14
sonic_1       | note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Search seems to work anyway.

I would expect one of:

a. sonic container is not created by default if it requires the user to manually create a config and is not necessary to run ArchiveBox b. config.cfg is created for me by the init script, using the environment variable I set in the docker-compose file c. config.cfg is not required by sonic (however, this is not the case: https://github.com/valeriansaliou/sonic/issues/197)

Steps to reproduce

From the README:

# create a new empty directory and initalize your collection (can be anywhere)
mkdir ~/archivebox && cd ~/archivebox
curl -O https://raw.githubusercontent.com/ArchiveBox/ArchiveBox/master/docker-compose.yml
docker-compose run archivebox init
docker-compose run archivebox --version

# start the webserver and open the UI (optional)
docker-compose run archivebox manage createsuperuser
docker-compose up -d
open http://127.0.0.1:8000

# you can also add links and manage your archive via the CLI:
docker-compose run archivebox add 'https://example.com'
docker-compose run archivebox status
docker-compose run archivebox help  # to see more options

ArchiveBox version

[root@Acheron archivebox]# docker-compose run archivebox --version
Starting archivebox_sonic_1 ... done
Creating archivebox_archivebox_run ... done
ArchiveBox v0.5.3
Cpython Linux Linux-5.9.1-arch1-1-x86_64-with-glibc2.28 x86_64 (in Docker)

[i] Dependency versions:
 √  ARCHIVEBOX_BINARY     v0.5.3          valid     /usr/local/bin/archivebox
 √  PYTHON_BINARY         v3.9.1          valid     /usr/local/bin/python3.9
 √  DJANGO_BINARY         v3.1.3          valid     /usr/local/lib/python3.9/site-packages/django/bin/django-admin.py
 √  CURL_BINARY           v7.64.0         valid     /usr/bin/curl
 √  WGET_BINARY           v1.20.1         valid     /usr/bin/wget
 √  NODE_BINARY           v15.5.1         valid     /usr/bin/node
 √  SINGLEFILE_BINARY     v0.1.14         valid     /node/node_modules/single-file/cli/single-file
 √  READABILITY_BINARY    v0.1.0          valid     /node/node_modules/readability-extractor/readability-extractor
 √  MERCURY_BINARY        v1.0.0          valid     /node/node_modules/@postlight/mercury-parser/cli.js
 √  GIT_BINARY            v2.20.1         valid     /usr/bin/git
 √  YOUTUBEDL_BINARY      v2021.01.03     valid     /usr/local/bin/youtube-dl
 √  CHROME_BINARY         v87.0.4280.88   valid     /usr/bin/chromium
 √  RIPGREP_BINARY        v0.10.0         valid     /usr/bin/rg

[i] Source-code locations:
 √  PACKAGE_DIR           22 files        valid     /app/archivebox
 √  TEMPLATES_DIR         3 files         valid     /app/archivebox/themes

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled
 -  COOKIES_FILE          -               disabled

[i] Data locations:
 √  OUTPUT_DIR            6 files         valid     /data
 √  SOURCES_DIR           1 files         valid     ./sources
 √  LOGS_DIR              0 files         valid     ./logs
 √  ARCHIVE_DIR           1 files         valid     ./archive
 √  CONFIG_FILE           81.0 Bytes      valid     ./ArchiveBox.conf
 √  SQL_INDEX             204.0 KB        valid     ./index.sqlite3

[root@Acheron archivebox]# docker version
Client:
 Version:           20.10.2
 API version:       1.40
 Go version:        go1.15.6
 Git commit:        2291f610ae
 Built:             Tue Jan  5 19:56:21 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server:
 Engine:
  Version:          19.03.13-ce
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.15.2
  Git commit:       4484c46d9d
  Built:            Sat Sep 26 12:03:35 2020
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          v1.4.1.m
  GitCommit:        c623d1b36f09f8ef6536a057bd658b3aa8632828.m
 runc:
  Version:          1.0.0-rc92
  GitCommit:        ff819c7e9184c13b7c2607fe6c30ae19403a7aff
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

[root@Acheron archivebox]# docker-compose version
docker-compose version 1.27.4, build 40524192
docker-py version: 4.3.1
CPython version: 3.7.7
OpenSSL version: OpenSSL 1.1.0l  10 Sep 2019

is: bug difficulty: easy status: done touches: documentation

opened by JohnMaguire 28

Question: ... How to fix Permission denied: '/data'

I'm following the setup instructions using docker-compose.

When I run docker-compose run archivebox init I get

[i] [2020-11-16 13:38:31] ArchiveBox v0.4.21: archivebox init
    > /data

Traceback (most recent call last):
  File "/usr/local/bin/archivebox", line 33, in <module>
    sys.exit(load_entry_point('archivebox', 'console_scripts', 'archivebox')())
  File "/app/archivebox/cli/__init__.py", line 123, in main
    run_subcommand(
  File "/app/archivebox/cli/__init__.py", line 63, in run_subcommand
    module.main(args=subcommand_args, stdin=stdin, pwd=pwd)    # type: ignore
  File "/app/archivebox/cli/archivebox_init.py", line 33, in main
    init(
  File "/app/archivebox/util.py", line 113, in typechecked_function
    return func(*args, **kwargs)
  File "/app/archivebox/main.py", line 259, in init
    is_empty = not len(set(os.listdir(out_dir)) - ALLOWED_IN_OUTPUT_DIR)
PermissionError: [Errno 13] Permission denied: '/data'

Please how can I fix this?

is: bug touches: config difficulty: easy status: done

opened by Prn-Ice 27

Discussion: new name!
Hey everyone! I have a big refactor in the works with some breaking changes, and I thought I'd take this opportunity to re-release BA with a better name and a 1.0 version. The new release modularizes BA into a python package, which lets people import individual parts for their own uses (e.g. parsers, link archiving, screenshotting, indexing). It fixes a lot of the bad decisions I made early on (e.g. using timestamps as unique keys instead of sha256 hashes of the URLs). It also adds a backend with a web GUI for searching and adding imports.

The new name should be easy to find and type in a python packaging context and should be related to web archiving somehow.

Requirements for a new name:

one word

no symbols or spaces (since it's going to be imported as a python package like from webfreeze.pocket import parse_links

should be 1st in google results when released with a new name (i.e. no competing projects/keywords)

should be intuitively related to web archiving

Potential ideas:

WebFreeze

Freezekit

ArchiveKit

WebCooler

Comment with your name suggestions/ideas!
status: idea phase
opened by pirate 25
WIP: Create python package from repository

This will create a python package installable using pip.

The package can be later published on pypi for easier access.

Before merging I would squash everything into one commit if approved.

Scripts

the installation provide an archive command that will be available from the shell and will execute the archive.py script

Setup

The important part is the setup.py file as it contains metadata and instructions for pip.

I filled it with the information I could find and it should be ok but as you are the author please review it.

config.py

As this file is considered editable by the user maybe we should move it somewhere suitable (~/.config/bookmark-archiver/config.py) and access it at runtime.

opened by edoput 23
Full-text search
Summary

This PR Adds the ability to do full-text search 🎉

Related issues

#22 #24

Changes these areas

[ ] Bugfixes

[x] Feature behavior

[ ] Command line interface

[ ] Configuration options

[x] Internal architecture

[ ] Snapshot data layout on disk
opened by jdcaballerov 19
Link parsing: Pinboard private feeds don't seem to get parsed properly
I would love to have the cron job that monitors my Pocket feed also monitor my private Pinboard feed. However, no matter which method I use to pass the feed to bookmark-archiver using the instructions, all have their own unique failure.

If I pass a public feed, like http://feeds.pinboard.in/rss/u:username/, it works fine. But if I pass a private feed, like https://feeds.pinboard.in/rss/secret:xxxx/u:username/private/, it errors out. I have tried the RSS, JSON, and Text feeds, and none work.

Examples here: (I've simply replaced the actual feed I used to test, with the demo URL Pinboard provides) ./archive "https://feeds.pinboard.in/rss/secret:xxxx/u:username/private/"

[*] [2018-10-18 21:14:03] Downloadinghttps://feeds.pinboard.in/rss/secret:xxxx/u:username/private/ > output/sources/feeds.pinboard.in-1539897243.txt [X] No links found :(

./archive "https://feeds.pinboard.in/json/secret:xxxx/u:username/private/"

[*] [2018-10-18 21:13:46] Downloading https://feeds.pinboard.in/json/secret:xxxx/u:username/private/ > output/sources/feeds.pinboard.in-1539897226.txt Traceback (most recent call last): File "./archive", line 161, in <module> links = merge_links(archive_path=out_dir, import_path=source) File "./archive", line 53, in merge_links raw_links = parse_links(import_path) File "/home/USERNAME/datahoarding/bookmark-archiver/archiver/parse.py", line 54, in parse_links links += list(parser_func(file)) File "/home/USERNAME/bookmark-archiver/archiver/parse.py", line 108, in parse_json_export url = erg['url'] KeyError: 'url'

./archive "https://feeds.pinboard.in/text/secret:xxxx/u:username/private/"

[*] [2018-10-18 21:17:57] Downloading https://feeds.pinboard.in/text/secret:xxxx/u:username/private/ > output/sources/feeds.pinboard.in-1539897477.txt [X] No links found :(

Even though the script says that links are not found, they are definitely there, and simply pasting the URL into a browser outputs the feed in the proper format. I used this script successfully with other methods, like the Pinboard manual export, Pocket manual export AND RSS feed, and browser export. Is this just not a supported method for importing/monitoring?
is: bug status: needs followup
opened by drpfenderson 19
Running `archivebox init` via pip install on Windows 10 triggers "File not found" error

I'm on Windows 10. I tried to install archivebox from pip, but after I did "npm install --prefix . 'git+https://github.com/ArchiveBox/ArchiveBox.git'", and did archivebox init, it gave a "File Not Found" Error
is: bug

opened by DUOLabs333 18
Pocket and Pinboard imports causing tags to be split incorrectly into individual characters w/ broken hyphenation

A simple bug due to a .split() or set() somewhere on the tags_str instead of the tags list. Should be easy to fix.

We should also add a filter to prevent emptystring / whitespace-only tags:
is: bug difficulty: easy good first ticket help wanted

opened by pirate 17
Add parser for Pocket API
Pass a url like pocket://Username to import that username's archived Pocket library. Tokens need to be stored in ArchveBox.conf with the following keys:

POCKET_CONSUMER_KEY = key-from-custom-pocket-app POCKET_ACCESS_TOKENS = {"YourUsername": "pocket-token-for-app"}

POCKET_ACCESS_TOKENS MUST be on a single line, or the JSON will be misinterpreted by the parser as a new key/value pair.

Summary

I'm not 100% this is the implementation, but my experience w/ the API is it's more reliable & complete than the feed export. It would be nice to use this as a feed source but the last since value would need to be persisted somewhere.

Related issues

None, I wrote this to import my Pocket library locally and was wondering if this would be useful.

Changes these areas

[ ] Bugfixes

[x] Feature behavior

[ ] Command line interface

[ ] Configuration options

[ ] Internal architecture

[ ] Snapshot data layout on disk
opened by mAAdhaTTah 17
Feature Request: support tag slug non english
Type

[ ] General question or discussion

[ ] Propose a brand new feature

[X] Request modification of existing behavior or design

What is the problem that your feature request solves

i want write tag by non english

Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes

What hacks or alternative solutions have you tried to solve the problem?

How badly do you want this new feature?

[ ] It's an urgent deal-breaker, I can't live without it

[ ] It's important to add it in the near-mid term future

[X] It would be nice to have eventually

[ ] I'm willing to contribute dev time / money to fix this issue

[ ] I like ArchiveBox so far / would recommend it to a friend

[ ] I've had a lot of difficulty getting ArchiveBox set up

status: idea phase
opened by green1052 0
URL Length seems to be limited to 200 characters

Describe the bug

If you try to edit a snapshot (e.g. if it failed to process fully or needs a title etc) where the original URL is longer than 200 chars the edits are not saved as there is an error saying that the URL is too long.

Steps to reproduce

Save a snapshot of a URL of greater than 200 chars, then try to edit it and save the edits. Save will fail with the error.

Screenshots or log output

Can do if needed but its fairly easy to reproduce.

ArchiveBox version

v0.6.3

opened by prgarnett 0
FreeBSD install instructions need a bit of TLC
Wiki Page URL

https://github.com/ArchiveBox/ArchiveBox/wiki/Install

Suggested Edit

FreeBSD install instructions need a bit of TLC:

pkg install python git wget curl youtube_dl ripgrep py39-pip py39-sqlite3 npm ffmpeg pkg install chromium
opened by mwestza 0

Bug: update and list verbs are very slow to start

Describe the bug

On a large archive, archivebox update or archivebox list starts right away some CPU intensive process that takes a long time to complete.

Steps to reproduce

Have a large archive (I have 1800+ links)
archivebox update or list

Screenshots or log output

docker-compose run archivebox update
[i] [2022-12-21 23:02:15] ArchiveBox v0.6.2: archivebox update
    > /data

[▶] [2022-12-21 23:05:07] Starting archiving of 1847 snapshots in index...

The user is simply left waiting, and for the lack of an explanation, the user is asked of his faith. Fans are screeching, it could be an infinite loop. If the user waits, they would wait 3 minutes. If they kill the program, the suspicion of an infinite loop would be wrong: (who could blame them)

Traceback (most recent call last):
  File "/usr/local/bin/archivebox", line 33, in <module>
    sys.exit(load_entry_point('archivebox', 'console_scripts', 'archivebox')())
  File "/app/archivebox/cli/__init__.py", line 140, in main
    run_subcommand(
  File "/app/archivebox/cli/__init__.py", line 80, in run_subcommand
    module.main(args=subcommand_args, stdin=stdin, pwd=pwd)    # type: ignore
  File "/app/archivebox/cli/archivebox_update.py", line 119, in main
    update(
  File "/app/archivebox/util.py", line 114, in typechecked_function
    return func(*args, **kwargs)
  File "/app/archivebox/main.py", line 788, in update
    matching_folders = list_folders(
  File "/app/archivebox/util.py", line 114, in typechecked_function
    return func(*args, **kwargs)
  File "/app/archivebox/main.py", line 929, in list_folders
    return STATUS_FUNCTIONS[status](links, out_dir=out_dir)
  File "/app/archivebox/index/__init__.py", line 411, in get_indexed_folders
    links = [snapshot.as_link_with_details() for snapshot in snapshots.iterator()]
  File "/app/archivebox/index/__init__.py", line 411, in <listcomp>
    links = [snapshot.as_link_with_details() for snapshot in snapshots.iterator()]
  File "/app/archivebox/core/models.py", line 127, in as_link_with_details
    return load_link_details(self.as_link())
  File "/app/archivebox/util.py", line 114, in typechecked_function
    return func(*args, **kwargs)
  File "/app/archivebox/index/__init__.py", line 348, in load_link_details
    existing_link = parse_json_link_details(out_dir)
  File "/app/archivebox/util.py", line 114, in typechecked_function
    return func(*args, **kwargs)
  File "/app/archivebox/index/json.py", line 110, in parse_json_link_details
    return Link.from_json(link_json, guess)
  File "/app/archivebox/index/schema.py", line 246, in from_json
    cast_result = ArchiveResult.from_json(json_result, guess)
  File "/app/archivebox/index/schema.py", line 97, in from_json
    info['end_ts'] = parse_date(info['end_ts'])
  File "/app/archivebox/util.py", line 114, in typechecked_function
    return func(*args, **kwargs)
  File "/app/archivebox/util.py", line 157, in parse_date
    return dateparser(date, settings={'TIMEZONE': 'UTC'}).replace(tzinfo=timezone.utc)
  File "/usr/local/lib/python3.9/site-packages/dateparser/conf.py", line 89, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/dateparser/__init__.py", line 54, in parse
    data = parser.get_date_data(date_string, date_formats)
  File "/usr/local/lib/python3.9/site-packages/dateparser/date.py", line 421, in get_date_data
    parsed_date = _DateLocaleParser.parse(
  File "/usr/local/lib/python3.9/site-packages/dateparser/date.py", line 178, in parse
    return instance._parse()
  File "/usr/local/lib/python3.9/site-packages/dateparser/date.py", line 182, in _parse
    date_data = self._parsers[parser_name]()
  File "/usr/local/lib/python3.9/site-packages/dateparser/date.py", line 196, in _try_freshness_parser
    return freshness_date_parser.get_date_data(self._get_translated_date(), self._settings)
  File "/usr/local/lib/python3.9/site-packages/dateparser/freshness_date_parser.py", line 159, in get_date_data
    date, period = self.parse(date_string, settings)
  File "/usr/local/lib/python3.9/site-packages/dateparser/freshness_date_parser.py", line 88, in parse
    self.now = apply_timezone(_now, settings.TIMEZONE)
  File "/usr/local/lib/python3.9/site-packages/dateparser/utils/__init__.py", line 115, in apply_timezone
    new_datetime = apply_dateparser_timezone(date_time, tz_string)
  File "/usr/local/lib/python3.9/site-packages/dateparser/utils/__init__.py", line 103, in apply_dateparser_timezone
    if info['regex'].search(' %s' % offset_or_timezone_abb):
KeyboardInterrupt

Long story short, I snooped after setting up the dev enviroment. Before telling the user anything, archivebox is iterating on every matching link, of 1847. But that's not what's slow! It's a particular function, merge_links() that, ran 1847 times, adds up to a lot of waiting.

Merge_links() is called by load_link_details(), seemingly to combine the information from disk about the link we're currently processing, and prettify it too. So far so good, but why are we doing in this in bulk? For example, archivebox list is going to iterate on each link, to print it, so why not "merge" as you roll? Do we really need a complete list of merged links before doing anything? Perhaps I'm not seeing the entire picture though...

opened by notevenaperson 0

Question: ...What's with the users?

Hi,

first of all, thank you for this cool project!

I was able to easily install it using docker-compose. And I could also register as admin and do some test archivals.

But I am failing to set up a user to use for actual archival jobs. What I mean is, I set up the user on the GUI, log out as admin, try to log in as the user but get an error message telling me that the user or password are incorrect (and that both could be case sensitive).

Well, I double and triple checked the name and the password - still nothing. I reset the password using the CLI - still nothing. I tried making the user "staff" - still nothing. I tried making the user "superuser" - still nothing. I wanted to try and add the user to some group - but I can't find any group and can't create any.

So can anybody tell me why my user can't log in?

And what does "staff" mean? Or, to turn it around: What would be the use for a user that is not staff and, hence, cannot log in. What could such user do? And how do I create groups?

Thanks!

opened by gitwittidbit 1

Releases(v0.6.2)

v0.6.2(Apr 10, 2021)
New features

new ArchiveResult log in the admin web UI, with full editing ability of individual extractor outputs + list of outputs under each Snapshot admin entry

ability to save multiple snapshots of the same URL over time using new Re-snapshot button

add init --quick and server --quick-init options to quickly update the db version without doing a full re-init (for users with large archive collections this will make version upgrades a lot faster / less painful)

add new archivebox setup command and archivebox init --setup flag to aid in automatically installing dependencies and creating a superuser during initial setup

new SNAPSHOTS_PER_PAGE=40 and MEDIA_MAX_SIZE=750m config options

allow hotlinking directly to specific extractor output on the snapshot detail page using URL #hash e.g. /archive/<timestamp>/index.html#git

add ability to view snapshot matching a given URLs by visiting /archive/https://example.com/some/url -> redirects to -> /archive/<timestamp>/index.html (also works without scheme /archive/example.com)

#660 add ability to tag URLs while adding them via the web UI and via the CLI using archivebox add --tag=tag1,tag2,tag3 ...

#659 add back ability to override visual styling with custom HTML and CSS using new config option CUSTOM_TEMPLATES_DIR

ability to add and remove multiple tags at once from the snapshot admin using autocompleting dropdown

Enhancements

lots of performance improvements! (in testing with 100k entries, the main index was brought down from 10-14 second load times to ~110ms once cache warms up)

full text search now works on the public snapshot list

dates and times are now localized to your browser's timezone instead of showing in UTC

integrity and correctness improvements to readability, mercury, warc, and other extractors

video subtitles and description are now added to the full-text search index as well (including youtube's autogenerated transcripts in all languages)

log all errors with full tracebacks to new data/logs/errors.log file (so users no longer have to run in --debug mode to see error details)

better archivebox schedule logging and changed logfile location to ./logs/schedule.log

better docker-compose setup experience with sonic config example in docker-compose.yml

add Django Debug Toolbar + djdt_flamegraph for developers to profile UI performance

add --overwrite flag support to archivebox schedule, archived urls get added similarly to add --overwrite

#644 remove boostrap and jquery remove network requests to CDNs by inlining them instead

#647 allow filtering by ArchiveResult status in the Snapshot admin UI to select only links that have been archived or not archived

#550 kill all orphan child processes after each extractor finishes to prevent dangling chromium/node subprocesses and memory leaks

3276434 add new SEARCH_BACKEND_TIMEOUT config option to tune amount of time search backend can take before it gives up

more diagnostic info added to the Snapshot admin view including most recent status code, content type, detected server, etc

make the order of the table columns, layout, and spacing the same on the public view and private view (also remove DataTable, we're not using it)

better snapshot grid page (faster load times, nicer CSS for tags and cards, more actions supported and metadata shown)

added Cache-Control headers to dramatically speed up load times by caching favicons, screenshots, etc. in browsers/upstreams

new project releases page https://releases.archivebox.io and demo url https://demo.archivebox.io

Bugfixes

#673 fix searching by URL substring in Snapshot admin list

#658 fix Snapshot admin action buttons not working in Safari and some other browsers

#678 fix AssertionError error when archivebox would to attempt archive with CHROME_BINARY=None when Chrome was not found on host system

#654 fix some issues with sonic attempting to index massive text blobs or binary blobs on some pages and hanging

#674 fix UTF-8 encoding encoding problems with file reading/writing on Windows (supporting a Python pkg on Windows is unreasonably painful ya'll)

#433 fix deleted items sometimes reappearing on next import/update

#473 fix issue preventing use of archivebox python API inside raw REPL (not using archivebox shell)

fix stdin/stdout/stderr handling for some edge cases in Docker/Docker-Compose

Source code(tar.gz)
Source code(zip)
archivebox--0.6.2-1.big_sur.bottle.tar.gz(11.46 MB)
archivebox-0.6.2-py3-none-any.whl(477.89 KB)
archivebox-0.6.2.tar.gz(403.89 KB)
archivebox_0.6.2-1_all.deb(281.89 KB)
Electron-ArchiveBox-macOS-x64-0.6.2.app.zip(76.54 MB)
v0.5.6(Feb 9, 2021)
add ARMv7 and ARMv8 CPU support for apt / deb distribution on Launchpad PPA

fix nodesource apt repo not supported on i386 b90afc8

fix handling of skipped ArchiveResult entries with null output 0aea5ed

catch exception on import of old index.json into ArchiveResult 171bbeb

move debsign to release not build 66fb5b2

skip tests during debian build a32eac3

fix emptystrings in cmd_version causing exception a49884a

automate deb dist better and bump version 0e6ac39

fix assertion 6705354

change wording of db not found error 683a087

Source code(tar.gz)
Source code(zip)
v0.5.4(Feb 1, 2021)
Thank you contributors who helped with the 181 commits in this release!
@cdvv7788, @jdcaballerov, @thedanbob, @aggroskater, @mAAdhaTTah, @mario-campos, @mikaelf

fix migration failing due to null cmd_versions in older archives a3008c8

Publish, minor, & major version to DockerHub and add set up CodeQL codeql-analysis.yml c5b7d9f, bbb6cc8

fix DATABASE_NAME posixpath, and dependencies dict bug 02bdb3b, 5c7842f

use relative imports for .util to fix windows import clash 72e2c7b

fix COOKIES_FILE config param breaking in wget ef7711f

Refactor should_save_extractor methods to accept overwrite parameter 5420903

Fix issue #617 by using mark_safe in combination with format_html … 1989275

make permission chowning on docker start less fancy, respect PUID/PGID #635

add createsuperuser flag to server command 39ec77e

fix files icons styling and use the db exclusively for rendering them, instead of filesystem f004058, 7d8fe66, 5c54bcc, 534ead2

limit youtubedl download size to 750m and stop splitting out audio files 3227f54

also search url, timestamp, tags on public index 8a4edb4

fix trailing slash problems and wget not detecting download path 9764a8e

add response status code to headers.json c089501

fix singlefile path used for sonic 24e2493

cleanup template layout in filesystem, new snapshot detail page UI

Source code(tar.gz)
Source code(zip)
archivebox-0.5.4-py3-none-any.whl(385.10 KB)
archivebox_0.5.4-1_all.deb(235.85 KB)
v0.5.3(Jan 6, 2021)
ArchiveResult moved to SQLite3 DB for performance @cdvv7788

lots of assorted bugfixes and improvements courtesy of @cdvv7788 and @jdcaballerov

new full-text search support with ripgrep and sonic courtesy of @jdcaballerov

new archivebox oneshot command for downloading a single site without starting a whole collection

new Pocket API importer courtesy of @mAAdhaTTah

new Wallabag importer courtesy of @ehainry

new extractor options on Add page courtesy of @BlipRanger

new apt/deb/homebrew/pip packaging setup into separate repos under new Github Org https://github.com/ArchiveBox

new official PPA and Docker Hub accounts https://hub.docker.com/r/archivebox/archivebox (with automatic armv7 builds courtesy of @chrismeller)

new Snapshot grid view courtesy of @jdcaballerov

Source code(tar.gz)
Source code(zip)
v0.4.24(Dec 3, 2020)

Last stable version for the v0.4 branch, contains numerous last fixes an improvements to v0.4 before the leap to v0.5.
Source code(tar.gz)
Source code(zip)
v0.4.21(Aug 18, 2020)

Source code(tar.gz)
Source code(zip)
v0.4.17(Aug 18, 2020)
Fix bugs with parsing long URLs as paths

html-encoded URLs

new generic HTML parser

new --init and --overwrite flags on add

improve stdout and hints

fix Pull title button

other small bugfixes

Source code(tar.gz)
Source code(zip)
v0.4.16(Aug 18, 2020)

A minor bugfix release for the Readability archive method to avoid timing out killing the whole archiving process.
Source code(tar.gz)
Source code(zip)

v0.4.15(Aug 18, 2020)

fix a bug where invalid URLs where attempted to be parsed an imported, causing the whole archive process to crash
add support for scheduled archiving in docker

docker run -v $PWD:/data archivebox schedule --foreground --every=day --depth=1 'https://getpocket.com/users/USERNAME/feed/all'

# docker-compose.yml

version: '3.7'

services:
  archivebox:
    image: nikisweeting/archivebox:latest
    command: schedule --foreground --every=day --depth=1 'https://getpocket.com/users/USERNAME/feed/all'
    environment:
      - USE_COLOR=True
      - SHOW_PROGRESS=False
    volumes:
      - ./data:/data

Source code(tar.gz)
Source code(zip)

v0.4.14(Aug 14, 2020)

Add support for the Readability article text extractor, it runs on the SingleFile, Wget, and DOM dump output by default, but if none of those are available it will download the article from scratch to do text extraction. This release also officially adds Docker support for ARM architectures, including the Raspberry Pi. The image size was also shrunk from 1.5GB to 452MB by making sure unnecessary build tools are uninstalled after the package build process.

Source code(tar.gz)
Source code(zip)
v0.4.13(Aug 10, 2020)

Source code(tar.gz)
Source code(zip)
v0.4.12(Aug 10, 2020)

This is a minor bugfix release with some Dockerfile improvements to qualify for the official docker image library.
Source code(tar.gz)
Source code(zip)
v0.4.11(Aug 7, 2020)

We add a major new archive method in this release: SingleFile. On bare metal it requires installing Node and Chrome/Chromium, but it works out-of-the-box in the Docker version.

This finally allows ArchiveBox to pass all of the acid tests except one, and the archive for Github and many other sites are nicer than Wget was able to do on its own.

Source code(tar.gz)
Source code(zip)
v0.4.9(Jul 28, 2020)
🌅 v0.4 is officially released. This is a long-awaited 3rd-pass review over every corner of the archivebox UX. It adresses many of the fundamental shortcomings around index consistency by using a new SQLite database, with automatic migrations provided by django. It also smooths many of the rough edges, adds a new admin Web UI, a rich new CLI, closes 40+ github tickets, and is the first official release available on PyPI.

https://pypi.org/project/archivebox/ pip install archivebox

https://hub.docker.com/r/nikisweeting/archivebox docker run -v $PWD:/data nikisweeting/archivebox

https://archivebox.readthedocs.io/en/latest/

https://github.com/pirate/ArchiveBox/releases/tag/v0.4.9

Enjoy!

🎉 Big thanks to everyone who helped! Especially the Monadical team @cdvv7788 @apkallum @afreydev and also @drpfenderson who helped us track down the last few index importing bugs! 🎉

The docs still have some work left to finish updating, but the CLI help text is all up-to-date (when in doubt, just pass --help).
Let us know if you find any rough edges here: https://github.com/pirate/ArchiveBox/issues/new/choose

pip install archivebox cd path/to/your/archive/folder archivebox init # this doubles as the migrate command, it will safely upgrade existing index files automatically archviebox add 'https://example.com' archviebox add 'https://getpocket.com/users/USERNAME/feed/all' --depth=1 archivebox status archivebox server archivebox help

Or if you prefer docker, the CLI works exactly the same archivebox [subcommand] [...args]:

docker run -v $PWD:/data nikisweeting/archivebox init docker run -v $PWD:/data nikisweeting/archivebox add 'https://example.com' docker run -v $PWD:/data -p 8000 nikisweeting/archivebox server

version: '3.7' services: archivebox: image: nikisweeting/archivebox:latest command: server 0.0.0.0:8000 stdin_open: true tty: true ports: - 8000:8000 environment: - USE_COLOR=True volumes: - ./data:/data

Screenshots

New Features

A bunch of big changes:

pip install archivebox is now available

full transition to Django Sqlite DB with migrations (making upgrades between versions much safer now)

maintains an intuitive and helpful CLI that's backwards-compatible with all previous archivebox data versions

uses argparse instead of hand-written CLI system: see archivebox/cli/archivebox.py

new subcommands-based CLI for archivebox (see below)

new Web UI with pagination, better search, filtering, permissions, and more

30+ assorted bugfixes, new features, and tickets closed

For more info, see: https://github.com/pirate/ArchiveBox/wiki/Roadmap

Released in this version:

Install Methods:

✅ pip/pipenv install archivebox [--dev]

✅ docker run nikisweeting/archivebox / docker-compose up

❌ apt/brew/pkg/yum/nix/etc install archivebox (maybe later)

Command Line Interface:

✅ archivebox

✅ archivebox version

✅ archivebox help

✅ archivebox init

✅ archivebox status

✅ archivebox add

✅ archivebox remove

✅ archivebox update

✅ archivebox list

✅ archivebox schedule

✅ archivebox config

✅ archivebox server

✅ archivebox shell

✅ archivebox manage

❌ archivebox oneshot

❌ archivebox export

❌ archivebox proxy

Web UI:

✅ / Main index

✅ /add Page to add new links to the archive (but needs improvement)

✅ /archive/<timestamp>/ Snapshot details page

✅ /archive/<timestamp>/<url> live wget archive of page

✅ /archive/<timestamp>/<extractor> get a specific extractor output for a given snapshot

✅ /archive/<url> shortcut to view most recent snapshot of given url

✅ /archive/<url_hash> shortcut to view most recent snapshot of given url

✅ /admin Admin interface to view and edit archive data

✅ /old.html Backwards-compatible static HTML index for the previous version

Python API:

✅ from archivebox.main import add, remove, info, config, etc...

✅ from archivebox.core.models import Snapshot, User, etc...

✅ from archivebox.extractors import media, wget, screenshot, etc...

✅ from archivebox.index import json, sql, html, etc...

✅ from archivebox.parsers import pinboard_rss, pocket_html, generic_json, etc...

(Red ❌ features are still unfinished and will be released in later versions)
Source code(tar.gz)
Source code(zip)
v0.2.4(Feb 27, 2019)
better archive corruption guards (check structure invariants on every parse & save)

remove title prefetching in favor of new FETCH_TITLE archive method

slightly improved CLI output for parsing and remote url downloading

re-save index after archiving completes to update titles and urls

remove redundant derivable data from link json schema

markdown link parsing support

faster link parsing and better symbol handling using a new compiled URL_REGEX

Source code(tar.gz)
Source code(zip)
v0.2.3(Feb 19, 2019)
fixed issues with parsing titles including trailing tags

fixed issues with titles defaulting to URLs instead of attempting to fetch

fixed issue where bookmark timestamps from RSS would be ignored and current ts used instead

fixed issue where ONLY_NEW would overwrite existing links in archive with only new ones

fixed lots of issues with URL parsing by using urllib.parse instead of hand-written lambdas

ignore robots.txt when using wget (ssshhh don't tell anyone 😁)

fix RSS parser bailing out when there's whitespace around XML tags

fix issue with browser history export trying to run ls on wrong directory

Source code(tar.gz)
Source code(zip)
v0.2.2(Feb 7, 2019)
This is a bugfix release, many parts of the parsing process have been improved or fixed.

Shaarli RSS export support

Fix issues with plain text link parsing including quotes, whitespace, and closing tags in URLs

add USER_AGENT to archive.org submissions so they can track archivebox usage

remove all icons similar to archive.org branding from archive UI

hide some of the noisier youtubedl and wget errors

set permissions on youtubedl media folder

fix chrome data dir incorrect path and quoting

better chrome binary finding

show which parser is used when importing links, show progress when fetching titles

Source code(tar.gz)
Source code(zip)
v0.2.1(Jan 11, 2019)
This is a feature-packed release, so it's likely to be a little buggier than usual!

New features:

ability to load any plain text list of links (also the new fallback for all parses)

WARC file saving via wget: FETCH_WARC=True

Git repository downloading with git clone: FETCH_GIT=True GIT_DOMAINS=github.com,gitlab.com,bitbucket.org

Media downloading with youtube-dl: FETCH_MEDIA=True MEDIA_TIMEOUT=36000

Bugfixes:

autodetect the correct chromium binary in almost all cases

create browser history export folder automatically

higher allowed timestamp precision

New logo:
Source code(tar.gz)
Source code(zip)
v0.2.0(Dec 21, 2018)

No major changes other than the name.
Source code(tar.gz)
Source code(zip)
v0.1.0(Jun 11, 2018)
Warning: Running this version will move the old html/ output folder to the new location: output/.

Changes:

entirely new folder structure & code layout

moved scripts into bin/ folder, symlinked setup and archive for backwards-compatibility

removed TEMPLATE_INDEX* config options, just symlink the files in templates/ to your custom versions

added support for ./bin/export-browser-history JSON imports of browsing history from Chrome and Firefox

Source code(tar.gz)
Source code(zip)
v0.0.3(Oct 30, 2017)
New Features:

Support for parsing links from RSS feeds

Support for specifying a URL as well as local file paths: ./archive.py https://example.com/path/to/rss/feed.xml

Support for --user-data-dir for archiving restricted sites with chrome headless

Simple & Fancy HTML & JSON indexes for each individual link

Archive attempt history stored in link index.json

Improvements:

Append to existing archive instead of overwriting the index each time

Reduced unnecessary config options, it should "just work"

Smartly dedupe and cleanup messy archive folders

Massively cleaned up codebase

Source code(tar.gz)
Source code(zip)
v0.0.2(Jul 4, 2017)
refactor codebase into separate files

check for minimum python version before running

fix utf-8 encoding errors when writing index.html

make index easier to customize with templates/ folder

WIP audio & video downloading with youtube-dl

Source code(tar.gz)
Source code(zip)
v0.0.1(Jul 4, 2017)

It's reached a point where I'm comfortable bringing Bookmark Archiver out of alpha and into beta. This release supports a broad range of bookmark export files, works well with wget archiving, and produces clean, future-compatible archive folders.

See the README for more details and a list of features. Future releases will have a changelog.
Source code(tar.gz)
Source code(zip)