๐Ÿ—ƒ Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

Overview

ArchiveBox
Open-source self-hosted web archiving.

โ–ถ๏ธ Quickstart | Demo | Github | Documentation | Info & Motivation | Community | Roadmap

"Your own personal internet archive" (็ฝ‘็ซ™ๅญ˜ๆกฃ / ็ˆฌ่™ซ)


Language grade: Python Language grade: JavaScript Total alerts


ArchiveBox is a powerful self-hosted internet archiving solution written in Python. You feed it URLs of pages you want to archive, and it saves them to disk in a variety of formats depending on setup and content within.

๐Ÿ”ข   Run ArchiveBox via Docker Compose (recommended), Docker, Apt, Brew, or Pip (see below).

apt/brew/pip3 install archivebox

archivebox init                       # run this in an empty folder
archivebox add 'https://example.com'  # start adding URLs to archive
curl https://example.com/rss.xml | archivebox add  # or add via stdin
archivebox schedule --every=day https://example.com/rss.xml

For each URL added, ArchiveBox saves several types of HTML snapshot (wget, Chrome headless, singlefile), a PDF, a screenshot, a WARC archive, any git repositories, images, audio, video, subtitles, article text, and more....

archivebox server --createsuperuser 0.0.0.0:8000   # use the interactive web UI
archivebox list 'https://example.com'  # use the CLI commands (--help for more)
ls ./archive/*/index.json              # or browse directly via the filesystem

You can then manage your snapshots via the filesystem, CLI, Web UI, SQLite DB (./index.sqlite3), Python API (alpha), REST API (alpha), or desktop app (alpha).

At the end of the day, the goal is to sleep soundly knowing that the part of the internet you care about will be automatically preserved in multiple, durable long-term formats that will be accessible for decades (or longer).



bookshelf graphic   logo   bookshelf graphic

โšก๏ธ   CLI Usage

# archivebox [subcommand] [--args]
archivebox --version
archivebox help
  • archivebox init/version/status/config/manage to administer your collection
  • archivebox add/remove/update/list to manage Snapshots in the archive
  • archivebox schedule to pull in fresh URLs in regularly from boorkmarks/history/Pocket/Pinboard/RSS/etc.
  • archivebox oneshot archive single URLs without starting a whole collection
  • archivebox shell/manage dbshell open a REPL to use the Python API (alpha), or SQL API

Demo | Screenshots | Usage
. . . . . . . . . . . . . . . . . . . . . . . . . . . .

cli init screenshot cli init screenshot server snapshot admin screenshot server snapshot details page screenshot

grassgrass

Quickstart

๐Ÿ–ฅ   Supported OSs: Linux/BSD, macOS, Windows     ๐ŸŽฎ   CPU Architectures: x86, amd64, arm7, arm8 (raspi >=3) ๐Ÿ“ฆ   Distributions: docker/apt/brew/pip3/npm (in order of completeness)

(click to expand your preferred โ–บ distribution below for full setup instructions)

Get ArchiveBox with docker-compose on any platform (recommended, everything included out-of-the-box)

First make sure you have Docker installed: https://docs.docker.com/get-docker/

# create a new empty directory and initalize your collection (can be anywhere)
mkdir ~/archivebox && cd ~/archivebox
curl -O 'https://raw.githubusercontent.com/ArchiveBox/ArchiveBox/master/docker-compose.yml'
docker-compose run archivebox init
docker-compose run archivebox --version

# start the webserver and open the UI (optional)
docker-compose run archivebox manage createsuperuser
docker-compose up -d
open 'http://127.0.0.1:8000'

# you can also add links and manage your archive via the CLI:
docker-compose run archivebox add 'https://example.com'
echo 'https://example.com' | docker-compose run archivebox -T add
docker-compose run archivebox status
docker-compose run archivebox help  # to see more options

# when passing stdin/stdout via the cli, use the -T flag
echo 'https://example.com' | docker-compose run -T archivebox add
docker-compose run -T archivebox list --html --with-headers > index.html

This is the recommended way to run ArchiveBox because it includes all the extractors like:
chrome, wget, youtube-dl, git, etc., full-text search w/ sonic, and many other great features.

Get ArchiveBox with docker on any platform

First make sure you have Docker installed: https://docs.docker.com/get-docker/

# create a new empty directory and initalize your collection (can be anywhere)
mkdir ~/archivebox && cd ~/archivebox
docker run -v $PWD:/data -it archivebox/archivebox init
docker run -v $PWD:/data -it archivebox/archivebox --version

# start the webserver and open the UI (optional)
docker run -v $PWD:/data -it -p 8000:8000 archivebox/archivebox server --createsuperuser 0.0.0.0:8000
open http://127.0.0.1:8000

# you can also add links and manage your archive via the CLI:
docker run -v $PWD:/data -it archivebox/archivebox add 'https://example.com'
docker run -v $PWD:/data -it archivebox/archivebox status
docker run -v $PWD:/data -it archivebox/archivebox help  # to see more options

# when passing stdin/stdout via the cli, use only -i (not -it)
echo 'https://example.com' | docker run -v $PWD:/data -i archivebox/archivebox add
docker run -v $PWD:/data -i archivebox/archivebox list --html --with-headers > index.html
Get ArchiveBox with apt on Ubuntu/Debian

This method should work on all Ubuntu/Debian based systems, including x86, amd64, arm7, and arm8 CPUs (e.g. Raspberry Pis >=3).

If you're on Ubuntu >= 20.04, add the apt repository with add-apt-repository:

(on other Ubuntu/Debian-based systems follow the โ™ฐ instructions below)

# add the repo to your sources and install the archivebox package using apt
sudo apt install software-properties-common
sudo add-apt-repository -u ppa:archivebox/archivebox
sudo apt install archivebox
# create a new empty directory and initalize your collection (can be anywhere)
mkdir ~/archivebox && cd ~/archivebox
npm install --prefix . 'git+https://github.com/ArchiveBox/ArchiveBox.git'
archivebox init
archivebox --version

# start the webserver and open the web UI (optional)
archivebox server --createsuperuser 0.0.0.0:8000
open http://127.0.0.1:8000

# you can also add URLs and manage the archive via the CLI and filesystem:
archivebox add 'https://example.com'
archivebox status
archivebox list --html --with-headers > index.html
archivebox list --json --with-headers > index.json
archivebox help  # to see more options

โ™ฐ On other Ubuntu/Debian-based systems add these sources directly to /etc/apt/sources.list:

echo "deb http://ppa.launchpad.net/archivebox/archivebox/ubuntu focal main" > /etc/apt/sources.list.d/archivebox.list
echo "deb-src http://ppa.launchpad.net/archivebox/archivebox/ubuntu focal main" >> /etc/apt/sources.list.d/archivebox.list
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys C258F79DCC02E369
sudo apt update
sudo apt install archivebox
sudo snap install chromium
archivebox --version
# then scroll back up and continue the initalization instructions above

(you may need to install some other dependencies manually however)

Get ArchiveBox with brew on macOS

First make sure you have Homebrew installed: https://brew.sh/#install

# install the archivebox package using homebrew
brew install archivebox/archivebox/archivebox

# create a new empty directory and initalize your collection (can be anywhere)
mkdir ~/archivebox && cd ~/archivebox
npm install --prefix . 'git+https://github.com/ArchiveBox/ArchiveBox.git'
archivebox init
archivebox --version

# start the webserver and open the web UI (optional)
archivebox server --createsuperuser 0.0.0.0:8000
open http://127.0.0.1:8000

# you can also add URLs and manage the archive via the CLI and filesystem:
archivebox add 'https://example.com'
archivebox status
archivebox list --html --with-headers > index.html
archivebox list --json --with-headers > index.json
archivebox help  # to see more options
Get ArchiveBox with pip on any platform

First make sure you have Python >= 3.7 installed: https://realpython.com/installing-python/

# install the archivebox package using pip3
pip3 install archivebox

# create a new empty directory and initalize your collection (can be anywhere)
mkdir ~/archivebox && cd ~/archivebox
npm install --prefix . 'git+https://github.com/ArchiveBox/ArchiveBox.git'
archivebox init
archivebox --version
# Install any missing extras like wget/git/chrome/etc. manually as needed

# start the webserver and open the web UI (optional)
archivebox server --createsuperuser 0.0.0.0:8000
open http://127.0.0.1:8000

# you can also add URLs and manage the archive via the CLI and filesystem:
archivebox add 'https://example.com'
archivebox status
archivebox list --html --with-headers > index.html
archivebox list --json --with-headers > index.json
archivebox help  # to see more options

No matter which install method you choose, they all roughly follow this 3-step process and all provide the same CLI, Web UI, and on-disk data format.

  1. Install ArchiveBox: apt/brew/pip3 install archivebox
  2. Start a collection: archivebox init
  3. Start archiving: archivebox add 'https://example.com'

grassgrass


. . . . . . . . . . . . . . . . . . . . . . . . . . . .

DEMO: https://archivebox.zervice.io
Quickstart | Usage | Configuration

Key Features



lego

Input formats

ArchiveBox supports many input formats for URLs, including Pocket & Pinboard exports, Browser bookmarks, Browser history, plain text, HTML, markdown, and more!

echo 'http://example.com' | archivebox add
archivebox add 'https://example.com/some/page'
archivebox add < ~/Downloads/firefox_bookmarks_export.html
archivebox add < any_text_with_urls_in_it.txt
archivebox add --depth=1 'https://example.com/some/downloads.html'
archivebox add --depth=1 'https://news.ycombinator.com#2020-12-12'

# (if using docker add -i when passing via stdin)
echo 'https://example.com' | docker run -v $PWD:/data -i archivebox/archivebox add

# (if using docker-compose add -T when passing via stdin)
echo 'https://example.com' | docker-compose run -T archivebox add

See the Usage: CLI page for documentation and examples.

It also includes a built-in scheduled import feature with archivebox schedule and browser bookmarklet, so you can pull in URLs from RSS feeds, websites, or the filesystem regularly/on-demand.

Output formats

All of ArchiveBox's state (including the index, snapshot data, and config file) is stored in a single folder called the "ArchiveBox data folder". All archivebox CLI commands must be run from inside this folder, and you first create it by running archivebox init.

The on-disk layout is optimized to be easy to browse by hand and durable long-term. The main index is a standard sqlite3 database (it can also be exported as static JSON/HTML), and the archive snapshots are organized by date-added timestamp in the archive/ subfolder. Each snapshot subfolder includes a static JSON and HTML index describing its contents, and the snapshot extrator outputs are plain files within the folder (e.g. media/example.mp4, git/somerepo.git, static/someimage.png, etc.)

# to browse your index statically without running the archivebox server, run:
archivebox list --html --with-headers > index.html
archivebox list --json --with-headers > index.json
# if running these commands with docker-compose, add -T:
# docker-compose run -T archivebox list ...

# then open the static index in a browser
open index.html

# or browse the snapshots via filesystem directly
ls ./archive/<timestamp>/
  • Index: index.html & index.json HTML and JSON index files containing metadata and details
  • Title, Favicon, Headers Response headers, site favicon, and parsed site title
  • Wget Clone: example.com/page-name.html wget clone of the site with warc/<timestamp>.gz
  • Chrome Headless
    • SingleFile: singlefile.html HTML snapshot rendered with headless Chrome using SingleFile
    • PDF: output.pdf Printed PDF of site using headless chrome
    • Screenshot: screenshot.png 1440x900 screenshot of site using headless chrome
    • DOM Dump: output.html DOM Dump of the HTML after rendering using headless chrome
    • Readability: article.html/json Article text extraction using Readability
  • Archive.org Permalink: archive.org.txt A link to the saved site on archive.org
  • Audio & Video: media/ all audio/video files + playlists, including subtitles & metadata with youtube-dl
  • Source Code: git/ clone of any repository found on github, bitbucket, or gitlab links
  • More coming soon! See the Roadmap...

It does everything out-of-the-box by default, but you can disable or tweak individual archive methods via environment variables or config file.

archivebox config --set SAVE_ARCHIVE_DOT_ORG=False
archivebox config --set YOUTUBEDL_ARGS='--max-filesize=500m'
archivebox config --help
lego graphic



Dependencies

You don't need to install all the dependencies, ArchiveBox will automatically enable the relevant modules based on whatever you have available, but it's recommended to use the official Docker image with everything preinstalled.

If you so choose, you can also install ArchiveBox and its dependencies directly on any Linux or macOS systems using the system package manager or by running the automated setup script.

ArchiveBox is written in Python 3 so it requires python3 and pip3 available on your system. It also uses a set of optional, but highly recommended external dependencies for archiving sites: wget (for plain HTML, static files, and WARC saving), chromium (for screenshots, PDFs, JS execution, and more), youtube-dl (for audio and video), git (for cloning git repos), and nodejs (for readability and singlefile), and more.



security graphic

Caveats

If you're importing URLs containing secret slugs or pages with private content (e.g Google Docs, CodiMD notepads, etc), you may want to disable some of the extractor modules to avoid leaking private URLs to 3rd party APIs during the archiving process.

# don't do this:
archivebox add 'https://docs.google.com/document/d/12345somelongsecrethere'
archivebox add 'https://example.com/any/url/you/want/to/keep/secret/'

# without first disabling share the URL with 3rd party APIs:
archivebox config --set SAVE_ARCHIVE_DOT_ORG=False   # disable saving all URLs in Archive.org
archivebox config --set SAVE_FAVICON=False      # optional: only the domain is leaked, not full URL
archivebox config --set CHROME_BINARY=chromium  # optional: switch to chromium to avoid Chrome phoning home to Google

Be aware that malicious archived JS can also read the contents of other pages in your archive due to snapshot CSRF and XSS protections being imperfect. See the Security Overview page for more details.

# visiting an archived page with malicious JS:
https://127.0.0.1:8000/archive/1602401954/example.com/index.html

# example.com/index.js can now make a request to read everything:
https://127.0.0.1:8000/index.html
https://127.0.0.1:8000/archive/*
# then example.com/index.js can send it off to some evil server

Support for saving multiple snapshots of each site over time will be added soon (along with the ability to view diffs of the changes between runs). For now ArchiveBox is designed to only archive each URL with each extractor type once. A workaround to take multiple snapshots of the same URL is to make them slightly different by adding a hash:

archivebox add 'https://example.com#2020-10-24'
...
archivebox add 'https://example.com#2020-10-25'



Screenshots

brew install archivebox
archivebox version
archivebox init
archivebox add archivebox data dir
archivebox server archivebox server add archivebox server list archivebox server detail



paisley graphic

Background & Motivation

Vast treasure troves of knowledge are lost every day on the internet to link rot. As a society, we have an imperative to preserve some important parts of that treasure, just like we preserve our books, paintings, and music in physical libraries long after the originals go out of print or fade into obscurity.

Whether it's to resist censorship by saving articles before they get taken down or edited, or just to save a collection of early 2010's flash games you love to play, having the tools to archive internet content enables to you save the stuff you care most about before it disappears.


Image from WTF is Link Rot?...

The balance between the permanence and ephemeral nature of content on the internet is part of what makes it beautiful. I don't think everything should be preserved in an automated fashion, making all content permanent and never removable, but I do think people should be able to decide for themselves and effectively archive specific content that they care about.

Because modern websites are complicated and often rely on dynamic content, ArchiveBox archives the sites in several different formats beyond what public archiving services like Archive.org and Archive.is are capable of saving. Using multiple methods and the market-dominant browser to execute JS ensures we can save even the most complex, finicky websites in at least a few high-quality, long-term data formats.

All the archived links are stored by date bookmarked in ./archive/<timestamp>, and everything is indexed nicely with JSON & HTML files. The intent is for all the content to be viewable with common software in 50 - 100 years without needing to run ArchiveBox in a VM.

Comparison to Other Projects

โ–ถ Check out our community page for an index of web archiving initiatives and projects.

comparison The aim of ArchiveBox is to go beyond what the Wayback Machine and other public archiving services can do, by adding a headless browser to replay sessions accurately, and by automatically extracting all the content in multiple redundant formats that will survive being passed down to historians and archivists through many generations.

User Interface & Intended Purpose

ArchiveBox differentiates itself from similar projects by being a simple, one-shot CLI interface for users to ingest bulk feeds of URLs over extended periods, as opposed to being a backend service that ingests individual, manually-submitted URLs from a web UI. However, we also have the option to add urls via a web interface through our Django frontend.

Private Local Archives vs Centralized Public Archives

Unlike crawler software that starts from a seed URL and works outwards, or public tools like Archive.org designed for users to manually submit links from the public internet, ArchiveBox tries to be a set-and-forget archiver suitable for archiving your entire browsing history, RSS feeds, or bookmarks, including private/authenticated content that you wouldn't otherwise share with a centralized service (do not do this until v0.5 is released with some security fixes). Also by having each user store their own content locally, we can save much larger portions of everyone's browsing history than a shared centralized service would be able to handle.

Storage Requirements

Because ArchiveBox is designed to ingest a firehose of browser history and bookmark feeds to a local disk, it can be much more disk-space intensive than a centralized service like the Internet Archive or Archive.today. However, as storage space gets cheaper and compression improves, you should be able to use it continuously over the years without having to delete anything. In my experience, ArchiveBox uses about 5gb per 1000 articles, but your milage may vary depending on which options you have enabled and what types of sites you're archiving. By default, it archives everything in as many formats as possible, meaning it takes more space than a using a single method, but more content is accurately replayable over extended periods of time. Storage requirements can be reduced by using a compressed/deduplicated filesystem like ZFS/BTRFS, or by setting SAVE_MEDIA=False to skip audio & video files.


dependencies graphic

Learn more

Whether you want to learn which organizations are the big players in the web archiving space, want to find a specific open-source tool for your web archiving need, or just want to see where archivists hang out online, our Community Wiki page serves as an index of the broader web archiving community. Check it out to learn about some of the coolest web archiving projects and communities on the web!



documentation graphic

Documentation

We use the Github wiki system and Read the Docs (WIP) for documentation.

You can also access the docs locally by looking in the ArchiveBox/docs/ folder.

Getting Started

Reference

More Info



development

ArchiveBox Development

All contributions to ArchiveBox are welcomed! Check our issues and Roadmap for things to work on, and please open an issue to discuss your proposed implementation before working on things! Otherwise we may have to close your PR if it doesn't align with our roadmap.

Low hanging fruit / easy first tickets:
Total alerts

Setup the dev environment

1. Clone the main code repo (making sure to pull the submodules as well)

git clone --recurse-submodules https://github.com/ArchiveBox/ArchiveBox
cd ArchiveBox
git checkout dev  # or the branch you want to test
git submodule update --init --recursive
git pull --recurse-submodules

2. Option A: Install the Python, JS, and system dependencies directly on your machine

# Install ArchiveBox + python dependencies
python3 -m venv .venv && source .venv/bin/activate && pip install -e '.[dev]'
# or: pipenv install --dev && pipenv shell

# Install node dependencies
npm install

# Check to see if anything is missing
archivebox --version
# install any missing dependencies manually, or use the helper script:
./bin/setup.sh

2. Option B: Build the docker container and use that for development instead

# Optional: develop via docker by mounting the code dir into the container
# if you edit e.g. ./archivebox/core/models.py on the docker host, runserver
# inside the container will reload and pick up your changes
docker build . -t archivebox
docker run -it --rm archivebox version
docker run -it --rm -p 8000:8000 \
    -v $PWD/data:/data \
    -v $PWD/archivebox:/app/archivebox \
    archivebox server 0.0.0.0:8000 --debug --reload

Common development tasks

See the ./bin/ folder and read the source of the bash scripts within. You can also run all these in Docker. For more examples see the Github Actions CI/CD tests that are run: .github/workflows/*.yaml.

Run in DEBUG mode

archivebox config --set DEBUG=True
# or
archivebox server --debug ...

Build and run a Github branch

docker build -t archivebox:dev https://github.com/ArchiveBox/ArchiveBox.git#dev
docker run -it -v $PWD:/data archivebox:dev ...

Run the linters

./bin/lint.sh

(uses flake8 and mypy)

Run the integration tests

./bin/test.sh

(uses pytest -s)

Make migrations or enter a django shell

Make sure to run this whenever you change things in models.py.

cd archivebox/
./manage.py makemigrations

cd path/to/test/data/
archivebox shell
archivebox manage dbshell

(uses pytest -s)

Build the docs, pip package, and docker image

(Normally CI takes care of this, but these scripts can be run to do it manually)

./bin/build.sh

# or individually:
./bin/build_docs.sh
./bin/build_pip.sh
./bin/build_deb.sh
./bin/build_brew.sh
./bin/build_docker.sh

Roll a release

(Normally CI takes care of this, but these scripts can be run to do it manually)

./bin/release.sh

# or individually:
./bin/release_docs.sh
./bin/release_pip.sh
./bin/release_deb.sh
./bin/release_brew.sh
./bin/release_docker.sh




This project is maintained mostly in my spare time with the help from generous contributors and Monadical ( โœจ hire them for dev work!).


Sponsor us on Github




Issues
  • v0.4 (first Django release)

    v0.4 (first Django release)

    The v0.4 Release

    A bunch of big changes:

    • pip install archivebox is now available
    • beginnings of transition to Django while maintaining a mostly backwards-compatible CLI
    • using argparse instead of hand-written CLI system: see archivebox/cli/archivebox.py
    • new subcommands-based CLI for archivebox (see below)

    For more info, see: https://github.com/pirate/ArchiveBox/wiki/Roadmap

    Released in this version:

    Install Methods:

    Note: apt, brew are now available as of v0.5

    Command Line Interface:

    Web UI:

    • โœ… / Main index
    • โœ… /add Page to add new links to the archive (but needs improvement)
    • โœ… /archive/<timestamp>/ Snapshot details page
    • โœ… /archive/<timestamp>/<url> live wget archive of page
    • โœ… /archive/<timestamp>/<extractor> get a specific extractor output for a given snapshot
    • โœ… /archive/<url> shortcut to view most recent snapshot of given url
    • โœ… /archive/<url_hash> shortcut to view most recent snapshot of given url
    • โœ… /admin Admin interface to view and edit archive data

    Python API:

    (Red โŒ features are still unfinished and will be released in later versions)

    opened by pirate 46
  • Error on Windows 10 when adding URL: UnicodeEncodeError: 'charmap' codec can't encode: character maps to <undefined>

    Error on Windows 10 when adding URL: UnicodeEncodeError: 'charmap' codec can't encode: character maps to

    [i] [2021-03-27 04:40:48] ArchiveBox v0.5.4: archivebox add https://youtube.com/
        > E:\ArchiveBox
    
    [!] Warning: Missing 6 recommended dependencies
        ! WGET_BINARY: wget (unable to detect version)
        ! SINGLEFILE_BINARY: single-file (unable to detect version)
          Hint: npm install --prefix . "git+https://github.com/ArchiveBox/ArchiveBox.git"
                or archivebox config --set SAVE_SINGLEFILE=False to silence this warning
    
        ! READABILITY_BINARY: readability-extractor (unable to detect version)
          Hint: npm install --prefix . "git+https://github.com/ArchiveBox/ArchiveBox.git"
                or archivebox config --set SAVE_READABILITY=False to silence this warning
    
        ! MERCURY_BINARY: mercury-parser (unable to detect version)
          Hint: npm install --prefix . "git+https://github.com/ArchiveBox/ArchiveBox.git"
                or archivebox config --set SAVE_MERCURY=False to silence this warning
    
        ! CHROME_BINARY: unable to find binary (unable to detect version)
        ! RIPGREP_BINARY: rg (unable to detect version)
    
    [+] [2021-03-27 04:40:52] Adding 1 links to index (crawl depth=0)...
        > Saved verbatim input to sources/E:\ArchiveBox\sources\1616820052-import.txt
        > Parsed 1 URLs from input (Plain Text)
        > Found 1 new URLs not already in index
    
    [*] [2021-03-27 04:40:52] Writing 1 links to main index...
        โˆš E:\ArchiveBox\index.sqlite3
    
    [โ–ถ] [2021-03-27 04:40:52] Starting archiving of 1 snapshots in index...
        ! Failed to archive link: UnicodeEncodeError: 'charmap' codec can't encode character '\u25be' in position 9443: character maps to <undefined>
    
    Traceback (most recent call last):
      File "d:\python\lib\runpy.py", line 194, in _run_module_as_main
        return _run_code(code, main_globals, None,
      File "d:\python\lib\runpy.py", line 87, in _run_code
        exec(code, run_globals)
      File "D:\Python\Scripts\archivebox.exe\__main__.py", line 7, in <module>
        from .cli import main
      File "d:\python\lib\site-packages\archivebox\cli\__init__.py", line 129, in main
        run_subcommand(
      File "d:\python\lib\site-packages\archivebox\cli\__init__.py", line 69, in run_subcommand
        module.main(args=subcommand_args, stdin=stdin, pwd=pwd)    # type: ignore
      File "d:\python\lib\site-packages\archivebox\cli\archivebox_add.py", line 85, in main
        add(
      File "d:\python\lib\site-packages\archivebox\util.py", line 112, in typechecked_function
        return func(*args, **kwargs)
      File "d:\python\lib\site-packages\archivebox\main.py", line 592, in add
        archive_links(new_links, overwrite=False, **archive_kwargs)
      File "d:\python\lib\site-packages\archivebox\util.py", line 112, in typechecked_function
        return func(*args, **kwargs)
      File "d:\python\lib\site-packages\archivebox\extractors\__init__.py", line 173, in archive_links
        archive_link(to_archive, overwrite=overwrite, methods=methods, out_dir=Path(link.link_dir))
      File "d:\python\lib\site-packages\archivebox\util.py", line 112, in typechecked_function
        return func(*args, **kwargs)
      File "d:\python\lib\site-packages\archivebox\extractors\__init__.py", line 95, in archive_link
        write_link_details(link, out_dir=out_dir, skip_sql_index=False)
      File "d:\python\lib\site-packages\archivebox\util.py", line 112, in typechecked_function
        return func(*args, **kwargs)
      File "d:\python\lib\site-packages\archivebox\index\__init__.py", line 333, in write_link_details
        write_html_link_details(link, out_dir=out_dir)
      File "d:\python\lib\site-packages\archivebox\util.py", line 112, in typechecked_function
        return func(*args, **kwargs)
      File "d:\python\lib\site-packages\archivebox\index\html.py", line 79, in write_html_link_details
        atomic_write(str(Path(out_dir) / HTML_INDEX_FILENAME), rendered_html)
      File "d:\python\lib\site-packages\archivebox\util.py", line 112, in typechecked_function
        return func(*args, **kwargs)
      File "d:\python\lib\site-packages\archivebox\system.py", line 47, in atomic_write
        f.write(contents)
      File "d:\python\lib\encodings\cp1252.py", line 19, in encode
        return codecs.charmap_encode(input,self.errors,encoding_table)[0]
    UnicodeEncodeError: 'charmap' codec can't encode character '\u25be' in position 9443: character maps to <undefined>
    
    is: bug difficulty: easy 
    opened by Leontking 41
  • Bugfix: docker-compose instructions create a sonic container that fails to start

    Bugfix: docker-compose instructions create a sonic container that fails to start

    Describe the bug

    I followed the docker-compose instructions from the README. This is the result:

    [[email protected] archivebox]# docker-compose ps
             Name                        Command                State             Ports
    --------------------------------------------------------------------------------------------
    archivebox_archivebox_1   dumb-init -- /app/bin/dock ...   Up         0.0.0.0:8000->8000/tcp
    archivebox_sonic_1        sonic -c /etc/sonic.cfg          Exit 101
    
    [[email protected] archivebox]# docker-compose logs sonic
    Attaching to archivebox_sonic_1
    sonic_1       | thread 'main' panicked at 'cannot read config file: Os { code: 21, kind: Other, message: "Is a directory" }', src/config/reader.rs:24:14
    sonic_1       | note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
    sonic_1       | thread 'main' panicked at 'cannot read config file: Os { code: 21, kind: Other, message: "Is a directory" }', src/config/reader.rs:24:14
    sonic_1       | note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
    

    Search seems to work anyway.

    I would expect one of:

    a. sonic container is not created by default if it requires the user to manually create a config and is not necessary to run ArchiveBox b. config.cfg is created for me by the init script, using the environment variable I set in the docker-compose file c. config.cfg is not required by sonic (however, this is not the case: https://github.com/valeriansaliou/sonic/issues/197)

    Steps to reproduce

    From the README:

    # create a new empty directory and initalize your collection (can be anywhere)
    mkdir ~/archivebox && cd ~/archivebox
    curl -O https://raw.githubusercontent.com/ArchiveBox/ArchiveBox/master/docker-compose.yml
    docker-compose run archivebox init
    docker-compose run archivebox --version
    
    # start the webserver and open the UI (optional)
    docker-compose run archivebox manage createsuperuser
    docker-compose up -d
    open http://127.0.0.1:8000
    
    # you can also add links and manage your archive via the CLI:
    docker-compose run archivebox add 'https://example.com'
    docker-compose run archivebox status
    docker-compose run archivebox help  # to see more options
    

    ArchiveBox version

    [[email protected] archivebox]# docker-compose run archivebox --version
    Starting archivebox_sonic_1 ... done
    Creating archivebox_archivebox_run ... done
    ArchiveBox v0.5.3
    Cpython Linux Linux-5.9.1-arch1-1-x86_64-with-glibc2.28 x86_64 (in Docker)
    
    [i] Dependency versions:
     โˆš  ARCHIVEBOX_BINARY     v0.5.3          valid     /usr/local/bin/archivebox
     โˆš  PYTHON_BINARY         v3.9.1          valid     /usr/local/bin/python3.9
     โˆš  DJANGO_BINARY         v3.1.3          valid     /usr/local/lib/python3.9/site-packages/django/bin/django-admin.py
     โˆš  CURL_BINARY           v7.64.0         valid     /usr/bin/curl
     โˆš  WGET_BINARY           v1.20.1         valid     /usr/bin/wget
     โˆš  NODE_BINARY           v15.5.1         valid     /usr/bin/node
     โˆš  SINGLEFILE_BINARY     v0.1.14         valid     /node/node_modules/single-file/cli/single-file
     โˆš  READABILITY_BINARY    v0.1.0          valid     /node/node_modules/readability-extractor/readability-extractor
     โˆš  MERCURY_BINARY        v1.0.0          valid     /node/node_modules/@postlight/mercury-parser/cli.js
     โˆš  GIT_BINARY            v2.20.1         valid     /usr/bin/git
     โˆš  YOUTUBEDL_BINARY      v2021.01.03     valid     /usr/local/bin/youtube-dl
     โˆš  CHROME_BINARY         v87.0.4280.88   valid     /usr/bin/chromium
     โˆš  RIPGREP_BINARY        v0.10.0         valid     /usr/bin/rg
    
    [i] Source-code locations:
     โˆš  PACKAGE_DIR           22 files        valid     /app/archivebox
     โˆš  TEMPLATES_DIR         3 files         valid     /app/archivebox/themes
    
    [i] Secrets locations:
     -  CHROME_USER_DATA_DIR  -               disabled
     -  COOKIES_FILE          -               disabled
    
    [i] Data locations:
     โˆš  OUTPUT_DIR            6 files         valid     /data
     โˆš  SOURCES_DIR           1 files         valid     ./sources
     โˆš  LOGS_DIR              0 files         valid     ./logs
     โˆš  ARCHIVE_DIR           1 files         valid     ./archive
     โˆš  CONFIG_FILE           81.0 Bytes      valid     ./ArchiveBox.conf
     โˆš  SQL_INDEX             204.0 KB        valid     ./index.sqlite3
    
    [[email protected] archivebox]# docker version
    Client:
     Version:           20.10.2
     API version:       1.40
     Go version:        go1.15.6
     Git commit:        2291f610ae
     Built:             Tue Jan  5 19:56:21 2021
     OS/Arch:           linux/amd64
     Context:           default
     Experimental:      true
    
    Server:
     Engine:
      Version:          19.03.13-ce
      API version:      1.40 (minimum version 1.12)
      Go version:       go1.15.2
      Git commit:       4484c46d9d
      Built:            Sat Sep 26 12:03:35 2020
      OS/Arch:          linux/amd64
      Experimental:     false
     containerd:
      Version:          v1.4.1.m
      GitCommit:        c623d1b36f09f8ef6536a057bd658b3aa8632828.m
     runc:
      Version:          1.0.0-rc92
      GitCommit:        ff819c7e9184c13b7c2607fe6c30ae19403a7aff
     docker-init:
      Version:          0.19.0
      GitCommit:        de40ad0
    
    [[email protected] archivebox]# docker-compose version
    docker-compose version 1.27.4, build 40524192
    docker-py version: 4.3.1
    CPython version: 3.7.7
    OpenSSL version: OpenSSL 1.1.0l  10 Sep 2019
    
    is: bug difficulty: easy status: done touches: documentation 
    opened by JohnMaguire 28
  • Question: ... How to fix Permission denied: '/data'

    Question: ... How to fix Permission denied: '/data'

    I'm following the setup instructions using docker-compose.

    When I run docker-compose run archivebox init I get

    [i] [2020-11-16 13:38:31] ArchiveBox v0.4.21: archivebox init
        > /data
    
    Traceback (most recent call last):
      File "/usr/local/bin/archivebox", line 33, in <module>
        sys.exit(load_entry_point('archivebox', 'console_scripts', 'archivebox')())
      File "/app/archivebox/cli/__init__.py", line 123, in main
        run_subcommand(
      File "/app/archivebox/cli/__init__.py", line 63, in run_subcommand
        module.main(args=subcommand_args, stdin=stdin, pwd=pwd)    # type: ignore
      File "/app/archivebox/cli/archivebox_init.py", line 33, in main
        init(
      File "/app/archivebox/util.py", line 113, in typechecked_function
        return func(*args, **kwargs)
      File "/app/archivebox/main.py", line 259, in init
        is_empty = not len(set(os.listdir(out_dir)) - ALLOWED_IN_OUTPUT_DIR)
    PermissionError: [Errno 13] Permission denied: '/data'
    

    Please how can I fix this?

    is: bug touches: config difficulty: easy status: done 
    opened by Prn-Ice 27
  • RSS parser falls back to full-text and imports unneeded URLs from metadata fields

    RSS parser falls back to full-text and imports unneeded URLs from metadata fields

    It looks like Shaarli feeds are not being parsed correctly and markup is being included in the link structure (much like ticket 134 for pocket). Also, it looks like shaarli detail and tag pages are being parsed as source links, making the import much slower and leading to clutter in the archive.

    You can use the public shaarli demo to reproduce this.

    There's a demo (U: demo / PW: demo) running on https://demo.shaarli.org/.

    1. Add whatever link to this instance

    The Atom feed then e.g. looks like this (with just one link, this is whats being parsed as the input file)

    <?xml  version="1.0" encoding="UTF-8" ?>
    <feed xmlns="http://www.w3.org/2005/Atom">
      <title>Shaarli demo (master)</title>
      <subtitle>Shaared links</subtitle>
      
        <updated>2019-01-30T06:06:01+00:00</updated>
      
      <link rel="self" href="https://demo.shaarli.org/?do=atom" />
      
      <author>
        <name>https://demo.shaarli.org/</name>
        <uri>https://demo.shaarli.org/</uri>
      </author>
      <id>https://demo.shaarli.org/</id>
      <generator>Shaarli</generator>
      
        <entry>
          <title>Aktuelle Trojaner-Welle: Emotet lauert in gefรƒยคlschten Rechnungsmails | heise online</title>
          
            <link href="https://www.heise.de/security/meldung/Aktuelle-Trojaner-Welle-Emotet-lauert-in-gefaelschten-Rechnungsmails-4291268.html" />
          
          <id>https://demo.shaarli.org/?cEV4vw</id>
          
            <published>2019-01-30T06:06:01+00:00</published>
            <updated>2019-01-30T06:06:01+00:00</updated>
          
          <content type="html" xml:lang="en"><![CDATA[<div class="markdown"><p>&#8212; <a href="https://demo.shaarli.org/?cEV4vw">Permalink</a></p></div>]]></content>
          
          
        </entry>
      
    </feed>
    

    Note that ArchiveBox wants to include 8 links from this:

    Adding 8 new links from /data/sources/demo.shaarli.org-1548828643.txt to /data/index.json
    

    Most likely because 8 instances of http:// were found (that's just my speculation). However, the expected behaviour should be that only the source link should be parsed / added, not the shaarli detail pages like https://demo.shaarli.org/?cEV4vw that contain nothing but the actual link to the source (again). IMO that doesn't make sense. It's even "worse" if a link has tags, because every tag then will lead to a new link to be crawled.

    1. Grab the Atom Feed https://demo.shaarli.org/?do=atom and import to ArchiveBox: docker-compose exec archivebox /bin/archive https://demo.shaarli.org/?do=atom
    2. You will see that markup fragments end up in the parser:
    [email protected]:/volume1/docker/ArchiveBox/ArchiveBox-master# docker-compose exec archivebox /bin/archive https://demo.shaarli.org/?do=atom
    [*] [2019-01-30 06:10:43] Downloading https://demo.shaarli.org/?do=atom > /data/sources/demo.shaarli.org-1548828643.txt
    [+] [2019-01-30 06:11:02] Adding 8 new links from /data/sources/demo.shaarli.org-1548828643.txt to /data/index.json
    [โˆš] [2019-01-30 06:11:18] Updated main index files:
        > /data/index.json
        > /data/index.html
    [โ–ถ] [2019-01-30 06:11:18] Updating files for 8 links in archive...
    [+] [2019-01-30 06:11:27] "Aktuelle Trojaner-Welle: Emotet lauert in gefรคlschten Rechnungsmails | heise online - Shaarli demo (master)"
        https://demo.shaarli.org/?cEV4vw
        > /data/archive/1548828660 (new)
          > favicon
          > wget
            Got wget response code 8:
              Total wall clock time: 5.1s
              Downloaded: 20 files, 1.1M in 0.7s (1.54 MB/s)
            Some resources were skipped: 404 Not Found
            Run to see full output:
                cd /data/archive/1548828660;
                wget --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent --restrict-file-names=unix --timeout=60 --warc-file=warc/1548828689 --page-requisites --user-agent="ArchiveBox/b53251fe4 (+https://github.com/pirate/ArchiveBox/) wget/1.18" https://demo.shaarli.org/?cEV4vw
          > pdf
          > screenshot
          > dom
          > archive_org
          > git
          > media
          โˆš index.json
          โˆš index.html
    [+] [2019-01-30 06:11:50] "Aktuelle Trojaner-Welle: Emotet lauert in gefรคlschten Rechnungsmails | heise online - Shaarli demo (master)"
        https://demo.shaarli.org/?cEV4vw</id>
        > /data/archive/1548828659 (new)
          > favicon
          > wget
            Got wget response code 8:
              Total wall clock time: 5.1s
              Downloaded: 20 files, 1.1M in 0.7s (1.54 MB/s)
            Some resources were skipped: 404 Not Found
            Run to see full output:
                cd /data/archive/1548828659;
                wget --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent --restrict-file-names=unix --timeout=60 --warc-file=warc/1548828710 --page-requisites --user-agent="ArchiveBox/b53251fe4 (+https://github.com/pirate/ArchiveBox/) wget/1.18" https://demo.shaarli.org/?cEV4vw</id>
          > pdf
          > screenshot
          > dom
          > archive_org
            Failed: Exception BadQueryException: Illegal character in query at index 32: https://demo.shaarli.org/?cEV4vw</id>
            Run to see full output:
                curl --location --head --max-time 60 --get https://web.archive.org/save/https://demo.shaarli.org/?cEV4vw</id>
          > git
          > media
          โˆš index.json
          โˆš index.html
    [+] [2019-01-30 06:12:10] "comments_outline_white"
        https://www.heise.de/security/meldung/Aktuelle-Trojaner-Welle-Emotet-lauert-in-gefaelschten-Rechnungsmails-4291268.html
        > /data/archive/1548828658 (new)
          > favicon
          > wget
            Got wget response code 4:
              Total wall clock time: 38s
              Downloaded: 128 files, 6.0M in 12s (502 KB/s)
            Some resources were skipped: Got an error from the server
            Run to see full output:
                cd /data/archive/1548828658;
                wget --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent --restrict-file-names=unix --timeout=60 --warc-file=warc/1548828730 --page-requisites --user-agent="ArchiveBox/b53251fe4 (+https://github.com/pirate/ArchiveBox/) wget/1.18" https://www.heise.de/security/meldung/Aktuelle-Trojaner-Welle-Emotet-lauert-in-gefaelschten-Rechnungsmails-4291268.html
          > pdf
          > screenshot
          > dom
          > archive_org
          > git
          > media
            got youtubedl response code 1:
    b'ERROR: Unable to extract container ID; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.\n'
            Failed: Exception Failed to download media
            Run to see full output:
                cd /data/archive/1548828658;
                youtube-dl --write-description --write-info-json --write-annotations --yes-playlist --write-thumbnail --no-call-home --no-check-certificate --user-agent --all-subs -x -k --audio-format mp3 --audio-quality 320K --embed-thumbnail --add-metadata https://www.heise.de/security/meldung/Aktuelle-Trojaner-Welle-Emotet-lauert-in-gefaelschten-Rechnungsmails-4291268.html
          โˆš index.json
          โˆš index.html
    [+] [2019-01-30 06:13:06] "https://demo.shaarli.org/</id>"
        https://demo.shaarli.org/</id>
        > /data/archive/1548828657 (new)
          > favicon
          > wget
            Got wget response code 8:
              https://demo.shaarli.org/%3C/id%3E:
              2019-01-30 06:13:07 ERROR 404: Not Found.
            Some resources were skipped: 404 Not Found
            Run to see full output:
                cd /data/archive/1548828657;
                wget --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent --restrict-file-names=unix --timeout=60 --warc-file=warc/1548828786 --page-requisites --user-agent="ArchiveBox/b53251fe4 (+https://github.com/pirate/ArchiveBox/) wget/1.18" https://demo.shaarli.org/</id>
          > pdf
          > screenshot
          > dom
          > archive_org
            Failed: Exception BadQueryException: Illegal character in path at index 25: https://demo.shaarli.org/</id>
            Run to see full output:
                curl --location --head --max-time 60 --get https://web.archive.org/save/https://demo.shaarli.org/</id>
          > git
          > media
            got youtubedl response code 1:
    b"WARNING: Could not send HEAD request to https://demo.shaarli.org/</id>: HTTP Error 404: Not Found\nERROR: Unable to download webpage: HTTP Error 404: Not Found (caused by <HTTPError 404: 'Not Found'>); please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.\n"
            Failed: Exception Failed to download media
            Run to see full output:
                cd /data/archive/1548828657;
                youtube-dl --write-description --write-info-json --write-annotations --yes-playlist --write-thumbnail --no-call-home --no-check-certificate --user-agent --all-subs -x -k --audio-format mp3 --audio-quality 320K --embed-thumbnail --add-metadata https://demo.shaarli.org/</id>
          โˆš index.json
          โˆš index.html
    [+] [2019-01-30 06:13:16] "https://demo.shaarli.org/</uri>"
        https://demo.shaarli.org/</uri>
        > /data/archive/1548828656 (new)
          > favicon
          > wget
            Got wget response code 8:
              https://demo.shaarli.org/%3C/uri%3E:
              2019-01-30 06:13:17 ERROR 404: Not Found.
            Some resources were skipped: 404 Not Found
            Run to see full output:
                cd /data/archive/1548828656;
                wget --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent --restrict-file-names=unix --timeout=60 --warc-file=warc/1548828796 --page-requisites --user-agent="ArchiveBox/b53251fe4 (+https://github.com/pirate/ArchiveBox/) wget/1.18" https://demo.shaarli.org/</uri>
          > pdf
          > screenshot
          > dom
          > archive_org
            Failed: Exception BadQueryException: Illegal character in path at index 25: https://demo.shaarli.org/</uri>
            Run to see full output:
                curl --location --head --max-time 60 --get https://web.archive.org/save/https://demo.shaarli.org/</uri>
          > git
          > media
            got youtubedl response code 1:
    b"WARNING: Could not send HEAD request to https://demo.shaarli.org/</uri>: HTTP Error 404: Not Found\nERROR: Unable to download webpage: HTTP Error 404: Not Found (caused by <HTTPError 404: 'Not Found'>); please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.\n"
            Failed: Exception Failed to download media
            Run to see full output:
                cd /data/archive/1548828656;
                youtube-dl --write-description --write-info-json --write-annotations --yes-playlist --write-thumbnail --no-call-home --no-check-certificate --user-agent --all-subs -x -k --audio-format mp3 --audio-quality 320K --embed-thumbnail --add-metadata https://demo.shaarli.org/</uri>
          โˆš index.json
          โˆš index.html
    [+] [2019-01-30 06:13:25] "Shaarli demo (master)"
        https://demo.shaarli.org/?do=atom
        > /data/archive/1548828655 (new)
          > favicon
          > wget
          > pdf
          > screenshot
          > dom
          > archive_org
          > git
          > media
          โˆš index.json
          โˆš index.html
    [+] [2019-01-30 06:13:36] "https://demo.shaarli.org/</name>"
        https://demo.shaarli.org/</name>
        > /data/archive/1548828655.0 (new)
          > favicon
          > wget
            Got wget response code 8:
              https://demo.shaarli.org/%3C/name%3E:
              2019-01-30 06:13:37 ERROR 404: Not Found.
            Some resources were skipped: 404 Not Found
            Run to see full output:
                cd /data/archive/1548828655.0;
                wget --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent --restrict-file-names=unix --timeout=60 --warc-file=warc/1548828816 --page-requisites --user-agent="ArchiveBox/b53251fe4 (+https://github.com/pirate/ArchiveBox/) wget/1.18" https://demo.shaarli.org/</name>
          > pdf
          > screenshot
          > dom
          > archive_org
            Failed: Exception BadQueryException: Illegal character in path at index 25: https://demo.shaarli.org/</name>
            Run to see full output:
                curl --location --head --max-time 60 --get https://web.archive.org/save/https://demo.shaarli.org/</name>
          > git
          > media
            got youtubedl response code 1:
    b"WARNING: Could not send HEAD request to https://demo.shaarli.org/</name>: HTTP Error 404: Not Found\nERROR: Unable to download webpage: HTTP Error 404: Not Found (caused by <HTTPError 404: 'Not Found'>); please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.\n"
            Failed: Exception Failed to download media
            Run to see full output:
                cd /data/archive/1548828655.0;
                youtube-dl --write-description --write-info-json --write-annotations --yes-playlist --write-thumbnail --no-call-home --no-check-certificate --user-agent --all-subs -x -k --audio-format mp3 --audio-quality 320K --embed-thumbnail --add-metadata https://demo.shaarli.org/</name>
          โˆš index.json
          โˆš index.html
    [+] [2019-01-30 06:13:45] "http://www.w3.org/2005/Atom"
        http://www.w3.org/2005/Atom
        > /data/archive/1548828644 (new)
          > favicon
          > wget
          > pdf
          > screenshot
          > dom
          > archive_org
            Failed: Exception LiveDocumentNotAvailableException: http://www.w3.org/2005/Atom: live document unavailable: java.net.SocketTimeoutException: Read timed out
            Run to see full output:
                curl --location --head --max-time 60 --get https://web.archive.org/save/http://www.w3.org/2005/Atom
          > git
          > media
          โˆš index.json
          โˆš index.html
    [โˆš] [2019-01-30 06:15:28] Update of 8 links complete (4.17 min)
        - 8 entries skipped
        - 41 entries updated
        - 15 errors
    

    (note the </id> at the end of the links)

    is: bug status: needs followup difficulty: easy 
    opened by mawmawmawm 24
  • WIP: Create python package from repository

    WIP: Create python package from repository

    This will create a python package installable using pip.

    The package can be later published on pypi for easier access.

    Before merging I would squash everything into one commit if approved.

    Scripts

    the installation provide an archive command that will be available from the shell and will execute the archive.py script

    Setup

    The important part is the setup.py file as it contains metadata and instructions for pip.

    I filled it with the information I could find and it should be ok but as you are the author please review it.

    config.py

    As this file is considered editable by the user maybe we should move it somewhere suitable (~/.config/bookmark-archiver/config.py) and access it at runtime.

    opened by edoput 23
  • Discussion: new name!

    Discussion: new name!

    Hey everyone! I have a big refactor in the works with some breaking changes, and I thought I'd take this opportunity to re-release BA with a better name and a 1.0 version. The new release modularizes BA into a python package, which lets people import individual parts for their own uses (e.g. parsers, link archiving, screenshotting, indexing). It fixes a lot of the bad decisions I made early on (e.g. using timestamps as unique keys instead of sha256 hashes of the URLs). It also adds a backend with a web GUI for searching and adding imports.

    The new name should be easy to find and type in a python packaging context and should be related to web archiving somehow.

    Requirements for a new name:

    • one word
    • no symbols or spaces (since it's going to be imported as a python package like from webfreeze.pocket import parse_links
    • should be 1st in google results when released with a new name (i.e. no competing projects/keywords)
    • should be intuitively related to web archiving

    Potential ideas:

    • WebFreeze
    • Freezekit
    • ArchiveKit
    • WebCooler

    Comment with your name suggestions/ideas!

    status: idea phase 
    opened by pirate 23
  • Link parsing: Pinboard private feeds don't seem to get parsed properly

    Link parsing: Pinboard private feeds don't seem to get parsed properly

    I would love to have the cron job that monitors my Pocket feed also monitor my private Pinboard feed. However, no matter which method I use to pass the feed to bookmark-archiver using the instructions, all have their own unique failure.

    If I pass a public feed, like http://feeds.pinboard.in/rss/u:username/, it works fine. But if I pass a private feed, like https://feeds.pinboard.in/rss/secret:xxxx/u:username/private/, it errors out. I have tried the RSS, JSON, and Text feeds, and none work.

    Examples here: (I've simply replaced the actual feed I used to test, with the demo URL Pinboard provides) ./archive "https://feeds.pinboard.in/rss/secret:xxxx/u:username/private/"

    [*] [2018-10-18 21:14:03] Downloadinghttps://feeds.pinboard.in/rss/secret:xxxx/u:username/private/ > output/sources/feeds.pinboard.in-1539897243.txt
    [X] No links found :(
    

    ./archive "https://feeds.pinboard.in/json/secret:xxxx/u:username/private/"

    [*] [2018-10-18 21:13:46] Downloading https://feeds.pinboard.in/json/secret:xxxx/u:username/private/ > output/sources/feeds.pinboard.in-1539897226.txt
    Traceback (most recent call last):
      File "./archive", line 161, in <module>
        links = merge_links(archive_path=out_dir, import_path=source)
      File "./archive", line 53, in merge_links
        raw_links = parse_links(import_path)
      File "/home/USERNAME/datahoarding/bookmark-archiver/archiver/parse.py", line 54, in parse_links
        links += list(parser_func(file))
      File "/home/USERNAME/bookmark-archiver/archiver/parse.py", line 108, in parse_json_export
        url = erg['url']
    KeyError: 'url'
    

    ./archive "https://feeds.pinboard.in/text/secret:xxxx/u:username/private/"

    [*] [2018-10-18 21:17:57] Downloading https://feeds.pinboard.in/text/secret:xxxx/u:username/private/ > output/sources/feeds.pinboard.in-1539897477.txt
    [X] No links found :(
    

    Even though the script says that links are not found, they are definitely there, and simply pasting the URL into a browser outputs the feed in the proper format. I used this script successfully with other methods, like the Pinboard manual export, Pocket manual export AND RSS feed, and browser export. Is this just not a supported method for importing/monitoring?

    is: bug status: needs followup 
    opened by drpfenderson 19
  • Full-text search

    Full-text search

    Summary

    This PR Adds the ability to do full-text search ๐ŸŽ‰

    Related issues

    #22 #24

    Changes these areas

    • [ ] Bugfixes
    • [x] Feature behavior
    • [ ] Command line interface
    • [ ] Configuration options
    • [x] Internal architecture
    • [ ] Snapshot data layout on disk
    opened by jdcaballerov 19
  • Running `archivebox init` via pip install on Windows 10 triggers

    Running `archivebox init` via pip install on Windows 10 triggers "File not found" error

    I'm on Windows 10. I tried to install archivebox from pip, but after I did "npm install --prefix . 'git+https://github.com/ArchiveBox/ArchiveBox.git'", and did archivebox init, it gave a "File Not Found" Error

    is: bug 
    opened by DUOLabs333 18
  • Question: Do we really need healthcheck?

    Question: Do we really need healthcheck?

    Today I noticed that my disk write rate increased over the past months, and when I investigated, I found Docker did a large amount of disk write, which was seemingly from regularly writing config.v2.json in the directory of ArchiveBox container. When I inspected that file, I found that there are logs for healthcheck, and the container has been in unhealthy state for a long time because I changed the service port of the container in docker-compose.yml. I tried bringing it back to healthy state by changing the port back, but the problem was still there, and the disk write rate I could see at that point was ~15MB per 30s. Eventually I disabled healthcheck in docker-compose.yml and this problem stops.

    There are two observations:

    • the unhealthy state had been there for a long time, and nothing ever brought that up to me
    • the healthcheck may produce large disk write

    So I wonder whether it's really meaningful to have the healthcheck at all. It doesn't seem to be useful that it can inform users if something goes wrong, and it seems to be wasting resources. Should we drop it? Or disable it by default in the provided docker-compose.yml?

    opened by upsuper 3
  • Question: How to run AB on localhost but store data on NAS?

    Question: How to run AB on localhost but store data on NAS?

    Hello!

    I'm using docker-compose. Following the title if this issue: I first tried changing the whole data dir to a mounted path pointing to my NAS, but I got this error:

      archivebox_1  | [X] OSError: Failed to write /data/ArchiveBox.conf with fcntl.F_FULLFSYNC. ([Errno 22] Invalid argument)
      archivebox_1  |     You can store the archive/ subfolder on a hard drive or network share that doesn't support support syncronous writes,
      archivebox_1  |     but the main folder containing the index.sqlite3 and ArchiveBox.conf files must be on a filesystem that supports FSYNC.
    

    I cannot figure out how to separate data/archive/ from the rest as I wish to store this on my NAS. I tried symlinking it but it just complains that data/archive/ already exists.

    I would prefer to have as much data on my NAS so all or any of data/{logs,sonic,sources}/ as well.

    How do I setup this with docker-compose?

    Thank you kindly! in advance.

    opened by iwconfig 2
  • Support requested in setting up Archivebox

    Support requested in setting up Archivebox

    I am looking for someone who can assist me in configuring Archivebox in the proper way to support my project. My project is about a list of priceplans

    Currently my list of URL's looks like this: Schermafbeelding 2021-11-17 om 21 00 26

    The list is being used as an import file for a dashboard and the user who retrieves a particular company will be guided to the url mentioned in this table. Every URL points to a single page with a price scheme of the company. At the end of December each of these urls will get updated with information for 2022 or will get deleted and replaced with some other url for 2022. I would like to store the information for the current year and keep it accessible. Therefore I would like to replace every URL with a URL that refers to a specific domain that holds an archive of all these pages. I have no interest in storing additional pages that are linked to this URL, etc. It's only about the information on the given url.

    I am not enough an expert in internet pages and ArchiveBox has just too many features for me to get this done save and sound before the year is coming to an end and these pages are being replaced.

    Can someone assist me in setting up such a project with ArchiveBox?

    Thanks

    opened by sebastiaan6907 1
  • Bug: Unable to download TikTok page

    Bug: Unable to download TikTok page

    Describe the bug

    Not sure if this is a bug, a limitation of the Tiktok website, or a limitation of ArchiveBox, however ArchiveBox fails to download all of the tiktoks on a user's page.

    Steps to reproduce

    Ran ArchiveBox depth=1 using media archive method

    Screenshots or log output

    [+] [2021-11-12 23:56:55] Adding 1 links to index (crawl depth=1)... > Saved verbatim input to sources/1636761415-import.txt > Parsed 1 URLs from input (Generic TXT) [*] Starting crawl of 1 sites 1 hop out from starting point > Downloading https://vm.tiktok.com/REDACTED_URL contents > Saved verbatim input to sources/1636761415.248872-crawl-vm.tiktok.com.txt > Parsed 0 URLs from input (Failed to parse) > Found 1 new URLs not already in index [*] [2021-11-12 23:56:55] Writing 1 links to main index... โˆš ./index.sqlite3 [โ–ถ] [2021-11-12 23:56:55] Starting archiving of 1 snapshots in index... [+] [2021-11-12 23:56:55] "vm.tiktok.com/REDACTED_URL" https://vm.tiktok.com/REDACTED_URL > ./archive/1636761415.248872 > media Extractor failed: Failed to save media Got youtube-dl response code: 1. WARNING: The program functionality for this site has been marked as broken, and will probably not work. ERROR: Unable to download JSON metadata: HTTP Error 502: Bad Gateway (caused by ); please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see https://yt-dl.org/update on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output. Run to see full output: cd /data/archive/1636761415.248872; youtube-dl --write-description --write-info-json --write-annotations --write-thumbnail --no-call-home --write-sub --all-subs --write-auto-sub --convert-subs=srt --yes-playlist --continue --ignore-errors --geo-bypass --add-metadata --max-filesize=750m https://vm.tiktok.com/REDACTED_URL 2 files (238.4 KB) in 0:00:03s [โˆš] [2021-11-12 23:56:59] Update of 1 pages complete (3.54 sec) - 0 links skipped - 7 links updated - 3 links had errors Hint: To manage your archive in a Web UI, run: archivebox server 0.0.0.0:8000 [+] [2021-11-12 23:56:55] Adding 1 links to index (crawl depth=1)... > Saved verbatim input to sources/1636761415-import.txt > Parsed 1 URLs from input (Generic TXT) [*] Starting crawl of 1 sites 1 hop out from starting point > Downloading https://vm.tiktok.com/REDACTED_URL contents > Saved verbatim input to sources/1636761415.248872-crawl-vm.tiktok.com.txt > Parsed 0 URLs from input (Failed to parse) > Found 1 new URLs not already in index [*] [2021-11-12 23:56:55] Writing 1 links to main index... โˆš ./index.sqlite3 [โ–ถ] [2021-11-12 23:56:55] Starting archiving of 1 snapshots in index... [+] [2021-11-12 23:56:55] "vm.tiktok.com/REDACTED_URL" https://vm.tiktok.com/REDACTED_URL > ./archive/1636761415.248872 > media Extractor failed: Failed to save media Got youtube-dl response code: 1. WARNING: The program functionality for this site has been marked as broken, and will probably not work. ERROR: Unable to download JSON metadata: HTTP Error 502: Bad Gateway (caused by ); please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see https://yt-dl.org/update on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output. Run to see full output: cd /data/archive/1636761415.248872; youtube-dl --write-description --write-info-json --write-annotations --write-thumbnail --no-call-home --write-sub --all-subs --write-auto-sub --convert-subs=srt --yes-playlist --continue --ignore-errors --geo-bypass --add-metadata --max-filesize=750m https://vm.tiktok.com/REDACTED_URL 2 files (238.4 KB) in 0:00:03s [โˆš] [2021-11-12 23:56:59] Update of 1 pages complete (3.54 sec) - 0 links skipped - 7 links updated - 3 links had errors Hint: To manage your archive in a Web UI, run: archivebox server 0.0.0.0:8000
    

    ArchiveBox version

    ArchiveBox v0.6.3
    Cpython Linux Linux-5.4.0-90-generic-x86_64-with-glibc2.28 x86_64
    IN_DOCKER=True DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep
    
    [i] Dependency versions:
     โˆš  ARCHIVEBOX_BINARY     v0.6.3          valid     /usr/local/bin/archivebox                                                   
     โˆš  PYTHON_BINARY         v3.9.8          valid     /usr/local/bin/python3.9                                                    
     โˆš  DJANGO_BINARY         v3.1.13         valid     /usr/local/lib/python3.9/site-packages/django/bin/django-admin.py           
     โˆš  CURL_BINARY           v7.64.0         valid     /usr/bin/curl                                                               
     โˆš  WGET_BINARY           v1.20.1         valid     /usr/bin/wget                                                               
     โˆš  NODE_BINARY           v15.14.0        valid     /usr/bin/node                                                               
     โˆš  SINGLEFILE_BINARY     v0.3.16         valid     /node/node_modules/single-file/cli/single-file                              
     โˆš  READABILITY_BINARY    v0.0.2          valid     /node/node_modules/readability-extractor/readability-extractor              
     โˆš  MERCURY_BINARY        v1.0.0          valid     /node/node_modules/@postlight/mercury-parser/cli.js                         
     โˆš  GIT_BINARY            v2.20.1         valid     /usr/bin/git                                                                
     โˆš  YOUTUBEDL_BINARY      v2021.06.06     valid     /usr/local/bin/youtube-dl                                                   
     โˆš  CHROME_BINARY         v90.0.4430.212  valid     /usr/bin/chromium                                                           
     โˆš  RIPGREP_BINARY        v0.10.0         valid     /usr/bin/rg                                                                 
    
    [i] Source-code locations:
     โˆš  PACKAGE_DIR           23 files        valid     /app/archivebox                                                             
     โˆš  TEMPLATES_DIR         3 files         valid     /app/archivebox/templates                                                   
     -  CUSTOM_TEMPLATES_DIR  -               disabled                                                                              
    
    [i] Secrets locations:
     -  CHROME_USER_DATA_DIR  -               disabled                                                                              
     -  COOKIES_FILE          -               disabled                                                                              
    
    [i] Data locations:
     โˆš  OUTPUT_DIR            5 files         valid     /data                                                                       
     โˆš  SOURCES_DIR           12 files        valid     ./sources                                                                   
     โˆš  LOGS_DIR              1 files         valid     ./logs                                                                      
     โˆš  ARCHIVE_DIR           3 files         valid     ./archive                                                                   
     โˆš  CONFIG_FILE           81.0 Bytes      valid     ./ArchiveBox.conf                                                           
     โˆš  SQL_INDEX             228.0 KB        valid     ./index.sqlite3 
    
    opened by aidenmitchell 2
  • Bug: Empty image spaces where images are supposed to be

    Bug: Empty image spaces where images are supposed to be

    Describe the bug

    Empty Image Spaces, where Images are supposed to be. Singlefile, Wget both show empty images.

    Steps to reproduce

    Go to https://mariushosting.com/ and archive any of the posts

    Screenshots or log output

    https://ibb.co/QJHGWzC

    ArchiveBox version

    latest

    opened by Unrepentant-Atheist 9
  • Feature Request: Hide previews for non-existent archive methods

    Feature Request: Hide previews for non-existent archive methods

    Type

    • [ ] General question or discussion
    • [ ] Propose a brand new feature
    • [x] Request modification of existing behavior or design

    What is the problem that your feature request solves

    Users see preview panes, even if the archive method is not available. If the users try to view the non existent previews, they see some kind of error message. This probably confuses users with non technical background.

    Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes

    Only show previews, that really exists.

    How badly do you want this new feature?

    • [ ] It's an urgent deal-breaker, I can't live without it
    • [ ] It's important to add it in the near-mid term future
    • [x] It would be nice to have eventually

    • [ ] I'm willing to contribute dev time / money to fix this issue
    • [x] I like ArchiveBox so far / would recommend it to a friend
    • [ ] I've had a lot of difficulty getting ArchiveBox set up
    status: idea phase 
    opened by thenktor 1
  • Bug: Disable archiv method selection, if method is not available

    Bug: Disable archiv method selection, if method is not available

    Describe the bug

    I've disabled most archive methods (e.g. SAVE_MERCURY=False), but when adding a new link to the archive the archive method selection field still shows the disabled methods. You can select them, but of course they just do not work.

    Actually I'd expect, that disabled methods do not appear in this list.

    Screenshots or log output

    archive-methods

    ArchiveBox version

    ArchiveBox v0.6.2
    Cpython Linux Linux-5.13.19_1-x86_64-with-glibc2.28 x86_64
    IN_DOCKER=True DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep
    
    [i] Dependency versions:
     โˆš  ARCHIVEBOX_BINARY     v0.6.2          valid     /usr/local/bin/archivebox                                                   
     โˆš  PYTHON_BINARY         v3.9.5          valid     /usr/local/bin/python3.9                                                    
     โˆš  DJANGO_BINARY         v3.1.10         valid     /usr/local/lib/python3.9/site-packages/django/bin/django-admin.py           
     โˆš  CURL_BINARY           v7.64.0         valid     /usr/bin/curl                                                               
     -  WGET_BINARY           -               disabled  /usr/bin/wget                                                               
     โˆš  NODE_BINARY           v15.14.0        valid     /usr/bin/node                                                               
     โˆš  SINGLEFILE_BINARY     v0.3.16         valid     /node/node_modules/single-file/cli/single-file                              
     -  READABILITY_BINARY    -               disabled  /node/node_modules/readability-extractor/readability-extractor              
     -  MERCURY_BINARY        -               disabled  /node/node_modules/@postlight/mercury-parser/cli.js                         
     -  GIT_BINARY            -               disabled  /usr/bin/git                                                                
     -  YOUTUBEDL_BINARY      -               disabled  /usr/local/bin/youtube-dl                                                   
     โˆš  CHROME_BINARY         v90.0.4430.93   valid     /usr/bin/chromium                                                           
     โˆš  RIPGREP_BINARY        v0.10.0         valid     /usr/bin/rg                                                                 
    
    [i] Source-code locations:
     โˆš  PACKAGE_DIR           22 files        valid     /app/archivebox                                                             
     โˆš  TEMPLATES_DIR         3 files         valid     /app/archivebox/templates                                                   
     -  CUSTOM_TEMPLATES_DIR  -               disabled                                                                              
    
    [i] Secrets locations:
     -  CHROME_USER_DATA_DIR  -               disabled                                                                              
     -  COOKIES_FILE          -               disabled                                                                              
    
    [i] Data locations:
     โˆš  OUTPUT_DIR            5 files         valid     /data                                                                       
     โˆš  SOURCES_DIR           9 files         valid     ./sources                                                                   
     โˆš  LOGS_DIR              1 files         valid     ./logs                                                                      
     โˆš  ARCHIVE_DIR           6 files         valid     ./archive                                                                   
     โˆš  CONFIG_FILE           283.0 Bytes     valid     ./ArchiveBox.conf                                                           
     โˆš  SQL_INDEX             236.0 KB        valid     ./index.sqlite3
    
    opened by thenktor 0
  • ArchiveBox For YunoHost

    ArchiveBox For YunoHost

    I wasn't sure the best place to message you, but I wanted to let the ArchiveBox developers know that I finished packaging archivebox for yunohost.

    Hopefully it makes it easier for some folks to run and install and cross-pollinates with some new communities.

    opened by mhfowler 1
  • Bug: Unable to delete snapshots

    Bug: Unable to delete snapshots

    I am using archivebox the latest version web client, when I try to delete a snapshot, it always says โ€˜no actions selectedโ€™ How could this happen? By the way, I am using archivebox on arm64(aarch64) platform

    opened by orange2008 3
  • Bug: ArchiveBox add for Wallabag Atom feed doesn't work

    Bug: ArchiveBox add for Wallabag Atom feed doesn't work

    Describe the bug

    ArchiveBox add for Wallabag Atom feed doesn't work.

    Initially noticed that schedule doesn't work, and figured out that it's because Wallabag Atom feed doesn't work.

    Steps to reproduce

    Run archivebox add:

    archivebox add --parser=wallabag_atom --depth=1 https://wallabag.../feed/user/token/all
    

    Screenshots or log output

    [i] [2021-10-02 05:59:25] ArchiveBox v0.6.2: archivebox add --parser=wallabag_atom --depth=1 https://wallabag.../feed/user/token/all
        > /data
    
    [+] [2021-10-02 05:59:26] Adding 1 links to index (crawl depth=1)...
        > Saved verbatim input to sources/1633154366-import.txt
    
    [X] No links found using Wallabag Atom parser
        Hint: Try a different parser or double check the input?
    
        > Parsed 0 URLs from input (Wallabag Atom)
        > Found 0 new URLs not already in index
    
    [*] [2021-10-02 05:59:26] Writing 0 links to main index...
        โˆš ./index.sqlite3
    

    ArchiveBox version

    ArchiveBox v0.6.2
    Cpython Linux Linux-5.11.0-36-generic-x86_64-with-glibc2.28 x86_64
    IN_DOCKER=True DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=sonic
    
    [i] Dependency versions:
     โˆš  ARCHIVEBOX_BINARY     v0.6.2          valid     /usr/local/bin/archivebox
     โˆš  PYTHON_BINARY         v3.9.5          valid     /usr/local/bin/python3.9
     โˆš  DJANGO_BINARY         v3.1.10         valid     /usr/local/lib/python3.9/site-packages/django/bin/django-admin.py
     โˆš  CURL_BINARY           v7.64.0         valid     /usr/bin/curl
     -  WGET_BINARY           -               disabled  /usr/bin/wget
     โˆš  NODE_BINARY           v15.14.0        valid     /usr/bin/node
     โˆš  SINGLEFILE_BINARY     v0.3.16         valid     /node/node_modules/single-file/cli/single-file
     โˆš  READABILITY_BINARY    v0.0.2          valid     /node/node_modules/readability-extractor/readability-extractor
     -  MERCURY_BINARY        -               disabled  /node/node_modules/@postlight/mercury-parser/cli.js
     -  GIT_BINARY            -               disabled  /usr/bin/git
     -  YOUTUBEDL_BINARY      -               disabled  /usr/local/bin/youtube-dl
     โˆš  CHROME_BINARY         v90.0.4430.93   valid     /usr/bin/chromium
     โˆš  RIPGREP_BINARY        v0.10.0         valid     /usr/bin/rg
    
    [i] Source-code locations:
     โˆš  PACKAGE_DIR           22 files        valid     /app/archivebox
     โˆš  TEMPLATES_DIR         3 files         valid     /app/archivebox/templates
     -  CUSTOM_TEMPLATES_DIR  -               disabled
    
    [i] Secrets locations:
     -  CHROME_USER_DATA_DIR  -               disabled
     -  COOKIES_FILE          -               disabled
    
    [i] Data locations:
     โˆš  OUTPUT_DIR            9 files         valid     /data
     โˆš  SOURCES_DIR           35 files        valid     ./sources
     โˆš  LOGS_DIR              2 files         valid     ./logs
     โˆš  ARCHIVE_DIR           102 files       valid     ./archive
     โˆš  CONFIG_FILE           420.0 Bytes     valid     ./ArchiveBox.conf
     โˆš  SQL_INDEX             1.1 MB          valid     ./index.sqlite3
    
    opened by m0nhawk 0
Releases(v0.6.2)
  • v0.6.2(Apr 10, 2021)

    New features

    • new ArchiveResult log in the admin web UI, with full editing ability of individual extractor outputs + list of outputs under each Snapshot admin entry
    • ability to save multiple snapshots of the same URL over time using new Re-snapshot button
    • add init --quick and server --quick-init options to quickly update the db version without doing a full re-init (for users with large archive collections this will make version upgrades a lot faster / less painful)
    • add new archivebox setup command and archivebox init --setup flag to aid in automatically installing dependencies and creating a superuser during initial setup
    • new SNAPSHOTS_PER_PAGE=40 and MEDIA_MAX_SIZE=750m config options
    • allow hotlinking directly to specific extractor output on the snapshot detail page using URL #hash e.g. /archive/<timestamp>/index.html#git
    • add ability to view snapshot matching a given URLs by visiting /archive/https://example.com/some/url -> redirects to -> /archive/<timestamp>/index.html (also works without scheme /archive/example.com)
    • #660 add ability to tag URLs while adding them via the web UI and via the CLI using archivebox add --tag=tag1,tag2,tag3 ...
    • #659 add back ability to override visual styling with custom HTML and CSS using new config option CUSTOM_TEMPLATES_DIR
    • ability to add and remove multiple tags at once from the snapshot admin using autocompleting dropdown

    Enhancements

    • lots of performance improvements! (in testing with 100k entries, the main index was brought down from 10-14 second load times to ~110ms once cache warms up)
    • full text search now works on the public snapshot list
    • dates and times are now localized to your browser's timezone instead of showing in UTC
    • integrity and correctness improvements to readability, mercury, warc, and other extractors
    • video subtitles and description are now added to the full-text search index as well (including youtube's autogenerated transcripts in all languages)
    • log all errors with full tracebacks to new data/logs/errors.log file (so users no longer have to run in --debug mode to see error details)
    • better archivebox schedule logging and changed logfile location to ./logs/schedule.log
    • better docker-compose setup experience with sonic config example in docker-compose.yml
    • add Django Debug Toolbar + djdt_flamegraph for developers to profile UI performance
    • add --overwrite flag support to archivebox schedule, archived urls get added similarly to add --overwrite
    • #644 remove boostrap and jquery remove network requests to CDNs by inlining them instead
    • #647 allow filtering by ArchiveResult status in the Snapshot admin UI to select only links that have been archived or not archived
    • #550 kill all orphan child processes after each extractor finishes to prevent dangling chromium/node subprocesses and memory leaks
    • 3276434 add new SEARCH_BACKEND_TIMEOUT config option to tune amount of time search backend can take before it gives up
    • more diagnostic info added to the Snapshot admin view including most recent status code, content type, detected server, etc
    • make the order of the table columns, layout, and spacing the same on the public view and private view (also remove DataTable, we're not using it)
    • better snapshot grid page (faster load times, nicer CSS for tags and cards, more actions supported and metadata shown)
    • added Cache-Control headers to dramatically speed up load times by caching favicons, screenshots, etc. in browsers/upstreams
    • new project releases page https://releases.archivebox.io and demo url https://demo.archivebox.io

    Bugfixes

    • #673 fix searching by URL substring in Snapshot admin list
    • #658 fix Snapshot admin action buttons not working in Safari and some other browsers
    • #678 fix AssertionError error when archivebox would to attempt archive with CHROME_BINARY=None when Chrome was not found on host system
    • #654 fix some issues with sonic attempting to index massive text blobs or binary blobs on some pages and hanging
    • #674 fix UTF-8 encoding encoding problems with file reading/writing on Windows (supporting a Python pkg on Windows is unreasonably painful ya'll)
    • #433 fix deleted items sometimes reappearing on next import/update
    • #473 fix issue preventing use of archivebox python API inside raw REPL (not using archivebox shell)
    • fix stdin/stdout/stderr handling for some edge cases in Docker/Docker-Compose

    image image

    Source code(tar.gz)
    Source code(zip)
    archivebox--0.6.2-1.big_sur.bottle.tar.gz(11.46 MB)
    archivebox-0.6.2-py3-none-any.whl(477.89 KB)
    archivebox-0.6.2.tar.gz(403.89 KB)
    archivebox_0.6.2-1_all.deb(281.89 KB)
    Electron-ArchiveBox-macOS-x64-0.6.2.app.zip(76.54 MB)
  • v0.5.6(Feb 9, 2021)

    • add ARMv7 and ARMv8 CPU support for apt / deb distribution on Launchpad PPA
    • fix nodesource apt repo not supported on i386 b90afc8
    • fix handling of skipped ArchiveResult entries with null output 0aea5ed
    • catch exception on import of old index.json into ArchiveResult 171bbeb
    • move debsign to release not build 66fb5b2
    • skip tests during debian build a32eac3
    • fix emptystrings in cmd_version causing exception a49884a
    • automate deb dist better and bump version 0e6ac39
    • fix assertion 6705354
    • change wording of db not found error 683a087
    Source code(tar.gz)
    Source code(zip)
  • v0.5.4(Feb 1, 2021)

    Thank you contributors who helped with the 181 commits in this release!
    @cdvv7788, @jdcaballerov, @thedanbob, @aggroskater, @mAAdhaTTah, @mario-campos, @mikaelf

    • fix migration failing due to null cmd_versions in older archives a3008c8
    • Publish, minor, & major version to DockerHub and add set up CodeQL codeql-analysis.yml c5b7d9f, bbb6cc8
    • fix DATABASE_NAME posixpath, and dependencies dict bug 02bdb3b, 5c7842f
    • use relative imports for .util to fix windows import clash 72e2c7b
    • fix COOKIES_FILE config param breaking in wget ef7711f
    • Refactor should_save_extractor methods to accept overwrite parameter 5420903
    • Fix issue #617 by using mark_safe in combination with format_html โ€ฆ 1989275
    • make permission chowning on docker start less fancy, respect PUID/PGID #635
    • add createsuperuser flag to server command 39ec77e
    • fix files icons styling and use the db exclusively for rendering them, instead of filesystem f004058, 7d8fe66, 5c54bcc, 534ead2
    • limit youtubedl download size to 750m and stop splitting out audio files 3227f54
    • also search url, timestamp, tags on public index 8a4edb4
    • fix trailing slash problems and wget not detecting download path 9764a8e
    • add response status code to headers.json c089501
    • fix singlefile path used for sonic 24e2493
    • cleanup template layout in filesystem, new snapshot detail page UI
    Screen Shot 2021-01-30 at 9 53 22 p Source code(tar.gz)
    Source code(zip)
    archivebox-0.5.4-py3-none-any.whl(385.10 KB)
    archivebox_0.5.4-1_all.deb(235.85 KB)
  • v0.5.3(Jan 6, 2021)

    • ArchiveResult moved to SQLite3 DB for performance @cdvv7788
    • lots of assorted bugfixes and improvements courtesy of @cdvv7788 and @jdcaballerov
    • new full-text search support with ripgrep and sonic courtesy of @jdcaballerov
    • new archivebox oneshot command for downloading a single site without starting a whole collection
    • new Pocket API importer courtesy of @mAAdhaTTah
    • new Wallabag importer courtesy of @ehainry
    • new extractor options on Add page courtesy of @BlipRanger
    • new apt/deb/homebrew/pip packaging setup into separate repos under new Github Org https://github.com/ArchiveBox
    • new official PPA and Docker Hub accounts https://hub.docker.com/r/archivebox/archivebox (with automatic armv7 builds courtesy of @chrismeller)
    • new Snapshot grid view courtesy of @jdcaballerov image
    Source code(tar.gz)
    Source code(zip)
  • v0.4.24(Dec 3, 2020)

  • v0.4.21(Aug 18, 2020)

  • v0.4.17(Aug 18, 2020)

    • Fix bugs with parsing long URLs as paths
    • html-encoded URLs
    • new generic HTML parser
    • new --init and --overwrite flags on add
    • improve stdout and hints
    • fix Pull title button
    • other small bugfixes
    Source code(tar.gz)
    Source code(zip)
  • v0.4.16(Aug 18, 2020)

  • v0.4.15(Aug 18, 2020)

    • fix a bug where invalid URLs where attempted to be parsed an imported, causing the whole archive process to crash
    • add support for scheduled archiving in docker
    docker run -v $PWD:/data archivebox schedule --foreground --every=day --depth=1 'https://getpocket.com/users/USERNAME/feed/all'
    
    # docker-compose.yml
    
    version: '3.7'
    
    services:
      archivebox:
        image: nikisweeting/archivebox:latest
        command: schedule --foreground --every=day --depth=1 'https://getpocket.com/users/USERNAME/feed/all'
        environment:
          - USE_COLOR=True
          - SHOW_PROGRESS=False
        volumes:
          - ./data:/data
    
    Source code(tar.gz)
    Source code(zip)
  • v0.4.14(Aug 14, 2020)

    Add support for the Readability article text extractor, it runs on the SingleFile, Wget, and DOM dump output by default, but if none of those are available it will download the article from scratch to do text extraction. This release also officially adds Docker support for ARM architectures, including the Raspberry Pi. The image size was also shrunk from 1.5GB to 452MB by making sure unnecessary build tools are uninstalled after the package build process.

    image

    Source code(tar.gz)
    Source code(zip)
  • v0.4.13(Aug 10, 2020)

  • v0.4.12(Aug 10, 2020)

  • v0.4.11(Aug 7, 2020)

    We add a major new archive method in this release: SingleFile. On bare metal it requires installing Node and Chrome/Chromium, but it works out-of-the-box in the Docker version.

    This finally allows ArchiveBox to pass all of the acid tests except one, and the archive for Github and many other sites are nicer than Wget was able to do on its own.

    Source code(tar.gz)
    Source code(zip)
  • v0.4.9(Jul 28, 2020)

    image

    ๐ŸŒ… v0.4 is officially released. This is a long-awaited 3rd-pass review over every corner of the archivebox UX. It adresses many of the fundamental shortcomings around index consistency by using a new SQLite database, with automatic migrations provided by django. It also smooths many of the rough edges, adds a new admin Web UI, a rich new CLI, closes 40+ github tickets, and is the first official release available on PyPI.

    • https://pypi.org/project/archivebox/ pip install archivebox
    • https://hub.docker.com/r/nikisweeting/archivebox docker run -v $PWD:/data nikisweeting/archivebox
    • https://archivebox.readthedocs.io/en/latest/
    • https://github.com/pirate/ArchiveBox/releases/tag/v0.4.9

    Enjoy!

    ๐ŸŽ‰ Big thanks to everyone who helped! Especially the Monadical team @cdvv7788 @apkallum @afreydev and also @drpfenderson who helped us track down the last few index importing bugs! ๐ŸŽ‰

    The docs still have some work left to finish updating, but the CLI help text is all up-to-date (when in doubt, just pass --help).
    Let us know if you find any rough edges here: https://github.com/pirate/ArchiveBox/issues/new/choose

    pip install archivebox
    
    cd path/to/your/archive/folder
    
    archivebox init  # this doubles as the migrate command, it will safely upgrade existing index files automatically
    archviebox add 'https://example.com'
    archviebox add 'https://getpocket.com/users/USERNAME/feed/all' --depth=1
    archivebox status
    archivebox server
    archivebox help
    

    Or if you prefer docker, the CLI works exactly the same archivebox [subcommand] [...args]:

    docker run -v $PWD:/data nikisweeting/archivebox init
    docker run -v $PWD:/data nikisweeting/archivebox add 'https://example.com'
    docker run -v $PWD:/data -p 8000 nikisweeting/archivebox server
    
    version: '3.7'
    
    services:
        archivebox:
            image: nikisweeting/archivebox:latest
            command: server 0.0.0.0:8000
            stdin_open: true
            tty: true
            ports:
                - 8000:8000
            environment:
                - USE_COLOR=True
            volumes:
                - ./data:/data
    

    Screenshots

    Screen Shot 2020-07-28 at 6 19 48 AM

    New Features

    A bunch of big changes:

    • pip install archivebox is now available
    • full transition to Django Sqlite DB with migrations (making upgrades between versions much safer now)
    • maintains an intuitive and helpful CLI that's backwards-compatible with all previous archivebox data versions
    • uses argparse instead of hand-written CLI system: see archivebox/cli/archivebox.py
    • new subcommands-based CLI for archivebox (see below)
    • new Web UI with pagination, better search, filtering, permissions, and more
    • 30+ assorted bugfixes, new features, and tickets closed

    For more info, see: https://github.com/pirate/ArchiveBox/wiki/Roadmap

    Released in this version:

    Install Methods:

    Command Line Interface:

    Web UI:

    • โœ… / Main index
    • โœ… /add Page to add new links to the archive (but needs improvement)
    • โœ… /archive/<timestamp>/ Snapshot details page
    • โœ… /archive/<timestamp>/<url> live wget archive of page
    • โœ… /archive/<timestamp>/<extractor> get a specific extractor output for a given snapshot
    • โœ… /archive/<url> shortcut to view most recent snapshot of given url
    • โœ… /archive/<url_hash> shortcut to view most recent snapshot of given url
    • โœ… /admin Admin interface to view and edit archive data
    • โœ… /old.html Backwards-compatible static HTML index for the previous version

    Python API:

    (Red โŒ features are still unfinished and will be released in later versions)

    Source code(tar.gz)
    Source code(zip)
  • v0.2.4(Feb 27, 2019)

    • better archive corruption guards (check structure invariants on every parse & save)
    • remove title prefetching in favor of new FETCH_TITLE archive method
    • slightly improved CLI output for parsing and remote url downloading
    • re-save index after archiving completes to update titles and urls
    • remove redundant derivable data from link json schema
    • markdown link parsing support
    • faster link parsing and better symbol handling using a new compiled URL_REGEX
    Source code(tar.gz)
    Source code(zip)
  • v0.2.3(Feb 19, 2019)

    • fixed issues with parsing titles including trailing tags
    • fixed issues with titles defaulting to URLs instead of attempting to fetch
    • fixed issue where bookmark timestamps from RSS would be ignored and current ts used instead
    • fixed issue where ONLY_NEW would overwrite existing links in archive with only new ones
    • fixed lots of issues with URL parsing by using urllib.parse instead of hand-written lambdas
    • ignore robots.txt when using wget (ssshhh don't tell anyone ๐Ÿ˜)
    • fix RSS parser bailing out when there's whitespace around XML tags
    • fix issue with browser history export trying to run ls on wrong directory
    Source code(tar.gz)
    Source code(zip)
  • v0.2.2(Feb 7, 2019)

    This is a bugfix release, many parts of the parsing process have been improved or fixed.

    • Shaarli RSS export support
    • Fix issues with plain text link parsing including quotes, whitespace, and closing tags in URLs
    • add USER_AGENT to archive.org submissions so they can track archivebox usage
    • remove all icons similar to archive.org branding from archive UI
    • hide some of the noisier youtubedl and wget errors
    • set permissions on youtubedl media folder
    • fix chrome data dir incorrect path and quoting
    • better chrome binary finding
    • show which parser is used when importing links, show progress when fetching titles
    Source code(tar.gz)
    Source code(zip)
  • v0.2.1(Jan 11, 2019)

    This is a feature-packed release, so it's likely to be a little buggier than usual!

    New features:

    • ability to load any plain text list of links (also the new fallback for all parses)
    • WARC file saving via wget: FETCH_WARC=True
    • Git repository downloading with git clone: FETCH_GIT=True GIT_DOMAINS=github.com,gitlab.com,bitbucket.org
    • Media downloading with youtube-dl: FETCH_MEDIA=True MEDIA_TIMEOUT=36000

    Bugfixes:

    • autodetect the correct chromium binary in almost all cases
    • create browser history export folder automatically
    • higher allowed timestamp precision

    New logo:

    Source code(tar.gz)
    Source code(zip)
  • v0.2.0(Dec 21, 2018)

  • v0.1.0(Jun 11, 2018)

    Warning: Running this version will move the old html/ output folder to the new location: output/.

    Changes:

    • entirely new folder structure & code layout
    • moved scripts into bin/ folder, symlinked setup and archive for backwards-compatibility
    • removed TEMPLATE_INDEX* config options, just symlink the files in templates/ to your custom versions
    • added support for ./bin/export-browser-history JSON imports of browsing history from Chrome and Firefox
    Source code(tar.gz)
    Source code(zip)
  • v0.0.3(Oct 30, 2017)

    New Features:

    • Support for parsing links from RSS feeds
    • Support for specifying a URL as well as local file paths: ./archive.py https://example.com/path/to/rss/feed.xml
    • Support for --user-data-dir for archiving restricted sites with chrome headless
    • Simple & Fancy HTML & JSON indexes for each individual link
    • Archive attempt history stored in link index.json

    Improvements:

    • Append to existing archive instead of overwriting the index each time
    • Reduced unnecessary config options, it should "just work"
    • Smartly dedupe and cleanup messy archive folders
    • Massively cleaned up codebase
    Source code(tar.gz)
    Source code(zip)
  • v0.0.2(Jul 4, 2017)

    • refactor codebase into separate files
    • check for minimum python version before running
    • fix utf-8 encoding errors when writing index.html
    • make index easier to customize with templates/ folder
    • WIP audio & video downloading with youtube-dl
    Source code(tar.gz)
    Source code(zip)
  • v0.0.1(Jul 4, 2017)

    It's reached a point where I'm comfortable bringing Bookmark Archiver out of alpha and into beta. This release supports a broad range of bookmark export files, works well with wget archiving, and produces clean, future-compatible archive folders.

    See the README for more details and a list of features. Future releases will have a changelog.

    Source code(tar.gz)
    Source code(zip)
Owner
ArchiveBox
The self-hosted internet archiving solution by @pirate and @Monadical-SAS. #webarchiving #digipres
ArchiveBox
The open-source core of Pinry, a tiling image board system for people who want to save, tag, and share images, videos and webpages in an easy to skim through format.

The open-source core of Pinry, a tiling image board system for people who want to save, tag, and share images, videos and webpages in an easy to skim

Pinry 1.8k Nov 25, 2021
Free and open-source digital preservation system designed to maintain standards-based, long-term access to collections of digital objects.

Archivematica By Artefactual Archivematica is a web- and standards-based, open-source application which allows your institution to preserve long-term

Artefactual 297 Nov 19, 2021
Open source platform for the machine learning lifecycle

MLflow: A Machine Learning Lifecycle Platform MLflow is a platform to streamline machine learning development, including tracking experiments, packagi

MLflow 10.8k Dec 4, 2021
A collection of self-contained and well-documented issues for newcomers to start contributing with

fedora-easyfix A collection of self-contained and well-documented issues for newcomers to start contributing with How to setup the local development e

Akashdeep Dhar 8 Oct 16, 2021
:books: Web app for browsing, reading and downloading eBooks stored in a Calibre database

About Calibre-Web is a web app providing a clean interface for browsing, reading and downloading eBooks using an existing Calibre database. This softw

Jan B 5.5k Dec 2, 2021
Fava - web interface for Beancount

Fava is a web interface for the double-entry bookkeeping software Beancount with a focus on features and usability. Check out the online demo and lear

null 1.1k Dec 4, 2021
A simple shared budget manager web application

I hate money I hate money is a web application made to ease shared budget management. It keeps track of who bought what, when, and for whom; and helps

The spiral project. 606 Nov 26, 2021
Indico - A feature-rich event management system, made @ CERN, the place where the Web was born.

Indico Indico is: ?? a general-purpose event management tool; ?? fully web-based; ?? feature-rich but also extensible through the use of plugins; โš–๏ธ O

Indico 1.3k Nov 25, 2021
The official source code repository for the calibre ebook manager

calibre calibre is an e-book manager. It can view, convert, edit and catalog e-books in all of the major e-book formats. It can also talk to e-book re

Kovid Goyal 11.1k Nov 28, 2021
Source code for Gramps Genealogical program

The Gramps Project ( https://gramps-project.org ) We strive to produce a genealogy program that is both intuitive for hobbyists and feature-complete f

Gramps Project 1.4k Dec 3, 2021
A :baby: buddy to help caregivers track sleep, feedings, diaper changes, and tummy time to learn about and predict baby's needs without (as much) guess work.

Baby Buddy A buddy for babies! Helps caregivers track sleep, feedings, diaper changes, tummy time and more to learn about and predict baby's needs wit

Baby Buddy 1.2k Dec 1, 2021
ProPublica's collaborative tip-gathering framework. Import and manage CSV, Google Sheets and Screendoor data with ease.

Collaborate This is a web application for managing and building stories based on tips solicited from the public. This project is meant to be easy to s

ProPublica 81 Nov 16, 2021
Collect your thoughts and notes without leaving the command line.

jrnl To get help, submit an issue on Github. jrnl is a simple journal application for your command line. Journals are stored as human readable plain t

Manuel Ebert 26 Nov 15, 2021
Scan, index, and archive all of your paper documents

[ en | de | el ] Important news about the future of this project It's been more than 5 years since I started this project on a whim as an effort to tr

Paperless 7.7k Dec 1, 2021
Automatic Video Library Manager for TV Shows. It watches for new episodes of your favorite shows, and when they are posted it does its magic.

Automatic Video Library Manager for TV Shows. It watches for new episodes of your favorite shows, and when they are posted it does its magic. Exclusiv

pyMedusa 1.4k Dec 2, 2021
Agile project management platform. Built on top of Django and AngularJS

Taiga Backend Documentation Currently, we have authored three main documentation hubs: API: Our API documentation and reference for developing from Ta

Taiga.io 5.6k Nov 27, 2021
Fast pattern fetcher, Takes a URLs list and outputs the URLs which contains the parameters according to the specified pattern.

Fast Pattern Fetcher (fpf) Coded with <3 by HS Devansh Raghav Fast Pattern Fetcher, Takes a URLs list and outputs the URLs which contains the paramete

whoami security 3 Nov 25, 2021
This is a GUI for scrapping PDFs with the help of optical character recognition making easier than ever to scrape PDFs.

pdf-scraper-with-ocr With this tool I am aiming to facilitate the work of those who need to scrape PDFs either by hand or using tools that doesn't imp

Jacobo Josรฉ Guijarro Villalba 68 Nov 21, 2021
google-resumable-media Apache-2google-resumable-media (๐Ÿฅ‰28 ยท โญ 27) - Utilities for Google Media Downloads and Resumable.. Apache-2

google-resumable-media Utilities for Google Media Downloads and Resumable Uploads See the docs for examples and usage. Experimental asyncio Support Wh

Google APIs 33 Nov 18, 2021
Your self hosted Youtube media server

The Tube Archivist Your self hosted Youtube media server Core functionality Subscribe to your favourite Youtube channels Download Videos using yt-dlp

Simon 603 Dec 4, 2021
Free and Open Source Machine Translation API. 100% self-hosted, offline capable and easy to setup.

LibreTranslate Try it online! | API Docs | Community Forum Free and Open Source Machine Translation API, entirely self-hosted. Unlike other APIs, it d

null 1.7k Nov 28, 2021
Crawler job that scrapes comments from social media posts and saves them in a S3 bucket.

Toxicity comments crawler Crawler job that scrapes comments from social media posts and saves them in a S3 bucket. Twitter Tweets and replies are scra

Douglas Trajano 2 Nov 30, 2021
Free and Open Source Machine Translation API. 100% self-hosted, no limits, no ties to proprietary services. Built on top of Argos Translate.

LibreTranslate Try it online! | API Docs Free and Open Source Machine Translation API, entirely self-hosted. Unlike other APIs, it doesn't rely on pro

UAV4GEO 1.7k Nov 29, 2021
Meltano: ELT for the DataOps era. Meltano is open source, self-hosted, CLI-first, debuggable, and extensible.

Meltano is open source, self-hosted, CLI-first, debuggable, and extensible. Pipelines are code, ready to be version c

Meltano 72 Nov 25, 2021
Your self-hosted bookmark archive. Free and open source.

Your self-hosted bookmark archive. Free and open source. Contents About LinkAce Support Setup Contribution About LinkAce LinkAce is a self-hosted arch

Kevin Woblick 960 Dec 4, 2021
HTML2Image is a lightweight Python package that acts as a wrapper around the headless mode of existing web browsers to generate images from URLs and from HTML+CSS strings or files.

A package acting as a wrapper around the headless mode of existing web browsers to generate images from URLs and from HTML+CSS strings or files.

null 95 Nov 18, 2021
An awesome tool to save articles from RSS feed to Pocket automatically.

RSS2Pocket An awesome tool to save articles from RSS feed to Pocket automatically. About the Project I used to use IFTTT to save articles from RSS fee

Hank Liao 6 Dec 1, 2021
Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.

img2dataset Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine. Also supports

Romain Beaumont 341 Dec 2, 2021