Command line program to download documents from web portals.

Last update: Dec 26, 2022

Related tags

Overview

command line document download made easy

Highlights

list available documents in json format or download them
filter documents using
- string matching
- regular expressions or
- jq queries
display captcha or QR codes for interactive input
writing new plugins is easy
existing plugins (some of them even work):
- amazon
- ing.de
- dkb.de
- o2.de
- kabel.vodafone.de
- conrad.de
- elster.de

Dependencies

python
click
click-plugins
watchdog
jq
python-dateutil
requests
selenium (default webdriver is "chrome")

Installation

$ git clone --recursive https://github.com/heeplr/document-dl
$ cd document-dl
$ pip install .

Usage

Display Help:

$ document-dl -h
Usage: document-dl [OPTIONS] COMMAND [ARGS]...

  download documents from web portals

Options:
  -u, --username TEXT             login id  [env var: DOCDL_USERNAME]
  -p, --password TEXT             secret password  [env var: DOCDL_PASSWORD]
  -m, --match <ATTRIBUTE PATTERN>...
                                  only output documents where attribute
                                  contains pattern string  [env var:
                                  DOCDL_MATCH]

  -r, --regex <ATTRIBUTE REGEX>...
                                  only output documents where attribute value
                                  matches regex  [env var: DOCDL_REGEX]

  -j, --jq JQ_EXPRESSION          only output documents if json query matches
                                  document's attributes (see
                                  https://stedolan.github.io/jq/manual/ )
                                  [env var: DOCDL_JQ]

  -H, --headless BOOLEAN          show browser window if false  [env var:
                                  DOCDL_HEADLESS; default: True]

  -b, --browser [chrome|edge|firefox|ie|opera|safari|webkitgtk]
                                  webdriver to use for selenium based plugins
                                  [env var: DOCDL_BROWSER; default: chrome]

  -t, --timeout INTEGER           seconds to wait for data before terminating
                                  connection  [env var: DOCDL_TIMEOUT;
                                  default: 15]

  -i, --image-loading BOOLEAN     Turn off image loading when False  [env var:
                                  DOCDL_IMAGE_LOADING; default: False]

  -a, --action [download|list]    download or just list documents  [env var:
                                  DOCDL_ACTION; default: list]

  -h, --help                      Show this message and exit.

Commands:
  amazon    amazon.de (invoices)
  conrad    conrad.de (invoices)
  dkb       dkb.de with photoTAN (postbox)
  elster    elster.de with path to .pfx certfile as username (postbox)
  ing       banking.ing.de with photoTAN (postbox)
  o2        o2online.de (invoices/postbox)
  vodafone  kabel.vodafone.de (postbox, invoices)

Display plugin-specific help: (currently there is a bug in click that prompts for username and password before displaying the help)

$ document-dl ing --help
Usage: document-dl ing [OPTIONS]

  banking.ing.de with photoTAN (postbox)

Options:
  -k, --diba-key TEXT  DiBa Key  [env var: DOCDL_DIBA_KEY]
  -h, --help           Show this message and exit.

Examples

List all documents from vodafone.de, prompt for username/password:

$ document-dl vodafone

Same, but show browser window this time:

$ document-dl --headless=false vodafone

Download all documents from conrad.de, pass credentials as commandline arguments:

$ document-dl --username mylogin --password mypass --action download conrad

Download all documents from conrad.de, pass credentials as env vars:

$ DOCDL_USERNAME='mylogin' DOCDL_PASSWORD='mypass' document-dl --action download conrad

Download all documents from o2online.de where "doctype" attribute contains "BILL":

$ document-dl --match doctype BILL --action download o2

You can also use regular expressions to filter documents:

$ document-dl --regex date '^(2021-04|2021-05).*$' o2

List all documents from o2online.de where year >= 2019:

$ document-dl --jq 'select(.year >= 2019)' o2

Download document from elster.de with id == 15:

$ document-dl --jq 'contains({id: 15})' --action download elster

Writing a plugin

Plugins are click-plugins which in turn are normal @click.command's registered in setup.py

put your plugin into "docdl/plugins"
write your plugin class:
- if you just need requests, inherit from docdl.WebPortal and use self.session that's initialized for you
- if you need selenium, inherit from docdl.SeleniumWebPortal and use self.webdriver that's initialized for you
- add click glue code
- add your plugin to setup.py docdl_plugins registry

import docdl
import docdl.util

class MyPlugin(docdl.WebPortal):

    URL_LOGIN = "https://myservice.com/login"

    def login(self):
        request = self.session.get(self.URL_LOGIN)
        # ... authenticate ...
        if not_logged_in:
            return False
        return True

    def logout(self):
        # ... logout ...

    def documents(self):
        # iterate over all available documents
        for count, document in enumerate(all_documents):

            # scrape:
            #  * document attributes
            #    * it's recommended to assign an incremental "id"
            #      attribute to every document
            #    * if you set a "filename" attribute, it will be used to
            #      rename the downloaded file
            #    * dates should be parsed to datetime.datetime objects
            #      docdl.util.parse_date() should parse the most common strings
            #
            # also you must scrape either:
            #  * the download URL
            #
            # or (for SeleniumWebPortal plugins):
            #  * the DOM element that triggers download. It is expected
            #    that the download starts immediately after click() on
            #    the DOM element
            # or implement a custom download() method

            yield docdl.Document(
                url = this_documents_url,
                # download_element = <some selenium element to click>
                attributes = {
                    "id": count,
                    "category": "invoices",
                    "title": this_documents_title,
                    "filename": this_documents_target_filename,
                    "date": docdl.util.parse_date(some_date_string)
                }
            )


    def download(self, document):
        """you shouldn't need this for most web portals"""
        # ... save file to os.getcwd() ...
        return self.rename_after_download(document, filename)


@click.command()
@click.pass_context
def myplugin(ctx):
    """plugin description (what, documents, are, scraped)"""
    docdl.cli.run(ctx, MyPlugin)

and in setup.py:

# ...
setup(
    # ...
    packages=find_packages(
        # ...
        entry_points={
            'docdl_plugins': [
                # ...
                'myplugin=docdl.plugins.myplugin:myplugin',
                # ...
            ],
            # ...
        }
)

Security

Beware that your login credentials are most probably saved in your shell history when you pass them as commandline arguments. You can use the input prompt to avoid that or set environment variables safely.

Bugs

document-dl is still in a very early state of development and a lot of things don't work, yet. Especially a ton of edge cases need to be covered. If you find a bug, please open an issue or send a pull request.

--browser settings beside chrome probably don't work unless you help to test them
some services offer more documents/data than currently scraped

TODO

logging
better documentation
properly parse rfc6266
delete action

Comments

Testing Firefox

Mac OS 11.5.2
Firefox 91.0.1 + Selenium

I'm running this command:

 document-dl -b firefox -u NUMBER -p 'PASSWORD' --action download --jq 'contains({id: 0})' o2

It starts a pure Firefox without anything installed (it's a new profile I think). After a while it exits and I'm getting these errors:

Traceback (most recent call last):
  File "/usr/local/bin/document-dl", line 8, in <module>
    sys.exit(documentdl())
  File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1137, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1062, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1668, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.9/site-packages/click/core.py", line 763, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/docdl/plugins/o2.py", line 133, in o2
    docdl.cli.run(ctx, O2)
  File "/usr/local/lib/python3.9/site-packages/docdl/cli.py", line 150, in run
    plugin = plugin_class(
  File "/usr/local/lib/python3.9/site-packages/docdl/__init__.py", line 147, in __init__
    self._init_webdriver(webdriver_opts, arguments['webdriver'])
  File "/usr/local/lib/python3.9/site-packages/docdl/__init__.py", line 272, in _init_webdriver
    self.webdriver = webdriver.Firefox(
  File "/usr/local/lib/python3.9/site-packages/selenium/webdriver/firefox/webdriver.py", line 190, in __init__
    executor = ExtensionConnection("127.0.0.1", self.profile,
  File "/usr/local/lib/python3.9/site-packages/selenium/webdriver/firefox/extension_connection.py", line 52, in __init__
    self.binary.launch_browser(self.profile, timeout=timeout)
  File "/usr/local/lib/python3.9/site-packages/selenium/webdriver/firefox/firefox_binary.py", line 73, in launch_browser
    self._wait_until_connectable(timeout=timeout)
  File "/usr/local/lib/python3.9/site-packages/selenium/webdriver/firefox/firefox_binary.py", line 109, in _wait_until_connectable
    raise WebDriverException(
selenium.common.exceptions.WebDriverException: Message: Can't load the profile. Possible firefox version mismatch. 
You must use GeckoDriver instead for Firefox 48+. Profile Dir: /var/folders/td/x4r_b40s2r5bvlrlrby13ptw0000gn/T/tmp799ugx0k 
If you specified a log_file in the FirefoxBinary constructor, check it for details.
``

opened by CyrosX 3

amazon: add support for limiting to a single year

It is already possible to filter for documents (or orders, in Amazon context) of a certain year. Nevertheless, documents for all years will be inspected which can become time-consuming. This PR adds a feature to filter so processing is limited as well.

Currently, it is only implemented for the Amazon plugin. A global "--year" switch might be well worth considering but would also involve API changes. Therefor I chose to only propose this first step.

opened by sdx23 1
use selenium<4.3.0

selenium 4.3.0 deprecated Opera and find_element_by_* which is still used here. so we need to use an earlier version.

selenium changelog: https://github.com/SeleniumHQ/selenium/blob/a4995e2c096239b42c373f26498a6c9bb4f2b3e7/py/CHANGES

opened by cyroxx 0
new feature: preserve document time as file mtime

Metadata is always nice to have, but having it, we should also use it. Specifically, set the file mtime to the document date. Actually, I even think that should be the default option. The download time is still available in other attributes, but I don't really see where it'd be important.

This might just as well fit into rename_after_download which should then be postprocess_after_download, but I had no strong preference so it landed in cli.py.

Up to now I didn't check whether the date attribute acutally exists. I guess it should for any document.

Regarding click options: I find it strange to supply "=true" to bool options. Imho such options should be simple switches to enable the non-default behaviour (e.g. also in --headless) -- which is common for a lot of other programs. However, for now I've kept with the existing style.

opened by sdx23 1
RFC: limited time-range

Let me first say that I totally agree with the statement in #4 that we should take care not to clutter the namespace [1]. After all, I see unification/standardization as one big aspect of this project, that distinguishes it from me hacking a short standalone selenium-script for just the websites/services I need.

As well, I do agree that adding a special "--year" switch or the like is redundant with the ability of jq querying (and that also put me at unease with my suggestion in #4). But I also see the point of dates being something special since they are (possibly the main) restriction on which documents to process.

That is important on the one hand for speed / not doing a lot of useless work (see #4). This is relevant for regular downloading as mentioned in #1: when the script runs periodically once a month, it is surely fine to only (try) to download documents from within the last 60 days.

On the other hand, the scraped website itself may raise that question (and that's why I bring up this topic again). I'm currently developing a plugin for smartbroker [2], which displays the postbox by default as a search form [3]. It allows selecting predefined ranges (the last x days with x in [10,30,...360] or alternatively specifying your own range. So I could either -- quite arbitrarily -- select "last 360 days" or do something (possibly stupid?) like setting the range to 1970-01-01 til today.

Now this is specific to the plugin in question, and in principle I'd just leave it as is (360 days) and possibly specify an additional option making all documents available by forcing "1970-01-01 til today". But from a user perspective it might get confusing, what one must do for which plugin to work as expected from the experience with others.

Not sure I'm overthinking this; this can still be changed at some later point. Nevertheless I wanted to bring this to attention and ask for comments.

[1] sidenote, that is offtopic here: raises the question whether short cli options should be discouraged in plugins [2] https://github.com/sdx23/document-dl/tree/smartbroker [3]
question

opened by sdx23 3
Support for Remote selenium webdriver (Docker Version?)

Is it possible to configure a remote server for doing all the Selenium / browser work? There is webdriver.Remote for Selenium in Python to configure the ip / hostname and https://hub.docker.com/u/selenium Docker images for all browsers.

Thought about building a stack like this: document-dl container & selenium chrome container connected through docker docker network and a volume bind for an invoice directory (auto sub-dirs for all invoice providers).

Alternative: everything in one container (https://nander.cc/using-selenium-within-a-docker-container).
enhancement

opened by CyrosX 1

Owner

GitHub

This tool crawls a list of websites and download all PDF and office documents

This tool crawls a list of websites and download all PDF and office documents. Then it analyses the PDF documents and tries to detect accessibility issues.

7 Sep 30, 2022

Web scraping library and command-line tool for text discovery and extraction (main content, metadata, comments)

trafilatura: Web scraping tool for text discovery and retrieval Description Trafilatura is a Python package and command-line tool which seamlessly dow

704 Jan 6, 2023

A Simple Web Scraper made to Extract Download Links from Todaytvseries2.com

TDTV2-Direct Version 1.00.1 • A Simple Web Scraper made to Extract Download Links from Todaytvseries2.com :) How to Works?? install all dependancies v

1 Nov 28, 2021

Universal Reddit Scraper - A comprehensive Reddit scraping command-line tool written in Python.

543 Jan 3, 2023

VG-Scraper is a python program using the module called BeautifulSoup which allows anyone to scrape something off an website. This program lets you put in a number trough an input and a number is 1 news article.

VG-Scraper VG-Scraper is a convinient program where you can find all the news articles instead of finding one yourself. Installing [Linux] Open a term

3 Feb 13, 2022

Parsel lets you extract data from XML/HTML documents using XPath or CSS selectors

Parsel Parsel is a BSD-licensed Python library to extract and remove data from HTML and XML using XPath and CSS selectors, optionally combined with re

859 Dec 29, 2022

A Web Scraping Program.

Web Scraping AUTHOR: Saurabh G. MTech Information Security, IIT Jammu. If you find this repository useful. I would appreciate if you Star it and Fork

2 Dec 14, 2022

Web-scraping - Program that scrapes a website for a collection of quotes, picks one at random and displays it

web-scraping Program that scrapes a website for a collection of quotes, picks on

1 Jan 7, 2022

robobrowser - A simple, Pythonic library for browsing the web without a standalone web browser.

RoboBrowser: Your friendly neighborhood web scraper Homepage: http://robobrowser.readthedocs.org/ RoboBrowser is a simple, Pythonic library for browsi

3.7k Dec 27, 2022

Script used to download data for stocks.

This script is useful for downloading stock market data for a wide range of companies specified by their respective tickers. The script reads in the d

71 Oct 4, 2022

Download images from forum threads

Forum Image Scraper Downloads images from forum threads Only works with forums which doesn't require a login to view and have an incremental paginatio

9 Nov 16, 2022

download NCERT books using scrapy

download_ncert_books download NCERT books using scrapy Downloading Books: You can either use the spider by cloning this repo and following the instruc

1 Dec 2, 2022

Automatically download and crop key information from the arxiv daily paper.

Arxiv daily 速览功能：按关键词筛选arxiv每日最新paper，自动获取摘要，自动截取文中表格和图片。 1 测试环境 Ubuntu 16+ Python3.7 torch 1.9 Colab GPU 2 使用演示首先下载权重baiduyun 提取码:il87，放置于code/Pars

20 Jul 30, 2022

A simplistic scraper made to download tons of random screenshots made by people.

printStealer 1.1 What is this tool? This tool is developed to show the insecurity of the screenshot utility called prnt sc. It is a site that stores s

4 Jul 26, 2022

Bulk download tool for the MyMedia platform

MyMedia Bulk Content Downloader This is a bulk download tool for the MyMedia platform. USE ONLY WHERE ALLOWED BY THE COPYRIGHT OWNER. NOT AFFILIATED W

3 Oct 14, 2022

PaperRobot: a paper crawler that can quickly download numerous papers, facilitating paper studying and management

PaperRobot PaperRobot 是一个论文抓取工具，可以快速批量下载大量论文，方便后期进行持续的论文管理与学习。 PaperRobot通过多个接口抓取论文，目前抓取成功率维持在90%以上。通过配置Config文件，可以抓取任意计算机领域相关会议的论文。 Installation Down

47 Nov 23, 2022

Simple tool to scrape and download cross country ski timings and results from live.skidor.com

LiveSkidorDownload Simple tool to scrape and download cross country ski timings and results from live.skidor.com Usage: Put the python file in a dedic

0 Jan 7, 2022

Liveskidordownload - Simple tool to scrape and download cross country ski timings and results from live.skidor.com

LiveSkidorDownload Simple tool to scrape and download cross country ski timings