Command line program to download documents from web portals.

Overview

command line document download made easy



Highlights

  • list available documents in json format or download them
  • filter documents using
    • string matching
    • regular expressions or
    • jq queries
  • display captcha or QR codes for interactive input
  • writing new plugins is easy
  • existing plugins (some of them even work):
    • amazon
    • ing.de
    • dkb.de
    • o2.de
    • kabel.vodafone.de
    • conrad.de
    • elster.de



Dependencies



Installation

$ git clone --recursive https://github.com/heeplr/document-dl
$ cd document-dl
$ pip install .



Usage

Display Help:

$ document-dl -h
Usage: document-dl [OPTIONS] COMMAND [ARGS]...

  download documents from web portals

Options:
  -u, --username TEXT             login id  [env var: DOCDL_USERNAME]
  -p, --password TEXT             secret password  [env var: DOCDL_PASSWORD]
  -m, --match <ATTRIBUTE PATTERN>...
                                  only output documents where attribute
                                  contains pattern string  [env var:
                                  DOCDL_MATCH]

  -r, --regex <ATTRIBUTE REGEX>...
                                  only output documents where attribute value
                                  matches regex  [env var: DOCDL_REGEX]

  -j, --jq JQ_EXPRESSION          only output documents if json query matches
                                  document's attributes (see
                                  https://stedolan.github.io/jq/manual/ )
                                  [env var: DOCDL_JQ]

  -H, --headless BOOLEAN          show browser window if false  [env var:
                                  DOCDL_HEADLESS; default: True]

  -b, --browser [chrome|edge|firefox|ie|opera|safari|webkitgtk]
                                  webdriver to use for selenium based plugins
                                  [env var: DOCDL_BROWSER; default: chrome]

  -t, --timeout INTEGER           seconds to wait for data before terminating
                                  connection  [env var: DOCDL_TIMEOUT;
                                  default: 15]

  -i, --image-loading BOOLEAN     Turn off image loading when False  [env var:
                                  DOCDL_IMAGE_LOADING; default: False]

  -a, --action [download|list]    download or just list documents  [env var:
                                  DOCDL_ACTION; default: list]

  -h, --help                      Show this message and exit.

Commands:
  amazon    amazon.de (invoices)
  conrad    conrad.de (invoices)
  dkb       dkb.de with photoTAN (postbox)
  elster    elster.de with path to .pfx certfile as username (postbox)
  ing       banking.ing.de with photoTAN (postbox)
  o2        o2online.de (invoices/postbox)
  vodafone  kabel.vodafone.de (postbox, invoices)

Display plugin-specific help: (currently there is a bug in click that prompts for username and password before displaying the help)

$ document-dl ing --help
Usage: document-dl ing [OPTIONS]

  banking.ing.de with photoTAN (postbox)

Options:
  -k, --diba-key TEXT  DiBa Key  [env var: DOCDL_DIBA_KEY]
  -h, --help           Show this message and exit.



Examples

List all documents from vodafone.de, prompt for username/password:

$ document-dl vodafone

Same, but show browser window this time:

$ document-dl --headless=false vodafone

Download all documents from conrad.de, pass credentials as commandline arguments:

$ document-dl --username mylogin --password mypass --action download conrad

Download all documents from conrad.de, pass credentials as env vars:

$ DOCDL_USERNAME='mylogin' DOCDL_PASSWORD='mypass' document-dl --action download conrad

Download all documents from o2online.de where "doctype" attribute contains "BILL":

$ document-dl --match doctype BILL --action download o2

You can also use regular expressions to filter documents:

$ document-dl --regex date '^(2021-04|2021-05).*$' o2

List all documents from o2online.de where year >= 2019:

$ document-dl --jq 'select(.year >= 2019)' o2

Download document from elster.de with id == 15:

$ document-dl --jq 'contains({id: 15})' --action download elster



Writing a plugin

Plugins are click-plugins which in turn are normal @click.command's registered in setup.py

  • put your plugin into "docdl/plugins"

  • write your plugin class:

    • if you just need requests, inherit from docdl.WebPortal and use self.session that's initialized for you
    • if you need selenium, inherit from docdl.SeleniumWebPortal and use self.webdriver that's initialized for you
    • add click glue code
    • add your plugin to setup.py docdl_plugins registry
import docdl
import docdl.util

class MyPlugin(docdl.WebPortal):

    URL_LOGIN = "https://myservice.com/login"

    def login(self):
        request = self.session.get(self.URL_LOGIN)
        # ... authenticate ...
        if not_logged_in:
            return False
        return True

    def logout(self):
        # ... logout ...

    def documents(self):
        # iterate over all available documents
        for count, document in enumerate(all_documents):

            # scrape:
            #  * document attributes
            #    * it's recommended to assign an incremental "id"
            #      attribute to every document
            #    * if you set a "filename" attribute, it will be used to
            #      rename the downloaded file
            #    * dates should be parsed to datetime.datetime objects
            #      docdl.util.parse_date() should parse the most common strings
            #
            # also you must scrape either:
            #  * the download URL
            #
            # or (for SeleniumWebPortal plugins):
            #  * the DOM element that triggers download. It is expected
            #    that the download starts immediately after click() on
            #    the DOM element
            # or implement a custom download() method

            yield docdl.Document(
                url = this_documents_url,
                # download_element = <some selenium element to click>
                attributes = {
                    "id": count,
                    "category": "invoices",
                    "title": this_documents_title,
                    "filename": this_documents_target_filename,
                    "date": docdl.util.parse_date(some_date_string)
                }
            )


    def download(self, document):
        """you shouldn't need this for most web portals"""
        # ... save file to os.getcwd() ...
        return self.rename_after_download(document, filename)


@click.command()
@click.pass_context
def myplugin(ctx):
    """plugin description (what, documents, are, scraped)"""
    docdl.cli.run(ctx, MyPlugin)

and in setup.py:

# ...
setup(
    # ...
    packages=find_packages(
        # ...
        entry_points={
            'docdl_plugins': [
                # ...
                'myplugin=docdl.plugins.myplugin:myplugin',
                # ...
            ],
            # ...
        }
)



Security

Beware that your login credentials are most probably saved in your shell history when you pass them as commandline arguments. You can use the input prompt to avoid that or set environment variables safely.



Bugs

document-dl is still in a very early state of development and a lot of things don't work, yet. Especially a ton of edge cases need to be covered. If you find a bug, please open an issue or send a pull request.

  • --browser settings beside chrome probably don't work unless you help to test them
  • some services offer more documents/data than currently scraped



TODO

  • logging
  • better documentation
  • properly parse rfc6266
  • delete action
Comments
  • Testing Firefox

    Testing Firefox

    • Mac OS 11.5.2
    • Firefox 91.0.1 + Selenium

    I'm running this command:

     document-dl -b firefox -u NUMBER -p 'PASSWORD' --action download --jq 'contains({id: 0})' o2
    

    It starts a pure Firefox without anything installed (it's a new profile I think). After a while it exits and I'm getting these errors:

    Traceback (most recent call last):
      File "/usr/local/bin/document-dl", line 8, in <module>
        sys.exit(documentdl())
      File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1137, in __call__
        return self.main(*args, **kwargs)
      File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1062, in main
        rv = self.invoke(ctx)
      File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1668, in invoke
        return _process_result(sub_ctx.command.invoke(sub_ctx))
      File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
        return ctx.invoke(self.callback, **ctx.params)
      File "/usr/local/lib/python3.9/site-packages/click/core.py", line 763, in invoke
        return __callback(*args, **kwargs)
      File "/usr/local/lib/python3.9/site-packages/click/decorators.py", line 26, in new_func
        return f(get_current_context(), *args, **kwargs)
      File "/usr/local/lib/python3.9/site-packages/docdl/plugins/o2.py", line 133, in o2
        docdl.cli.run(ctx, O2)
      File "/usr/local/lib/python3.9/site-packages/docdl/cli.py", line 150, in run
        plugin = plugin_class(
      File "/usr/local/lib/python3.9/site-packages/docdl/__init__.py", line 147, in __init__
        self._init_webdriver(webdriver_opts, arguments['webdriver'])
      File "/usr/local/lib/python3.9/site-packages/docdl/__init__.py", line 272, in _init_webdriver
        self.webdriver = webdriver.Firefox(
      File "/usr/local/lib/python3.9/site-packages/selenium/webdriver/firefox/webdriver.py", line 190, in __init__
        executor = ExtensionConnection("127.0.0.1", self.profile,
      File "/usr/local/lib/python3.9/site-packages/selenium/webdriver/firefox/extension_connection.py", line 52, in __init__
        self.binary.launch_browser(self.profile, timeout=timeout)
      File "/usr/local/lib/python3.9/site-packages/selenium/webdriver/firefox/firefox_binary.py", line 73, in launch_browser
        self._wait_until_connectable(timeout=timeout)
      File "/usr/local/lib/python3.9/site-packages/selenium/webdriver/firefox/firefox_binary.py", line 109, in _wait_until_connectable
        raise WebDriverException(
    selenium.common.exceptions.WebDriverException: Message: Can't load the profile. Possible firefox version mismatch. 
    You must use GeckoDriver instead for Firefox 48+. Profile Dir: /var/folders/td/x4r_b40s2r5bvlrlrby13ptw0000gn/T/tmp799ugx0k 
    If you specified a log_file in the FirefoxBinary constructor, check it for details.
    ``
    opened by CyrosX 3
  • amazon: add support for limiting to a single year

    amazon: add support for limiting to a single year

    It is already possible to filter for documents (or orders, in Amazon context) of a certain year. Nevertheless, documents for all years will be inspected which can become time-consuming. This PR adds a feature to filter so processing is limited as well.

    Currently, it is only implemented for the Amazon plugin. A global "--year" switch might be well worth considering but would also involve API changes. Therefor I chose to only propose this first step.

    opened by sdx23 1
  • use selenium<4.3.0

    use selenium<4.3.0

    selenium 4.3.0 deprecated Opera and find_element_by_* which is still used here. so we need to use an earlier version.

    selenium changelog: https://github.com/SeleniumHQ/selenium/blob/a4995e2c096239b42c373f26498a6c9bb4f2b3e7/py/CHANGES

    opened by cyroxx 0
  • new feature: preserve document time as file mtime

    new feature: preserve document time as file mtime

    Metadata is always nice to have, but having it, we should also use it. Specifically, set the file mtime to the document date. Actually, I even think that should be the default option. The download time is still available in other attributes, but I don't really see where it'd be important.

    This might just as well fit into rename_after_download which should then be postprocess_after_download, but I had no strong preference so it landed in cli.py.

    Up to now I didn't check whether the date attribute acutally exists. I guess it should for any document.

    Regarding click options: I find it strange to supply "=true" to bool options. Imho such options should be simple switches to enable the non-default behaviour (e.g. also in --headless) -- which is common for a lot of other programs. However, for now I've kept with the existing style.

    opened by sdx23 1
  • RFC: limited time-range

    RFC: limited time-range

    Let me first say that I totally agree with the statement in #4 that we should take care not to clutter the namespace [1]. After all, I see unification/standardization as one big aspect of this project, that distinguishes it from me hacking a short standalone selenium-script for just the websites/services I need.

    As well, I do agree that adding a special "--year" switch or the like is redundant with the ability of jq querying (and that also put me at unease with my suggestion in #4). But I also see the point of dates being something special since they are (possibly the main) restriction on which documents to process.

    That is important on the one hand for speed / not doing a lot of useless work (see #4). This is relevant for regular downloading as mentioned in #1: when the script runs periodically once a month, it is surely fine to only (try) to download documents from within the last 60 days.

    On the other hand, the scraped website itself may raise that question (and that's why I bring up this topic again). I'm currently developing a plugin for smartbroker [2], which displays the postbox by default as a search form [3]. It allows selecting predefined ranges (the last x days with x in [10,30,...360] or alternatively specifying your own range. So I could either -- quite arbitrarily -- select "last 360 days" or do something (possibly stupid?) like setting the range to 1970-01-01 til today.

    Now this is specific to the plugin in question, and in principle I'd just leave it as is (360 days) and possibly specify an additional option making all documents available by forcing "1970-01-01 til today". But from a user perspective it might get confusing, what one must do for which plugin to work as expected from the experience with others.

    Not sure I'm overthinking this; this can still be changed at some later point. Nevertheless I wanted to bring this to attention and ask for comments.

    [1] sidenote, that is offtopic here: raises the question whether short cli options should be discouraged in plugins [2] https://github.com/sdx23/document-dl/tree/smartbroker [3] 2022-01-07-180639_624x359_scrot

    question 
    opened by sdx23 3
  • Support for Remote selenium webdriver (Docker Version?)

    Support for Remote selenium webdriver (Docker Version?)

    Is it possible to configure a remote server for doing all the Selenium / browser work? There is webdriver.Remote for Selenium in Python to configure the ip / hostname and https://hub.docker.com/u/selenium Docker images for all browsers.

    Thought about building a stack like this: document-dl container & selenium chrome container connected through docker docker network and a volume bind for an invoice directory (auto sub-dirs for all invoice providers).

    Alternative: everything in one container (https://nander.cc/using-selenium-within-a-docker-container).

    enhancement 
    opened by CyrosX 1
Owner
null
This tool crawls a list of websites and download all PDF and office documents

This tool crawls a list of websites and download all PDF and office documents. Then it analyses the PDF documents and tries to detect accessibility issues.

AccessibilityLU 7 Sep 30, 2022
Web scraping library and command-line tool for text discovery and extraction (main content, metadata, comments)

trafilatura: Web scraping tool for text discovery and retrieval Description Trafilatura is a Python package and command-line tool which seamlessly dow

Adrien Barbaresi 704 Jan 6, 2023
A Simple Web Scraper made to Extract Download Links from Todaytvseries2.com

TDTV2-Direct Version 1.00.1 • A Simple Web Scraper made to Extract Download Links from Todaytvseries2.com :) How to Works?? install all dependancies v

Danushka-Madushan 1 Nov 28, 2021
Universal Reddit Scraper - A comprehensive Reddit scraping command-line tool written in Python.

Universal Reddit Scraper - A comprehensive Reddit scraping command-line tool written in Python.

Joseph Lai 543 Jan 3, 2023
VG-Scraper is a python program using the module called BeautifulSoup which allows anyone to scrape something off an website. This program lets you put in a number trough an input and a number is 1 news article.

VG-Scraper VG-Scraper is a convinient program where you can find all the news articles instead of finding one yourself. Installing [Linux] Open a term

null 3 Feb 13, 2022
Parsel lets you extract data from XML/HTML documents using XPath or CSS selectors

Parsel Parsel is a BSD-licensed Python library to extract and remove data from HTML and XML using XPath and CSS selectors, optionally combined with re

Scrapy project 859 Dec 29, 2022
A Web Scraping Program.

Web Scraping AUTHOR: Saurabh G. MTech Information Security, IIT Jammu. If you find this repository useful. I would appreciate if you Star it and Fork

Saurabh G. 2 Dec 14, 2022
Web-scraping - Program that scrapes a website for a collection of quotes, picks one at random and displays it

web-scraping Program that scrapes a website for a collection of quotes, picks on

Manvir Mann 1 Jan 7, 2022
robobrowser - A simple, Pythonic library for browsing the web without a standalone web browser.

RoboBrowser: Your friendly neighborhood web scraper Homepage: http://robobrowser.readthedocs.org/ RoboBrowser is a simple, Pythonic library for browsi

Joshua Carp 3.7k Dec 27, 2022
Script used to download data for stocks.

This script is useful for downloading stock market data for a wide range of companies specified by their respective tickers. The script reads in the d

Carmelo Gonzales 71 Oct 4, 2022
Download images from forum threads

Forum Image Scraper Downloads images from forum threads Only works with forums which doesn't require a login to view and have an incremental paginatio

null 9 Nov 16, 2022
download NCERT books using scrapy

download_ncert_books download NCERT books using scrapy Downloading Books: You can either use the spider by cloning this repo and following the instruc

null 1 Dec 2, 2022
Automatically download and crop key information from the arxiv daily paper.

Arxiv daily 速览 功能:按关键词筛选arxiv每日最新paper,自动获取摘要,自动截取文中表格和图片。 1 测试环境 Ubuntu 16+ Python3.7 torch 1.9 Colab GPU 2 使用演示 首先下载权重baiduyun 提取码:il87,放置于code/Pars

HeoLis 20 Jul 30, 2022
A simplistic scraper made to download tons of random screenshots made by people.

printStealer 1.1 What is this tool? This tool is developed to show the insecurity of the screenshot utility called prnt sc. It is a site that stores s

appelsiensam 4 Jul 26, 2022
Bulk download tool for the MyMedia platform

MyMedia Bulk Content Downloader This is a bulk download tool for the MyMedia platform. USE ONLY WHERE ALLOWED BY THE COPYRIGHT OWNER. NOT AFFILIATED W

Ege Feyzioglu 3 Oct 14, 2022
PaperRobot: a paper crawler that can quickly download numerous papers, facilitating paper studying and management

PaperRobot PaperRobot 是一个论文抓取工具,可以快速批量下载大量论文,方便后期进行持续的论文管理与学习。 PaperRobot通过多个接口抓取论文,目前抓取成功率维持在90%以上。通过配置Config文件,可以抓取任意计算机领域相关会议的论文。 Installation Down

moxiaoxi 47 Nov 23, 2022
Simple tool to scrape and download cross country ski timings and results from live.skidor.com

LiveSkidorDownload Simple tool to scrape and download cross country ski timings and results from live.skidor.com Usage: Put the python file in a dedic

null 0 Jan 7, 2022
Liveskidordownload - Simple tool to scrape and download cross country ski timings and results from live.skidor.com

LiveSkidorDownload Simple tool to scrape and download cross country ski timings

null 0 Jan 7, 2022
Python scrapper scrapping torrent website and download new movies Automatically.

torrent-scrapper Python scrapper scrapping torrent website and download new movies Automatically. If you like it Put a ⭐ on this repo ?? Run this git

Fazil vk 1 Jan 8, 2022