{"id":20819358,"url":"https://github.com/heeplr/document-dl","last_synced_at":"2025-05-07T15:22:46.327Z","repository":{"id":45103035,"uuid":"384407986","full_name":"heeplr/document-dl","owner":"heeplr","description":"Command line program to download documents from web portals","archived":false,"fork":false,"pushed_at":"2023-11-06T04:58:43.000Z","size":237,"stargazers_count":22,"open_issues_count":4,"forks_count":3,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-03-31T11:21:17.374Z","etag":null,"topics":["document-dl","plaintext-accounting","python","scraper","scraping","scraping-websites","selenium"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"unlicense","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/heeplr.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-07-09T10:52:10.000Z","updated_at":"2024-12-30T22:26:11.000Z","dependencies_parsed_at":"2023-01-31T00:00:47.999Z","dependency_job_id":null,"html_url":"https://github.com/heeplr/document-dl","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/heeplr%2Fdocument-dl","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/heeplr%2Fdocument-dl/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/heeplr%2Fdocument-dl/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/heeplr%2Fdocument-dl/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/heeplr","download_url":"https://codeload.github.com/heeplr/document-dl/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252903005,"owners_count":21822353,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["document-dl","plaintext-accounting","python","scraper","scraping","scraping-websites","selenium"],"created_at":"2024-11-17T22:06:09.877Z","updated_at":"2025-05-07T15:22:46.305Z","avatar_url":"https://github.com/heeplr.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n----\n# command line document download made easy\n[![Pylint](https://github.com/heeplr/document-dl/actions/workflows/pylint.yml/badge.svg)](https://github.com/heeplr/document-dl/actions/workflows/pylint.yml)\n[![flake8](https://github.com/heeplr/document-dl/actions/workflows/flake8.yml/badge.svg)](https://github.com/heeplr/document-dl/actions/workflows/flake8.yml)\n----\n\nLike [youtube-dl](https://youtube-dl.org/) can download videos from various\nwebsites, document-dl can download documents like invoices, messages, reports, etc.\n\nIt can save you from regularly logging into your account to download new\ndocuments.\n\nWebsites that don't require any form of 2FA can be polled without interaction\nregularly using a cron job so documents are downloaded automatically.\n\n\u003cbr\u003e\n\n## Highlights\n\n* list available documents in json format or download them\n* filter documents using\n  * **string matching**\n  * **regular expressions** or\n  * **[jq queries](https://stedolan.github.io/jq/manual/)**\n* display captcha or QR codes for interactive input\n* writing new plugins is easy\n* existing plugins (some of them even work):\n  * amazon\n  * ing.de\n  * handyvertrag.de\n  * dkb.de\n  * o2.de\n  * www.vodafone.de\n  * conrad.de\n  * elster.de\n  * strato.de\n\n\n\u003cbr\u003e\u003cbr\u003e\n## Dependencies\n* [python](https://python.org)\n* [click](https://github.com/pallets/click)\n* [click-plugins](https://github.com/click-contrib/click-plugins)\n* [jq](https://github.com/mwilliamson/jq.py)\n* [python-dateutil](https://dateutil.readthedocs.io/en/stable/)\n* [requests](https://docs.python-requests.org/en/master/)\n* [selenium](https://selenium-python.readthedocs.io/) (default webdriver is \"chrome\")\n* [slugify](https://github.com/un33k/python-slugify)\n* [watchdog](https://github.com/gorakhargosh/watchdog)\n\n\u003cbr\u003e\u003cbr\u003e\n## Installation (for debian bullseye)\n\n```sh\n$ apt install git python3-dev python3-pip python3-selenium chromium-chromedriver\n$ pip3 install --user git+https://github.com/heeplr/document-dl.git\n```\n\nor for developers:\n\n```sh\n$ git clone --recursive https://github.com/heeplr/document-dl\n$ cd document-dl\n$ pip install --user --editable .\n```\n\n\u003cbr\u003e\u003cbr\u003e\n## Usage\n\nDisplay Help:\n\n```sh\n$ document-dl -h\nUsage: document-dl [OPTIONS] COMMAND [ARGS]...\n\n  download documents from web portals\n\nOptions:\n  -u, --username TEXT             login id  [env var: DOCDL_USERNAME]\n  -p, --password TEXT             secret password  [env var: DOCDL_PASSWORD]\n  -m, --match \u003cATTRIBUTE PATTERN\u003e...\n                                  only output documents where attribute\n                                  contains pattern string  [env var:\n                                  DOCDL_STRING_MATCHES]\n  -r, --regex \u003cATTRIBUTE REGEX\u003e...\n                                  only output documents where attribute value\n                                  matches regex  [env var:\n                                  DOCDL_REGEX_MATCHES]\n  -j, --jq JQ_EXPRESSION          only output documents if json query matches\n                                  document's attributes (see\n                                  https://stedolan.github.io/jq/manual/ )\n                                  [env var: DOCDL_JQ_MATCHES]\n  -H, --headless / --show         show/hide browser window  [env var:\n                                  DOCDL_HEADLESS; default: headless]\n  -b, --browser [chrome|edge|firefox|ie|safari|webkitgtk]\n                                  webdriver to use for selenium based plugins\n                                  [env var: DOCDL_BROWSER; default: chrome]\n  -t, --timeout INTEGER           seconds to wait for data before terminating\n                                  connection  [env var: DOCDL_TIMEOUT;\n                                  default: 25]\n  -i, --image-loading BOOLEAN     Turn off image loading when False  [env var:\n                                  DOCDL_IMAGE_LOADING; default: False]\n  -l, --list                      list documents  [env var: DOCDL_ACTION;\n                                  default: list]\n  -d, --download                  download documents  [env var: DOCDL_ACTION;\n                                  default: list]\n  -f, --format [list|dicts]       choose between line buffered output of json\n                                  dicts or single json list  [env var:\n                                  DOCDL_OUTPUT_FORMAT; default: dicts]\n  -D, --debug                     use selenium remote debugging on port 9222\n                                  [env var: DOCDL_DEBUG]\n  -h, --help                      Show this message and exit.\n\nCommands:\n  amazon        amazon.com (invoices)\n  believe       believebackstage.com (financial reports + catalog export)\n  conrad        conrad.de (invoices)\n  dkb           dkb.de with chipTAN QR (postbox)\n  elster        elster.de with path to .pfx certfile as username (postbox)\n  handyvertrag  service.handyvertrag.de (invoices, call record)\n  ing           banking.ing.de with photoTAN (postbox)\n  o2            o2online.de (invoices, call record, postbox)\n  strato        strato.de (invoices)\n  vodafone      www.vodafone.de (invoices)\n```\n\nDisplay plugin-specific help:\n(**currently there is a [bug in click](https://github.com/pallets/click/issues/1369)\nthat prompts for username and password before displaying the help**)\n\n```\n$ document-dl ing --help\nUsage: document-dl ing [OPTIONS]\n\n  banking.ing.de with photoTAN (postbox)\n\nOptions:\n  -k, --diba-key TEXT  DiBa Key  [env var: DOCDL_DIBA_KEY]\n  -h, --help           Show this message and exit.\n```\n\n\u003cbr\u003e\u003cbr\u003e\n## Examples\n\nList all documents from vodafone.de, prompt for username/password:\n```sh\n$ document-dl vodafone\n```\n\nSame, but show browser window this time:\n```sh\n$ document-dl --show vodafone\n```\n\nDownload all documents from conrad.de, pass credentials as commandline arguments:\n```sh\n$ document-dl --username mylogin --password mypass --action download conrad\n```\n\nDownload all documents from conrad.de, pass credentials as env vars:\n```sh\n$ DOCDL_USERNAME='mylogin' DOCDL_PASSWORD='mypass' document-dl --action download conrad\n```\n\nDownload all documents from o2online.de where \"category\" attribute contains \"BILL\":\n```sh\n$ document-dl --match category BILL --action download o2\n```\n\nYou can also use regular expressions to filter documents:\n```sh\n$ document-dl --regex date '^(2021-04|2021-05).*$' o2\n```\n\nList all documents from o2online.de where year \u003e= 2019:\n```sh\n$ document-dl --jq 'select(.year \u003e= 2019)' o2\n```\n\nDownload document from elster.de with id == 15:\n```sh\n$ document-dl --jq 'contains({id: 15})' --action download elster\n```\n\nYou can create a config file ```.o2_documentdlrc``` like so:\n```sh\nDOCDL_PLUGIN=\"o2\"\nDOCDL_USERNAME=\"01771234567\"\nDOCDL_PASSWORD=\"super-secret-password\"\nDOCDL_ACTION=\"download\"\nDOCDL_DSTPATH=\"${HOME}/Documents/o2\"\nDOCDL_TIMEOUT=\"30\"\n```\n\nthen invoke document-dl in a script like so:\n\n```sh\n#!/bin/bash\n\nCONFIG=\"${HOME}/.config/.o2_documentdlrc\"\n\n# load config\nset -a\n. \"${CONFIG}\" || error \"parsing config ${CONFIG}\"\nset +a\n# cd to target dir\ncd \"${DOCDL_DSTPATH}\"\n# download documents\n/usr/bin/document-dl \"${DOCDL_PLUGIN}\"\n```\n\n\n\u003cbr\u003e\u003cbr\u003e\n## Security\nBEWARE that your login credentials are most probably **saved in your shell\nhistory when you pass them as commandline arguments**.\nYou can use the input prompt to avoid that or set environment variables\nsecurely.\nMake sure to set secure permissions when saving credentials on a trusted system (e.g. ```chmod 0600 \u003cfile\u003e```)\n\n\u003cbr\u003e\u003cbr\u003e\n## Writing a plugin\n\nPlugins are [click-plugins](https://github.com/click-contrib/click-plugins) which\nin turn are normal @click.command's that are registered in setup.py\n\nRoughly, you have to:\n\n* put your plugin into *\"docdl/plugins/myplugin.py\"*\n* write your plugin class, e.g. MyPlugin():\n  * if you just need python requests, inherit from ```docdl.WebPortal``` and use\n    ```self.session``` that's initialized for you\n  * if you need selenium, inherit from ```docdl.SeleniumWebPortal``` and use\n    ```self.webdriver``` that's initialized for you\n  * add a\n    * login() method,\n    * logout() method and\n    * documents() generator that yields ```docdl.Document()``` instances\n    * optional: download() method if you need to do more fancy stuff than downloading an URLs and saving it to a file\n* add click glue code\n* add your plugin to setup.py docdl_plugins registry\n\nCheckout other plugins as example.\n\n### requests plugin example\n\n```python\nimport docdl\nimport docdl.util\n\nclass MyPlugin(docdl.WebPortal):\n\n    URL_LOGIN = \"https://myservice.com/login\"\n    URL_LOGOUT = \"https://myservice.com/logout\"\n\n    def login(self):\n        # maybe load some session cookie\n        request = self.session.get(self.URL_LOGIN)\n        # authenticate\n        request = self.session.post(\n            self.URL_LOGIN,\n            data={ 'username': self.username, 'password': self.password }\n        )\n        # return false if login failed, true otherwise\n        if not request.ok:\n            return False\n        return True\n\n    def logout(self):\n        request = self.session.get(self.URL_LOGOUT)\n\n    def documents(self):\n        # acquire list of documents\n        # ...\n\n        # iterate over all available documents\n        for count, document in enumerate(all_documents):\n\n            # scrape:\n            #  * document attributes\n            #    * it's recommended to assign an incremental \"id\"\n            #      attribute to every document\n            #    * if you set a \"filename\" attribute, it will be used to\n            #      rename the downloaded file\n            #    * dates should be parsed to datetime.datetime objects\n            #      docdl.util.parse_date() should parse the most common strings\n            #\n            # also you must scrape either:\n            #  * the download URL\n            #\n            # or (for SeleniumWebPortal plugins):\n            #  * the DOM element that triggers download. It is expected\n            #    that the download starts immediately after click() on\n            #    the DOM element\n            # or implement a custom download() method\n\n            yield docdl.Document(\n                url = this_documents_url,\n                # download_element = \u003csome selenium element to click\u003e\n                attributes = {\n                    \"id\": count,\n                    \"category\": \"invoices\",\n                    \"title\": this_documents_title,\n                    \"filename\": this_documents_target_filename,\n                    \"date\": docdl.util.parse_date(some_date_string)\n                }\n            )\n\n\n    def download(self, document):\n        \"\"\"you shouldn't need this for most web portals\"\"\"\n        # ... save file to os.getcwd() ...\n        return self.rename_after_download(document, filename)\n\n\n@click.command()\n@click.pass_context\ndef myplugin(ctx):\n    \"\"\"plugin description (what, documents, are, scraped)\"\"\"\n    docdl.cli.run(ctx, MyPlugin)\n\n```\n\n### selenium plugin example\n\nTBD\n\n\n### register plugin\n\n...in setup.py:\n\n```\n# ...\nsetup(\n    # ...\n    packages=find_packages(\n        # ...\n        entry_points={\n            'docdl_plugins': [\n                # ...\n                'myplugin=docdl.plugins.myplugin:myplugin',\n                # ...\n            ],\n            # ...\n        }\n)\n```\n\n\n\u003cbr\u003e\u003cbr\u003e\n## Bugs\ndocument-dl is still in a very early state of development and a lot of\nthings don't work, yet. Especially a ton of edge cases need to be\ncovered.\nIf you find a bug, please [open an issue](https://github.com/heeplr/document-dl/issues)\nor send a pull request.\n\n* --browser settings beside **chrome** probably don't work unless you help to test them\n* some services offer more documents/data than currently scraped\n\n\n\u003cbr\u003e\u003cbr\u003e\n## TODO\n* logging\n* better documentation\n* properly parse rfc6266\n* delete action\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fheeplr%2Fdocument-dl","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fheeplr%2Fdocument-dl","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fheeplr%2Fdocument-dl/lists"}