{"id":13910383,"url":"https://github.com/Disane87/docudigger","last_synced_at":"2025-07-18T09:31:18.487Z","repository":{"id":64419216,"uuid":"573151842","full_name":"Disane87/docudigger","owner":"Disane87","description":"Website scraper for getting invoices automagically as pdf (useful for taxes or DMS)","archived":false,"fork":false,"pushed_at":"2025-05-18T17:51:08.000Z","size":6744,"stargazers_count":79,"open_issues_count":35,"forks_count":10,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-06-19T01:49:37.519Z","etag":null,"topics":["dms","invoices","nodejs","scraping"],"latest_commit_sha":null,"homepage":"https://blog.disane.dev","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Disane87.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE.md","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null},"funding":{"github":null,"patreon":null,"open_collective":null,"ko_fi":"disanedev","tidelift":null,"community_bridge":null,"liberapay":null,"issuehunt":null,"otechie":null,"lfx_crowdfunding":null,"custom":null}},"created_at":"2022-12-01T20:11:38.000Z","updated_at":"2025-06-18T21:33:52.000Z","dependencies_parsed_at":"2023-02-18T19:15:20.032Z","dependency_job_id":"f012a9f7-7e41-448d-b367-209c45b35dcf","html_url":"https://github.com/Disane87/docudigger","commit_stats":null,"previous_names":[],"tags_count":97,"template":false,"template_full_name":null,"purl":"pkg:github/Disane87/docudigger","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Disane87%2Fdocudigger","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Disane87%2Fdocudigger/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Disane87%2Fdocudigger/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Disane87%2Fdocudigger/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Disane87","download_url":"https://codeload.github.com/Disane87/docudigger/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Disane87%2Fdocudigger/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":265733851,"owners_count":23819423,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dms","invoices","nodejs","scraping"],"created_at":"2024-08-07T00:01:16.005Z","updated_at":"2025-07-18T09:31:18.177Z","avatar_url":"https://github.com/Disane87.png","language":"TypeScript","funding_links":["https://ko-fi.com/disanedev"],"categories":["TypeScript"],"sub_categories":[],"readme":"\u003ch1 align=\"center\"\u003eWelcome to docudigger 👋\u003c/h1\u003e\n\u003cp\u003e\n  \u003cimg alt=\"npm\" src=\"https://img.shields.io/npm/v/@disane-dev/docudigger/latest\"\u003e\n  \u003cimg alt=\"GitHub package.json dependency version (subfolder of monorepo)\" src=\"https://img.shields.io/github/package-json/dependency-version/Disane87/docudigger/puppeteer\"\u003e\n\n  \u003cimg src=\"https://img.shields.io/badge/npm-%3E%3D9.1.2-blue.svg\" /\u003e\n  \u003cimg src=\"https://img.shields.io/badge/node-%3E%3D18.12.1-blue.svg\" /\u003e\n  \u003ca href=\"#\" target=\"_blank\"\u003e\n    \u003cimg alt=\"License: MIT\" src=\"https://img.shields.io/badge/License-MIT-yellow.svg\" /\u003e\n  \u003c/a\u003e\n  \u003ca href=\"https://github.com/disane87/docudigger/pkgs/container/docudigger\" target=\"_blank\"\u003e\n    \u003cimg alt=\"Docker\" src=\"https://ghcr-badge.egpl.dev/disane87/docudigger/latest_tag?color=%2344cc11\u0026ignore=latest\u0026label=Docker\u0026trim=\" /\u003e\n  \u003c/a\u003e\n  \n\u003c/p\u003e\n\n\u003e Document scraper for getting invoices automagically as pdf (useful for taxes or DMS)\n\n### 🏠 [Homepage](https://repo.disane.dev/Disane/docudigger#readme)\n\n## Configuration\n\nAll settings can be changed via `CLI`, env variable (even when using docker).\n\n| Setting                 | Description                                                                                                                | Default value   |\n| ----------------------- | -------------------------------------------------------------------------------------------------------------------------- | --------------- |\n| AMAZON_USERNAME         | Your Amazon username                                                                                                       | `null`          |\n| AMAZON_PASSWORD         | Your amazon password                                                                                                       | `null`          |\n| AMAZON_TLD              | Amazon top level domain                                                                                                    | `de`            |\n| AMAZON_YEAR_FILTER      | Only extracts invoices from this year (i.e. 2023)                                                                          | `2023`          |\n| AMAZON_PAGE_FILTER      | Only extracts invoices from this page (i.e. 2)                                                                             | `null`          |\n| ONLY_NEW                | Tracks already scraped documents and starts a new run at the last scraped one                                              | `true`          |\n| FILE_DESTINATION_FOLDER | Destination path for all scraped documents                                                                                 | `./documents/`  |\n| FILE_FALLBACK_EXTENSION | Fallback extension when no extension can be determined                                                                     | `.pdf`          |\n| DEBUG                   | Debug flag (sets the loglevel to DEBUG)                                                                                    | `false`         |\n| SUBFOLDER_FOR_PAGES     | Creates subfolders for every scraped page/plugin                                                                           | `false`         |\n| LOG_PATH                | Sets the log path                                                                                                          | `./logs/`       |\n| LOG_LEVEL               | Log level (see https://github.com/winstonjs/winston#logging-levels)                                                        | `info`          |\n| RECURRING               | Flag for executing the script periodically. Needs 'RECURRING_PATTERN' to be set. Default `true`when using docker container | `false`         |\n| RECURRING_PATTERN       | Cron pattern to execute periodically. Needs RECURRING to true                                                              | `*/30 * * * *`  |\n| TZ                      | Timezone used for docker enviroments                                                                                       | `Europe/Berlin` |\n\n## Install\n\n```sh\nnpm install\n```\n\n## Usage\n\n\u003c!-- usage --\u003e\n```sh-session\n$ npm install -g @disane-dev/docudigger\n$ docudigger COMMAND\nrunning command...\n$ docudigger (--version)\n@disane-dev/docudigger/2.0.7 linux-x64 node-v20.18.0\n$ docudigger --help [COMMAND]\nUSAGE\n  $ docudigger COMMAND\n...\n```\n\u003c!-- usagestop --\u003e\n\n\u003e [!IMPORTANT]  \n\u003e Don't forget to include `--ignore-scripts` in your install command.\n\n## `docudigger scrape all`\n\nScrapes all websites periodically (default for docker environment)\n\n```\nUSAGE\n  $ docudigger scrape all [--json] [--logLevel trace|debug|info|warn|error] [-d] [-l \u003cvalue\u003e] [-c \u003cvalue\u003e -r]\n\nFLAGS\n  -c, --recurringCron=\u003cvalue\u003e  [default: * * * * *] Cron pattern to execute periodically\n  -d, --debug\n  -l, --logPath=\u003cvalue\u003e        [default: ./logs/] Log path\n  -r, --recurring\n  --logLevel=\u003coption\u003e          [default: info] Specify level for logging.\n                               \u003coptions: trace|debug|info|warn|error\u003e\n\nGLOBAL FLAGS\n  --json  Format output as json.\n\nDESCRIPTION\n  Scrapes all websites periodically\n\nEXAMPLES\n  $ docudigger scrape all\n```\n\n## `docudigger scrape amazon`\n\nUsed to get invoices from amazon\n\n```\nUSAGE\n  $ docudigger scrape amazon -u \u003cvalue\u003e -p \u003cvalue\u003e [--json] [--logLevel trace|debug|info|warn|error] [-d] [-l\n    \u003cvalue\u003e] [-c \u003cvalue\u003e -r] [--fileDestinationFolder \u003cvalue\u003e] [--fileFallbackExentension \u003cvalue\u003e] [-t \u003cvalue\u003e]\n    [--yearFilter \u003cvalue\u003e] [--pageFilter \u003cvalue\u003e] [--onlyNew]\n\nFLAGS\n  -c, --recurringCron=\u003cvalue\u003e        [default: */30 * * * *] Cron pattern to execute periodically\n  -d, --debug\n  -l, --logPath=\u003cvalue\u003e              [default: ./logs/] Log path\n  -p, --password=\u003cvalue\u003e             (required) Password\n  -r, --recurring\n  -t, --tld=\u003cvalue\u003e                  [default: de] Amazon top level domain\n  -u, --username=\u003cvalue\u003e             (required) Username\n  --fileDestinationFolder=\u003cvalue\u003e    [default: ./data/] Amazon top level domain\n  --fileFallbackExentension=\u003cvalue\u003e  [default: .pdf] Amazon top level domain\n  --logLevel=\u003coption\u003e                [default: info] Specify level for logging.\n                                     \u003coptions: trace|debug|info|warn|error\u003e\n  --onlyNew                          Gets only new invoices\n  --pageFilter=\u003cvalue\u003e               Filters a page\n  --yearFilter=\u003cvalue\u003e               Filters a year\n\nGLOBAL FLAGS\n  --json  Format output as json.\n\nDESCRIPTION\n  Used to get invoices from amazon\n\n  Scrapes amazon invoices\n\nEXAMPLES\n  $ docudigger scrape amazon\n```\n\n## Docker\n\n```sh\ndocker run \\\n  -e AMAZON_USERNAME='[YOUR MAIL]' \\\n  -e AMAZON_PASSWORD='[YOUR PW]' \\\n  -e AMAZON_TLD='de' \\\n  -e AMAZON_YEAR_FILTER='2024' \\\n  -e AMAZON_PAGE_FILTER='1' \\\n  -e LOG_LEVEL='info' \\\n  -v \"C:/temp/docudigger/:/home/node/docudigger\" \\\n  ghcr.io/disane87/docudigger\n```\n\n## Dev-Time 🪲\n\n### NPM\n\n```npm\nnpm install\n[Change created .env for your needs]\nnpm run start\n```\n\n## Author\n\n👤 **Marco Franke**\n\n- Website: http://byte-style.de\n- Github: [@Disane87](https://github.com/Disane87)\n- LinkedIn: [@marco-franke-799399136](https://linkedin.com/in/marco-franke-799399136)\n\n## 🤝 Contributing\n\nContributions, issues and feature requests are welcome!\u003cbr /\u003eFeel free to check [issues page](https://repo.disane.dev/Disane/docudigger/issues). You can also take a look at the [contributing guide](https://repo.disane.dev/Disane/docudigger/blob/master/CONTRIBUTING.md).\n\n## Show your support\n\nGive a ⭐️ if this project helped you!\n\n---\n\n_This README was generated with ❤️ by [readme-md-generator](https://github.com/kefranabg/readme-md-generator)_\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FDisane87%2Fdocudigger","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FDisane87%2Fdocudigger","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FDisane87%2Fdocudigger/lists"}