Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/Disane87/docudigger

Website scraper for getting invoices automagically as pdf (useful for taxes or DMS)
https://github.com/Disane87/docudigger

dms invoices nodejs scraping

Last synced: 3 months ago
JSON representation

Website scraper for getting invoices automagically as pdf (useful for taxes or DMS)

Host: GitHub
URL: https://github.com/Disane87/docudigger
Owner: Disane87
License: mit
Created: 2022-12-01T20:11:38.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2024-04-13T12:09:08.000Z (7 months ago)
Last Synced: 2024-04-14T00:38:37.139Z (7 months ago)
Topics: dms, invoices, nodejs, scraping
Language: TypeScript
Homepage: https://blog.disane.dev
Size: 5.63 MB
Stars: 29
Watchers: 2
Forks: 6
Open Issues: 17
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Funding: .github/FUNDING.yml
- License: LICENSE.md
- Code of conduct: CODE_OF_CONDUCT.md

Awesome Lists containing this project

README

        
Welcome to docudigger 👋



  

  

  

  

  

    

  

  

    

  

  



> Document scraper for getting invoices automagically as pdf (useful for taxes or DMS)

### 🏠 [Homepage](https://repo.disane.dev/Disane/docudigger#readme)

## Configuration

All settings can be changed via `CLI`, env variable (even when using docker).

| Setting                 | Description                                                                                                                | Default value   |

| ----------------------- | -------------------------------------------------------------------------------------------------------------------------- | --------------- |

| AMAZON_USERNAME         | Your Amazon username                                                                                                       | `null`          |

| AMAZON_PASSWORD         | Your amazon password                                                                                                       | `null`          |

| AMAZON_TLD              | Amazon top level domain                                                                                                    | `de`            |

| AMAZON_YEAR_FILTER      | Only extracts invoices from this year (i.e. 2023)                                                                          | `2023`          |

| AMAZON_PAGE_FILTER      | Only extracts invoices from this page (i.e. 2)                                                                             | `null`          |

| ONLY_NEW                | Tracks already scraped documents and starts a new run at the last scraped one                                              | `true`          |

| FILE_DESTINATION_FOLDER | Destination path for all scraped documents                                                                                 | `./documents/`  |

| FILE_FALLBACK_EXTENSION | Fallback extension when no extension can be determined                                                                     | `.pdf`          |

| DEBUG                   | Debug flag (sets the loglevel to DEBUG)                                                                                    | `false`         |

| SUBFOLDER_FOR_PAGES     | Creates subfolders for every scraped page/plugin                                                                           | `false`         |

| LOG_PATH                | Sets the log path                                                                                                          | `./logs/`       |

| LOG_LEVEL               | Log level (see https://github.com/winstonjs/winston#logging-levels)                                                        | `info`          |

| RECURRING               | Flag for executing the script periodically. Needs 'RECURRING_PATTERN' to be set. Default `true`when using docker container | `false`         |

| RECURRING_PATTERN       | Cron pattern to execute periodically. Needs RECURRING to true                                                              | `*/30 * * * *`  |

| TZ                      | Timezone used for docker enviroments                                                                                       | `Europe/Berlin` |

## Install

```sh

npm install

```

## Usage

```sh-session

$ npm install -g @disane-dev/docudigger

$ docudigger COMMAND

running command...

$ docudigger (--version)

@disane-dev/docudigger/2.0.6 linux-x64 node-v20.13.1

$ docudigger --help [COMMAND]

USAGE

  $ docudigger COMMAND

...

```

## `docudigger scrape all`

Scrapes all websites periodically (default for docker environment)

```

USAGE

  $ docudigger scrape all [--json] [--logLevel trace|debug|info|warn|error] [-d] [-l ] [-c  -r]

FLAGS

  -c, --recurringCron=  [default: * * * * *] Cron pattern to execute periodically

  -d, --debug

  -l, --logPath=        [default: ./logs/] Log path

  -r, --recurring

  --logLevel=          [default: info] Specify level for logging.

                               

GLOBAL FLAGS

  --json  Format output as json.

DESCRIPTION

  Scrapes all websites periodically

EXAMPLES

  $ docudigger scrape all

```

## `docudigger scrape amazon`

Used to get invoices from amazon

```

USAGE

  $ docudigger scrape amazon -u  -p  [--json] [--logLevel trace|debug|info|warn|error] [-d] [-l

    ] [-c  -r] [--fileDestinationFolder ] [--fileFallbackExentension ] [-t ]

    [--yearFilter ] [--pageFilter ] [--onlyNew]

FLAGS

  -c, --recurringCron=        [default: * * * * *] Cron pattern to execute periodically

  -d, --debug

  -l, --logPath=              [default: ./logs/] Log path

  -p, --password=             (required) Password

  -r, --recurring

  -t, --tld=                  [default: de] Amazon top level domain

  -u, --username=             (required) Username

  --fileDestinationFolder=    [default: ./data/] Amazon top level domain

  --fileFallbackExentension=  [default: .pdf] Amazon top level domain

  --logLevel=                [default: info] Specify level for logging.

                                     

  --onlyNew                          Gets only new invoices

  --pageFilter=               Filters a page

  --yearFilter=               Filters a year

GLOBAL FLAGS

  --json  Format output as json.

DESCRIPTION

  Used to get invoices from amazon

  Scrapes amazon invoices

EXAMPLES

  $ docudigger scrape amazon

```

## Docker

```sh

docker run \

  -e AMAZON_USERNAME='[YOUR MAIL]' \

  -e AMAZON_PASSWORD='[YOUR PW]' \

  -e AMAZON_TLD='de' \

  -e AMAZON_YEAR_FILTER='2020' \

  -e AMAZON_PAGE_FILTER='1' \

  -e LOG_LEVEL='info' \

  -v "C:/temp/docudigger/:/home/node/docudigger" \

  ghcr.io/disane87/docudigger

```

## Dev-Time 🪲

### NPM

```npm

npm install

[Change created .env for your needs]

npm run start

```

## Author

👤 **Marco Franke**

- Website: http://byte-style.de

- Github: [@Disane87](https://github.com/Disane87)

- LinkedIn: [@marco-franke-799399136](https://linkedin.com/in/marco-franke-799399136)

## 🤝 Contributing

Contributions, issues and feature requests are welcome!
Feel free to check [issues page](https://repo.disane.dev/Disane/docudigger/issues). You can also take a look at the [contributing guide](https://repo.disane.dev/Disane/docudigger/blob/master/CONTRIBUTING.md).

## Show your support

Give a ⭐️ if this project helped you!

---

_This README was generated with ❤️ by [readme-md-generator](https://github.com/kefranabg/readme-md-generator)_