Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ovh/website-evidence-collector-batch
A tool to launch website-evidence-collector on several URLs or Sitemaps and generate a full report.
https://github.com/ovh/website-evidence-collector-batch
cookies rgpd
Last synced: 2 months ago
JSON representation
A tool to launch website-evidence-collector on several URLs or Sitemaps and generate a full report.
- Host: GitHub
- URL: https://github.com/ovh/website-evidence-collector-batch
- Owner: ovh
- License: eupl-1.2
- Created: 2020-02-18T15:13:50.000Z (almost 5 years ago)
- Default Branch: master
- Last Pushed: 2023-07-18T20:58:55.000Z (over 1 year ago)
- Last Synced: 2024-05-09T23:44:34.757Z (8 months ago)
- Topics: cookies, rgpd
- Language: JavaScript
- Homepage:
- Size: 134 KB
- Stars: 14
- Watchers: 14
- Forks: 5
- Open Issues: 8
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE.txt
Awesome Lists containing this project
README
# website-evidence-collector-batch
A tool to launch [website-evidence-collector](https://github.com/EU-EDPS/website-evidence-collector) on several URLs or Sitemaps and generate a full report.
## Prerequisites
You need to have [website-evidence-collector](https://github.com/EU-EDPS/website-evidence-collector) installed on your machine.
See [installation guide](https://github.com/EU-EDPS/website-evidence-collector#installation).
## Install
```bash
$ npm install -g git+https://github.com/ovh/website-evidence-collector-batch.git
```## Usage
```bash
$ website-evidence-collector-batch --config="/path/to/config/file"
```Your results will be stored in the output folder, like this:
* **full_results**: all reports for each pages individually (JSON and HTML)
* **report.html**: the full HTML report of all pages
* **report.json**: the full JSON report of all pages
* **report_simplified.json**: the simplified report of all pages (with only the list of cookies/localStorage/beacons)## Configuration
Create a config file with the following configuration:
```yaml
output: '/path/to/output/folder' # (required) Path to the output folder
workers: 4 # (optional) number of concurrency workers (default is CPUs count)
dnt: true # (optional) Set Do-Not-Track (default is false)
firstPartyUri: 'https://ovhcloud.com/fr/' # (required) First Party URI
urls: # (required/optional) List of URLs to grab
- 'https://ovhcloud.com/fr/url1'
- 'https://ovhcloud.com/fr/url2'
sitemaps: # (required/optional) Sitemaps list containing URLs to grab (can be files or urls)
- url: 'https://ovhcloud.com/fr/sitemap.xml'
exclude: '/^exclude/these/url$/'
- file: '/path/to/sitemap_custom.xml'
setCookie: cookies.txt # (optional) --set-cookie option to be passed to website-evidence-collector
# see https://github.com/EU-EDPS/website-evidence-collector/blob/master/FAQ.md#how-do-i-gather-evidence-with-given-consent
```You must provide at least one item in `urls` and/or `sitemaps`.
You can create your config file in **JSON** or **YAML** format.
## FAQ
### Why do you launch multiple parallels instances of the tool, instead of using parameter `--browse-link`?
You can use the parameter [--browse-link](https://github.com/EU-EDPS/website-evidence-collector/blob/master/lib/argv.js#L35) to launch the tool on a set of URLs.
In this case, the URLs will be browsed one after one, taking a lot of time.
This tool will launch multiple instances in parallel, and then merge the results into one report.
As an example, on a set of **100** URLs, we've benchmarked that this is **3x** faster:
| | |
|-------------------------------------------------------|-----------|
| website-evidence-collector with --browse-link | ~13min41s |
| website-evidence-collector-batch (4 CPUs / 4 workers) | ~04min38s |## Credits
This tool is based on the great tool from @rriemann-eu: [website-evidence-collector](https://github.com/EU-EDPS/website-evidence-collector).
## License
This tool is licensed under the same license than [website-evidence-collector](https://github.com/EU-EDPS/website-evidence-collector/blob/master/LICENSE.txt). See [LICENSE.txt](LICENSE.txt) file.