https://github.com/stratosphereips/collectress
Collectress (/kəˈlɛktɹɪs/) is a Python tool designed for downloading web data feeds periodically and consistently.
https://github.com/stratosphereips/collectress
feeds feeds-downloader threat-intelligence web-downloader
Last synced: 5 months ago
JSON representation
Collectress (/kəˈlɛktɹɪs/) is a Python tool designed for downloading web data feeds periodically and consistently.
- Host: GitHub
- URL: https://github.com/stratosphereips/collectress
- Owner: stratosphereips
- License: gpl-2.0
- Created: 2023-07-16T19:30:35.000Z (almost 3 years ago)
- Default Branch: main
- Last Pushed: 2024-12-27T20:07:20.000Z (over 1 year ago)
- Last Synced: 2025-09-05T15:03:51.592Z (9 months ago)
- Topics: feeds, feeds-downloader, threat-intelligence, web-downloader
- Language: Python
- Homepage:
- Size: 111 KB
- Stars: 5
- Watchers: 3
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- Contributing: docs/CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Citation: CITATION.cff
- Security: SECURITY.md
Awesome Lists containing this project
README

[](https://github.com/stratosphereips/collectress/actions/workflows/python-checks.yml)
[](https://github.com/stratosphereips/collectress/actions/workflows/validate-yml.yml)
[](https://github.com/stratosphereips/collectress/actions/workflows/github-code-scanning/codeql)
[](https://github.com/stratosphereips/collectress/actions/workflows/docker-publish.yml)
[](https://github.com/stratosphereips/collectress/actions/workflows/docker-hub.yml)
Collectress is a Python tool designed for downloading web data feeds periodically and consistently. The data to download is specified in a YAML feed file. The data is downloaded and stored in a directory structure for each feed and in directories named by the current date.
## Features
- Downloads content from multiple feeds specified in a YAML file
- Creates a directory for each feed
- Content stored in a date-structured directory format (YYYY/MM/DD)
- Handles errors gracefully, allowing the tool to continue even if a single operation fails
- Command-line arguments for input, output, and cache.
- Download optimisation through eTag cache.
- Logs a JSON-formatted comprehensive activity summary per script run
## Usage
Collectress can be run from the command line as follows (a `log.json` will be created upon execution):
```bash
python collectress.py -f data_feeds.yml -w data_feeds/ -e etag_cache.json
```
Parameters:
```bash
-h, --help show this help message and exit
-e ECACHE, --ecache ECACHE
eTag cache for optimizing downloads
-f FEED, --feed FEED YAML file containing the feeds
-w WORKDIR, --workdir WORKDIR
The root of the output directory
```
## Usage Docker
Collectress can be used through its Docker image:
```bash
docker run --rm \
-e TZ=$(readlink /etc/localtime | sed -e 's,/usr/share/zoneinfo/,,' ) \
-v ${PWD}/data_feeds.yml:/collectress/data_feeds.yml \
-v ${PWD}/log.json:/collectress/log.json \
-v ${PWD}/etag_cache.json:/collectress/etag_cache.json \
-v ${PWD}/data_output:/data ghcr.io/stratosphereips/collectress:main \
python collectress.py -f data_feeds.yml -e etag_cache.json -w /data
```
# About
This tool was developed at the Stratosphere Laboratory at the Czech Technical University in Prague.