https://github.com/stratosphereips/collectress

Collectress (/kəˈlɛktɹɪs/) is a Python tool designed for downloading web data feeds periodically and consistently.
https://github.com/stratosphereips/collectress

feeds feeds-downloader threat-intelligence web-downloader

Last synced: 6 months ago
JSON representation

Collectress (/kəˈlɛktɹɪs/) is a Python tool designed for downloading web data feeds periodically and consistently.

Host: GitHub
URL: https://github.com/stratosphereips/collectress
Owner: stratosphereips
License: gpl-2.0
Created: 2023-07-16T19:30:35.000Z (about 3 years ago)
Default Branch: main
Last Pushed: 2024-12-27T20:07:20.000Z (over 1 year ago)
Last Synced: 2025-09-05T15:03:51.592Z (11 months ago)
Topics: feeds, feeds-downloader, threat-intelligence, web-downloader
Language: Python
Homepage:
Size: 111 KB
Stars: 5
Watchers: 3
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md
- Contributing: docs/CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Citation: CITATION.cff
- Security: SECURITY.md

Awesome Lists containing this project

README

[![Python package](https://github.com/stratosphereips/collectress/actions/workflows/python-checks.yml/badge.svg)](https://github.com/stratosphereips/collectress/actions/workflows/python-checks.yml)
[![Validate-YAML](https://github.com/stratosphereips/collectress/actions/workflows/validate-yml.yml/badge.svg)](https://github.com/stratosphereips/collectress/actions/workflows/validate-yml.yml)
[![CodeQL](https://github.com/stratosphereips/collectress/actions/workflows/github-code-scanning/codeql/badge.svg)](https://github.com/stratosphereips/collectress/actions/workflows/github-code-scanning/codeql)
[![Docker GHCR](https://github.com/stratosphereips/collectress/actions/workflows/docker-publish.yml/badge.svg)](https://github.com/stratosphereips/collectress/actions/workflows/docker-publish.yml)
[![Docker Hub CI](https://github.com/stratosphereips/collectress/actions/workflows/docker-hub.yml/badge.svg)](https://github.com/stratosphereips/collectress/actions/workflows/docker-hub.yml)

Collectress is a Python tool designed for downloading web data feeds periodically and consistently. The data to download is specified in a YAML feed file. The data is downloaded and stored in a directory structure for each feed and in directories named by the current date.

## Features

- Downloads content from multiple feeds specified in a YAML file
- Creates a directory for each feed
- Content stored in a date-structured directory format (YYYY/MM/DD)
- Handles errors gracefully, allowing the tool to continue even if a single operation fails
- Command-line arguments for input, output, and cache.
- Download optimisation through eTag cache.
- Logs a JSON-formatted comprehensive activity summary per script run

## Usage

Collectress can be run from the command line as follows (a `log.json` will be created upon execution):

```bash
python collectress.py -f data_feeds.yml -w data_feeds/ -e etag_cache.json
```

Parameters:
```bash
-h, --help show this help message and exit
-e ECACHE, --ecache ECACHE
eTag cache for optimizing downloads
-f FEED, --feed FEED YAML file containing the feeds
-w WORKDIR, --workdir WORKDIR
The root of the output directory
```

## Usage Docker

Collectress can be used through its Docker image:

```bash
docker run --rm \
-e TZ=$(readlink /etc/localtime | sed -e 's,/usr/share/zoneinfo/,,' ) \
-v ${PWD}/data_feeds.yml:/collectress/data_feeds.yml \
-v ${PWD}/log.json:/collectress/log.json \
-v ${PWD}/etag_cache.json:/collectress/etag_cache.json \
-v ${PWD}/data_output:/data ghcr.io/stratosphereips/collectress:main \
python collectress.py -f data_feeds.yml -e etag_cache.json -w /data
```

# About

This tool was developed at the Stratosphere Laboratory at the Czech Technical University in Prague.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/stratosphereips/collectress

Awesome Lists containing this project

README