https://github.com/mtulio/s3-stream
S3 Streaming data to output broker, logger or database
https://github.com/mtulio/s3-stream
Last synced: 12 months ago
JSON representation
S3 Streaming data to output broker, logger or database
- Host: GitHub
- URL: https://github.com/mtulio/s3-stream
- Owner: mtulio
- License: apache-2.0
- Created: 2017-08-19T20:11:40.000Z (almost 9 years ago)
- Default Branch: master
- Last Pushed: 2021-03-11T03:57:54.000Z (over 5 years ago)
- Last Synced: 2025-02-24T03:17:49.485Z (over 1 year ago)
- Language: Python
- Homepage:
- Size: 21.5 KB
- Stars: 0
- Watchers: 3
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# s3-stream
S3stream streaming S3 object text data, like Cloud Front logs, to an given
output. Supported output are:
* Kafka
But we will support these outputs:
* Syslog (in dev)
* Elasticsearch
* Raw logs
## Overview
This project will get an S3 file, from an SQS notification (we are assuming
that you have already create it), filter something (no required) and publish
on Kafka, or other output providers.
The simple architecture are:
```
|_S3_| -> |_SQS_|<-----------.
| |
.--------------:---------------|--------------.
:|_INIT_| | |-|_ONE_SHOOT_|:--> sys.exit(0)
: | | |_SQS_DELETE_| :
: '-->|_SQS_POOLER_|<--------| :
: | | :
: |_PARSE_MSG_| | :
: | | :
: |_S3_GET_| | :
: | | :
: |_EXTRACTOR_| | :
: | | :
: |_FILTER_| | :
: | | :
: |_PUBLISH_| | :
: | |_OK_| |_FAIL_|--:--> sys.exit(1)
: | |_______| :
'--------------:---------------|--------------'
| |
| |_RESULT_|
__________| |
| |
|_KAFKA_|------------------->|
|_ELASTICSEARCH_|----------->|
|_GRAYLOG_GELF_|------------>|
|_SYSLOG_|------------------>|
```
* Limitations
The project assumes that the SNS topic is already created, and you need to monitor both queue and s3stream, otherwise when processor stops the queue could increase drasticaly (take care of the costs).
## Use case
1) Near real time Cloud Front log processor to parse logs generated by gzip files from S3 to Kafka stopic, then Graylog could read as input plugin and index it - on ElasticSearch.
This solution, parser and processor, could take a lot of IOPS, so we have successful tested on i3.large running both s3stream and kafka using ephemeral storage (NVMe) processing millions of log messages by hour.
## Goals
* Filter messages before stream output
* Support SQS pooler to run with interval
* Support stream to kafka
* Support stream to syslog
* Support stream to gelf (graylog server)
* Support stream to Elasticsearch
* Support dry-run
* Support to run as a deamon
* Support to get last executions information
* Support to run the consumer, when it's already running
* Improve consumer metrics