An open API service indexing awesome lists of open source software.

https://github.com/mtulio/s3-stream

S3 Streaming data to output broker, logger or database
https://github.com/mtulio/s3-stream

Last synced: 12 months ago
JSON representation

S3 Streaming data to output broker, logger or database

Awesome Lists containing this project

README

          

# s3-stream

S3stream streaming S3 object text data, like Cloud Front logs, to an given
output. Supported output are:

* Kafka

But we will support these outputs:

* Syslog (in dev)
* Elasticsearch
* Raw logs

## Overview

This project will get an S3 file, from an SQS notification (we are assuming
that you have already create it), filter something (no required) and publish
on Kafka, or other output providers.

The simple architecture are:

```
|_S3_| -> |_SQS_|<-----------.
| |
.--------------:---------------|--------------.
:|_INIT_| | |-|_ONE_SHOOT_|:--> sys.exit(0)
: | | |_SQS_DELETE_| :
: '-->|_SQS_POOLER_|<--------| :
: | | :
: |_PARSE_MSG_| | :
: | | :
: |_S3_GET_| | :
: | | :
: |_EXTRACTOR_| | :
: | | :
: |_FILTER_| | :
: | | :
: |_PUBLISH_| | :
: | |_OK_| |_FAIL_|--:--> sys.exit(1)
: | |_______| :
'--------------:---------------|--------------'
| |
| |_RESULT_|
__________| |
| |
|_KAFKA_|------------------->|
|_ELASTICSEARCH_|----------->|
|_GRAYLOG_GELF_|------------>|
|_SYSLOG_|------------------>|

```

* Limitations

The project assumes that the SNS topic is already created, and you need to monitor both queue and s3stream, otherwise when processor stops the queue could increase drasticaly (take care of the costs).

## Use case

1) Near real time Cloud Front log processor to parse logs generated by gzip files from S3 to Kafka stopic, then Graylog could read as input plugin and index it - on ElasticSearch.

This solution, parser and processor, could take a lot of IOPS, so we have successful tested on i3.large running both s3stream and kafka using ephemeral storage (NVMe) processing millions of log messages by hour.

## Goals

* Filter messages before stream output
* Support SQS pooler to run with interval
* Support stream to kafka
* Support stream to syslog
* Support stream to gelf (graylog server)
* Support stream to Elasticsearch
* Support dry-run
* Support to run as a deamon
* Support to get last executions information
* Support to run the consumer, when it's already running
* Improve consumer metrics