https://github.com/ronaldkanyepi/log-realtime-analysis
A scalable architecture for real-time log processing and visualization. Built with a Kafka-Spark ETL pipeline, DynamoDB for storing aggregate real-time metrics, and Python Dash for interactive dashboards. Designed for high-throughput log ingestion, real-time monitoring, and long-term storage.
https://github.com/ronaldkanyepi/log-realtime-analysis
dash docker docker-compose docker-container dynamodb etl etl-pipeline hdfs kafka kafka-consumer kafka-producer kafka-streams kafka-topic logs python realtime spark spark-streaming streaming visualization
Last synced: 11 months ago
JSON representation
A scalable architecture for real-time log processing and visualization. Built with a Kafka-Spark ETL pipeline, DynamoDB for storing aggregate real-time metrics, and Python Dash for interactive dashboards. Designed for high-throughput log ingestion, real-time monitoring, and long-term storage.
- Host: GitHub
- URL: https://github.com/ronaldkanyepi/log-realtime-analysis
- Owner: ronaldkanyepi
- License: mit
- Created: 2024-12-24T06:09:42.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-12-24T06:57:22.000Z (over 1 year ago)
- Last Synced: 2025-06-06T23:01:58.440Z (about 1 year ago)
- Topics: dash, docker, docker-compose, docker-container, dynamodb, etl, etl-pipeline, hdfs, kafka, kafka-consumer, kafka-producer, kafka-streams, kafka-topic, logs, python, realtime, spark, spark-streaming, streaming, visualization
- Language: Python
- Homepage:
- Size: 1.14 MB
- Stars: 4
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Log Realtime Analysis
## Overview
Log Realtime Analysis is a robust real-time log aggregation and visualization system designed to handle high-throughput logs using a Kafka-Spark ETL pipeline. For example, it can process application logs tracking user requests, error rates, and API response times in real-time. It integrates with DynamoDB for real-time metrics storage and visualizes key system insights using Python and Dash Plotly Library. The setup uses Docker for containerized deployment, ensuring seamless development and deployment workflows.
## Features
- **Log Ingestion:** High-throughput log streaming with Kafka.
- **Real-Time Aggregation:** Spark processes logs per minute for metrics like request counts, error rates, and response times.
- **Metrics Storage:** Aggregated metrics stored in DynamoDB for fast querying. DynamoDB is optimized for low-latency, high-throughput queries, making it ideal for real-time dashboard applications.
- **Data Storage:** Historical logs saved in HDFS as Parquet files for long-term analysis.
- **Interactive Dashboard:** Dash application with real-time updates and SLA metrics visualization.
## Architecture

1. **Input Topic:** `logging_info` for real-time log ingestion.
- **Purpose:** High-throughput, fault-tolerant log streaming.
2. **Real-Time Aggregation with Spark**
- **Processing Logic:** Aggregates logs per minute for metrics like request counts, error rates, and response times.
- **Output Topic:** `agg_logging_info` with structured metrics.
3. **Downstream Processing**
- **DynamoDB:** Stores real-time metrics for dashboards with low-latency queries.
- **HDFS:** Stores aggregated logs in Parquet format for long-term analysis.
4. **Visualization with Python Dash**
- **Purpose:** Auto-refreshing dashboards show live system metrics, request rates, error types, and performance insights.
---
## Dockerized Services
### Zookeeper
- **Image:** `bitnami/zookeeper:latest`
- **Ports:** `2181:2181`
- **Volume:** `${HOST_SHARED_DIR}/zookeeper:/bitnami/zookeeper`
### Kafka
- **Image:** `bitnami/kafka:latest`
- **Ports:** `9092:9092`, `29092:29092`
- **Volume:** `${HOST_SHARED_DIR}/kafka:/bitnami/kafka`
### DynamoDB Local
- **Image:** `amazon/dynamodb-local:latest`
- **Ports:** `8000:8000`
- **Volume:** `${HOST_SHARED_DIR}/dynamodb-local:/data`
### DynamoDB Admin
- **Image:** `aaronshaf/dynamodb-admin`
- **Ports:** `8001:8001`
### Spark Jupyter
- **Image:** `jupyter/all-spark-notebook:python-3.11.6`
- **Ports:** `8888:8888`, `4040:4040`
- **Volume:** `${HOST_SHARED_DIR}/spark-jupyter-data:/home/jovyan/data`
---
## Dashboard
The Python Dash application provides an intuitive interface for monitoring real-time metrics and logs. Key features include:
- SLA gauge visualization.
- Log-level distribution pie chart.
- Average response time by API.
- Top APIs with highest error counts.
- Real-time log-level line graph.
### Dashboard Components
1. **SLA Gauge:** Visualizes the system's SLA percentage.
2. **Log Level Distribution:** Displays the proportion of different log levels.
3. **Average Response Time:** Bar chart showing average response times for APIs.
4. **Top Error-Prone APIs:** Table listing APIs with the highest error counts.
5. **Log Counts Over Time:** Line chart of log counts aggregated by log levels.


---
## How to Run
### Prerequisites
- Docker and Docker Compose installed.
- Shared directory setup for volume bindings.
- Replace `${HOST_SHARED_DIR}` with your host directory.
- Replace `${IP_ADDRESS}` with your host machine IP.
### Steps
1. **Start the Services:**
```bash
docker-compose up -d
```
2. **Access Jupyter Notebook:**
Open `http://localhost:8888` or check the logs for the notebook in Docker for the full URL
3. **Run the Dash App:**
```bash
python ui/ui-prod.py
```
Access the dashboard at `http://127.0.0.1:8050`.
4. **Kafka Setup:**
- Create topics:
```bash
python kafka/kafka_producer.py
```
---
## Data Pipeline
1. **Log Generation:** Logs are streamed to Kafka's `airbnb_system_logs` topic.
2. **Spark Processing:** Spark consumes logs, aggregates them, and produces structured metrics to `agg_airbnb_system_logs`.
3. **Metrics Storage:** Aggregated data is stored in DynamoDB for real-time querying.
4. **Long-Term Storage:** Historical logs are stored in HDFS in Parquet format.
---
## Files
- `docker-compose.yml`: Docker configuration for services.
- `ui/ui-prod.py`: Dash application for visualizing logs and metrics.
- `kafka/kafka_topic.py`: Script for creating Kafka Topics one for granular logs and the other for aggregate logs from spark.
- `kafka/kafka_producer.py`: Script for simulating logs
- `spark/spark-portfolio.ipynb`: Consumes granular logs from the topic `logging_info` and aggregates the log data by minute intervals, computes statistics (count, avg, max, min response times), and streams the results in JSON format to the Kafka topic`agg_logging_info`
- `spark/spark_kafka.py`: Consumes log messages from a Kafka topic, parses them, and stores aggregated log metrics into a DynamoDB table.
## Future Enhancements
- Integrate machine learning for anomaly detection.
- Add support for multiple regions in DynamoDB.
- Implement alerting (sms and email) for SLA breaches.
- Enhance dashboard for customizable user settings.
---
## License
This project is licensed under the MIT License.
---
## Contributors
- **Ronald Nyasha Kanyepi** - [GitHub](https://github.com/ronaldkanyepi). For any inquiries, please contact [kanyepironald@gmail.com](mailto\:kanyepironald@gmail.com).