{"id":23499628,"url":"https://github.com/ronaldkanyepi/log-realtime-analysis","last_synced_at":"2025-07-20T16:04:16.079Z","repository":{"id":269521731,"uuid":"907677309","full_name":"ronaldkanyepi/Log-Realtime-Analysis","owner":"ronaldkanyepi","description":"A scalable architecture for real-time log processing and visualization. Built with a Kafka-Spark ETL pipeline, DynamoDB for storing aggregate real-time metrics, and Python Dash for interactive dashboards. Designed for high-throughput log ingestion, real-time monitoring, and long-term storage.","archived":false,"fork":false,"pushed_at":"2024-12-24T06:57:22.000Z","size":1200,"stargazers_count":4,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-06-06T23:01:58.440Z","etag":null,"topics":["dash","docker","docker-compose","docker-container","dynamodb","etl","etl-pipeline","hdfs","kafka","kafka-consumer","kafka-producer","kafka-streams","kafka-topic","logs","python","realtime","spark","spark-streaming","streaming","visualization"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ronaldkanyepi.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-12-24T06:09:42.000Z","updated_at":"2025-06-04T02:13:28.000Z","dependencies_parsed_at":null,"dependency_job_id":"c9540169-2b7d-4af8-9459-e08d42ff9388","html_url":"https://github.com/ronaldkanyepi/Log-Realtime-Analysis","commit_stats":null,"previous_names":["ronaldkanyepi/realtime-log-analysis"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/ronaldkanyepi/Log-Realtime-Analysis","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ronaldkanyepi%2FLog-Realtime-Analysis","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ronaldkanyepi%2FLog-Realtime-Analysis/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ronaldkanyepi%2FLog-Realtime-Analysis/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ronaldkanyepi%2FLog-Realtime-Analysis/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ronaldkanyepi","download_url":"https://codeload.github.com/ronaldkanyepi/Log-Realtime-Analysis/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ronaldkanyepi%2FLog-Realtime-Analysis/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":266152255,"owners_count":23884474,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dash","docker","docker-compose","docker-container","dynamodb","etl","etl-pipeline","hdfs","kafka","kafka-consumer","kafka-producer","kafka-streams","kafka-topic","logs","python","realtime","spark","spark-streaming","streaming","visualization"],"created_at":"2024-12-25T06:18:00.854Z","updated_at":"2025-07-20T16:04:15.788Z","avatar_url":"https://github.com/ronaldkanyepi.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Log Realtime Analysis\n\n## Overview\n\nLog Realtime Analysis is a robust real-time log aggregation and visualization system designed to handle high-throughput logs using a Kafka-Spark ETL pipeline. For example, it can process application logs tracking user requests, error rates, and API response times in real-time. It integrates with DynamoDB for real-time metrics storage and visualizes key system insights using Python and Dash Plotly Library. The setup uses Docker for containerized deployment, ensuring seamless development and deployment workflows.\n\n## Features\n\n- **Log Ingestion:** High-throughput log streaming with Kafka.\n- **Real-Time Aggregation:** Spark processes logs per minute for metrics like request counts, error rates, and response times.\n- **Metrics Storage:** Aggregated metrics stored in DynamoDB for fast querying. DynamoDB is optimized for low-latency, high-throughput queries, making it ideal for real-time dashboard applications.\n- **Data Storage:** Historical logs saved in HDFS as Parquet files for long-term analysis.\n- **Interactive Dashboard:** Dash application with real-time updates and SLA metrics visualization.\n\n## Architecture\n![kafka_flow.gif](ui%2Fassets%2Fkafka_flow.gif)\n1. **Input Topic:** `logging_info` for real-time log ingestion.\n   - **Purpose:** High-throughput, fault-tolerant log streaming.\n\n2. **Real-Time Aggregation with Spark**\n   - **Processing Logic:** Aggregates logs per minute for metrics like request counts, error rates, and response times.\n   - **Output Topic:** `agg_logging_info` with structured metrics.\n\n3. **Downstream Processing**\n   - **DynamoDB:** Stores real-time metrics for dashboards with low-latency queries.\n   - **HDFS:** Stores aggregated logs in Parquet format for long-term analysis.\n\n4. **Visualization with Python Dash**\n\n   - **Purpose:** Auto-refreshing dashboards show live system metrics, request rates, error types, and performance insights.\n\n---\n\n## Dockerized Services\n\n### Zookeeper\n\n- **Image:** `bitnami/zookeeper:latest`\n- **Ports:** `2181:2181`\n- **Volume:** `${HOST_SHARED_DIR}/zookeeper:/bitnami/zookeeper`\n\n### Kafka\n\n- **Image:** `bitnami/kafka:latest`\n- **Ports:** `9092:9092`, `29092:29092`\n- **Volume:** `${HOST_SHARED_DIR}/kafka:/bitnami/kafka`\n\n### DynamoDB Local\n\n- **Image:** `amazon/dynamodb-local:latest`\n- **Ports:** `8000:8000`\n- **Volume:** `${HOST_SHARED_DIR}/dynamodb-local:/data`\n\n### DynamoDB Admin\n\n- **Image:** `aaronshaf/dynamodb-admin`\n- **Ports:** `8001:8001`\n\n### Spark Jupyter\n- **Image:** `jupyter/all-spark-notebook:python-3.11.6`\n- **Ports:** `8888:8888`, `4040:4040`\n- **Volume:** `${HOST_SHARED_DIR}/spark-jupyter-data:/home/jovyan/data`\n\n---\n\n## Dashboard\n\nThe Python Dash application provides an intuitive interface for monitoring real-time metrics and logs. Key features include:\n\n- SLA gauge visualization.\n- Log-level distribution pie chart.\n- Average response time by API.\n- Top APIs with highest error counts.\n- Real-time log-level line graph.\n\n### Dashboard Components\n\n1. **SLA Gauge:** Visualizes the system's SLA percentage.\n2. **Log Level Distribution:** Displays the proportion of different log levels.\n3. **Average Response Time:** Bar chart showing average response times for APIs.\n4. **Top Error-Prone APIs:** Table listing APIs with the highest error counts.\n5. **Log Counts Over Time:** Line chart of log counts aggregated by log levels.\n\n![img.png](ui/assets/dashboard-1.png)\n\n![img.png](ui/assets/dashboard-2.png)\n---\n\n## How to Run\n\n### Prerequisites\n- Docker and Docker Compose installed.\n- Shared directory setup for volume bindings.\n- Replace `${HOST_SHARED_DIR}` with your host directory.\n- Replace `${IP_ADDRESS}` with your host machine IP.\n\n### Steps\n\n1. **Start the Services:**\n   ```bash\n   docker-compose up -d\n   ```\n2. **Access Jupyter Notebook:**\n   Open `http://localhost:8888` or check the logs for the notebook in Docker for the full URL\n3. **Run the Dash App:**\n   ```bash\n   python ui/ui-prod.py\n   ```\n   Access the dashboard at `http://127.0.0.1:8050`.\n4. **Kafka Setup:**\n   - Create topics:\n     ```bash\n     python kafka/kafka_producer.py\n     ```\n\n---\n\n## Data Pipeline\n\n1. **Log Generation:** Logs are streamed to Kafka's `airbnb_system_logs` topic.\n2. **Spark Processing:** Spark consumes logs, aggregates them, and produces structured metrics to `agg_airbnb_system_logs`.\n3. **Metrics Storage:** Aggregated data is stored in DynamoDB for real-time querying.\n4. **Long-Term Storage:** Historical logs are stored in HDFS in Parquet format.\n\n---\n\n## Files\n\n- `docker-compose.yml`: Docker configuration for services.\n- `ui/ui-prod.py`: Dash application for visualizing logs and metrics.\n- `kafka/kafka_topic.py`: Script for creating Kafka Topics one for granular logs and the other for aggregate logs from spark.\n- `kafka/kafka_producer.py`: Script for simulating logs\n- `spark/spark-portfolio.ipynb`: Consumes granular logs from the topic `logging_info` and  aggregates the log data by minute intervals, computes statistics (count, avg, max, min response times), and streams the results in JSON format to the Kafka topic`agg_logging_info`\n- `spark/spark_kafka.py`: Consumes log messages from a Kafka topic, parses them, and stores aggregated log metrics into a DynamoDB table.\n\n\n## Future Enhancements\n\n- Integrate machine learning for anomaly detection.\n- Add support for multiple regions in DynamoDB.\n- Implement alerting (sms and email) for SLA breaches.\n- Enhance dashboard for customizable user settings.\n---\n\n## License\nThis project is licensed under the MIT License.\n---\n\n## Contributors\n- **Ronald Nyasha Kanyepi** - [GitHub](https://github.com/ronaldkanyepi). For any inquiries, please contact [kanyepironald@gmail.com](mailto\\:kanyepironald@gmail.com).\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fronaldkanyepi%2Flog-realtime-analysis","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fronaldkanyepi%2Flog-realtime-analysis","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fronaldkanyepi%2Flog-realtime-analysis/lists"}