https://github.com/ikramhasan/lgtm-stack
An observability stack based on OpenTelemetry (OTel) and the Grafana "LGTM" suite, with an example app provided to demonstrate orchastration.
https://github.com/ikramhasan/lgtm-stack
fastapi grafana grafana-dashboard lgtm lgtm-stack loki mimir minio observability python tempo
Last synced: 4 months ago
JSON representation
An observability stack based on OpenTelemetry (OTel) and the Grafana "LGTM" suite, with an example app provided to demonstrate orchastration.
- Host: GitHub
- URL: https://github.com/ikramhasan/lgtm-stack
- Owner: ikramhasan
- Created: 2025-12-20T20:56:03.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2025-12-21T12:21:11.000Z (6 months ago)
- Last Synced: 2025-12-23T03:43:17.572Z (6 months ago)
- Topics: fastapi, grafana, grafana-dashboard, lgtm, lgtm-stack, loki, mimir, minio, observability, python, tempo
- Language: Python
- Homepage:
- Size: 41 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# LGTM Stack POC (Loki, Grafana, Tempo, Mimir)
An observability stack based on OpenTelemetry (OTel) and the Grafana "LGTM" suite.
## Architecture
- **Loki**: Log aggregation.
- **Grafana**: Visualization and dashboards.
- **Tempo**: Distributed tracing.
- **Mimir**: Scalable Long-term storage for Prometheus metrics.
- **OTel Collector**: Central gateway for receiving and routing telemetry data.
- **MinIO/S3**: Object storage backend for long-term data retention.
## Quick Start
### 1. Prerequisites
- Docker and Docker Compose.
- [uv](https://github.com/astral-sh/uv) (for running the example app).
### 2. Environment Setup
Copy the example environment file and adjust if necessary:
```bash
cp .env.example .env
```
### 2.1 MinIO Setup (Optional)
Follow the MinIO setup instructions below if you want to use MinIO for local development.
### 3. Start the Stack
```bash
docker-compose up -d
```
This starts Loki, Tempo, Mimir, Grafana, the OTel Collector, and a local MinIO instance.
### 4. Run the Example Application
```bash
cd example/fastapi-app
uv sync
uv run python main.py
```
Trigger some data by visiting `http://localhost:8000/process`.
## Configuration: MinIO vs. AWS S3
The stack is currently configured to use **MinIO** for local development.
### Using Local MinIO (Default)
In your `.env` file:
```env
S3_ENDPOINT=host.docker.internal:9000
S3_INSECURE=true
S3_FORCE_PATH_STYLE=true
AWS_ACCESS_KEY_ID=minioadmin
AWS_SECRET_ACCESS_KEY=minioadmin
```
Run the container `docker compose up -d` from `example/minio` to start the MinIO instance.
The MinIO instance is running at `http://localhost:9000` with the default credentials `minioadmin/minioadmin`.
Go to the MinIO dashboard, and create the buckets `loki-logs`, `tempo-traces`, and `mimir-metrics`.
### Using AWS S3
To switch to production AWS S3:
1. Update `.env`:
- `S3_ENDPOINT`: `s3.us-east-1.amazonaws.com` (or your region's endpoint).
- `S3_INSECURE`: `false`.
- `S3_FORCE_PATH_STYLE`: `false`.
- `AWS_ACCESS_KEY_ID` & `AWS_SECRET_ACCESS_KEY`: Your AWS credentials.
2. Ensure the buckets (`loki-logs`, `tempo-traces`, `mimir-metrics`) exist in your AWS account or update the bucket name variables in `.env`.
### Mimir Metrics & Dashboards
Use the following table to set up your primary observability dashboard. These metrics are exported by the FastAPI application.
| Panel Name | Visualization | Query (PromQL) | Description |
| :--- | :--- | :--- | :--- |
| **Total Request Rate** | Time series | `sum(rate(http_requests_total[$__rate_interval])) by (http_target)` | Real-time traffic per endpoint (Requests/sec). |
| **Error Rate (%)** | Stat | `sum(rate(http_errors_total[$__range])) / sum(rate(http_requests_total[$__range]))` | Percentage of requests resulting in 4xx/5xx errors over the selected time range. |
| **P95 Latency** | Time series | `histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[$__rate_interval])) by (le))` | 95th percentile response time for all endpoints. |
| **Active Requests** | Gauge | `sum(http_server_active_requests)` | Number of concurrent requests being processed. |
| **Errors by Endpoint** | Bar chart | `sum(increase(http_errors_total[$__range])) by (http_target)` | Total errors grouped by path over the selected time range. |
| **Top 5 Slowest Paths** | Table | `topk(5, sum(rate(http_request_duration_seconds_sum[$__range])) by (http_target) / sum(rate(http_request_duration_seconds_count[$__range])) by (http_target))` | List of endpoints with the highest average latency. |
#### How to Add a Panel
1. Click **+ Add** in the top right of your dashboard -> **Visualization**.
2. Select **Mimir** as the data source.
3. Paste the **Query** from the table above.
4. Set the **Title** to the Panel Name.
5. Select the **Visualization** type from the right sidebar.
6. Click **Save** or **Apply**.
### Loki Logs & Analysis
Loki allows you to query logs using **LogQL**. The stack is configured to automatically label logs with metadata like `service_name` and `deployment_environment`.
#### Key Queries
| Panel Name | Visualization | Query (LogQL) | Description |
| :--- | :--- | :--- | :--- |
| **Application Logs** | Logs | `{service_name="fastapi-service"}` | Live stream of all logs from the FastAPI app. |
| **Error Log Stream** | Logs | `{service_name="fastapi-service"} \|= "error"` | Filtered stream showing only lines containing "error" (case-insensitive). |
| **Log Volume** | Time series | `count_over_time({service_name="fastapi-service"}[$__interval])` | Bar chart showing the number of log lines produced per interval. |
| **Severity Distribution** | Pie chart | `sum by (level) (count_over_time({service_name="fastapi-service"}[$__range]))` | Breakdown of log levels (INFO, ERROR, WARN) for the selected time range. |
| **Error Frequency** | Time series | `count_over_time({service_name="fastapi-service"} \|= "error" [$__interval])` | Specifically tracks the rate of error-level logs. |
#### How to Add a Log Panel
1. Click **+ Add** -> **Visualization**.
2. Select **Loki** as the data source.
3. Paste one of the **Queries** above.
4. Select the matching **Visualization** type from the right sidebar.
#### Trace Correlation (Loki -> Tempo)
When viewing logs in the **Explore** tab or a **Logs** panel:
1. Click on a log line to expand it.
2. Look for the `trace_id` field.
3. Click the **Tempo** button next to the ID to instantly see the full distributed trace for that specific log entry.
### Advanced Observability Patterns
Beyond basic metrics, you can leverage the full power of the LGTM stack with these advanced patterns:
| Pattern / Metric | Visualization | Query | Description |
| :--- | :--- | :--- | :--- |
| **RED: Rate** | Time series | `sum(rate(http_requests_total[$__rate_interval]))` | Request rate per second. |
| **RED: Errors** | Time series | `sum(rate(http_errors_total[$__rate_interval]))` | Error rate per second. |
| **RED: Duration** | Time series | `histogram_quantile(0.9, sum(rate(http_request_duration_seconds_bucket[$__rate_interval])) by (le))` | 90th percentile response time. |
| **Latency Heatmap** | Heatmap | `sum(rate(http_request_duration_seconds_bucket[$__rate_interval])) by (le)` | Visual distribution of latency buckets. |
| **Log Severity** | Time series / Bar gauge | `sum by (level) (count_over_time({service_name="fastapi-service"} [$__range]))` | Monitor log health by severity over time. |
| **Apdex Score** | Stat | `(sum(rate(http_request_duration_seconds_bucket{le="0.5"}[$__range])) + sum(rate(http_request_duration_seconds_bucket{le="1.0"}[$__range])) / 2) / sum(rate(http_request_duration_seconds_count[$__range]))` | Single score (0-1) for user satisfaction. |
| **Resource Grouping** | Time series | `sum(rate(http_requests_total[$__rate_interval])) by (service_version, deployment_environment)` | Compare performance across versions/environments. |
> [!TIP]
> Update the `fastapi-service` service name to your application name.
> - **Dynamic Time Ranges**: Instead of hardcoding `[5m]`, use Grafana global variables:
> - **`[$__range]`**: Adjusts to the exact time period selected in the dashboard picker (e.g., Last 1 hour). Use this for total counts (with `increase()`) or "Stat" panels.
> - **`[$__rate_interval]`**: Automatically calculates the best interval for `rate()` based on the graph's time range and resolution. Use this for Time series graphs.
## Debugging Tips
- **Unhealthy Ring**: If Mimir/Loki report ring issues, ensure `replication_factor` is set to `1` in the YAML configs for single-node setups.
- **Log Ingestion**: Check the OTel Collector logs (`docker logs otel-collector`) to see if data is being received and exported correctly.
- **S3 Connectivity**: Ensure the S3 endpoint is reachable from *within* the Docker containers. On MacOS, `host.docker.internal` is used to reach the host's port 9000.