https://github.com/ayushpramanik/mlh-production-engineering
Production engineering and SRE-focused infrastructure project built during the MLH Production Engineering hackathon, covering scalable backend systems, observability, Docker, CI/CD, monitoring, load testing, and reliability engineering.
https://github.com/ayushpramanik/mlh-production-engineering
flask grafana k6 nginx postgresql prometheus pytest redis-cache
Last synced: 15 days ago
JSON representation
Production engineering and SRE-focused infrastructure project built during the MLH Production Engineering hackathon, covering scalable backend systems, observability, Docker, CI/CD, monitoring, load testing, and reliability engineering.
- Host: GitHub
- URL: https://github.com/ayushpramanik/mlh-production-engineering
- Owner: AyushPramanik
- License: mit
- Created: 2026-04-03T16:20:20.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2026-05-25T14:21:05.000Z (about 1 month ago)
- Last Synced: 2026-05-25T16:24:58.221Z (about 1 month ago)
- Topics: flask, grafana, k6, nginx, postgresql, prometheus, pytest, redis-cache
- Language: Python
- Homepage:
- Size: 351 KB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# MLH PE Hackathon — URL Shortener + Observability Platform
A production-grade URL shortener with full observability: structured logging, Prometheus metrics, Grafana dashboards, Redis caching, and email alerting.
**Stack:** Flask · Peewee ORM · PostgreSQL 16 · Redis 7 · Prometheus · Grafana · Docker Compose
---
## Architecture
```
┌─────────────────────────────────────────┐
│ Docker Compose Network │
│ │
Browser / Client ─────┤──► web (Flask :5000) ──► db (Postgres) │
:5001 │ │ │ │
│ │ └──────► redis (:6379) │
Grafana :3000 ◄───────┤ │ │
│ ▼ │
Prometheus :9090 ◄────┤──► /prometheus (metrics scrape) │
│ │
└─────────────────────────────────────────┘
Request flow:
Client → Flask before_request (timing start, DB connect)
→ Route handler (cache check → DB query → cache write)
→ Flask after_request (record metrics, log JSON)
→ Response
```
---
## Prerequisites
- **Docker Desktop** (includes Docker Compose) — [install](https://docs.docker.com/get-docker/)
- **uv** (for local development without Docker):
```bash
# macOS / Linux
curl -LsSf https://astral.sh/uv/install.sh | sh
```
---
## Quick Start (Docker — recommended)
```bash
# 1. Clone the repo
git clone && cd mlh-pe-hackathon
# 2. Start everything (app + DB + Redis + Prometheus + Grafana)
docker compose up -d
# 3. Wait ~15 seconds for services to be healthy, then verify
curl http://localhost:5001/health
# → {"status": "ok", "hostname": "..."}
# 4. (Optional) Seed with sample data
docker compose exec web uv run python seed/seed.py
# 5. Open the monitoring dashboard
open http://localhost:5001/dashboard # built-in live dashboard
open http://localhost:3000 # Grafana (admin / admin)
open http://localhost:9090 # Prometheus
```
That's it — the entire stack runs with one command.
---
## Quick Start (Local Development)
```bash
# 1. Install dependencies
uv sync
# 2. Start only the backing services
docker compose up -d db redis
# 3. Configure environment
cp .env.example .env
# Edit .env if your DB credentials differ from the defaults
# 4. Run the Flask server
uv run run.py
# 5. Verify
curl http://localhost:5000/health
```
---
## Running Tests
```bash
uv run pytest
```
Tests mock Redis and run against a real PostgreSQL database. Coverage must stay above 50%.
---
## Project Structure
```
mlh-pe-hackathon/
├── app/
│ ├── __init__.py # App factory, middleware, system routes
│ ├── database.py # Peewee ORM setup and connection hooks
│ ├── cache.py # Redis client and get/set/delete helpers
│ ├── alerting.py # Background email alert manager
│ ├── prometheus_metrics.py # Prometheus counter/histogram/gauge definitions
│ ├── metrics_store.py # Thread-safe sliding-window error tracker
│ ├── logging_config.py # Structured JSON logging setup
│ ├── models/ # Peewee data models (User, URL, Event)
│ ├── routes/ # Blueprint route handlers
│ │ ├── users.py # User CRUD + bulk load
│ │ ├── url.py # URL shortener CRUD + redirect
│ │ └── events.py # Audit log queries
│ └── templates/
│ └── dashboard.html # Self-contained live monitoring dashboard
├── docs/
│ ├── RUNBOOK.md # Incident response playbooks
│ ├── DEPLOY.md # Deployment and rollback guide
│ ├── TROUBLESHOOTING.md # Common problems and fixes
│ ├── DECISIONS.md # Architecture decision log
│ └── CAPACITY.md # Capacity planning and limits
├── monitoring/
│ ├── prometheus.yml # Scrape config
│ └── grafana/provisioning # Auto-provisioned datasource + dashboard
├── scaling/
│ ├── nginx/nginx.conf # Load balancer config (multi-replica)
│ └── load_test/ # k6 and Python load test scripts
├── seed/
│ ├── seed.py # Database seeding script
│ └── *.csv # Sample data files
├── tests/ # pytest test suite
├── docker-compose.yml # Full stack orchestration
├── Dockerfile # App container (python:3.13-slim + uv)
├── pyproject.toml # Dependencies and pytest config
├── .env.example # All environment variables documented
└── fire_drill.py # Alert testing tool (safe, no real incidents needed)
```
---
## API Reference
All endpoints return JSON. Errors follow `{"error": ""}`.
### System
| Method | Path | Description |
|--------|------|-------------|
| GET | `/health` | Health check — returns `{"status":"ok","hostname":"..."}` |
| GET | `/metrics` | JSON system metrics — CPU%, memory MB, request counts |
| GET | `/prometheus` | Prometheus scrape endpoint (text/plain exposition format) |
| GET | `/logs?limit=100` | Recent application log entries as JSON array |
| GET | `/alert-status` | Current alert states (service_down, high_error_rate) |
| GET | `/dashboard` | Live monitoring dashboard (HTML) |
### Users
| Method | Path | Description |
|--------|------|-------------|
| GET | `/users?page=1&per_page=20` | List users (paginated, cached) |
| POST | `/users` | Create user — body: `{"username":"…","email":"…"}` |
| GET | `/users/` | Get user by ID (cached) |
| PUT | `/users/` | Update username/email |
| DELETE | `/users/` | Hard delete user |
| GET | `/users//urls` | All URLs belonging to a user |
| POST | `/users/bulk` | Bulk import from CSV — body: `{"file":"users.csv"}` |
### URLs
| Method | Path | Description |
|--------|------|-------------|
| GET | `/urls?user_id=&is_active=&page=1&per_page=20` | List URLs (filtered, cached) |
| POST | `/urls` | Create short URL — body: `{"original_url":"…","title":"…","user_id":1}` |
| POST | `/shorten` | Alias for POST /urls — body: `{"url":"…"}` |
| GET | `/urls/` | Get URL by ID (cached) |
| PUT/PATCH | `/urls/` | Update URL fields |
| DELETE | `/urls/` | Soft delete (sets `is_active=false`) |
| GET | `/urls//events` | Audit log for a URL |
| POST | `/urls/bulk` | Bulk import from CSV — body: `{"file":"urls.csv"}` |
| GET | `/` | Redirect to original URL (302) |
### Events
| Method | Path | Description |
|--------|------|-------------|
| GET | `/events?url_id=&user_id=&event_type=` | List events (filtered) |
| POST | `/events` | Create event — body: `{"url_id":1,"event_type":"created"}` |
| GET | `/events/` | Get event by ID |
### Response shapes
**User object:**
```json
{"id": 1, "username": "alice", "email": "alice@example.com", "created_at": "2024-01-01T12:00:00"}
```
**URL object:**
```json
{
"id": 1, "user_id": 1, "short_code": "abc123",
"original_url": "https://example.com", "title": "Example",
"is_active": true, "created_at": "…", "updated_at": "…"
}
```
**Event object:**
```json
{"id": 1, "url_id": 1, "user_id": 1, "event_type": "created", "timestamp": "…", "details": "…"}
```
---
## Monitoring URLs
| Service | URL | Credentials |
|---------|-----|-------------|
| App | http://localhost:5001 | — |
| Live Dashboard | http://localhost:5001/dashboard | — |
| Grafana | http://localhost:3000 | admin / admin |
| Prometheus | http://localhost:9090 | — |
---
## Alerting (Email)
Copy `.env.example` to `.env` and fill in the SMTP section to enable email alerts:
```bash
SMTP_USER=you@gmail.com
SMTP_PASSWORD=your-app-password # Gmail App Password (not your login password)
ALERT_EMAIL_TO=oncall@yourteam.com
```
Two alerts fire automatically:
- **Service Down** — PostgreSQL unreachable
- **High Error Rate** — >10% of requests return 5xx in a 2-minute window
See [`docs/RUNBOOK.md`](docs/RUNBOOK.md) for response playbooks.
---
## Error Handling
| Status | Meaning |
|--------|---------|
| 200 / 201 | Success |
| 302 | Redirect (short URL) |
| 400 | Bad request — validation failed (see `"error"` field) |
| 404 | Resource not found or inactive |
| 405 | Wrong HTTP method |
| 409 | Conflict — duplicate username or email |
| 500 | Unexpected server error (never an HTML stack trace) |
---
## Further Reading
- [`docs/DEPLOY.md`](docs/DEPLOY.md) — how to deploy and rollback
- [`docs/TROUBLESHOOTING.md`](docs/TROUBLESHOOTING.md) — common problems and fixes
- [`docs/DECISIONS.md`](docs/DECISIONS.md) — why we chose each technology
- [`docs/CAPACITY.md`](docs/CAPACITY.md) — how many users can we handle
- [`docs/RUNBOOK.md`](docs/RUNBOOK.md) — incident response playbooks