{"id":51081183,"url":"https://github.com/ab1nv/bales","last_synced_at":"2026-06-23T18:02:36.970Z","repository":{"id":360348031,"uuid":"1247967819","full_name":"ab1nv/bales","owner":"ab1nv","description":"High-throughput dynamic ML inference gateway","archived":false,"fork":false,"pushed_at":"2026-05-26T02:33:15.000Z","size":173,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2026-05-26T04:19:00.123Z","etag":null,"topics":["fastapi","inference","pytorch"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ab1nv.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null},"funding":{"buy_me_a_coffee":"ab1nv"}},"created_at":"2026-05-24T02:38:35.000Z","updated_at":"2026-05-26T02:33:19.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/ab1nv/bales","commit_stats":null,"previous_names":["ab1nv/bales"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/ab1nv/bales","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ab1nv%2Fbales","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ab1nv%2Fbales/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ab1nv%2Fbales/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ab1nv%2Fbales/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ab1nv","download_url":"https://codeload.github.com/ab1nv/bales/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ab1nv%2Fbales/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34700915,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-23T02:00:07.161Z","response_time":65,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["fastapi","inference","pytorch"],"created_at":"2026-06-23T18:02:33.383Z","updated_at":"2026-06-23T18:02:36.944Z","avatar_url":"https://github.com/ab1nv.png","language":"Python","funding_links":["https://buymeacoffee.com/ab1nv","https://www.buymeacoffee.com/ab1nv"],"categories":[],"sub_categories":[],"readme":"# BALES - High-Throughput ML Inference Gateway\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://github.com/ab1nv/bales\"\u003e\n    \u003cimg src=\"https://i.ibb.co/23GJHSh5/bales-logo-1.png\" alt=\"BALES Logo\" width=\"200\" /\u003e\n  \u003c/a\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003cstrong\u003eZero-downtime inference with dynamic batching and priority scheduling.\u003c/strong\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://www.python.org/\"\u003e\u003cimg src=\"https://img.shields.io/badge/python-3.14-blue?logo=python\" alt=\"Python 3.14\" /\u003e\u003c/a\u003e\n  \u003ca href=\"https://github.com/ab1nv/bales/actions/workflows/ci.yml\"\u003e\u003cimg src=\"https://img.shields.io/github/actions/workflow/status/ab1nv/bales/ci.yml?branch=master\u0026label=CI\u0026logo=github\" alt=\"CI\" /\u003e\u003c/a\u003e\n  \u003ca href=\"https://github.com/ab1nv/bales/actions/workflows/docker-build.yml\"\u003e\u003cimg src=\"https://img.shields.io/github/actions/workflow/status/ab1nv/bales/docker-build.yml?branch=master\u0026label=Docker\u0026logo=docker\" alt=\"Docker Build\" /\u003e\u003c/a\u003e\n  \u003ca href=\"https://codecov.io/gh/ab1nv/bales\"\u003e\u003cimg src=\"https://img.shields.io/codecov/c/gh/ab1nv/bales?logo=codecov\" alt=\"Coverage\" /\u003e\u003c/a\u003e\n  \u003ca href=\"https://github.com/ab1nv/bales/pulls\"\u003e\u003cimg src=\"https://img.shields.io/github/issues-pr/ab1nv/bales?label=PRs\u0026logo=github\" alt=\"Pull Requests\" /\u003e\u003c/a\u003e\n  \u003ca href=\"LICENSE\"\u003e\u003cimg src=\"https://img.shields.io/badge/License-MIT-yellow.svg\" alt=\"License: MIT\" /\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"#quick-start\"\u003eQuick Start\u003c/a\u003e \u0026bull;\n  \u003ca href=\"#architecture\"\u003eArchitecture\u003c/a\u003e \u0026bull;\n  \u003ca href=\"#configuration\"\u003eConfiguration\u003c/a\u003e \u0026bull;\n  \u003ca href=\"#running-and-stress-testing\"\u003eRunning \u0026 Stress Testing\u003c/a\u003e \u0026bull;\n  \u003ca href=\"#benchmarking\"\u003eBenchmarking\u003c/a\u003e \u0026bull;\n  \u003ca href=\"#security\"\u003eSecurity\u003c/a\u003e \u0026bull;\n  \u003ca href=\"#api-reference\"\u003eAPI Reference\u003c/a\u003e \u0026bull;\n  \u003ca href=\"#development\"\u003eDevelopment\u003c/a\u003e\n\u003c/p\u003e\n\n---\n\n## Overview\n\nBALES is a production-ready inference gateway designed for high-throughput CPU-based ML serving. It combines **Redis-backed priority queues**, **dynamic request batching**, and **atomic model hot-swapping** to deliver:\n\n- **\u003e8,000 req/s** throughput on CPU\n- **P99 latency \u003c12ms** at `batch_size=32`\n- **Zero-downtime** model reloads without dropping in-flight requests\n\nBuilt with **FastAPI**, **PyTorch**, and **asyncio**, BALES is engineered for safety-first concurrency: inference never blocks the event loop, and every request future is guaranteed to resolve or time out cleanly.\n\n## Table of Contents\n\n- [Overview](#overview)\n- [Quick Start](#quick-start)\n  - [Local Development](#local-development)\n  - [Docker](#docker)\n- [Architecture](#architecture)\n  - [Data Flow](#data-flow)\n  - [Key Invariants](#key-invariants)\n- [Configuration](#configuration)\n- [Running and Stress Testing](#running-and-stress-testing)\n  - [Start the Server](#start-the-server)\n  - [Smoke Test](#smoke-test)\n  - [Stress Test with curl](#stress-test-with-curl)\n  - [Concurrent Load with wrk or hey](#concurrent-load-with-wrk-or-hey)\n- [Benchmarking](#benchmarking)\n  - [Isolated Batcher](#isolated-batcher)\n  - [Full-Stack Load Test with Locust](#full-stack-load-test-with-locust)\n  - [Interpreting Results](#interpreting-results)\n- [Security](#security)\n- [API Reference](#api-reference)\n  - [POST /infer](#post-infer)\n  - [GET /health](#get-health)\n  - [POST /models/model_id/reload](#post-modelsmodel_idreload)\n  - [GET /metrics](#get-metrics)\n- [Development](#development)\n\n---\n\n## Quick Start\n\n### Local Development\n\n**Prerequisites:** Python 3.14+, Redis 7+, [uv](https://docs.astral.sh/uv/)\n\n```bash\n# 1. Clone the repository\ngit clone https://github.com/ab1nv/bales.git\ncd bales\n\n# 2. Install dependencies (first time)\nuv sync --extra dev\n\n# 3. Start Redis (if not already running)\nredis-server --save \"\" --appendonly no\n\n# 4. Run the server\nuv run python main.py\n```\n\nThe gateway will be available at `http://localhost:8000`.\n\n### Docker\n\n```bash\n# Build and start everything (Redis + Bales)\ndocker compose up --build\n\n# Optional: include Prometheus for metrics scraping\ndocker compose --profile monitoring up --build\n```\n\n---\n\n## Architecture\n\n### Data Flow\n\n```mermaid\nflowchart TD\n    Client[\"Client\"]\n    Routes[\"FastAPI Routes\"]\n    Queue[\"Redis Priority Queue\"]\n    Consumer[\"Consumer Loop\"]\n    Batcher[\"Dynamic Batcher\"]\n    Torch[\"PyTorch run_in_executor\"]\n    Response[\"Response Future\"]\n\n    Client --\u003e|POST /infer| Routes\n    Routes --\u003e|push request| Queue\n    Queue --\u003e|pop batch| Consumer\n    Consumer --\u003e|preprocess \u0026 submit| Batcher\n    Batcher --\u003e|stack tensors| Torch\n    Torch --\u003e|postprocess| Response\n    Response --\u003e|resolve future| Routes\n    Routes --\u003e|JSON response| Client\n```\n\n### Key Invariants\n\n1. **PyTorch inference NEVER runs on the event loop thread** -- always dispatched via `run_in_executor`.\n2. **A request NEVER touches a half-loaded model during hot-swap** -- atomic reference replacement under an async lock.\n3. **A request NEVER gets dropped during hot-swap** -- in-flight requests hold a local reference to the old model until GC cleans up.\n4. **`request_id`** is the single source of truth linking API -\u003e queue -\u003e batcher -\u003e response.\n5. **`pending_futures`** is the ONLY place futures are stored.\n\n---\n\n## Configuration\n\nAll configuration is read from environment variables (with sensible defaults). Create a `.env` file from the example:\n\n```bash\ncp .env.example .env\n```\n\n| Variable | Default | Description |\n|----------|---------|-------------|\n| `REDIS_URL` | `redis://localhost:6379/0` | Redis connection string |\n| `MAX_BATCH_SIZE` | `32` | Maximum requests per batch |\n| `BATCH_WINDOW_MS` | `5.0` | Collection window in milliseconds |\n| `BATCHER_TIMEOUT_S` | `5.0` | Client timeout before 504 |\n| `DEFAULT_MODEL_ID` | `stub_v1` | Default registered model |\n| `THREAD_POOL_SIZE` | `4` | Executor threads for torch inference |\n| `HOST` | `0.0.0.0` | Server bind host |\n| `PORT` | `8000` | Server port |\n| `LOG_LEVEL` | `info` | Logging level |\n| `ENABLE_PROMETHEUS` | `true` | Enable metrics export |\n\n\u003e **Note:** `workers` must remain `1` for in-process shared state (`pending_futures`). Scale horizontally with Docker replicas instead.\n\n---\n\n## Running and Stress Testing\n\n### Start the Server\n\n```bash\n# Local (requires Redis running)\nuv run python main.py\n\n# Or with Docker (includes Redis)\ndocker compose up --build\n```\n\nThe server will start on `http://localhost:8000`.\n\n### Smoke Test\n\nVerify the server is healthy and can serve inference:\n\n```bash\n# Health check\ncurl http://localhost:8000/health\n\n# Single inference request\ncurl -X POST http://localhost:8000/infer \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"model_id\": \"stub_v1\",\n    \"model_type\": \"classification\",\n    \"payload\": {\"input\": [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 0.1, 0.2, 0.3]}\n  }'\n```\n\n\u003e **Note:** The stub model expects exactly 128 floats in the `input` array. The example above is truncated for readability.\n\n### Stress Test with curl\n\nSend 1000 sequential requests and measure total time:\n\n```bash\n# Generate a valid 128-element input\ninput_json=$(python3 -c \"import json; print(json.dumps({'input': [0.1]*128}))\")\n\n# Sequential stress test\nfor i in {1..1000}; do\n  curl -s -X POST http://localhost:8000/infer \\\n    -H \"Content-Type: application/json\" \\\n    -d \"{\n      \\\"model_id\\\": \\\"stub_v1\\\",\n      \\\"model_type\\\": \\\"classification\\\",\n      \\\"priority\\\": 2,\n      \\\"payload\\\": $input_json\n    }\" \u003e /dev/null\ndone\n```\n\n### Concurrent Load with wrk or hey\n\nFor true concurrency testing, use a load generator:\n\n**Using hey (simple, single-threaded):**\n```bash\n# Install: go install github.com/rakyll/hey@latest\n# Or: apt-get install hey\n\n# Run 50,000 requests with 500 concurrent connections\nhey -n 50000 -c 500 -m POST \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"model_id\":\"stub_v1\",\"model_type\":\"classification\",\"priority\":2,\"payload\":{\"input\":[0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1]}}' \\\n  http://localhost:8000/infer\n```\n\n**Using wrk (more accurate, multi-threaded):**\n```bash\n# Install wrk first: https://github.com/wg/wrk/wiki/Installing-Wrk-on-Linux\n\n# Create a Lua script for POST body\ncat \u003e infer.lua \u003c\u003c 'EOF'\nwrk.method = \"POST\"\nwrk.headers[\"Content-Type\"] = \"application/json\"\nwrk.body = '{\"model_id\":\"stub_v1\",\"model_type\":\"classification\",\"priority\":2,\"payload\":{\"input\":[0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1]}}'\nEOF\n\n# Run with 12 threads, 400 connections, for 30 seconds\nwrk -t12 -c400 -d30s -s infer.lua http://localhost:8000/infer\n```\n\n**Monitor during stress test:**\n```bash\n# Watch queue depth and pending requests\ncurl http://localhost:8000/health | python3 -m json.tool\n\n# Watch Prometheus metrics\ncurl http://localhost:8000/metrics | grep bales_\n```\n\n---\n\n## Benchmarking\n\n### Isolated Batcher\n\nTest pure batching throughput (no HTTP or Redis overhead). This tells you the theoretical maximum of the batching engine:\n\n```bash\nuv run python benchmarks/profile_batcher.py\n```\n\nExpected output:\n```\n=== Batcher Benchmark ===\nRequests:          10,000\nElapsed:           1.23s\nThroughput:        8,130 req/s\nP50 latency:       2.45ms\nP99 latency:       8.12ms\nAvg batch size:    31.2\nModel calls:       321  (vs 10000 individual = 31.2x reduction)\nTarget:            \u003e8,000 req/s, P99 \u003c12ms\nPass:              PASS\n```\n\n**What to tune:**\n- If throughput is low (\u003c 8,000): increase `BATCH_WINDOW_MS` slightly (try 5ms -\u003e 8ms) to allow more requests to accumulate per batch\n- If P99 is high (\u003e 12ms): reduce `BATCH_WINDOW_MS` (try 5ms -\u003e 3ms) or increase `THREAD_POOL_SIZE`\n- If avg batch size is low (\u003c 20): increase concurrent load or reduce window\n\n### Full-Stack Load Test with Locust\n\nBenchmark the complete HTTP -\u003e Redis -\u003e Batcher pipeline:\n\n```bash\n# Install locust (already in dev dependencies)\nuv sync --extra dev\n\n# Run headless load test\nuv run locust -f benchmarks/locustfile.py \\\n  --headless -u 500 -r 100 \\\n  --run-time 60s --host http://localhost:8000\n```\n\n**Parameters explained:**\n- `-u 500`: spawn 500 concurrent users\n- `-r 100`: hatch 100 users per second\n- `--run-time 60s`: run for 60 seconds\n- `--host http://localhost:8000`: target the local server\n\n**After the run, Locust prints:**\n- Total requests per second (RPS)\n- Average, median, and percentile latencies\n- Failure count and error rate\n\n### Interpreting Results\n\n| Metric | Target | What to do if failing |\n|--------|--------|----------------------|\n| Throughput | \u003e 8,000 req/s | Increase `-u` (users) in Locust. Check CPU usage with `htop`. |\n| P99 latency | \u003c 12ms | Reduce `BATCH_WINDOW_MS` or increase `THREAD_POOL_SIZE`. Check `/health` for queue backlog. |\n| Error rate | 0% | Check logs for timeout or Redis connection errors. Verify `pending_futures` is 0 in `/health`. |\n| Avg batch size | \u003e 20 | Should be close to `MAX_BATCH_SIZE` (32). If low, increase load or window. |\n\n**Comparison checklist:**\n1. Run `profile_batcher.py` first to establish the ceiling (no HTTP/Redis overhead)\n2. Run `locustfile.py` to measure real-world throughput\n3. Compare: `locust RPS` should be ~60-80% of `profile_batcher RPS` due to HTTP + Redis overhead\n4. If gap is larger: HTTP layer or Redis is the bottleneck, not the batcher\n5. If gap is small: batcher is the bottleneck, tune `THREAD_POOL_SIZE` or `BATCH_WINDOW_MS`\n\n---\n\n## Security\n\nBALES follows security best practices:\n\n- **Input validation:** All requests are validated via Pydantic v2 before entering the pipeline.\n- **No shell execution:** `weights_path` in hot-swap is validated with `Path.exists()` and never passed to shell commands.\n- **Resource limits:** Docker Compose enforces CPU (`4.0`) and memory (`2G`) caps.\n- **No Redis persistence:** Queue data is ephemeral by design (`--save \"\" --appendonly no`) to avoid I/O overhead and accidental data retention.\n- **Single worker:** Prevents shared-state corruption; horizontal scaling is done via container replicas behind a load balancer.\n- **Healthchecks:** Docker `HEALTHCHECK` polls `/health` every 10s to detect degraded state.\n- **Non-root container:** The Docker image runs as an unprivileged `bales` user.\n\n---\n\n## API Reference\n\n### POST /infer\n\nSubmit an inference request.\n\n**Request body:**\n\n```json\n{\n  \"model_id\": \"stub_v1\",\n  \"model_type\": \"classification\",\n  \"priority\": 2,\n  \"payload\": {\n    \"input\": [0.1, 0.2, ...]\n  }\n}\n```\n\n**Response:**\n\n```json\n{\n  \"request_id\": \"uuid\",\n  \"model_id\": \"stub_v1\",\n  \"result\": { \"label\": 3, \"confidence\": 0.95 },\n  \"latency_ms\": 4.123,\n  \"batch_size\": 16,\n  \"queued_ms\": 1.234\n}\n```\n\n### GET /health\n\nReturns system health, registered models, queue depths, and pending request count.\n\n```bash\ncurl http://localhost:8000/health\n```\n\n### POST /models/{model_id}/reload\n\nHot-swap a model's weights without dropping traffic.\n\n**Request body:**\n\n```json\n{\n  \"weights_path\": \"./weights/new_model.pt\"\n}\n```\n\n**Example:**\n```bash\ncurl -X POST http://localhost:8000/models/stub_v1/reload \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"weights_path\": \"./weights/stub_v2.pt\"}'\n```\n\n### GET /metrics\n\nPrometheus scrape endpoint exposing:\n\n- `bales_requests_total` - Total inference requests by model_id and status\n- `bales_request_latency_ms` - End-to-end latency distribution\n- `bales_batch_size` - Number of requests in each dispatched batch\n- `bales_queue_depth` - Number of requests waiting in priority queue\n\n```bash\ncurl http://localhost:8000/metrics\n```\n\n---\n\n## Development\n\n```bash\n# Run the test suite (requires Redis on localhost:6379)\nuv run pytest tests/ -v\n\n# Run a specific test file\nuv run pytest tests/test_integration.py -v\n\n# Run with coverage\nuv run pytest tests/ -v --cov=. --cov-report=html\n\n# Profile the batcher\nuv run python benchmarks/profile_batcher.py\n\n# Run the load test\nuv run locust -f benchmarks/locustfile.py --headless -u 500 -r 100 --run-time 60s --host http://localhost:8000\n\n# Lint check\nuv run ruff check .\n\n# Format check\nuv run ruff format --check .\n\n# Type check\nuv run ty check\n```\n\n---\n\n\u003cp align=\"center\"\u003e\n  Built with FastAPI + PyTorch + Redis + uv\u003cbr\u003e\n  \u003ca href=\"https://www.buymeacoffee.com/ab1nv\" target=\"_blank\"\u003eBuy me a coffee\u003c/a\u003e\n\u003c/p\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fab1nv%2Fbales","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fab1nv%2Fbales","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fab1nv%2Fbales/lists"}