https://github.com/vcal-project/ai-firewall
OpenAI-compatible LLM gateway that reduces API costs using Redis exact cache and Qdrant semantic cache.
https://github.com/vcal-project/ai-firewall
ai-cost-optimization ai-gateway ai-infrastructure llm openai qdrant redis rust semantic-cache vector-search
Last synced: 7 days ago
JSON representation
OpenAI-compatible LLM gateway that reduces API costs using Redis exact cache and Qdrant semantic cache.
- Host: GitHub
- URL: https://github.com/vcal-project/ai-firewall
- Owner: vcal-project
- License: apache-2.0
- Created: 2026-03-12T19:51:50.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2026-06-10T10:17:29.000Z (9 days ago)
- Last Synced: 2026-06-10T11:20:50.237Z (9 days ago)
- Topics: ai-cost-optimization, ai-gateway, ai-infrastructure, llm, openai, qdrant, redis, rust, semantic-cache, vector-search
- Language: Rust
- Homepage: https://vcal-project.com
- Size: 1.18 MB
- Stars: 7
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Security: SECURITY.md
Awesome Lists containing this project
README
# AI Cost Firewall




## Pilot-ready OpenAI-compatible gateway for LLM caching, cost control, and observability
AI Cost Firewall is a lightweight OpenAI-compatible API gateway that reduces LLM API cost and latency through two cache layers:
* exact cache using Redis
* semantic cache using Qdrant
Only cache misses are forwarded to the upstream LLM endpoint.
v0.2.0 is the first pilot-ready milestone of AI Cost Firewall. It consolidates the v0.1.x work into a stable OpenAI-compatible gateway model for caching, cost visibility, and operational diagnostics.
AI Cost Firewall is developed and maintained by VCAL Labs, Inc.
---
# Why AI Cost Firewall?
LLM applications frequently generate repeated or semantically similar prompts.
Without caching, every request results in:
- repeated upstream API calls
- additional token usage
- higher cost
- avoidable latency
AI Cost Firewall introduces a two-layer cache:
1. Exact cache (Redis)
2. Semantic cache (Qdrant)
The firewall behaves similarly to “nginx for LLM APIs”:
- applications call AI Cost Firewall
- the firewall evaluates exact and semantic cache reuse
- only cache misses reach the upstream provider
Supported OpenAI-compatible providers include:
- OpenAI
- Ollama
- LM Studio
- vLLM
- LiteLLM
- OpenRouter
---
# v0.2.1 Release Focus
AI Cost Firewall v0.2.1 builds on the v0.2.0 pilot-ready baseline with additional gateway controls, clearer fail-open behavior, and improved deployment diagnostics.
This release focuses on:
* configurable exact cache enable/disable behavior
* explicit Redis/exact-cache fail-open behavior
* separate upstream and embedding timeout controls
* request body and prompt-size protection
* independent exact and semantic cache store controls
* per-request cache bypass using `X-AIF-Cache-Bypass`
* metrics endpoint access-control configuration
* configurable readiness dependency behavior for Redis, Qdrant, and upstream providers
* improved Grafana Overview and Diagnostics dashboards
* cache-bypass visibility in Prometheus and Grafana
* cleaner Docker runtime image for release testing
* continued support for OpenAI-compatible chat and embedding APIs
v0.2.1 is an operational hardening release. It keeps the v0.2.0 architecture stable while making the gateway easier to test, debug, and deploy in pilot and production-like environments.
---
# Included Dashboards
AI Cost Firewall v0.2.1 includes Grafana dashboards for cost visibility, cache effectiveness, and operational diagnostics.
The dashboards are included in the Docker deployment files and are automatically provisioned by Grafana when using the provided Docker Compose setup.
## Cost Savings Overview
30-minute cold-cache demo run with local simulated OpenAI-compatible upstream.
The Overview dashboard shows the high-level cost and cache impact of AI Cost Firewall.
It demonstrates:
- total request volume
- estimated chat-completion cost
- gross savings from cache reuse
- embedding overhead
- net savings after embedding cost
- net savings percentage
- cache hit rate
- exact and semantic cache activity
- cache bypass request rate
- per-model spend and savings
- savings by cache type
This dashboard is intended for quick validation, demos, and cost-savings reviews.
---
## Semantic Diagnostics
[](assets/grafana/ai-firewall-diagnostics-021.png)
Semantic diagnostics from the same cold-cache demo run, including readiness, threshold behavior, lookup latency, and cache activity.
The Diagnostics dashboard provides a deeper operational view of semantic-cache behavior and runtime health.
It demonstrates:
- readiness state
- semantic lookup volume
- semantic threshold pass/fail behavior
- semantic candidate evaluation
- expired semantic entries skipped during lookup
- semantic lookup latency
- upstream and embedding latency
- embedding overhead by operation
- gross vs net semantic savings
- exact vs semantic savings
- semantic cache misses vs threshold passes
- semantic store health
- runtime and provider pressure signals
- provider error classes
This dashboard is intended for troubleshooting, tuning semantic similarity thresholds, validating fail-open behavior, and understanding runtime cache behavior during pilots.
---
# Deployment Patterns
AI Cost Firewall includes ready-to-run deployment examples under:
```text
deploy/examples/
```
Available patterns:
| Pattern | Description |
|---|---|
| `openai-cloud/` | Fastest cloud evaluation path |
| `local-ollama/` | Fully local OpenAI-compatible deployment |
| `hybrid-openai-local-embeddings/` | OpenAI chat + local embeddings |
| `openrouter/` | OpenRouter upstream with OpenAI embeddings |
| `local-full-stack/` | Fully local stack with dashboards |
Each example includes:
- `docker-compose.yml`
- minimal configuration
- example requests
- expected behavior
- expected metrics
- optional observability overlays
---
# Architecture Overview
[](assets/architecture/ai-cost-firewall-diagram.png)
Client applications send requests to AI Cost Firewall instead of directly to the LLM provider.
The firewall:
1. validates requests
2. checks exact cache
3. checks semantic cache
4. forwards only cache misses upstream
5. exposes metrics and operational diagnostics
Full architecture documentation:
```text
docs/architecture.md
```
---
# Quick Start (Docker)
## Prerequisites
Install:
- Docker
- Docker Compose
Verify installation:
```bash
docker --version
docker compose version
```
---
## Clone the repository
```bash
git clone https://github.com/vcal-project/ai-firewall.git
cd ai-firewall
```
Copy the example configuration:
```bash
cp configs/ai-firewall.conf.example configs/ai-firewall.conf
```
Edit the configuration and add your API key:
```bash
nano configs/ai-firewall.conf
```
---
## Start the stack
The default deployment starts:
- AI Cost Firewall
- Redis
- Qdrant
- Prometheus
- Grafana
```bash
docker compose pull
docker compose up -d
```
---
## Validate the deployment
```bash
curl http://localhost:8080/healthz
curl http://localhost:8080/readyz
curl http://localhost:8080/version
```
Expected:
```text
OK
ready
```
The `/version` endpoint returns release metadata, including the AI Cost Firewall version, release title, and OpenAI-compatible compatibility model.
---
## Example Request
```bash
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o-mini-2024-07-18",
"messages": [
{"role": "user", "content": "Explain Redis briefly."}
]
}'
```
Run the same request twice.
- The first request should go upstream.
- The second request should be served from cache.
---
# Operational Features
AI Cost Firewall includes operational safeguards and observability features designed for real deployments.
## Runtime Features
- readiness and liveness endpoints
- graceful shutdown with request draining
- startup dependency validation
- nginx-style configuration reload (SIGHUP)
- structured Prometheus metrics
- semantic cache lifecycle control
- upstream timeout tracking
- request size protection
- runtime diagnostics
- configurable semantic cache fail-open behavior
---
## Health Endpoints
| Endpoint | Purpose |
|---|---|
| `/healthz` | Process liveness |
| `/readyz` | Ready to serve traffic |
---
## Configuration Validation
Validate configuration statically before startup:
```bash
docker compose run --rm firewall \
--config /configs/ai-firewall.conf \
--test-config
```
Expected output:
```text
configuration OK
```
---
## Semantic Cache Fail-Open Behavior
When `semantic_cache_fail_open` is enabled, runtime semantic cache lookup or embedding failures skip semantic cache and continue to the upstream LLM endpoint.
This setting applies to runtime semantic cache behavior. It does not bypass startup dependency validation when semantic cache is enabled. If semantic cache is enabled, Qdrant must be reachable during startup and the configured vector size must match the collection.
---
## Print Loaded Configuration
```bash
docker compose run --rm firewall \
--config /configs/ai-firewall.conf \
--print-config
```
Secrets are automatically masked.
---
# OpenAI-Compatible Providers
AI Cost Firewall supports practical OpenAI-compatible deployments while keeping a simple flat configuration model.
The current model is:
```text
upstream_provider openai_compatible;
embedding_provider openai_compatible;
```
This means AI Cost Firewall expects OpenAI-style chat and embedding APIs. It does not yet provide provider-specific configuration blocks or native provider-specific request transformations.
Common OpenAI-compatible deployment patterns include:
| Runtime or Gateway | Usage Pattern |
| ------------------ | ------------------------------------------- |
| OpenAI | Cloud OpenAI-compatible chat and embeddings |
| Ollama | Local OpenAI-compatible model endpoint |
| LM Studio | Local OpenAI-compatible model endpoint |
| vLLM | Self-hosted OpenAI-compatible serving |
| LiteLLM | Gateway in front of multiple providers |
| OpenRouter | OpenAI-compatible hosted gateway |
Example configuration:
```text
upstream_provider openai_compatible;
upstream_base_url https://api.openai.com;
upstream_api_key sk-your-key;
embedding_provider openai_compatible;
embedding_base_url https://api.openai.com;
embedding_api_key sk-your-key;
```
The upstream provider and embedding provider may use different OpenAI-compatible base URLs.
Important limitations:
* AI Cost Firewall does not claim universal compatibility with every OpenAI-like API.
* Native Anthropic, Gemini, Mistral, and Cohere APIs are not directly supported in v0.2.0.
* Mistral, Anthropic, Gemini, or other providers may be used only when exposed through an OpenAI-compatible layer such as LiteLLM, OpenRouter, or another compatible gateway.
* Provider-specific config blocks, fallback chains, native provider transformations, and provider-specific pricing catalogs are intentionally postponed until after v0.2.0.
See:
```text
configs/examples/
deploy/examples/
docs/provider-compatibility.md
```
---
# Metrics Overview
Metrics are exposed at:
```text
http://localhost:8080/metrics
```
Example metrics:
```text
aif_requests_total
aif_cache_exact_hits
aif_cache_semantic_hits
aif_model_cost_micro_usd_total
aif_gross_saved_micro_usd_total
aif_net_saved_micro_usd_total
aif_embedding_overhead_micro_usd_total
```
AI Cost Firewall reports:
- gross chat-completion savings
- embedding overhead
- net savings after embedding cost
- cache hit ratios
- semantic cache diagnostics
- per-model traffic and cost metrics
---
# Configuration
AI Cost Firewall uses a simple nginx-style configuration format.
Minimal example:
```text
listen_addr 0.0.0.0:8080;
redis_url redis://redis:6379;
upstream_provider openai_compatible;
upstream_base_url https://api.openai.com;
upstream_api_key sk-your-key;
semantic_cache_enabled true;
```
Full documentation:
- `docs/config-reference.md`
- `docs/provider-compatibility.md`
- `docs/quickstart.md`
---
## Benchmarks
AI Cost Firewall v0.2.0 has been benchmarked with a local simulated OpenAI-compatible upstream provider to isolate gateway behavior, Redis/Qdrant integration, cache effectiveness, and Prometheus metrics without external API cost or provider rate-limit noise.
In a 30-minute cache-effectiveness benchmark, AI Cost Firewall sustained 30 RPS with 0% request failures, p95 latency of 9.03 ms, and a 98.86% aggregate cache-hit rate.
In a single-VM high-load benchmark, AI Cost Firewall sustained approximately 500 RPS for 5 minutes with 0% HTTP failures. Higher RPS values caused instability in the single-VM test environment, so this should be treated as a local benchmark observation, not a universal capacity limit.
See [BENCHMARKS.md](BENCHMARKS.md) for benchmark methodology, environment, limitations, and detailed results.
---
# Troubleshooting
See:
- `docs/troubleshooting.md`
- `docs/provider-compatibility.md`
- `docs/operation.md`
Common issues include:
- incorrect upstream base URLs
- provider TLS/certificate failures
- embedding dimension mismatches
- Qdrant vector-size mismatch
- unsupported provider behavior
- semantic threshold tuning
---
# Documentation
| Document | Description |
|---|---|
| `docs/architecture.md` | System architecture |
| `docs/config-reference.md` | Configuration directives |
| `docs/faq.md` | Frequently asked questions |
| `docs/how-it-works.md` | Request flow and cache logic |
| `docs/metrics-and-costs.md` | Cost and savings accounting |
| `docs/operation.md` | Runtime behavior |
| `docs/provider-compatibility.md` | OpenAI-compatible providers |
| `docs/quickstart.md` | Extended setup guide |
| `docs/troubleshooting.md` | Troubleshooting guide |
Full documentation:
https://ai-firewall.docs.vcal-project.com/
---
# Build from Source
```bash
git clone https://github.com/vcal-project/ai-firewall.git
cd ai-firewall
cargo build --release
cargo run --release
```
---
# Testing
Run tests:
```bash
cargo test
```
AI Cost Firewall includes tests for:
- configuration validation
- request validation
- semantic cache requirements
- semantic cache fail-open behavior
- environment variable parsing
- request size parsing
- cost accounting logic
---
# Contributing
Contributions are welcome.
Areas where contributions are especially valuable:
- documentation
- performance
- observability
- provider compatibility
- deployment examples
- testing
See:
```text
CONTRIBUTING.md
```
---
# Integration with VCAL Semantic Cache
AI Cost Firewall can optionally integrate with VCAL Semantic Cache for advanced semantic caching and distributed vector storage.
https://vcal-project.com/vcal-server
---
# License
Apache License 2.0