https://github.com/ctrl-gaurav/effgen
[ICML 2026] effGen: Enabling Small Language Models as Capable Autonomous Agents
https://github.com/ctrl-gaurav/effgen
agentic-ai agents large-language-models small-language-models
Last synced: 15 days ago
JSON representation
[ICML 2026] effGen: Enabling Small Language Models as Capable Autonomous Agents
- Host: GitHub
- URL: https://github.com/ctrl-gaurav/effgen
- Owner: ctrl-gaurav
- License: apache-2.0
- Created: 2026-01-31T07:26:17.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2026-05-19T14:44:19.000Z (24 days ago)
- Last Synced: 2026-05-19T15:38:38.285Z (24 days ago)
- Topics: agentic-ai, agents, large-language-models, small-language-models
- Language: Python
- Homepage: http://effgen.org/
- Size: 3.41 MB
- Stars: 162
- Watchers: 4
- Forks: 27
- Open Issues: 40
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Security: SECURITY.md
Awesome Lists containing this project
README
---
## ๐ฐ News & Updates
| | Date | Update |
|:---:|:---|:---|
| ๐ | **27 May 2026** | **v0.2.10 Released**: Security, Edge & DX โ secret scanning (gitleaks), SBOM (CycloneDX), pip-audit CI, sandboxed CodeExecutor (SubprocessSandbox + DockerSandbox), OAuth2/OIDC + RBAC + audit log, Docker + Helm, AWS Lambda (Mangum), Cloudflare Worker edge proxy, VSCode extension, Jupyter magics, live dashboard. [See changelog](CHANGELOG.md#0210---2026-05-27) |
| ๐ | **23 May 2026** | **v0.2.9 Released**: Observability & Reliability โ structured JSON logs + secret redaction, OTel samplers + canonical span spec, Prometheus histograms, SLO tracking, circuit breakers, bulkheads, jittered retries, chaos harness, fuzz suite, `effgen loadtest` CLI, Alertmanager rules. [See changelog](CHANGELOG.md#029---2026-05-23) |
| ๐ผ๏ธ | **21 May 2026** | **v0.2.8 Released**: First-class multimodal input โ image, audio, and video across 6 providers (Gemini, OpenAI, Groq, Anthropic, Together, HF). New `multimodal` preset, `MultimodalDescribeTool`, unified `Message` content schema, 5 cookbook walkthroughs. [See changelog](CHANGELOG.md#028---2026-05-21) |
| ๐ | **20 May 2026** | **v0.2.7 Released**: 31 prompt templates across 7 domains โ research, coding, data/SQL, legal, medical, creative, business โ with golden eval harness, interactive playground, and auto-generated gallery. [See changelog](CHANGELOG.md#027---2026-05-20) |
| ๐ | **19 May 2026** | **v0.2.6 Released**: 14 new tools โ OCR, AudioTranscribe, ImageInfo, ImageCaption, PDF, DOCX, Excel, Weather, Geocode, Maps, EmailSMTP, EmailIMAP, SlackWebhook, DiscordWebhook. New presets: `media`, `notify`. 58+ built-in tools total. [See changelog](CHANGELOG.md#026---2026-05-19) |
| ๐ | **18 May 2026** | **v0.2.5 Released**: 13 new free tools โ PubMed, ArXiv, SemanticScholar, RSS, News, YouTubeTranscript, YouTubeMetadata, Reddit, HackerNews, Translate, LanguageDetect, QRGenerate, QRRead. 44+ built-in tools total. [See changelog](CHANGELOG.md#025---2026-05-18) |
| ๐ | **14 May 2026** | **v0.2.4 Released**: ModelRouter with CostBased/LatencyBased/FirstAvailable policies, transparent provider failover, cross-process SQLite rate-limit coordination, persistent cost tracker + `effgen cost` dashboard CLI. [See changelog](CHANGELOG.md#024---2026-05-14) |
| ๐ | **4 May 2026** | **v0.2.3 Released**: 5 new cloud backends (Groq, Together AI, Fireworks, Replicate, HuggingFace Inference) โ 9 providers total. Unified ProviderRegistry, `effgen doctor` auth check, backend parity matrix. [See changelog](CHANGELOG.md#023---2026-05-04) |
| ๐ | **28 Apr 2026** | **v0.2.2 Released**: Gemini 3.x/2.5/2.0 registry, `thinking_budget`, Google Search grounding, Files API, Gemini native tools (GoogleSearch, UrlContext, CodeExecution). Anthropic Claude 4.7 registry, extended thinking, prompt caching (`cache_control`), streaming polish, experimental native tools. [See changelog](CHANGELOG.md#022---2026-04-28) |
| ๐ | **25 Apr 2026** | **v0.2.1 Released**: Cerebras backend (4 free-tier models, streaming, native tool-calling, rate-limit coordinator, cost tracking) + OpenAI gpt-5/gpt-5.4-nano/o-series with `reasoning_effort`, prompt caching, structured outputs v2, and OpenAI native tools (web_search, code_interpreter, file_search). [See changelog](CHANGELOG.md#021---2026-04-25) |
| ๐ | **9 Apr 2026** | **v0.2.0 Released**: Major release โ native tool calling, guardrails, multi-agent orchestration, RAG pipeline, 31 tools, eval framework, production API server, MLX Apple Silicon support, Python & TypeScript SDKs. [See changelog](CHANGELOG.md#020---2026-04-09) |
| ๐ | **8 Apr 2026** | **MLX & Apple Silicon support merged** (PR #4): Native Metal GPU acceleration via MLX & MLX-VLM backends, hardware detection, 5 Gradio GUI examples. `pip install effgen[mlx]` |
| ๐ง | **25 Mar 2026** | **v0.1.3 Released**: Verification hardening โ smarter loop detection, "skip the tool" prompting, model-aware token counting, sub-agent depth limits, circuit breaker persistence. [See changelog](CHANGELOG.md#013---2026-03-25) |
| ๐ง | **12 Mar 2026** | **v0.1.2 Released**: Test-driven hardening โ 10 example agents, 19 bug fixes, cross-model compatibility matrix (11 models, 73% pass rate). [See changelog](CHANGELOG.md#012---2026-03-12) |
| ๐ | **6 Mar 2026** | **v0.1.1 Released**: Stabilization โ fixed license/metadata consistency, improved error handling, added 6 examples, expanded test suite. [See changelog](CHANGELOG.md#011---2026-03-06) |
| ๐ | **1 Mar 2026** | **v0.1.0 Released**: Major feature release โ 14 built-in tools, agent presets, plugin system, real streaming, memory integration, ACP/MCP protocols, CI/CD, and comprehensive test suite. [See changelog](CHANGELOG.md#010---2026-03-01) |
| ๐ง | **3 Feb 2026** | **v0.0.2 Released**: vLLM backend fixes with automatic chat template support, GPU memory control, improved OOM error handling, and multi-model family compatibility |
| ๐ | **2 Feb 2026** | Preprint available: [EffGen: Enabling Small Language Models as Capable Autonomous Agents](https://arxiv.org/abs/2602.00887) |
| ๐ | **31 Jan 2026** | Initial release of effGen framework **(v0.0.1)** |
---
## ๐ค What is effGen?
**effGen** transforms Small Language Models into powerful AI agents. While most frameworks require massive LLMs, effGen is **optimized from the ground up** for efficient, smaller models โ delivering fast, capable agents without the compute overhead.
```python
from effgen import Agent, load_model
from effgen.core.agent import AgentConfig
from effgen.tools.builtin import Calculator, PythonREPL
# Load a small but mighty model
model = load_model("Qwen/Qwen2.5-1.5B-Instruct", quantization="4bit")
# Create agent with tools
config = AgentConfig(
name="math_agent",
model=model,
tools=[Calculator(), PythonREPL()]
)
agent = Agent(config=config)
# Run computation
result = agent.run("What is 24344 * 334?")
print(f"Answer: {result.output}")
```
---
## โก Installation
> **Requires Python 3.10 or newer.** Tested on Python 3.10, 3.11, 3.12, 3.13, 3.14.
### ๐ฆ From PyPI (Recommended)
```bash
pip install effgen
```
### ๐ Apple Silicon (MLX โ Recommended for Mac)
```bash
pip install effgen[mlx] # Text models on Apple Silicon
pip install effgen[mlx-vlm] # Vision-Language models on Apple Silicon
```
### ๐ With vLLM for Faster Inference
```bash
pip install effgen[vllm]
```
### ๐ Everything in one shot
```bash
pip install effgen[all] # installs vLLM + RAG + vector-DB + search + cloud-secrets + monitoring + โฆ
```
### โก Optional: flash-attn (NVIDIA GPUs only โ 2 steps)
> `flash-attn` is **not** in `[all]` on purpose: its own `setup.py` imports
> `torch` before pip's isolated build environment has torch installed (a
> well-known upstream bug), so bundling it would break `pip install effgen[all]`
> for everyone. Install it in two steps instead:
```bash
pip install effgen[all] # step 1: gets torch + the rest
pip install flash-attn --no-build-isolation # step 2: reuses the torch from step 1
```
See [docs/installation.md](docs/installation.md) for the full guide.
### ๐ง From Source
```bash
git clone https://github.com/ctrl-gaurav/effGen.git
cd effGen
# Quick install
./install.sh
# Full install (includes vLLM + dev tools)
./install.sh --full
# Manual install
pip install -e .
```
---
## ๐ Quick Start
### ๐ป CLI Usage
```bash
# Run a task
effgen run "What is the capital of France?"
# Interactive chat
effgen chat
# Start API server
effgen serve --port 8000
# List available presets
effgen presets
# Check infrastructure health
effgen health
# Interactive wizard
effgen
```
### ๐ Python API
```python
from effgen import Agent, load_model
from effgen.core.agent import AgentConfig
from effgen.tools.builtin import Calculator
# Load model
model = load_model("Qwen/Qwen2.5-1.5B-Instruct", quantization="4bit")
# Configure agent
config = AgentConfig(
name="calculator_agent",
model=model,
tools=[Calculator()],
system_prompt="You are a helpful math assistant."
)
# Create and run
agent = Agent(config=config)
result = agent.run("Calculate 15% tip on $85.50")
print(result.output)
```
### ๐ Apple Silicon (MLX)
```python
from effgen import Agent, load_model
from effgen.core.agent import AgentConfig
from effgen.tools.builtin import Calculator
# Load MLX model โ native Metal GPU, unified memory, no CPU-GPU transfer
model = load_model("LiquidAI/LFM2.5-1.2B-Instruct-MLX-8bit", engine="mlx")
config = AgentConfig(
name="mlx_agent",
model=model,
tools=[Calculator()],
)
agent = Agent(config=config)
result = agent.run("What is sqrt(144) + 2^10?")
print(result.output)
```
---
## โจ Features
**๐ง **
SLM Optimized
Small models
**๐**
Apple Silicon
MLX + Metal GPU
**๐ก๏ธ**
Guardrails
PII, injection, safety
**๐**
RAG Pipeline
Ingest, search, cite
**๐ฅ**
Multi-Agent
DAG workflows
**๐ผ๏ธ**
Multimodal
image/audio/video
**๐ญ**
Production API
OpenAI-compat
**๐**
Observability
metrics/traces/SLOs
---
## ๐ What's New in v0.2.9
Observability & Reliability โ production-ready telemetry in v0.2.9
**effGen v0.2.9** ships the full observability and reliability stack. All telemetry is async/non-blocking โ a failed export never fails inference.
**Structured JSON logging with secret redaction.** Every log line is a JSON object: `{ts, level, module, event, attributes, trace_id, span_id}`. The built-in `Redactor` strips OpenAI, Anthropic, Cerebras, Google, HF, Groq, Bearer, Slack, and Discord webhook patterns at the encoder โ no secret ever appears in a log file.
```python
from effgen.observability import get_logger
log = get_logger(__name__)
log.event("model.call.started", provider="cerebras", model="llama3.1-8b", cached_tokens=0)
# โ {"ts": "2026-05-23T...", "level": "INFO", "event": "model.call.started", ...}
```
**Prometheus histograms + SLO tracking.** `effgen_model_call_latency_seconds`, `effgen_tool_call_latency_seconds`, `effgen_agent_iteration_latency_seconds`, and `effgen_tokens_total` now expose histogram buckets at `/metrics`. `SLOTracker` maintains a rolling-window error budget and `burn_rate()` at `/slo`.
**Configurable OTel samplers + canonical span spec.** Choose `AlwaysOn`, `AlwaysOff`, `TraceIdRatio(p)`, or `RateLimited(per_second)` in config. `effgen/observability/spans.py` is the single source of truth for every span attribute name โ no more scattered string literals across adapters.
**Reliability primitives.** Four layers now protect every adapter call:
| Primitive | Class | What it does |
|-----------|-------|-------------|
| Timeouts | `ReliabilityConfig` | `model_call=60s`, `tool_call=30s`, `http=20s` โ explicit on every httpx client |
| Retries | `@retryable(Retry(...))` | Jittered exponential backoff for 5xx / 429 / network errors; emits OTel events |
| Circuit breaker | `CircuitBreaker` | CLOSED โ OPEN โ HALF_OPEN per provider; isolates misbehaving backends |
| Bulkhead | `Bulkhead` | Per-provider concurrency + queue limit; prevents provider starvation |
**Deterministic chaos harness.** Inject `NetworkTimeout`, `Http5xx`, `Http429`, `SlowResponse`, `PartialResponse`, or `MalformedJSON` faults with `Chaos(seed)`. Four canonical scenarios โ fallback on 5xx, Retry-After honoured, timeout fires cleanly, AllProvidersFailed โ all pass deterministically across 10 seeds.
**Fuzz suite.** Hypothesis runs 500 examples against all 66 `BaseTool` subclasses, random `ContentPart` message sequences, and the router's provider-availability logic. No unhandled exceptions, no secret leaks.
**Load-testing CLI + Alertmanager rules.**
```bash
# Run a 30-second load test (JSON report prints to stdout by default)
effgen loadtest --concurrency 10 --duration 30 --scenario fixed
# Or write the report to a file with --output
effgen loadtest --concurrency 10 --duration 30 --output report.json
# Integrate with Alertmanager
cp docs/observability/alert_rules.yaml /etc/prometheus/rules/effgen.yaml
```
See [docs/observability/overview.md](docs/observability/overview.md) for full setup, [docs/observability/metrics.md](docs/observability/metrics.md) for all metric definitions, and [docs/observability/alerting.md](docs/observability/alerting.md) for Alertmanager integration.
## ๐ What's New in v0.2.8
First-class multimodal in v0.2.8 โ image, audio & video across 6 providers
**effGen v0.2.8** makes multimodal input a first-class citizen. Send images, audio clips, and short video to any vision-capable provider through a unified `Message` schema โ the adapter handles the translation, not your code.
**Image input** โ Gemini, OpenAI gpt-4o, Groq, Anthropic (code-only), Together, HF. Automatic resize/MIME validation via `image_pre.py`. Raises `CapabilityNotSupportedError` cleanly when the provider doesn't support vision.
**Audio input** โ Gemini native inline audio, OpenAI Whisper transcription + gpt-4o audio, HF Inference ASR. Auto-downsamples to 16 kHz mono; chunks files over provider max duration. Anthropic raises `CapabilityNotSupportedError`.
**Video input** โ Gemini native video for providers that accept raw video; frame-sampling fallback (ffmpeg) for all others. `MissingSystemDependency` with install hints when ffmpeg is absent.
**Unified message schema** โ `TextPart`, `ImagePart`, `AudioPart`, `VideoPart` form a typed `ContentPart` union. `Message.content` is always a `List[ContentPart]`; backwards-compatible string constructor still works.
**`multimodal` preset** โ `create_agent("multimodal", model)` wires Gemini Flash-Lite (primary) + OpenAI gpt-4o-mini (fallback) with `ImageInfo`, `ImageCaption`, `OCR`, `AudioTranscribe`, `MultimodalDescribeTool`, and the full tool suite.
**5 cookbook walkthroughs** โ image Q&A, audio transcribe + reason, video summarize, OCR + LLM structured extraction, chart reading from an image. All in `docs/cookbook/`.
```python
from effgen import image_from, audio_from, video_from
from effgen.core.messages import Message, Role
from effgen.presets import create_agent
from effgen import load_model
model = load_model("gemini-2.0-flash", provider="gemini")
agent = create_agent("multimodal", model)
# Image question
img = image_from("https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/PNG_transparency_demonstration_1.png/240px-PNG_transparency_demonstration_1.png")
msg = Message(role=Role.USER, content=[img, "What is in this image?"])
result = agent.run_message(msg)
print(result.output)
# Audio transcription
aud = audio_from("/tmp/clip.mp3")
msg = Message(role=Role.USER, content=[aud, "Transcribe and summarize."])
result = agent.run_message(msg)
```
```bash
# Multimodal preset
effgen run --preset multimodal "Describe this image" --image /tmp/photo.jpg
# Check capability
python -c "from effgen.models.capabilities import Capability; print(Capability.vision)"
```
See [docs/multimodal/overview.md](docs/multimodal/overview.md) for the full architecture and [docs/cookbook/README.md](docs/cookbook/README.md) for the cookbook index.
31 prompt templates in v0.2.7 โ Prompt Library, Eval Harness & Interactive Playground
**effGen v0.2.7** adds a curated, domain-organized **Prompt Library** with 31 reusable templates across 7 domains, paired with a golden evaluation harness and an interactive playground CLI. See the [full gallery](docs/prompts/gallery.md).
**Research** โ literature review (zero-shot + CoT), paper summary, citation extraction, methodology critique.
**Coding** โ code review, bug diagnosis, refactoring plan, test generation, docstring fill.
**Data / SQL** โ NL-to-SQL with warnings, SQL explain, SQL optimize, data profile, ETL plan.
**Legal** โ contract summary, clause classify, research brief. All templates include mandatory legal disclaimer.
**Medical** โ symptom triage, drug interaction, medical literature synthesis. All templates include mandatory medical disclaimer.
**Creative** โ story continuation (zero-shot + few-shot), poetry forms, character bio, world building.
**Business** โ meeting summary, email draft (formal/casual), OKR generation, SWOT analysis, elevator pitch.
```bash
# Discover and browse
effgen prompts list
effgen prompts list --domain research
effgen prompts list --format markdown
# Inspect and evaluate
effgen prompts show research.literature_review.v1.cot
effgen prompts eval
effgen prompts eval --domain coding --live --model llama3.1-8b
# Interactive playground
effgen prompts playground
```
```python
from effgen.prompts.library import registry
p = registry.get("data.sql_from_nl.v1")
sql_prompt = p.template(
schema_ddl="CREATE TABLE orders (id INT, customer TEXT, total FLOAT, created_at DATE)",
question="Total revenue per customer this month",
dialect="postgresql",
)
```
See [docs/prompts/gallery.md](docs/prompts/gallery.md) for the full template catalog and [docs/prompts/library.md](docs/prompts/library.md) for the framework overview.
14 new tools in v0.2.6 โ OCR, Audio, Images, Documents, Geo/Weather & Communications
**effGen v0.2.6** adds 14 new built-in tools across document, media, and communication categories, bringing the total to **58+**. Two new presets (`media`, `notify`) are also introduced.
1. **OCR** โ `OCRTool` (Tesseract local + OCR.space fallback; `OCRBackendUnavailable` raised with install instructions).
```python
from effgen.tools.builtin.ocr import OCRTool
result = OCRTool().execute({"operation": "extract", "image_path": "/tmp/scan.png"})
print(result["data"]["text"])
```
2. **Audio Transcription** โ `AudioTranscribeTool` (faster-whisper local; HF Inference fallback; GPU auto-detected).
```python
from effgen.tools.builtin.audio_transcribe import AudioTranscribeTool
result = AudioTranscribeTool().execute({"operation": "transcribe", "audio_path": "/tmp/clip.mp3"})
```
3. **Image Analysis** โ `ImageInfoTool` (Pillow metadata, zero network) + `ImageCaptionTool` (vision-capable model router).
4. **Document Parsing** โ `PDFTool` (pypdf + pdfplumber), `DOCXTool` (python-docx), `ExcelTool` (openpyxl + pandas). All added to `research` and `general` presets.
```python
from effgen.tools.builtin.pdf import PDFTool
result = PDFTool().execute({"operation": "text", "path": "/tmp/paper.pdf"})
```
5. **Geo / Weather** โ `WeatherTool` (Open-Meteo, free, no auth), `GeocodeTool` (Nominatim/OSM, 1 req/s), `MapsTool` (staticmap PNG renderer).
```python
from effgen.tools.builtin.geocode import GeocodeTool
result = GeocodeTool().execute({"operation": "geocode", "address": "San Francisco, CA"})
```
6. **Email & Webhooks** โ `EmailSMTPTool`, `EmailIMAPTool`, `SlackWebhookTool`, `DiscordWebhookTool`. All in new `notify` preset. Webhook URLs are redacted in logs.
```python
from effgen.tools.builtin.slack_webhook import SlackWebhookTool
result = SlackWebhookTool().execute({"operation": "post", "text": "Deploy complete!"})
```
See the [full tool gallery](docs/tools/gallery.md) for quickstart snippets for all 58+ tools.
13 new free tools in v0.2.5 โ Research, News, YouTube, Social, Translation & QR
**effGen v0.2.5** adds 13 free, no-auth-required tools, bringing the built-in tool count above 44. All tools integrate with the `research` and `general` presets.
1. **Academic Research** โ `PubMedTool` (NCBI, 3 ops, built-in rate limiting), `ArXivTool` (Atom feed + PDF download), `SemanticScholarTool` (search + citations + references).
```python
from effgen.tools.builtin.arxiv import ArXivTool
tool = ArXivTool()
result = tool.execute({"operation": "search", "query": "transformer attention", "max_results": 5})
```
2. **News & RSS** โ `RSSFeedTool` (any RSS/Atom feed), `NewsTool` (BBC, Reuters, HN, NPR, etc. + optional NewsAPI.org key).
```python
from effgen.tools.builtin.news import NewsTool
result = NewsTool().execute({"operation": "top_headlines", "category": "technology"})
```
3. **YouTube** โ `YouTubeTranscriptTool` (captions without Google API key), `YouTubeMetadataTool` (via yt-dlp, public content only).
4. **Social Media** โ `RedditTool` (public JSON, no OAuth), `HackerNewsTool` (Firebase API, no auth).
5. **Translation & Language Detection** โ `TranslateTool` (LibreTranslate + offline argostranslate fallback), `LanguageDetectTool` (55+ languages, fully offline).
6. **QR Codes** โ `QRGenerateTool` (generate locally), `QRReadTool` (decode from image, with OpenCV fallback if zbar is unavailable).
See the [full tool gallery](docs/tools/gallery.md) for quickstart snippets for all 58+ tools.
Top 5 features from v0.2.4 โ ModelRouter & Cost Optimizer
1. **`PolicyBasedRouter`** โ composable routing engine with three built-in policies. Pick the cheapest provider within your budget, the fastest under your SLA, or simply the first available โ and combine them freely.
```python
from effgen import PolicyBasedRouter, RoutingContext, CostBasedPolicy, LatencyBasedPolicy
from effgen.models.capabilities import Capability
router = PolicyBasedRouter(policies=[LatencyBasedPolicy(), CostBasedPolicy()])
ctx = RoutingContext(
prompt_tokens_estimate=500,
user_budget_usd=0.01,
latency_budget_ms=3000,
required_capabilities={Capability.chat},
)
decision = router.route(ctx)
print(decision.chosen) # e.g., ProviderModelPair("cerebras", "llama3.1-8b")
print(decision.eliminated) # [(pair, reason), ...] โ fully explainable
```
2. **Transparent failover** โ `route_and_execute(ctx, fn)` retries on rate-limits / 5xx / timeouts and seamlessly moves to the next-best provider. Each hop fires a `RouterEvent` to registered subscribers.
```python
from effgen import load_model
def call_provider(pair):
model = load_model(pair.model_id, provider=pair.provider)
return model.generate("Hello!").text
router.subscribe(
lambda event: print(
f"Failover: {event.from_provider}/{event.from_model} "
f"โ {event.to_provider}/{event.to_model}"
)
)
result = router.route_and_execute(ctx, call_provider)
```
3. **Cross-process SQLite rate-limit coordination** โ share a single rate-limit budget across multiple workers:
```python
from effgen import RateLimitCoordinator, SQLiteRateLimitStore
store = SQLiteRateLimitStore("~/.effgen/rate_limits.sqlite")
coordinator = RateLimitCoordinator(storage=store) # WAL-mode, BEGIN IMMEDIATE
```
4. **Persistent cost tracking + `effgen cost` CLI** โ every API call persists to SQLite; query spend instantly:
```bash
effgen cost today # per-provider per-model table
effgen cost week # rolling 7-day view
effgen cost by-provider # lifetime totals
effgen cost set-budget 1.0 # set $1/day cap (BudgetExceededError at 100%)
```
5. **Fully explainable decisions + budget guard** โ `RouterDecision` records every eliminated provider and why (`"rate_limited"`, `"no_key"`, `"cost_exceeds_budget"`, `"latency_exceeds_sla"`). Configure a daily spend cap; the router automatically fails over to a free-tier provider when the budget is hit.
Top 5 features from v0.2.3
1. **5 new cloud backends** โ `GroqAdapter`, `TogetherAdapter`, `FireworksAdapter`, `ReplicateAdapter`, `HFInferenceAdapter` โ each with streaming, native tools, rate-limit coordination, and cost tracking. 9 providers total.
```python
model = load_model("llama-3.1-8b-instant", provider="groq")
model = load_model("Qwen/Qwen2.5-72B-Instruct", provider="hf")
```
2. **Unified ProviderRegistry** โ `list_providers()`, `list_models(provider)`, `lookup(model_id)` consolidated across all 9 adapters. `AmbiguousModelError` on bare IDs shared across providers.
3. **`effgen doctor`** โ new CLI command showing which providers have API keys configured.
4. **Backend parity matrix** โ canonical agentic task ("(17 ร 23) + sqrt(144) = 403") runs identically across all providers; streaming and error surfaces verified uniform. See `docs/providers/parity.md`.
5. **HuggingFace Router support** โ `HFInferenceAdapter` with 124-model dynamic catalog, `refresh_models()` + `check_drift()`, `ModelUnavailableError` with `suggest_alternatives()`, and custom Inference Endpoint URL.
Top 5 features from v0.2.2 (and earlier)
1. **Gemini 3.x/2.5/2.0 + Gemma families** โ full model registry with correct context windows, output limits, and feature flags; SDK migrated to `google-genai>=1.0.0`.
2. **Gemini `thinking_budget`** โ activate Gemini's internal reasoning with `GenerationConfig(thinking_budget=8192, include_thoughts=True)`; thinking trace surfaces in `ModelResponse.metadata["thinking"]`.
3. **Gemini grounding + Files API** โ `GenerationConfig(grounding=True)` injects Google Search; `upload_file(path)` passes PDFs/images to the model with a 2 GiB guard.
4. **Gemini native tools** โ `GoogleSearchTool`, `GeminiUrlContextTool`, `GeminiCodeExecutionTool` activate server-side Gemini capabilities in any Agent. Parallel function calls handled automatically.
5. **Anthropic Claude 4.7, extended thinking, prompt caching** โ full Claude 4.x registry; `GenerationConfig.thinking` for extended reasoning; `mark_cached()` + `AgentConfig.cache_system_prompt/cache_tools` for `cache_control`; cache tokens surfaced in usage.
Top 5 features from v0.2.1
1. **Cerebras backend** โ 4 free-tier models (`llama3.1-8b`, `qwen-3-235b-a22b-instruct-2507`, `gpt-oss-120b`, `zai-glm-4.7`) with streaming, native function-calling, automatic RPM/TPM/RPD/TPD rate-limit coordination, and per-call cost tracking. `pip install effgen[cerebras]` and set `CEREBRAS_API_KEY`.
```python
from effgen import load_model
model = load_model("llama3.1-8b", provider="cerebras")
```
2. **OpenAI gpt-5 / gpt-5.4-nano / o-series reasoning models** โ full registry coverage with `reasoning_effort` (`minimal`/`low`/`medium`/`high`) and `max_reasoning_tokens` on `GenerationConfig`. Reasoning payloads are routed only to reasoning-capable models.
3. **OpenAI prompt caching surfacing** โ `cached_input_tokens` exposed on `ModelResponse.usage`; `AgentConfig.stable_system_prompt=True` keeps the system prompt anchored at position 0 to maximize OpenAI's automatic โฅ1024-token prefix cache hit rate.
4. **Structured outputs v2** โ `OpenAIAdapter.generate_structured()` with strict JSON Schema; `to_openai_schema(pydantic_model)` inlines `$ref`s and forces `additionalProperties: false`; refusals raise `ModelRefusalError`.
5. **OpenAI native tools** โ `OpenAIWebSearchTool`, `OpenAICodeInterpreterTool`, `OpenAIFileSearchTool` route through OpenAI's Responses API and compose with effGen's local tools in the same agent. `ToolIncompatibleError` fires at Agent init when paired with a non-OpenAI model.
Top 5 features from v0.2.0
1. **Native Tool Calling** โ Qwen, Llama, Mistral models use built-in function calling instead of text parsing. Set `tool_calling_mode="native"` or `"hybrid"`. Structured JSON/Pydantic output validation included.
2. **Guardrails & Safety** โ PII detection, prompt injection blocking, toxicity filtering, tool permissions. One-liner: `get_guardrail_preset("strict")`.
3. **Production RAG Pipeline** โ Ingest PDF/DOCX/HTML/Markdown, semantic+BM25 hybrid search, reranking, inline citations. `create_agent("rag", model, knowledge_base="./docs/")`.
4. **Production API Server** โ OpenAI-compatible `/v1/chat/completions`, request queuing, agent pooling, multi-tenancy, API keys. Drop-in OpenAI replacement with local SLMs.
5. **Apple Silicon Native** โ MLX & MLX-VLM backends for M1/M2/M3/M4. Metal GPU acceleration, unified memory. `pip install effgen[mlx]`.
---
## ๐ฏ Agent Presets
Get started instantly with ready-to-use agent configurations:
```python
from effgen import load_model
from effgen.presets import create_agent
model = load_model("Qwen/Qwen2.5-3B-Instruct", quantization="4bit")
# One-line agent creation
math_agent = create_agent("math", model) # Calculator + PythonREPL
research_agent = create_agent("research", model) # WebSearch + URLFetch + Wikipedia
coding_agent = create_agent("coding", model) # CodeExecutor + PythonREPL + FileOps + Bash
general_agent = create_agent("general", model) # All tools
rag_agent = create_agent("rag", model, knowledge_base="./docs/") # RAG pipeline
minimal_agent = create_agent("minimal", model) # Direct inference, no tools
```
```bash
# CLI preset support
effgen run --preset math "What is sqrt(144)?"
effgen run --preset research "Tell me about quantum computing"
```
---
## ๐ ๏ธ Built-in Tools (58+)
**๐ข**
Calculator
Math & Units
**๐**
WebSearch
DuckDuckGo
**๐ป**
CodeExecutor
Sandboxed
**๐**
PythonREPL
Interactive
**๐**
FileOps
Read/Write
**๐**
Retrieval
RAG + BM25
**๐ฏ**
AgenticSearch
ripgrep
**๐ฅ๏ธ**
BashTool
Shell Cmds
**๐ค๏ธ**
WeatherTool
Open-Meteo
**๐**
JSONTool
Query/Validate
**๐**
DateTimeTool
Timezones
**๐**
TextProcessing
Regex/Count
**๐**
URLFetch
Web Scrape
**๐**
Wikipedia
Free API
**๐ฌ**
PubMed
NCBI / Free
**๐**
ArXiv
Papers + PDF
**๐**
SemanticScholar
Citations
**๐ก**
RSSFeed
Any Feed
**๐ฐ**
News
BBC/Reuters/HN
**โถ๏ธ**
YouTubeTranscript
No API key
**๐ฌ**
YouTubeMetadata
yt-dlp
**๐ค**
Reddit
Public JSON
**๐ฅ**
HackerNews
Firebase API
**๐**
Translate
LibreTranslate
**๐**
LanguageDetect
Offline / 55+
**๐ฑ**
QRGenerate
Local / No net
**๐ท**
QRRead
Local Decode
**โฆ**
+more
Finance, DevOps
---
## ๐ Prompt Library (New in v0.2.7)
effGen ships a curated catalog of **31 reusable prompt templates** across 7 domains, each with a golden evaluation test and CLI access. Browse the [full gallery](docs/prompts/gallery.md).
| Domain | Templates | Variants |
|--------|-----------|----------|
| Research | 5 | zero-shot, CoT, structured, tool-augmented |
| Coding | 5 | zero-shot, CoT, structured, few-shot, tool-augmented |
| Data / SQL | 5 | zero-shot, CoT, structured, few-shot, tool-augmented |
| Legal | 3 | zero-shot, structured, tool-augmented |
| Medical | 3 | structured, tool-augmented |
| Creative | 5 | zero-shot, CoT, structured, few-shot |
| Business | 5 | zero-shot, CoT, structured, few-shot |
```bash
effgen prompts list # browse all 31 templates
effgen prompts show research.paper_summary.v1 # inspect a template
effgen prompts eval # run golden eval (no model needed)
effgen prompts playground # interactive REPL
```
```python
from effgen.prompts.library import registry
# Get and render a template
p = registry.get("coding.code_review.v1")
prompt = p.template(code="def add(a, b): return a + b", language="python")
# Search templates
cot_prompts = registry.search(variant="cot")
sql_prompts = registry.search(domain="data")
```
> Legal and medical templates enforce a mandatory non-advice disclaimer in every rendered output, verified by unit tests.
---
## ๐ Examples
### ๐ฅ๏ธ GUI Applications (Gradio)
```bash
# Visual agent & tool development
python examples/basic/chat_gui_mlx.py # MLX Chat โ streaming chat with Apple Silicon models (port 7860)
python examples/basic/agent_viz_mlx.py # Agent Visualizer โ step-by-step reasoning + code editor (port 7860)
python examples/basic/tool_builder_gui.py # Tool Builder โ visually create custom tools (port 7863)
python examples/basic/tool_tester_gui.py # Tool Tester โ browse, test, inspect all 58+ tools (port 7864)
```
### ๐ Apple Silicon (MLX)
```bash
python examples/basic/basic_agent_mlx.py # Basic MLX agent with calculator
python examples/basic/chat_gui_mlx.py --autoload # Chat GUI with auto model loading
python examples/basic/agent_viz_mlx.py --autoload # Agent visualizer with auto model loading
```
### ๐ค Core Agent Examples
```bash
python examples/basic/qa_agent.py # Q&A agent (no tools)
python examples/basic/calculator_agent.py # Math with Calculator + PythonREPL
python examples/tools/advanced_multi_tool_agent.py # 5 tools + fallback chains
python examples/tools/file_operations_agent.py # File read/write/search
python examples/tools/coding_agent.py # Code execution + iteration
python examples/advanced/conversational_agent.py # Multi-turn memory
python examples/advanced/advanced_streaming_agent.py # Token streaming with callbacks
python examples/advanced/data_processing_agent.py # JSON & data pipelines
python examples/advanced/multi_agent_pipeline.py # Multi-agent orchestration
python examples/advanced/error_recovery_agent.py # Error handling patterns
```
### โก Quick-Start Examples
```bash
python examples/basic/basic_agent.py # Basic agent (Transformers)
python examples/basic/basic_agent_vllm.py # Basic agent (vLLM - 5-10x faster)
python examples/plugins_presets/preset_agents.py # Ready-to-use agent presets
python examples/web_retrieval/streaming_agent.py # Simple streaming
python examples/web_retrieval/memory_agent.py # Simple multi-turn memory
python examples/tools/multi_tool_agent.py # Simple multi-tool
python examples/web_retrieval/weather_agent.py # Weather via Open-Meteo (free)
python examples/plugins_presets/plugin_example.py # Custom tool plugins
python examples/web_retrieval/web_agent.py # Web search agent
python examples/web_retrieval/retrieval_agent.py # RAG-based retrieval
```
> ๐ See [examples/compatibility_matrix.md](examples/utils/compatibility_matrix.md) for model compatibility across all agents.
๐ More Examples
### Multi-Tool Agent
```python
from effgen import Agent, load_model
from effgen.core.agent import AgentConfig
from effgen.tools.builtin import Calculator, WebSearch, PythonREPL
model = load_model("Qwen/Qwen2.5-3B-Instruct")
config = AgentConfig(
name="research_agent",
model=model,
tools=[Calculator(), WebSearch(), PythonREPL()],
system_prompt="You are a research assistant."
)
agent = Agent(config=config)
result = agent.run("Search for the population of Tokyo and calculate what percentage it is of Japan's total population")
```
### Streaming
```python
from effgen import Agent, load_model
from effgen.core.agent import AgentConfig
from effgen.tools.builtin import Calculator
model = load_model("Qwen/Qwen2.5-3B-Instruct", quantization="4bit")
agent = Agent(config=AgentConfig(
name="stream_demo", model=model,
tools=[Calculator()], enable_streaming=True
))
for token in agent.stream("What is 2 + 2?"):
print(token, end="", flush=True)
```
### Memory (Multi-Turn)
```python
agent = Agent(config=AgentConfig(
name="memory_demo", model=model,
tools=[], enable_memory=True
))
agent.run("My name is Alice and I'm working on quantum computing.")
result = agent.run("What's my name and what am I working on?")
# โ "Your name is Alice and you're working on quantum computing."
```
### Retrieval Agent (RAG)
```python
from effgen.tools.builtin import Retrieval
retrieval_tool = Retrieval(knowledge_base_path="./docs")
config = AgentConfig(name="qa_agent", model=model, tools=[retrieval_tool])
agent = Agent(config=config)
result = agent.run("What does the documentation say about configuration?")
```
---
## ๐ค Multi-Model Support
effGen supports **9 cloud inference providers** + 4 local backends, tested across 11+ model families:
| Backend | Platform | Install | Best For |
|---------|----------|---------|----------|
| **MLX** | Apple Silicon (M1/M2/M3/M4) | `effgen[mlx]` | Native Metal GPU, unified memory, 4/8-bit quantization |
| **MLX-VLM** | Apple Silicon | `effgen[mlx-vlm]` | Vision-Language models (Qwen2-VL, LLaVA, Phi-3 Vision, 30+ architectures) |
| **vLLM** | NVIDIA GPU | `effgen[vllm]` | High-throughput batch inference |
| **Transformers** | Any (CPU/GPU) | *(bundled)* | Universal compatibility, local models |
| **OpenAI** | Cloud API | *(bundled)* | gpt-5/gpt-5.4/o-series, reasoning_effort, structured outputs, native tools |
| **Anthropic** | Cloud API | *(bundled)* | Claude 4.7/4.x, extended thinking, prompt caching, native tools |
| **Google Gemini** | Cloud API | *(bundled)* | Gemini 3.x/2.5/2.0, thinking_budget, grounding, Files API, native tools |
| **Cerebras** | Cloud API | `effgen[cerebras]` | 4 free-tier models (llama3.1-8b, qwen-3-235b), ultra-low latency |
| **Groq** | Cloud API | `effgen[groq]` | 16 models (llama-3.3-70b, mixtral, qwen3-32b), ultra-fast free-tier inference |
| **Together AI** | Cloud API | `effgen[together]` | 163-model catalog (llama, deepseek, qwen, mistral), per-model pricing |
| **Fireworks** | Cloud API | `effgen[fireworks]` | 80 chat models (54 tool-capable), serverless + dedicated |
| **Replicate** | Cloud API | `effgen[replicate]` | 38 models, async run-poll, SSE streaming, compute-second billing |
| **HuggingFace** | Cloud API | `effgen[hf]` | 124-model HF Router catalog, custom Inference Endpoints, free serverless tier |
### Provider Auth Check
```bash
# See which API keys are configured
effgen doctor
```
### Quick Cloud Start
```python
from effgen import load_model, Agent
from effgen.core.agent import AgentConfig
from effgen.tools.builtin import Calculator
# Any of the 9 cloud providers
model = load_model("llama-3.1-8b-instant", provider="groq") # Groq
# model = load_model("meta-llama/Llama-3.3-70B-Instruct-Turbo", provider="together")
# model = load_model("Qwen/Qwen2.5-72B-Instruct", provider="hf")
agent = Agent(config=AgentConfig(name="agent", model=model, tools=[Calculator()]))
result = agent.run("What is (17 * 23) + sqrt(144)?")
print(result.output) # โ 403
```
### Top Recommended Models
| Model | Size | Compatibility |
|-------|------|---------------|
| **LFM2.5-1.2B-Instruct-MLX-8bit** | 1.2B | Apple Silicon optimized, fast agentic |
| **Qwen2.5-1.5B-Instruct** | 1.5B | 10/10 agents pass |
| **Qwen2.5-3B-Instruct** | 3B | 10/10 agents pass (recommended default) |
| **Phi-4-mini-instruct** | 3.8B | 10/10 agents pass |
| Qwen3-1.7B | 1.7B | 9.5/10 |
| Qwen2.5-7B-Instruct | 7B | 9/10 |
| Llama-3.2-3B-Instruct | 3B | 8.5/10 |
> Full matrix with 11 models x 10 agents: [compatibility_matrix.md](examples/utils/compatibility_matrix.md)
---
## ๐ Security
**๐ณ**
Docker Sandbox
Isolated execution
**๐ก๏ธ**
Input Validation
Auto sanitization
**โก**
Rate Limiting
Configurable limits
> ๐ For security policies and vulnerability reporting, see [SECURITY.md](SECURITY.md)
---
## ๐ Deployment
effGen v0.2.10 ships production-ready deployment recipes for every major target:
### ๐ณ Docker
Multi-stage build with a non-root user, read-only filesystem, and `/health` healthcheck. See [`docs/deploy/docker.md`](docs/deploy/docker.md).
```bash
docker build -f deploy/docker/Dockerfile -t effgen:0.2.10 .
docker run -p 8000:8000 --env-file .env effgen:0.2.10
curl http://localhost:8000/health
```
### โ Kubernetes / Helm
Full Helm chart with Deployment, Service, Ingress, NetworkPolicy, PDB, and HPA (scales on CPU + `effgen_model_call_latency_seconds`). See [`docs/deploy/kubernetes.md`](docs/deploy/kubernetes.md).
```bash
helm lint deploy/k8s/helm/effgen/
helm install effgen deploy/k8s/helm/effgen/ --set image.tag=0.2.10
```
### ฮป AWS Lambda
Mangum adapter wrapping the FastAPI app. Cold start < 3 s; warm call < 100 ms. SAM template included. See [`docs/deploy/lambda.md`](docs/deploy/lambda.md).
```bash
cd deploy/aws_lambda
sam build && sam deploy --guided
```
### โ Cloudflare Worker
Thin edge proxy handling CORS, Bearer JWT auth, and KV-backed rate limiting before forwarding to your backend. See [`docs/deploy/cloudflare.md`](docs/deploy/cloudflare.md).
```bash
cd deploy/cloudflare
wrangler deploy # staging: wrangler deploy --env staging
```
---
## ๐ท Developer Experience
### VSCode Extension
Prompt-template completion, inline "Run" code lens on `LibraryPrompt` definitions, and hover docs โ all from the effGen registry. See [`docs/dx/vscode.md`](docs/dx/vscode.md).
```bash
cd tools/vscode-effgen
npm ci && npm run compile
# Install: Extensions โ ยทยทยท โ Install from VSIX โ vscode-effgen-*.vsix
```
### Jupyter Magics
```python
%load_ext effgen.jupyter
%effgen_chat "What is 17 * 23?"
%%effgen_agent general
Summarise the top HackerNews stories today and rank them by interest.
%effgen_metrics
```
See [`docs/dx/jupyter.md`](docs/dx/jupyter.md).
### Live Dashboard
The API server serves a real-time SPA at `/dashboard` (no auth required). Panels: span stream (SSE), Prometheus metrics, recent agent runs with token counts and cost, SLO burn rates. See [`docs/dx/dashboard.md`](docs/dx/dashboard.md).
```bash
EFFGEN_DEV_MODE=1 effgen serve --port 8000
open http://localhost:8000/dashboard
```
---
## ๐ Security
### Secret Scanning
Gitleaks pre-commit hook + CI workflow (`secret-scan.yml`) catch secrets before they reach the repo. Install the hook once:
```bash
pip install pre-commit && pre-commit install
```
### Sandboxed Code Execution
`CodeExecutor` defaults to `SubprocessSandbox` (rootless user-namespace, network blocked, isolated `/tmp`) or `DockerSandbox` when Docker is available. To opt out (not recommended):
```bash
EFFGEN_SANDBOX_BACKEND=off effgen run ... # loud warning emitted
```
### API Server Auth
Protect your API server with OAuth2/OIDC (any OIDC provider โ Auth0, Keycloak, Cognito):
```bash
export EFFGEN_OIDC_ISSUER=https://your-tenant.auth0.com/
export EFFGEN_OIDC_CLIENT_ID=your-client-id
export EFFGEN_OIDC_JWKS_URI=https://your-tenant.auth0.com/.well-known/jwks.json
effgen serve --port 8000
```
See [`docs/server/auth.md`](docs/server/auth.md), [`docs/server/rbac.md`](docs/server/rbac.md), and [`docs/server/audit.md`](docs/server/audit.md).
---
## ๐ Citation
If you use **effGen** in your research, please cite our paper:
```bibtex
@software{srivastava2026effgen,
title={effGen: Enabling Small Language Models as Capable Autonomous Agents},
author={Gaurav Srivastava and Aafiya Hussain and Chi Wang and Yingyan Celine Lin and Xuan Wang},
year={2026},
eprint={2602.00887},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2602.00887},
}
```
---
## ๐ Links
---
## ๐ License
Apache License 2.0 โ see [LICENSE](LICENSE) for details.
---