{"id":38559001,"url":"https://github.com/kenyony/flexllm","last_synced_at":"2026-02-24T13:01:14.215Z","repository":{"id":331881324,"uuid":"1132161461","full_name":"KenyonY/flexllm","owner":"KenyonY","description":"High-Performance LLM Client for Production Batch processing with checkpoint recovery, response caching, load balancing, and cost tracking","archived":false,"fork":false,"pushed_at":"2026-02-07T07:05:58.000Z","size":502,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-02-07T16:59:23.905Z","etag":null,"topics":["claude","claude-code","gemini","llm","mllm","openai","python","rate-limit"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/KenyonY.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":"docs/roadmap.md","authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-01-11T12:57:51.000Z","updated_at":"2026-02-07T07:06:02.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/KenyonY/flexllm","commit_stats":null,"previous_names":["kenyony/flexllm"],"tags_count":20,"template":false,"template_full_name":null,"purl":"pkg:github/KenyonY/flexllm","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KenyonY%2Fflexllm","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KenyonY%2Fflexllm/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KenyonY%2Fflexllm/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KenyonY%2Fflexllm/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/KenyonY","download_url":"https://codeload.github.com/KenyonY/flexllm/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KenyonY%2Fflexllm/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29783615,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-24T10:45:18.109Z","status":"ssl_error","status_checked_at":"2026-02-24T10:45:09.911Z","response_time":75,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["claude","claude-code","gemini","llm","mllm","openai","python","rate-limit"],"created_at":"2026-01-17T07:46:01.267Z","updated_at":"2026-02-24T13:01:14.207Z","avatar_url":"https://github.com/KenyonY.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003ch1 align=\"center\"\u003eflexllm\u003c/h1\u003e\n\n\u003cp align=\"center\"\u003e\n    \u003cstrong\u003eHigh-Performance LLM Client for Production\u003c/strong\u003e\u003cbr\u003e\n    \u003cem\u003eBatch processing with checkpoint recovery, response caching, load balancing, and cost tracking\u003c/em\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n    \u003ca href=\"https://pypi.org/project/flexllm/\"\u003e\n        \u003cimg src=\"https://img.shields.io/pypi/v/flexllm?color=brightgreen\u0026style=flat-square\" alt=\"PyPI version\"\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://github.com/KenyonY/flexllm/blob/main/LICENSE\"\u003e\n        \u003cimg alt=\"License\" src=\"https://img.shields.io/github/license/KenyonY/flexllm.svg?color=blue\u0026style=flat-square\"\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://pypistats.org/packages/flexllm\"\u003e\n        \u003cimg alt=\"pypi downloads\" src=\"https://img.shields.io/pypi/dm/flexllm?style=flat-square\"\u003e\n    \u003c/a\u003e\n\u003c/p\u003e\n\n---\n\n## Why flexllm?\n\n**Built for production batch processing at scale.**\n\n```python\nfrom flexllm import LLMClient\n\nclient = LLMClient(base_url=\"https://api.openai.com/v1\", model=\"gpt-4\", api_key=\"...\")\n\n# Process 100k requests with automatic checkpoint recovery\n# Interrupted at 50k? Just restart - it continues from 50,001\nresults = await client.chat_completions_batch(\n    messages_list,\n    output_jsonl=\"results.jsonl\",  # Progress saved here\n    show_progress=True,\n    track_cost=True,  # Real-time cost display\n)\n```\n\n**Scale out across multiple endpoints with zero code change.**\n\n```python\nfrom flexllm import LLMClient\n\n# Same LLMClient API, just pass endpoints for multi-node\nclient = LLMClient(\n    endpoints=[\n        {\"base_url\": \"http://gpu1:8000/v1\", \"model\": \"qwen\", \"concurrency_limit\": 50},\n        {\"base_url\": \"http://gpu2:8000/v1\", \"model\": \"qwen\", \"concurrency_limit\": 20},\n        {\"base_url\": \"http://gpu3:8000/v1\", \"model\": \"qwen\"},\n    ],\n    fallback=True,  # Auto-switch on endpoint failure\n)\n\nresults = await client.chat_completions_batch(messages_list, output_jsonl=\"results.jsonl\")\n```\n\n---\n\n## Features\n\n| Feature                          | Description                                                                     |\n| -------------------------------- | ------------------------------------------------------------------------------- |\n| **Checkpoint Recovery**    | Batch jobs auto-resume from interruption - process millions of requests safely  |\n| **Multi-Endpoint Pool**   | Distribute tasks across GPU nodes with shared-queue dynamic balancing and automatic failover |\n| **Response Caching**       | Built-in caching with TTL and IPC multi-process sharing                         |\n| **Cost Tracking**          | Real-time cost monitoring with budget control                                   |\n| **High-Performance Async** | Fine-grained concurrency control, QPS limiting, and streaming                   |\n| **Multi-Provider**         | Supports OpenAI-compatible APIs, Gemini, Claude                                 |\n| **Multimodal Preprocessing** | Auto-convert local files/URLs to base64 for `image_url`, `video_url`, `audio_url`, `input_audio` |\n| **Agent (Tool-Use Loop)**  | AgentClient with automatic tool calling, parallel execution, multi-turn chat, and built-in tools (read/write/edit/glob/grep/bash) |\n\n---\n\n## Installation\n\n```bash\npip install flexllm\n\n# With all features\npip install flexllm[all]\n```\n\n### Claude Code Integration\n\nEnable Claude Code to use flexllm for LLM API calls, batch processing, and more:\n\n```bash\nflexllm install-skill\n```\n\nAfter installation, Claude Code gains the ability to use flexllm across all your projects.\n\n---\n\n## Quick Start\n\n### Basic Usage\n\n```python\nfrom flexllm import LLMClient\n\n# Recommended: use context manager for proper resource cleanup\nasync with LLMClient(\n    model=\"gpt-4\",\n    base_url=\"https://api.openai.com/v1\",\n    api_key=\"your-api-key\"\n) as client:\n    # Async call\n    response = await client.chat_completions([\n        {\"role\": \"user\", \"content\": \"Hello!\"}\n    ])\n\n# Sync version (also supports context manager)\nwith LLMClient(model=\"gpt-4\", base_url=\"...\", api_key=\"...\") as client:\n    response = client.chat_completions_sync([\n        {\"role\": \"user\", \"content\": \"Hello!\"}\n    ])\n\n# Get token usage\nresult = await client.chat_completions(\n    messages=[{\"role\": \"user\", \"content\": \"Hello!\"}],\n    return_usage=True,  # Returns ChatCompletionResult with usage info\n)\nprint(f\"Tokens: {result.usage}\")  # {'prompt_tokens': 10, 'completion_tokens': 5, ...}\n```\n\n### Batch Processing with Checkpoint Recovery\n\nProcess millions of requests safely. If interrupted, just restart - it continues from where it left off.\n\n```python\nmessages_list = [\n    [{\"role\": \"user\", \"content\": f\"Question {i}\"}]\n    for i in range(100000)\n]\n\n# Interrupted at 50,000? Re-run and it continues from 50,001.\nresults = await client.chat_completions_batch(\n    messages_list,\n    output_jsonl=\"results.jsonl\",  # Progress saved here\n    show_progress=True,\n)\n```\n\n### Multi-Endpoint Pool\n\nDistribute batch tasks across multiple GPU nodes / API endpoints. Faster endpoints automatically handle more tasks via a shared queue model, with automatic failover and health monitoring.\n\n\u003e Single endpoint: pass `model`/`base_url`. Multiple endpoints: pass `endpoints`. Same `LLMClient`, same API.\n\n```python\nfrom flexllm import LLMClient\n\nclient = LLMClient(\n    endpoints=[\n        # Each endpoint can have independent rate limits\n        {\"base_url\": \"http://gpu1:8000/v1\", \"model\": \"qwen\", \"concurrency_limit\": 50, \"max_qps\": 100},\n        {\"base_url\": \"http://gpu2:8000/v1\", \"model\": \"qwen\", \"concurrency_limit\": 20, \"max_qps\": 50},\n        {\"base_url\": \"http://gpu3:8000/v1\", \"model\": \"qwen\"},\n    ],\n    fallback=True,               # Auto-switch on endpoint failure\n    failure_threshold=3,         # Mark unhealthy after 3 consecutive failures\n    recovery_time=60.0,          # Try to recover after 60 seconds\n)\n\n# Single request — automatic failover across endpoints\nresult = await client.chat_completions(messages)\n\n# Distributed batch — shared queue, dynamic load balancing, checkpoint recovery\nresults = await client.chat_completions_batch(\n    messages_list,\n    distribute=True,\n    output_jsonl=\"results.jsonl\",\n    track_cost=True,\n)\n\n# Streaming with failover\nasync for chunk in client.chat_completions_stream(messages):\n    print(chunk, end=\"\", flush=True)\n```\n\n**Highlights:**\n- **Shared Queue**: Faster endpoints automatically pull more tasks — no manual tuning needed\n- **Automatic Failover**: Failed requests retry on healthy endpoints; unhealthy nodes auto-recover\n- **Per-Endpoint Config**: Independent `concurrency_limit` and `max_qps` for each endpoint\n- **Full Feature Support**: Checkpoint recovery, caching, cost tracking all work with Pool\n\n### Response Caching\n\n```python\nfrom flexllm import LLMClient, ResponseCacheConfig\n\nclient = LLMClient(\n    model=\"gpt-4\",\n    base_url=\"https://api.openai.com/v1\",\n    api_key=\"your-api-key\",\n    cache=ResponseCacheConfig(enabled=True, ttl=3600),  # 1 hour TTL\n)\n\n# First call: API request (~2s, ~$0.01)\nresult1 = await client.chat_completions(messages)\n\n# Second call: Cache hit (~0.001s, $0)\nresult2 = await client.chat_completions(messages)\n```\n\n### Cost Tracking\n\n```python\n# Track costs during batch processing\nresults, cost_report = await client.chat_completions_batch(\n    messages_list,\n    return_cost_report=True,\n)\nprint(f\"Total cost: ${cost_report.total_cost:.4f}\")\n\n# Real-time cost display in progress bar\nresults = await client.chat_completions_batch(\n    messages_list,\n    track_cost=True,  # Shows 💰 $0.0012 in progress bar\n)\n```\n\n### Streaming\n\n```python\n# Token-by-token streaming\nasync for chunk in client.chat_completions_stream(messages):\n    print(chunk, end=\"\", flush=True)\n\n# Batch streaming - process results as they complete\nasync for result in client.iter_chat_completions_batch(messages_list):\n    process(result)\n```\n\n### Thinking Mode (Reasoning Models)\n\nUnified interface for DeepSeek-R1, Qwen3, Claude extended thinking, Gemini thinking.\n\n```python\nresult = await client.chat_completions(\n    messages,\n    thinking=True,      # Enable thinking\n    return_raw=True,\n)\n\n# Unified parsing across all providers\nparsed = client.parse_thoughts(result.data)\nprint(\"Thinking:\", parsed[\"thought\"])\nprint(\"Answer:\", parsed[\"answer\"])\n```\n\n### Multimodal Preprocessing\n\nAutomatically convert local file paths and URLs to base64 data URIs. Supports images, videos, and audio — just pass local paths in your messages:\n\n```python\nfrom flexllm.msg_processors import messages_preprocess\n\nmessages = [\n    {\n        \"role\": \"user\",\n        \"content\": [\n            {\"type\": \"image_url\", \"image_url\": {\"url\": \"/path/to/image.png\"}},\n            {\"type\": \"video_url\", \"video_url\": {\"url\": \"/path/to/video.mp4\"}},\n            {\"type\": \"input_audio\", \"input_audio\": {\"data\": \"/path/to/audio.wav\", \"format\": \"wav\"}},\n            {\"type\": \"text\", \"text\": \"Describe what you see and hear.\"},\n        ],\n    }\n]\n\n# All local paths → base64 data URIs (async)\nprocessed = await messages_preprocess(messages)\nresult = await client.chat_completions(processed)\n```\n\n| Content type   | Source field       | Output format             |\n|----------------|--------------------|---------------------------|\n| `image_url`    | `image_url.url`    | `data:image/...;base64,…` (with resize support) |\n| `video_url`    | `video_url.url`    | `data:video/...;base64,…` |\n| `audio_url`    | `audio_url.url`    | `data:audio/...;base64,…` |\n| `input_audio`  | `input_audio.data` | Raw base64 (no `data:` prefix, OpenAI format) |\n\nSupported sources: local file paths, `file://` URIs, HTTP/HTTPS URLs, existing `data:` URIs (passthrough).\nClaude and Gemini clients automatically convert these to their native formats.\n\n### Tool Calls (Function Calling)\n\n```python\ntools = [{\n    \"type\": \"function\",\n    \"function\": {\n        \"name\": \"get_weather\",\n        \"description\": \"Get weather information\",\n        \"parameters\": {\n            \"type\": \"object\",\n            \"properties\": {\"location\": {\"type\": \"string\"}},\n            \"required\": [\"location\"],\n        },\n    },\n}]\n\nresult = await client.chat_completions(\n    messages=[{\"role\": \"user\", \"content\": \"What's the weather in Tokyo?\"}],\n    tools=tools,\n    return_usage=True,\n)\n\nif result.tool_calls:\n    for call in result.tool_calls:\n        print(f\"Call: {call.function['name']}({call.function['arguments']})\")\n```\n\n### Agent (Tool-Use Loop)\n\n`AgentClient` wraps `LLMClient` and handles the tool-calling loop automatically: LLM calls → execute tools → feed results back → repeat until done.\n\n```python\nfrom flexllm import AgentClient, LLMClient\n\nclient = LLMClient(model=\"gpt-4\", base_url=\"...\", api_key=\"...\")\n\nagent = AgentClient(\n    client=client,\n    system=\"You are a helpful assistant.\",\n    tools=[{...}],                        # OpenAI-format tool definitions\n    tool_executor=my_tool_fn,             # (name, arguments_json) -\u003e result\n    max_rounds=10,\n)\n\n# Stateless single task\nresult = await agent.run(\"Check the weather in Beijing\")\n# result.content, result.rounds, result.tool_calls, result.usage\n\n# Stateful multi-turn chat (auto-maintains message history)\nr1 = await agent.chat(\"Hello\")\nr2 = await agent.chat(\"Check the weather\")   # carries r1 context\nagent.reset()\n\n# Structured output with Pydantic\nfrom pydantic import BaseModel\nclass Decision(BaseModel):\n    action: str\n    reason: str\n\nresult = await agent.run(\"Analyze this\", response_format=Decision)\nresult.parsed  # -\u003e Decision(action=\"approve\", reason=\"...\")\n```\n\n---\n\n## CLI\n\n```bash\n# Quick ask\nflexllm ask \"What is Python?\"\n\n# Interactive chat\nflexllm chat\n\n# Batch processing with cost tracking\nflexllm batch input.jsonl -o output.jsonl --track-cost\nflexllm batch input.jsonl -o output.jsonl -n 5           # First 5 records only\nflexllm batch data.jsonl -o out.jsonl -uf text -sf sys   # Custom field names\n\n# Model management\nflexllm list              # Configured models\nflexllm models            # Remote available models\nflexllm set-model gpt-4   # Set default model\nflexllm test              # Test connection\nflexllm init              # Initialize config file\n\n# Serve - wrap LLM as HTTP API (for fine-tuned model deployment)\nflexllm serve -m qwen-finetuned -s \"You are an assistant\"\nflexllm serve --thinking true -p 8000 -v  # With thinking mode + request logging\n\n# Agent mode with built-in tools\nflexllm agent --tools code \"读取 main.py 并分析\"          # Code tools (read/edit/glob/grep/bash)\nflexllm agent --tools all \"创建并修改文件\"                 # All tools (includes write)\nflexllm agent --tools code -v \"调试问题\"                  # Verbose mode (show execution details)\nflexllm chat --tools code                               # Interactive multi-turn agent\nflexllm agent --tools shell,dtflow \"清洗data.jsonl\"      # Legacy CLI tools\n\n# Utilities\nflexllm pricing gpt-4     # Query model pricing\nflexllm credits           # Check API key balance\nflexllm mock              # Start mock LLM server for testing\n```\n\n### Configuration\n\nConfig file location: `~/.flexllm/config.yaml`\n\nSee [config.example.yaml](config.example.yaml) for a comprehensive configuration example with all available options, or [config.quickstart.yaml](config.quickstart.yaml) for a minimal quick-start template.\n\n```yaml\n# Default model\ndefault: \"gpt-4\"\n\n# Global system prompt (applied to all commands unless overridden)\nsystem: \"You are a helpful assistant.\"\n\n# Global user content template (applied to all user messages unless overridden)\n# Use {content} as placeholder for original user content\n# user_template: \"{content}/detail\"\n\n# Model list\nmodels:\n  - id: gpt-4\n    name: gpt-4\n    provider: openai\n    base_url: https://api.openai.com/v1\n    api_key: your-api-key\n    system: \"You are a GPT-4 assistant.\"  # Model-specific system prompt (optional)\n\n  - id: local-finetuned\n    name: local-finetuned\n    provider: openai\n    base_url: http://localhost:8000/v1\n    api_key: EMPTY\n    user_template: \"{content}/detail\"  # Model-specific user template for fine-tuned models (optional)\n    # Model params: any field beyond meta fields (id/name/provider/base_url/api_key/system/user_template)\n    # is automatically passed through to the LLM API\n    max_tokens: 512\n    temperature: 0.3\n\n  - id: local-ollama\n    name: local-ollama\n    provider: openai\n    base_url: http://localhost:11434/v1\n    api_key: EMPTY\n\n# Batch command config (optional)\nbatch:\n  concurrency: 20\n  cache: true\n  track_cost: true\n  system: \"You are a batch processing assistant.\"  # Batch-specific system prompt (optional)\n  # user_template: \"[INST]{content}[/INST]\"  # Batch-specific user template (optional)\n```\n\n**Model params priority** (higher priority overrides lower):\n1. CLI argument (e.g., `-t 0.5`, `--max-tokens 100`)\n2. Batch config (batch command only, e.g., `batch.temperature`)\n3. Model config (e.g., `models[].temperature`, `models[].max_tokens`)\n4. Command defaults (e.g., chat/chat-web defaults: temperature=0.7, max_tokens=2048)\n\nAny field in model config beyond the meta fields (`id`, `name`, `provider`, `base_url`, `api_key`, `system`, `user_template`) is treated as a model call parameter and automatically passed through to the LLM API.\n\n**System prompt priority** (higher priority overrides lower):\n1. CLI argument (`-s/--system`)\n2. Batch config (`batch.system`)\n3. Model config (`models[].system`)\n4. Global config (`system`)\n\n**User template priority** (higher priority overrides lower):\n1. CLI argument (`--user-template`)\n2. Batch config (`batch.user_template`)\n3. Model config (`models[].user_template`)\n4. Global config (`user_template`)\n\nUser template uses `{content}` as placeholder for original user content. Useful for fine-tuned models requiring specific prompt formats (e.g., `\"{content}/detail\"`, `\"[INST]{content}[/INST]\"`).\n\nEnvironment variables (higher priority than config file):\n\n- `FLEXLLM_BASE_URL` / `OPENAI_BASE_URL`\n- `FLEXLLM_API_KEY` / `OPENAI_API_KEY`\n- `FLEXLLM_MODEL` / `OPENAI_MODEL`\n\n---\n\n## Architecture\n\n```\nflexllm/\n├── clients/           # All client implementations\n│   ├── base.py        # Abstract base class (LLMClientBase)\n│   ├── llm.py         # Unified entry point (LLMClient)\n│   ├── openai.py      # OpenAI-compatible backend\n│   ├── gemini.py      # Google Gemini backend\n│   ├── claude.py      # Anthropic Claude backend\n│   ├── pool.py        # Multi-endpoint load balancer\n│   └── router.py      # Provider routing strategies\n├── agent/             # Agent layer (tool-use loop)\n│   ├── client.py      # AgentClient implementation\n│   ├── types.py       # AgentResult, ToolCallRecord\n│   └── tools/         # Built-in tools (read/write/edit/glob/grep/bash)\n├── cli/               # CLI commands and helpers\n├── pricing/           # Cost estimation and tracking\n├── serve.py           # HTTP API server (flexllm serve)\n├── cache/             # Response caching with IPC\n├── async_api/         # High-performance async engine\n└── msg_processors/    # Multi-modal message processing\n```\n\nThe architecture follows a simple layered design:\n\n```\nAgentClient (tool-use loop, multi-turn chat, structured output)\n    │\n    └── LLMClient (single endpoint or multi-endpoint)\n            │                                  │\n            │                                  ├── ProviderRouter (round_robin)\n            │                                  ├── Health Monitor (failure threshold + auto recovery)\n            │                                  └── Shared Task Queue (dynamic load balancing)\n            │                                  │\n            └──────────── Backend Clients ─────┘\n                            ├── OpenAIClient\n                            ├── GeminiClient\n                            └── ClaudeClient\n                                    │\n                                    └── LLMClientBase (Abstract - 4 methods to implement)\n                                            │\n                                            ├── ConcurrentRequester (Async engine)\n                                            ├── ResponseCache (Caching layer)\n                                            └── CostTracker (Cost monitoring)\n```\n\n---\n\n## API Reference\n\n### LLMClient\n\n```python\nLLMClient(\n    provider: str = \"auto\",        # \"auto\", \"openai\", \"gemini\", \"claude\"\n    model: str,                    # Model name\n    base_url: str = None,          # API base URL (required for openai)\n    api_key: str = \"EMPTY\",        # API key\n    cache: ResponseCacheConfig,    # Cache config\n    concurrency_limit: int = 10,   # Max concurrent requests\n    max_qps: float = None,         # Max requests per second\n    retry_times: int = 3,          # Retry count on failure\n    timeout: int = 120,            # Request timeout (seconds)\n)\n```\n\n### Main Methods\n\n| Method                                         | Description                 |\n| ---------------------------------------------- | --------------------------- |\n| `chat_completions(messages)`                 | Single async request        |\n| `chat_completions_sync(messages)`            | Single sync request         |\n| `chat_completions_batch(messages_list)`      | Batch async with checkpoint |\n| `iter_chat_completions_batch(messages_list)` | Streaming batch results     |\n| `chat_completions_stream(messages)`          | Token-by-token streaming    |\n\n### AgentClient\n\n```python\nAgentClient(\n    client: LLMClient,                # LLMClient instance (composition, not inheritance)\n    system: str = None,                # System prompt\n    tools: list[dict] = None,          # OpenAI-format tool definitions\n    tool_executor: Callable = None,    # (name, arguments_json) -\u003e result (sync or async)\n    max_rounds: int = 10,              # Max tool-calling rounds per run\n    max_context_tokens: int = None,    # Optional context window limit\n)\n```\n\n| Method               | Description                                           |\n| -------------------- | ----------------------------------------------------- |\n| `run(user_input)`    | Stateless single task with tool-use loop              |\n| `chat(user_input)`   | Stateful multi-turn chat (auto-maintains history)     |\n| `reset()`            | Clear conversation history                            |\n\nReturns `AgentResult` with `.content`, `.rounds`, `.tool_calls`, `.usage`, `.parsed`.\n\n---\n\n## License\n\nApache 2.0\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkenyony%2Fflexllm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkenyony%2Fflexllm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkenyony%2Fflexllm/lists"}