https://github.com/g-eoj/cragents
Agents that constrain token generation.
https://github.com/g-eoj/cragents
llguidance pydantic-ai vllm
Last synced: 4 months ago
JSON representation
Agents that constrain token generation.
- Host: GitHub
- URL: https://github.com/g-eoj/cragents
- Owner: g-eoj
- License: apache-2.0
- Created: 2025-10-14T21:32:08.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2026-01-31T00:47:43.000Z (4 months ago)
- Last Synced: 2026-01-31T11:45:26.444Z (4 months ago)
- Topics: llguidance, pydantic-ai, vllm
- Language: Python
- Homepage: https://g-eoj.github.io/cragents/
- Size: 428 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# `cragents`
**C**onstrain **R**easoning **Agents** to limit reasoning output.
## Why
> And I'm thinking While I'm thinking... (Crackerman, Stone Temple Pilots, 1992)
Reasoning models use a lot of tokens for their reasoning output.
This is resource intensive while not necessarily improving accuracy - have you ever seen a reasoning model talk itself out of the right answer?
So it may be desirable to limit the tokens used.
Doing so can:
- Improved response speed
- Decrease GPU memory requirements
- Provide more space in the context for stuff that matters
- Improve accuracy on user queries that do not require extended analysis
## How
`cragents` provides a utility to constrain `pydantic-ai` [agents](https://ai.pydantic.dev/agents/), if [vLLM](https://docs.vllm.ai/en/stable/) is used to serve the agent's model.
It will limit the number of paragraphs and the number of sentences per paragraph in reasoning output.
The limits are configurable.
```py
import os
from cragents import CRAgent, vllm_model_profile
from pydantic_ai.models.openai import OpenAIChatModel, OpenAIChatModelSettings
from pydantic_ai.providers.openai import OpenAIProvider
model = OpenAIChatModel(
model_name=os.environ["VLLM_MODEL_NAME"],
provider=OpenAIProvider(
api_key=os.environ["VLLM_API_KEY"],
base_url=os.environ["VLLM_BASE_URL"],
),
profile=vllm_model_profile,
settings=OpenAIChatModelSettings(
max_tokens=1000,
),
)
agent = CRAgent(model)
await agent.constrain_reasoning(reasoning_paragraph_limit=1, reasoning_sentence_limit=1)
run = await agent.run("Hi")
```
Inspecting the `ThinkingPart`s shows that output is constrained.
```py
from pydantic_ai.messages import ThinkingPart
for message in run.all_messages():
for part in message.parts:
if isinstance(part, ThinkingPart):
print(part)
```
```sh
ThinkingPart(content='\nOkay, the user said "Hi".\n', id='content', provider_name='openai')
```
For the above example, vLLM was run on a single RTX 4090:
```sh
uv run vllm serve "Qwen/Qwen3-VL-8B-Thinking-FP8" --gpu-memory-utilization 0.92 --api-key $VLLM_API_KEY --enable-auto-tool-choice --tool-call-parser hermes --max-model-len 40000 --guided-decoding-backend guidance
```
### Limitations
- Only models that use the `` tokens to denote reasoning will work
- Only models that use the `` tokens to denote tool calls will work
- vLLM must be started without a reasoning parser (`pydantic-ai` will still extract reasoning content correctly)