https://github.com/g-eoj/cragents

Agents that constrain token generation.
https://github.com/g-eoj/cragents

llguidance pydantic-ai vllm

Last synced: 4 months ago
JSON representation

Agents that constrain token generation.

Host: GitHub
URL: https://github.com/g-eoj/cragents
Owner: g-eoj
License: apache-2.0
Created: 2025-10-14T21:32:08.000Z (8 months ago)
Default Branch: main
Last Pushed: 2026-01-31T00:47:43.000Z (4 months ago)
Last Synced: 2026-01-31T11:45:26.444Z (4 months ago)
Topics: llguidance, pydantic-ai, vllm
Language: Python
Homepage: https://g-eoj.github.io/cragents/
Size: 428 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 4
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # `cragents`

**C**onstrain **R**easoning **Agents** to limit reasoning output.

## Why

> And I'm thinking While I'm thinking... (Crackerman, Stone Temple Pilots, 1992)

Reasoning models use a lot of tokens for their reasoning output.

This is resource intensive while not necessarily improving accuracy - have you ever seen a reasoning model talk itself out of the right answer?

So it may be desirable to limit the tokens used.

Doing so can:

- Improved response speed

- Decrease GPU memory requirements

- Provide more space in the context for stuff that matters

- Improve accuracy on user queries that do not require extended analysis

## How

`cragents` provides a utility to constrain `pydantic-ai` [agents](https://ai.pydantic.dev/agents/), if [vLLM](https://docs.vllm.ai/en/stable/) is used to serve the agent's model.

It will limit the number of paragraphs and the number of sentences per paragraph in reasoning output.

The limits are configurable.

```py

import os

from cragents import CRAgent, vllm_model_profile

from pydantic_ai.models.openai import OpenAIChatModel, OpenAIChatModelSettings

from pydantic_ai.providers.openai import OpenAIProvider

model = OpenAIChatModel(

    model_name=os.environ["VLLM_MODEL_NAME"],

    provider=OpenAIProvider(

        api_key=os.environ["VLLM_API_KEY"],

        base_url=os.environ["VLLM_BASE_URL"],

    ),

    profile=vllm_model_profile,

    settings=OpenAIChatModelSettings(

        max_tokens=1000,

    ),

)

agent = CRAgent(model)

await agent.constrain_reasoning(reasoning_paragraph_limit=1, reasoning_sentence_limit=1)

run = await agent.run("Hi")

```

Inspecting the `ThinkingPart`s shows that output is constrained.

```py

from pydantic_ai.messages import ThinkingPart

for message in run.all_messages():

    for part in message.parts:

        if isinstance(part, ThinkingPart):

            print(part)

```

```sh

ThinkingPart(content='\nOkay, the user said "Hi".\n', id='content', provider_name='openai')

```

For the above example, vLLM was run on a single RTX 4090:

```sh

uv run vllm serve "Qwen/Qwen3-VL-8B-Thinking-FP8" --gpu-memory-utilization 0.92 --api-key $VLLM_API_KEY --enable-auto-tool-choice --tool-call-parser hermes --max-model-len 40000 --guided-decoding-backend guidance

```

### Limitations

- Only models that use the `` tokens to denote reasoning will work

- Only models that use the `` tokens to denote tool calls will work

- vLLM must be started without a reasoning parser (`pydantic-ai` will still extract reasoning content correctly)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/g-eoj/cragents

Awesome Lists containing this project

README