https://github.com/leonardosul/mixture-llm
Combine multiple LLMs for better outputs.
https://github.com/leonardosul/mixture-llm
llms moa python
Last synced: 4 months ago
JSON representation
Combine multiple LLMs for better outputs.
- Host: GitHub
- URL: https://github.com/leonardosul/mixture-llm
- Owner: leonardosul
- License: mit
- Created: 2025-12-16T06:57:15.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2025-12-18T04:25:22.000Z (6 months ago)
- Last Synced: 2025-12-20T15:28:11.160Z (6 months ago)
- Topics: llms, moa, python
- Language: Python
- Homepage: https://leonardosul.github.io/mixture-llm/
- Size: 47.9 KB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
# mixture-llm
[](https://github.com/leonardosul/mixture-llm/actions/workflows/ci.yaml)
[](https://pypi.org/project/mixture-llm/)
[](https://leonardosul.github.io/mixture-llm)
Combine LLMs to beat the best single LLM.
The Mixture-of-Agents architecture achieved **65.1% on AlpacaEval 2.0** using only open-source models—surpassing GPT-4o's 57.5%. This library gives you the building blocks to construct these pipelines.
## Install
```bash
pip install mixture-llm
```
## Quick start
```python
from mixture_llm import Propose, Aggregate, run
pipeline = [
Propose(["gpt-5-nano-2025-08-07", "claude-sonnet-4-5", "llama-3.3-70b"]),
Aggregate("gpt-5-nano-2025-08-07"),
]
result, history = await run(pipeline, "What is quantum computing?", my_client)
```
## Paper-accurate pipelines
### Together MoA (65.1% AlpacaEval)
The benchmark-winning configuration from [Wang et al. (2024)](https://arxiv.org/abs/2406.04692): 3 layers, 6 diverse proposers, Qwen aggregator.
```python
PROPOSERS = [
"wizardlm-2-8x22b",
"qwen1.5-110b-chat",
"qwen1.5-72b-chat",
"llama-3-70b-instruct",
"mixtral-8x22b-instruct",
"dbrx-instruct",
]
together_moa = [
Propose(PROPOSERS, temp=0.7, max_tokens=512),
Synthesize(PROPOSERS, temp=0.7, max_tokens=512),
Synthesize(PROPOSERS, temp=0.7, max_tokens=512),
Aggregate("qwen1.5-110b-chat"),
]
```
### MoA-Lite (59.3% AlpacaEval)
Cost-optimized 2-layer variant—still beats GPT-4o.
```python
moa_lite = [
Propose(PROPOSERS, temp=0.7, max_tokens=512),
Synthesize(PROPOSERS, temp=0.7, max_tokens=512),
Aggregate("qwen1.5-72b-chat"),
]
```
### Self-MoA (+6.6% over standard MoA)
[Li et al. (2025)](https://arxiv.org/abs/2502.00674) showed that sampling one top model multiple times can outperform diverse model mixtures.
```python
# Same model, multiple samples via temperature
self_moa = [
Propose(["gpt-5-nano-2025-08-07"] * 6, temp=0.7),
Aggregate("gpt-5-nano-2025-08-07"),
]
```
### With robustness (shuffle + dropout)
Prevents positional bias and improves diversity.
```python
robust_moa = [
Propose(["gpt-5-nano-2025-08-07", "claude-sonnet-4-5", "llama-70b", "gemini-2.5-flash"]),
Shuffle(),
Dropout(0.2),
Aggregate("gpt-5-nano-2025-08-07"),
]
```
## Steps
**LLM steps** — call models:
- `Propose(agents)` — generate initial responses in parallel
- `Synthesize(agents)` — each agent synthesizes all previous outputs
- `Aggregate(agent)` — single model combines everything into final output
- `Refine(agents)` — improve each response individually
- `Rank(agent, n)` — select top n responses by quality
- `Vote(agent)` — pick consensus answer
**Transform steps** — manipulate responses:
- `Shuffle()` — randomize order (prevents position bias)
- `Dropout(rate)` — randomly drop responses (improves robustness)
- `Sample(n)` — random subset
- `Take(n)` — first n responses
- `Filter(fn)` — keep responses matching predicate
- `Map(fn)` — transform each response
## Configuration
Every LLM step accepts `temp` and `max_tokens`:
```python
Propose(["gpt-5-nano-2025-08-07", "claude-sonnet-4-5"], temp=0.9, max_tokens=4096)
```
Override the synthesis prompt:
```python
Aggregate("gpt-5-nano-2025-08-07", prompt="Pick the single best response and return it verbatim.")
```
## Client examples
Your client is an async function with this signature:
```python
async def client(model, messages, temp, max_tokens) -> tuple[str, int, int]:
# Returns (response_text, input_tokens, output_tokens)
```
### OpenAI SDK (OpenAI + Anthropic models)
```python
from openai import AsyncOpenAI
openai_client = AsyncOpenAI()
anthropic_client = AsyncOpenAI(
base_url="https://api.anthropic.com/v1/",
api_key=os.environ["ANTHROPIC_API_KEY"],
)
async def multi_provider_client(model, messages, temp, max_tokens):
client = anthropic_client if model.startswith("claude") else openai_client
# GPT-5: max_completion_tokens, no temperature, minimal reasoning
is_gpt5 = model.startswith("gpt-5")
params = {"model": model, "messages": messages}
params.update({"max_completion_tokens": max_tokens, "reasoning_effort": "minimal"} if is_gpt5 else {"max_tokens": max_tokens, "temperature": temp})
resp = await client.chat.completions.create(**params)
return resp.choices[0].message.content, resp.usage.prompt_tokens, resp.usage.completion_tokens
# Mix providers in one pipeline
pipeline = [
Propose(["gpt-5-nano-2025-08-07", "claude-sonnet-4-5", "gpt-5-nano-2025-08-07"]),
Aggregate("claude-sonnet-4-5"),
]
```
### OpenRouter (access all models via one API)
```python
from openai import AsyncOpenAI
client = AsyncOpenAI(
base_url="https://openrouter.ai/api/v1",
api_key=os.environ["OPENROUTER_API_KEY"],
)
async def openrouter_client(model, messages, temp, max_tokens):
resp = await client.chat.completions.create(
model=model, messages=messages, temperature=temp, max_tokens=max_tokens
)
return resp.choices[0].message.content, resp.usage.prompt_tokens, resp.usage.completion_tokens
# Together MoA models via OpenRouter
PROPOSERS = [
"qwen/qwen-2.5-72b-instruct",
"meta-llama/llama-3.3-70b-instruct",
"mistralai/mixtral-8x22b-instruct",
]
together_moa_openrouter = [
Propose(PROPOSERS, temp=0.7, max_tokens=512),
Synthesize(PROPOSERS, temp=0.7, max_tokens=512),
Aggregate("qwen/qwen-2.5-72b-instruct"),
]
```
### Groq via LiteLLM (free tier)
Groq offers free access to several models. Great for experimentation.
```python
from litellm import acompletion
async def groq_client(model, messages, temp, max_tokens):
resp = await acompletion(
model=f"groq/{model}", messages=messages, temperature=temp, max_tokens=max_tokens
)
return resp.choices[0].message.content, resp.usage.prompt_tokens, resp.usage.completion_tokens
# Free Groq models (check console.groq.com/docs/rate-limits for current list)
GROQ_FREE = [
"llama-3.3-70b-versatile",
"llama-3.1-8b-instant",
"qwen/qwen3-32b",
"meta-llama/llama-4-scout-17b-16e-instruct",
]
free_moa = [
Propose(GROQ_FREE, temp=0.7, max_tokens=512),
Aggregate("llama-3.3-70b-versatile"),
]
# Self-MoA with Groq (single model, multiple samples)
free_self_moa = [
Propose(["llama-3.3-70b-versatile"] * 4, temp=0.7),
Aggregate("llama-3.3-70b-versatile"),
]
```
## Examples
The [`examples/`](examples/) directory contains tested, runnable scripts for different providers. See [`examples/EXAMPLES.md`](examples/EXAMPLES.md) for detailed documentation.
| Example | Provider | What You'll Learn |
|---------|----------|-------------------|
| [`openai_basic.py`](examples/openai_basic.py) | OpenAI | Basic MoA pattern (Propose → Aggregate), client setup, token tracking |
| [`openai_self_moa.py`](examples/openai_self_moa.py) | OpenAI | Self-MoA technique—one model sampled 6 times beats diverse mixtures |
| [`multi_provider.py`](examples/multi_provider.py) | OpenAI + Anthropic | Provider routing, Shuffle step to prevent position bias |
| [`openrouter_moa.py`](examples/openrouter_moa.py) | OpenRouter | 3-layer MoA (Propose → Synthesize → Aggregate), paper configuration |
| [`groq_free.py`](examples/groq_free.py) | Groq | Free experimentation, LiteLLM integration, Dropout for robustness |
| [`with_history.py`](examples/with_history.py) | Groq | Pipeline debugging, Rank step, execution history inspection |
```bash
# Install and run
pip install -e ".[examples]"
export OPENAI_API_KEY=sk-...
python examples/openai_basic.py
# Or try free with Groq
export GROQ_API_KEY=gsk_...
python examples/groq_free.py
```
## Key findings from the research
- **Aggregator quality matters 2x more than proposer quality** — invest in your final model
- **3 layers is the sweet spot** — diminishing returns beyond this
- **Diversity vs quality tradeoff** — Self-MoA shows a single great model can beat diverse mediocre ones
- **6 proposers optimal** — gains diminish after this point
## References
- Wang et al. "Mixture-of-Agents Enhances Large Language Model Capabilities" (2024) — [arXiv:2406.04692](https://arxiv.org/abs/2406.04692)
- Li et al. "Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models Beneficial?" (2025) — [arXiv:2502.00674](https://arxiv.org/abs/2502.00674)
## License
MIT