https://github.com/elibutters/cascadeinference
Cascade based inference for LLMs
https://github.com/elibutters/cascadeinference
cascade cascade-inference chatgpt claude gemini google openai
Last synced: 2 months ago
JSON representation
Cascade based inference for LLMs
- Host: GitHub
- URL: https://github.com/elibutters/cascadeinference
- Owner: elibutters
- License: mit
- Created: 2025-06-16T19:16:14.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-06-17T00:12:35.000Z (about 1 year ago)
- Last Synced: 2026-01-04T15:16:31.565Z (6 months ago)
- Topics: cascade, cascade-inference, chatgpt, claude, gemini, google, openai
- Language: Python
- Homepage:
- Size: 26.4 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Cascade Inference
Cascade based inference for large language models.
## Installation
```bash
pip install cascade-inference
# To use semantic agreement, install the optional dependencies:
pip install cascade-inference[semantic]
```
## Basic Usage
> **💡 Pro-Tip:** It is highly recommended to use Level 1 client models from the same or similar model families (e.g., all Llama-based, all Qwen-based). This improves the reliability of the `semantic` agreement strategy. If you mix models from different families (like Llama and Gemini), consider lowering the `threshold` in the agreement strategy to account for stylistic differences.
Using the library is as simple as a standard OpenAI API call.
```python
from openai import OpenAI
import cascade
import os
# Setup your clients
client = OpenAI(
base_url="https://openrouter.ai/api/v1",
api_key=os.environ.get("OPENROUTER_API_KEY"),
)
# Call the create function directly
response = cascade.chat.completions.create(
# Provide the ensemble of fast clients
level1_clients=[
(client, "meta-llama/llama-3.1-8b-instruct"),
(client, "google/gemini-flash-1.5")
],
# Provide the single, powerful client for escalation
level2_client=(client, "openai/gpt-4o"),
agreement_strategy="semantic", # or "strict"
messages=[
{"role": "user", "content": "What are the key differences between HBM3e and GDDR7 memory?"}
]
)
# The response object looks just like a standard OpenAI response
print(response.choices[0].message.content)
```
## Advanced Configuration
For more control, you can pass a dictionary to the `agreement_strategy` parameter. This allows you to fine-tune the agreement logic.
### 1. Changing the Semantic Similarity Threshold
You can adjust how strictly the semantic comparison is applied. The `threshold` is a value between 0 and 1, where 1 is a perfect match. The default is `0.9`.
```python
response = cascade.chat.completions.create(
# ... clients and messages ...
agreement_strategy={
"name": "semantic",
"threshold": 0.95 # Require a 95% similarity match
},
# ...
)
```
### 2. Using a Different Embedding Model
The default model is `sentence-transformers/all-MiniLM-L6-v2`, which is fast and lightweight. You can specify any other model compatible with the [**`FastEmbed`** library](https://qdrant.github.io/fastembed/examples/Supported_Models/).
Some other excellent choices from the supported models list include:
* `nomic-ai/nomic-embed-text-v1.5`
* `sentence-transformers/paraphrase-multilingual-mpnet-base-v2`: For multilingual use cases.
The library will automatically download and cache the new model on the first run.
```python
response = cascade.chat.completions.create(
# ... clients and messages ...
agreement_strategy={
"name": "semantic",
"model_name": "BAAI/bge-base-en-v1.5", # A larger, more powerful model
"threshold": 0.85 # It's good practice to adjust the threshold for a new model
},
# ...
)
```
### 3. Using a Remote Embedding Model
If local embedding is too slow, you can use the `remote_semantic` strategy. This feature is optimized for the [Hugging Face Inference API](https://huggingface.co/docs/api-inference/index) and is the recommended way to perform remote comparisons.
**Usage:**
You must provide a Hugging Face API key, which you can get for free from your account settings: [**huggingface.co/settings/tokens**](https://huggingface.co/settings/tokens).
The key can be passed directly via the `api_key` parameter or set as the `HUGGING_FACE_HUB_TOKEN` environment variable.
The default model is `sentence-transformers/all-mpnet-base-v2`, but you can easily use other models from the [**`sentence-transformers`**](https://huggingface.co/sentence-transformers) family on the Hub. We recommend the following models for the remote strategy:
* **Default & High-Quality:** `sentence-transformers/all-mpnet-base-v2`
* **Lightweight & Fast:** `sentence-transformers/all-MiniLM-L6-v2`
* **Multilingual:** `sentence-transformers/paraphrase-multilingual-mpnet-base-v2`
```python
response = cascade.chat.completions.create(
# ... clients and messages ...
agreement_strategy={
"name": "remote_semantic",
"model_name": "sentence-transformers/paraphrase-multilingual-mpnet-base-v2", # A multilingual model
"threshold": 0.95,
"api_key": "hf_YourHuggingFaceToken" # Optional, can also be set via env variable
},
# ...
)
```
You can also point the strategy to a completely different API provider by overriding the `api_url`, but you may need to fork the `RemoteSemanticAgreement` class if the provider requires a different payload structure.