https://github.com/elibutters/cascadeinference

Cascade based inference for LLMs
https://github.com/elibutters/cascadeinference

cascade cascade-inference chatgpt claude gemini google openai

Last synced: 2 months ago
JSON representation

Cascade based inference for LLMs

Host: GitHub
URL: https://github.com/elibutters/cascadeinference
Owner: elibutters
License: mit
Created: 2025-06-16T19:16:14.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2025-06-17T00:12:35.000Z (about 1 year ago)
Last Synced: 2026-01-04T15:16:31.565Z (6 months ago)
Topics: cascade, cascade-inference, chatgpt, claude, gemini, google, openai
Language: Python
Homepage:
Size: 26.4 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # Cascade Inference

Cascade based inference for large language models.

## Installation

```bash

pip install cascade-inference

# To use semantic agreement, install the optional dependencies:

pip install cascade-inference[semantic]

```

## Basic Usage

> **💡 Pro-Tip:** It is highly recommended to use Level 1 client models from the same or similar model families (e.g., all Llama-based, all Qwen-based). This improves the reliability of the `semantic` agreement strategy. If you mix models from different families (like Llama and Gemini), consider lowering the `threshold` in the agreement strategy to account for stylistic differences.

Using the library is as simple as a standard OpenAI API call.

```python

from openai import OpenAI

import cascade

import os

# Setup your clients

client = OpenAI(

    base_url="https://openrouter.ai/api/v1",

    api_key=os.environ.get("OPENROUTER_API_KEY"),

)

# Call the create function directly

response = cascade.chat.completions.create(

    # Provide the ensemble of fast clients

    level1_clients=[

        (client, "meta-llama/llama-3.1-8b-instruct"),

        (client, "google/gemini-flash-1.5")

    ],

    # Provide the single, powerful client for escalation

    level2_client=(client, "openai/gpt-4o"),

    agreement_strategy="semantic", # or "strict"

    messages=[

        {"role": "user", "content": "What are the key differences between HBM3e and GDDR7 memory?"}

    ]

)

# The response object looks just like a standard OpenAI response

print(response.choices[0].message.content)

```

## Advanced Configuration

For more control, you can pass a dictionary to the `agreement_strategy` parameter. This allows you to fine-tune the agreement logic.

### 1. Changing the Semantic Similarity Threshold

You can adjust how strictly the semantic comparison is applied. The `threshold` is a value between 0 and 1, where 1 is a perfect match. The default is `0.9`.

```python

response = cascade.chat.completions.create(

    # ... clients and messages ...

    agreement_strategy={

        "name": "semantic",

        "threshold": 0.95  # Require a 95% similarity match

    },

    # ...

)

```

### 2. Using a Different Embedding Model

The default model is `sentence-transformers/all-MiniLM-L6-v2`, which is fast and lightweight. You can specify any other model compatible with the [**`FastEmbed`** library](https://qdrant.github.io/fastembed/examples/Supported_Models/).

Some other excellent choices from the supported models list include:

*   `nomic-ai/nomic-embed-text-v1.5`

*   `sentence-transformers/paraphrase-multilingual-mpnet-base-v2`: For multilingual use cases.

The library will automatically download and cache the new model on the first run.

```python

response = cascade.chat.completions.create(

    # ... clients and messages ...

    agreement_strategy={

        "name": "semantic",

        "model_name": "BAAI/bge-base-en-v1.5", # A larger, more powerful model

        "threshold": 0.85 # It's good practice to adjust the threshold for a new model

    },

    # ...

)

```

### 3. Using a Remote Embedding Model

If local embedding is too slow, you can use the `remote_semantic` strategy. This feature is optimized for the [Hugging Face Inference API](https://huggingface.co/docs/api-inference/index) and is the recommended way to perform remote comparisons.

**Usage:**

You must provide a Hugging Face API key, which you can get for free from your account settings: [**huggingface.co/settings/tokens**](https://huggingface.co/settings/tokens).

The key can be passed directly via the `api_key` parameter or set as the `HUGGING_FACE_HUB_TOKEN` environment variable.

The default model is `sentence-transformers/all-mpnet-base-v2`, but you can easily use other models from the [**`sentence-transformers`**](https://huggingface.co/sentence-transformers) family on the Hub. We recommend the following models for the remote strategy:

*   **Default & High-Quality:** `sentence-transformers/all-mpnet-base-v2`

*   **Lightweight & Fast:** `sentence-transformers/all-MiniLM-L6-v2`

*   **Multilingual:** `sentence-transformers/paraphrase-multilingual-mpnet-base-v2`

```python

response = cascade.chat.completions.create(

    # ... clients and messages ...

    agreement_strategy={

        "name": "remote_semantic",

        "model_name": "sentence-transformers/paraphrase-multilingual-mpnet-base-v2", # A multilingual model

        "threshold": 0.95,

        "api_key": "hf_YourHuggingFaceToken" # Optional, can also be set via env variable

    },

    # ...

)

```

You can also point the strategy to a completely different API provider by overriding the `api_url`, but you may need to fork the `RemoteSemanticAgreement` class if the provider requires a different payload structure.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/elibutters/cascadeinference

Awesome Lists containing this project

README