An open API service indexing awesome lists of open source software.

https://github.com/centminmod/or-cli

Python command-line tool for interacting with AI models through the OpenRouter API/Cloudflare AI Gateway, or local self-hosted Ollama. Optionally support Microsoft LLMLingua prompt token compression
https://github.com/centminmod/or-cli

cloudflare-ai cloudflare-ai-gateway llm-inference llms ollama ollama-api openai-api openrouter openrouter-api

Last synced: 26 days ago
JSON representation

Python command-line tool for interacting with AI models through the OpenRouter API/Cloudflare AI Gateway, or local self-hosted Ollama. Optionally support Microsoft LLMLingua prompt token compression

Awesome Lists containing this project

README

          

I've been a paying user of ChatGPT Plus, Claude Pro, and Google Gemini Advanced since the beginning. However, for tasks involving automated text processing (summarization, transformation) of large datasets, I needed a non-GUI solution. I was using the OpenAI API, but then I discovered [OpenRouter AI](https://openrouter.ai) on February 16, 2025. OpenRouter AI offers a generous free tier, so I created `or-cli.py` to leverage it for my text processing needs. I've also added Ollama integration to use self-hosted models from [Hugging Face](https://huggingface.co/models?pipeline_tag=text-generation&sort=trending).

# or-cli.py - OpenRouter AI Command-Line Interface

A versatile Python command-line tool for interacting with AI models through the [OpenRouter API](https://openrouter.ai/docs), supporting direct API calls, request caching via [Cloudflare AI Gateway](https://developers.cloudflare.com/ai-gateway/), or local model inference with [Ollama](https://ollama.ai/) which can optionally leverage [Microsoft LLMLingua](https://llmlingua.com/) prompt token compression techniques to reduce prompt token sizes.

## Table of Contents

- [Overview](#overview)
- [Key Features](#key-features)
- [Configuration](#configuration)
- [Requirements](#requirements)
- [Usage](#usage)
- [Command-Line Arguments](#command-line-arguments)
- [Example Usage](#example-usage)
- [Basic Usage](#basic-usage)
- [Working with Images](#working-with-images)
- [Model Selection](#model-selection)
- [Token Usage and Limits](#token-usage-and-limits)
- [Prompt Compression](#prompt-compression)
- [Multi-model Features](#multi-model-features)
- [Web Page Processing](#web-page-processing)
- [Xenforo Thread Summary](#xenforo-thread-summary)
- [Local Ollama Integration](#local-ollama-integration)
- [Conversational Exchanges](#conversational-exchanges)
- [Structured Output](#structured-output)
- [Technical Details](#technical-details)
- [Functions Overview](#functions-overview)
- [Yappi Profiling](#yappi-profiling)
- [Advanced Features](#advanced-features)
- [Prompt Compression](#prompt-compression)
- [Multi-model Evaluation](#multi-model-evaluation)
- [Web Page Processing](#web-page-processing)
- [Cloudflare AI Gateway Integration](#cloudflare-ai-gateway-integration)
- [Local Ollama Integration](#local-ollama-integration)

## Overview

`or-cli.py` is a powerful command-line interface (CLI) tool designed to communicate with AI language models through multiple pathways:

1. **Direct API Access** to OpenRouter's extensive model catalog - https://openrouter.ai/docs/api-reference/overview
2. **Cloudflare AI Gateway** for request proxying with intelligent caching - https://developers.cloudflare.com/ai-gateway/
3. **Local Ollama API** for on-premise model inference - https://ollama.com and use self-hosted LLM models from [Hugging Face](https://huggingface.co/models?pipeline_tag=text-generation&sort=trending)

The tool streamlines AI interactions for a wide range of applications, from simple text completions to sophisticated multi-model evaluations, webpage analysis, and token-optimized prompt engineering.

## Key Features

- **Multimodal Support**: Send text prompts with optional image inputs
- **Code-Aware Processing**: Special handling for code snippets with the `--code` flag
- **Webpage Handling**: Convert HTML to markdown with intelligent content extraction via `--webpage`
- **Advanced Prompt Compression**: Reduce token usage by up to 60% with Microsoft LLMLingua compression techniques that can reduce token size by up to 60%!
- **Multi-Model Workflows**:
- **Evaluation Mode**: Have models evaluate each other's responses with `--eval`
- **Comparison Mode**: Get parallel responses from multiple models with `--multi`
- **Conversation Support**: Maintain context across messages with `--follow-up`
- **Usage Analytics**: Track token consumption and costs with `--tokens`
- **Debugging Tools**: Detailed logging with `--debug` and performance profiling with `--yappi`
- **Customizable Generation**: Fine-tune outputs with temperature, seed, top_p and more
- **JSON Structured Outputs**: Format responses as structured data when needed

## Configuration

**Configure API Keys and Environment**:
- Set your OpenRouter API key via the `OPENROUTER_API_KEY` environment variable or the `--api-key` flag
- For Cloudflare AI Gateway: set `USE_CLOUDFLARE_AI_GATEWAY=y`, `CF_ACCOUNT_ID`, and optionally `CF_GATEWAY_ID`
- For Ollama: ensure your instance is running at `http://localhost:11434/v1`

## Requirements

- **Python**: 3.6 or higher
- **Core Dependencies**:
```
requests # HTTP communication
openai # OpenAI SDK for API formatting
aiohttp # Asynchronous HTTP for webpage fetching
beautifulsoup4 # HTML parsing
html2text # HTML to Markdown conversion
htmlmin # HTML minification
orjson # Fast JSON processing
```
- **Optional Dependencies**:
```
llmlingua # Required for --compress features
yappi # Required for performance profiling
```

Install all dependencies with:
```bash
pip install requests openai aiohttp beautifulsoup4 html2text htmlmin orjson llmlingua yappi
```

## Usage

Run the script with command-line arguments to customize behavior. For full help:

```bash
python or-cli.py -h
usage: or-cli.py [-h] [-p PROMPT] [-m MESSAGE] [-c] [-i IMAGE] [--model MODEL] [--ollama] [--ollama-max-tokens OLLAMA_MAX_TOKENS] [-t] [-d] [--api-key API_KEY] [--temperature TEMPERATURE]
[--seed SEED] [--top-p TOP_P] [--max-tokens MAX_TOKENS] [--response-format RESPONSE_FORMAT] [--structured-outputs] [--include-reasoning] [--limits] [--eval] [--multi]
[--webpage WEBPAGE] [--condense [CONDENSE]] [--compress] [--compress-long] [--compress-long-question COMPRESS_LONG_QUESTION] [--compress-extended]
[--compress-batch-size COMPRESS_BATCH_SIZE] [--compress-force-token COMPRESS_FORCE_TOKEN] [--compress-rate COMPRESS_RATE] [--follow-up FOLLOW_UP] [--compress-save]
[--compress-save-path COMPRESS_SAVE_PATH] [-q] [--yappi] [--yappi-path YAPPI_PATH] [--yappi-export-format {callgrind,snakeviz,gprof2dot}]

Send a chat completion request to OpenRouter and optionally query generation stats or API key limits.

optional arguments:
-h, --help show this help message and exit
-p PROMPT, --prompt PROMPT
System prompt/instructions.
-m MESSAGE, --message MESSAGE
User message. If not provided, reads from stdin.
-c, --code Flag indicating message contains code (apply escaping).
-i IMAGE, --image IMAGE
Optional path to an image file to include.
--model MODEL LLM model to use (default: google/gemini-2.0-flash-lite-preview-02-05:free). When used with --eval, pass two or three comma-separated models.
--ollama Use local Ollama endpoint and default model.
--ollama-max-tokens OLLAMA_MAX_TOKENS
Override the maximum prompt token context size (default: 8000)
-t, --tokens Query and display token usage and cost info.
-d, --debug Enable detailed debug logging.
--api-key API_KEY Override API key (or set OPENROUTER_API_KEY environment variable).
--temperature TEMPERATURE
Sampling temperature (default: 0.3 for determinism)
--seed SEED Fixed seed for deterministic output (optional)
--top-p TOP_P Nucleus sampling top_p value (default: 1.0)
--max-tokens MAX_TOKENS
Upper limit for tokens to generate (optional)
--response-format RESPONSE_FORMAT
Response format as a JSON string (e.g., '{"type": "json_object"}'), or "json" as a shortcut.
--structured-outputs Enable structured outputs (optional)
--include-reasoning Include reasoning tokens in the response (optional)
--limits Check API key rate limits and usage
--eval Evaluate the first model's response with a second model (and optionally a third model) when using comma-separated models in --model
--multi Multi-model mode: all provided models respond to the same prompt
--webpage WEBPAGE Optional URL of a webpage to convert HTML to Markdown
--condense [CONDENSE]
Tiered condense level for --webpage mode: level 1 (default) uses first 1/3 and last 1/3; level 2 uses first 2/5 and last 1/5; level 3 uses first 1/5 and last 1/5 of pages.
Only applies if total pages > 10.
--compress Enable prompt compression using LLMLingua.
--compress-long Enable coarse-level compression using LongLLMLingua before LLMLingua-2
--compress-long-question COMPRESS_LONG_QUESTION
Override the default question for coarse compression (e.g. 'Summary this text').
--compress-extended If set with --compress, save extended compression parameters (fn_labeled_original_prompt and compressed_prompt_list) to files in compress_logs directory instead of printing
to stdout.
--compress-batch-size COMPRESS_BATCH_SIZE
Set the maximum batch size for prompt compression (default: 400)
--compress-force-token COMPRESS_FORCE_TOKEN
Set the maximum force token value for prompt compression (default: 10000)
--compress-rate COMPRESS_RATE
Set the compression rate for LLMLingua-2 (default: 0.4)
--follow-up FOLLOW_UP
Follow-up messages to send to the assistant
--compress-save If set with --compress, save the compressed prompt to a file instead of sending to the API
--compress-save-path COMPRESS_SAVE_PATH
File path to save the compressed prompt when --compress-save is used
-q, --quiet Hide header output (quiet mode)
--yappi Enable yappi profiling for performance analysis
--yappi-path YAPPI_PATH
Optional file path to save yappi profiling results
--yappi-export-format {callgrind,snakeviz,gprof2dot}
Optional export format for yappi profiling data: callgrind (for KCachegrind/QCachegrind), snakeviz (pstats format for SnakeViz), or gprof2dot (convert to a dot graph via
gprof2dot).

Examples: python or-cli.py --limits echo 'def foo(x): return x*2' | python or-cli.py -p 'You are a code explainer.' -m 'Explain the code:'
```

### Command-Line Arguments

| Flag | Description | Optional/Required | Default Value |
|------|-------------|-------------------|---------------|
| `-p`, `--prompt` | System prompt/instructions for the AI | Required* | N/A |
| `-m`, `--message` | User message to send | Optional | Reads from stdin |
| `-c`, `--code` | Indicates message contains code (applies escaping) | Optional | False |
| `-i`, `--image` | Path or URL to an image file to include + `-m` message prompt. Use a image input supporting model --model google/gemini-2.0-flash-001 | Optional | N/A |
| `--model` | AI model(s) to use; comma-separated for `--eval` or `--multi` | Optional | google/gemini-2.0-flash-lite-preview-02-05:free |
| `--ollama` | Use local Ollama API instead of OpenRouter | Optional | False |
| `--ollama-max-tokens` | Maximum prompt token context for Ollama | Optional | 8000 |
| `-t`, `--tokens` | Display token usage and cost information | Optional | False |
| `-d`, `--debug` | Enable detailed debug logging | Optional | False |
| `--api-key` | Override API key | Optional | From environment variable |
| `--temperature` | Sampling temperature (higher = more creative) | Optional | 0.3 |
| `--seed` | Fixed seed for deterministic output | Optional | None |
| `--top-p` | Nucleus sampling value | Optional | 1.0 |
| `--max-tokens` | Maximum tokens to generate | Optional | None |
| `--response-format` | JSON format specification | Optional | None |
| `--structured-outputs` | Enable structured outputs if model supports it | Optional | False |
| `--include-reasoning` | Include reasoning tokens in response | Optional | False |
| `--limits` | Check API key rate limits and usage | Optional | False |
| `--eval` | Evaluate first model's response with second/third model | Optional | False |
| `--multi` | Get responses from all specified models | Optional | False |
| `--webpage` | URL to convert to Markdown for input | Optional | N/A |
| `--condense` | Condense level (1-3) for Xenforo multi-page threads analysis | Optional | 1 (if specified) |
| `--compress` | Enable prompt compression with Microsoft LLMLingua | Optional | False |
| `--compress-long` | Enable two-stage compression pipeline | Optional | False |
| `--compress-long-question` | Custom prompt for coarse compression | Optional | "" |
| `--compress-extended` | Save detailed compression parameters | Optional | False |
| `--compress-batch-size` | Batch size for compression | Optional | 400 |
| `--compress-force-token` | Force token value for compression | Optional | 10000 |
| `--compress-rate` | Compression rate (0.0-1.0) | Optional | 0.4 |
| `--follow-up` | Add follow-up message(s) | Optional | [] |
| `--compress-save` | Save compressed prompt to file | Optional | False |
| `--compress-save-path` | File path for saved compressed prompt | Optional | ./compress-text.txt |
| `-q`, `--quiet` | Hide header output | Optional | False |
| `--yappi` | Enable yappi profiling | Optional | False |
| `--yappi-path` | Path for profiling results | Optional | N/A |
| `--yappi-export-format` | Format for profiling data export | Optional | N/A |

\* Required unless `--limits` is specified

**Notes:**
- **Webpage Condense Levels**:
- Level 1: First 1/3 + last 1/3 of pages
- Level 2: First 2/5 + last 1/5 of pages
- Level 3: First 1/5 + last 1/5 of pages

## Example Usage

### Basic Usage

Send a simple query to the default model - using default OpenRouter AI API endpoint and default Google Gemini 2.0 Flash Lite Preview LLM model. Add `--ollama` flag to use locally self-hosted Ollama and default local, llama3.2 LLM model::

```bash
python or-cli.py -p "You are a helpful assistant." -m "What is the capital of France?" -t
```

with `-q`

```bash
python or-cli.py -p "You are a helpful assistant." -m "What is the capital of France?" -q
The capital of France is Paris.
```

with `-t`:

```bash
python or-cli.py -p "You are a helpful assistant." -m "What is the capital of France?" -t

----- Assistant Response -----
The capital of France is Paris.

----- Generation Stats -----
Model Used: google/gemini-2.0-flash-lite-preview-02-05:free
Provider Name: Google
Generation Time: 44 ms
Prompt Tokens: 24
Completion Tokens: 7
Total Tokens: 31
Total Cost: $0
Usage: 0
Latency: 1036 ms
Native Tokens Prompt: 13
Native Tokens Completion: 8
Native Tokens Reasoning: 0
Native Tokens Total: 21
Cache Discount: None
Temperature: 0.3
Top P: 1.0
Seed: None
Max Tokens: None
Compress: False
Compress Rate (Setting): 0.4
Original Tokens (LLMLingua-2): N/A
Compressed Tokens (LLMLingua-2): N/A
Compression Rate (LLMLingua-2): N/A
LLMLingua-2 max_batch_size: N/A
LLMLingua-2 max_force_token: N/A
```

with `-d` debug mode:

```bash
python or-cli.py -p "You are a helpful assistant." -m "What is the capital of France?" -d
[DEBUG] Request Payload:
{
"model": "google/gemini-2.0-flash-lite-preview-02-05:free",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "What is the capital of France?"
}
],
"temperature": 0.3,
"top_p": 1.0,
"options": {
"num_ctx": 8000
},
"extra_body": {
"models": [
"google/gemini-2.0-flash-exp:free",
"google/gemini-exp-1206:free",
"google/gemini-2.0-pro-exp-02-05:free"
]
},
"structured_outputs": false,
"include_reasoning": false
}
[DEBUG] Model 'google/gemini-2.0-flash-lite-preview-02-05:free' does not support structured_outputs; omitting them.
[DEBUG] Using chat completion endpoint: https://gateway.ai.cloudflare.com/v1/CLOUDFLARE_ACC_ID/CLOUDFLARE_GATEWAY_ID/openrouter
[DEBUG] Raw Response:
{
"id": "gen-1740151914-twGTMOxVuGp4DxqZJIpN",
"choices": [
{
"finish_reason": "stop",
"index": 0,
"logprobs": null,
"message": {
"content": "The capital of France is Paris.\n",
"refusal": null,
"role": "assistant"
},
"native_finish_reason": "STOP"
}
],
"created": 1740151914,
"model": "google/gemini-2.0-flash-lite-preview-02-05",
"object": "chat.completion",
"usage": {
"completion_tokens": 8,
"prompt_tokens": 13,
"total_tokens": 21
},
"provider": "Google"
}

----- Assistant Response -----
The capital of France is Paris.
```

Use text from a file or pipe it from another command - using default OpenRouter AI API endpoint and default Google Gemini 2.0 Flash Lite Preview LLM model. Add `--ollama` flag to use locally self-hosted Ollama and default local, llama3.2 LLM model::

```bash
cat document.txt | python or-cli.py -p "Summarize this text:" -t
```

### Working with Images

Process an image with a text prompt - using default OpenRouter AI API endpoint. Would need to probably specifiy a LLM model that supports images from https://openrouter.ai/models. Seems only paid models support Images and cheapest model is [Google Gemini Flash 1.5](https://openrouter.ai/google/gemini-flash-1.5) `google/gemini-flash-1.5` at `$0.04/K` input imgs or [Google Gemini Flash 2.0](https://openrouter.ai/google/gemini-2.0-flash-001) `google/gemini-2.0-flash-001` at `$0.0258/K` input imgs. Note API only supports, PNG, JPEG, or WEBP image formats.

```bash
python or-cli.py -p "Describe what you see in detail:" -m "image" -i path/to/image.jpg --model google/gemini-2.0-flash-001
```
```bash
wget -O amazon.png https://assets.aboutamazon.com/2e/d7/ac71f1f344c39f8949f48fc89e71/amazon-logo-squid-ink-smile-orange.png

python or-cli.py -p "Describe what you see in detail:" -m "logo" -i amazon.png --model google/gemini-2.0-flash-001
```
```bash
python or-cli.py -p "Describe what you see in detail:" -m "logo" -i amazon.png --model google/gemini-2.0-flash-001

----- Assistant Response -----
The image shows the Amazon logo. The word "amazon" is written in a dark gray sans-serif font. Below the word is a curved orange arrow that starts under the "a" and ends at the "z", resembling a smile. The background is black.
```

Cost from OpenRouter AI API endpoint perspective for image + prompt tokens = $0.000499

![OpenRouter Google Gemini 2.0 Flash image cost metrics](/screenshots/openrouter-ai-image-processing-3.png)

![OpenRouter Google Gemini 2.0 Flash image cost metrics](/screenshots/openrouter-ai-image-processing-4.png)

From Cloudflare AI Gateway perspective

![Cloudflare AI Gateway metrics for image processing](/screenshots/openrouter-ai-image-processing-1.png)

![Cloudflare AI Gateway metrics for image processing](/screenshots/openrouter-ai-image-processing-2.png)

Using `-t` flag for token stats reporting cost of $0.0003416 for the prompt + image input:

```bash
python or-cli.py -p "Describe what you see in detail:" -m "logo" -i amazon.png --model google/gemini-2.0-flash-001 -t

----- Assistant Response -----
The image shows the logo for Amazon. The word "amazon" is written in a bold, sans-serif font in a dark gray color. Beneath the word, there is a curved orange line that resembles a smile. The line starts under the "a" and ends with an arrow pointing towards the "z," creating a visual connection between the two letters. The background is black.

----- Generation Stats -----
Model Used: google/gemini-2.0-flash-001
Provider Name: Google AI Studio
Generation Time: 571 ms
Prompt Tokens: 281
Completion Tokens: 77
Total Tokens: 358
Total Cost: $0.0003416
Usage: 0.0003416
Latency: 1862 ms
Native Tokens Prompt: 2850
Native Tokens Completion: 77
Native Tokens Reasoning: 0
Native Tokens Total: 2927
Cache Discount: None
Temperature: 0.3
Top P: 1.0
Seed: None
Max Tokens: None
Compress: False
Compress Rate (Setting): 0.4
Original Tokens (LLMLingua-2): N/A
Compressed Tokens (LLMLingua-2): N/A
Compression Rate (LLMLingua-2): N/A
LLMLingua-2 max_batch_size: N/A
LLMLingua-2 max_force_token: N/A
```

Repeated requests are being cached by Cloudflare AI Gateway reducing my costs.

![Cloudflare AI Gateway metrics for image processing](/screenshots/openrouter-ai-image-processing-6.png)

### Model Selection

Specify a particular model from OpenRouter:

```bash
python or-cli.py "You are a helpful assistant." -m "What is the capital of France?" -t --model google/gemini-2.0-flash-lite-preview-02-05
```

Specify a particular model from Ollama:

```bash
python or-cli.py "You are a helpful assistant." -m "What is the capital of France?" -t --ollama --model llama3.2
```

```bash
ollama list
NAME ID SIZE MODIFIED
llama3.2-custom:latest 6714623728ec 2.0 GB 17 hours ago
hf.co/bartowski/Mistral-Small-24B-Instruct-2501-GGUF:Q2_K 5d1899e4e37f 8.9 GB 25 hours ago
hf.co/bartowski/Mistral-Small-24B-Instruct-2501-GGUF:Q4_0 767466b55220 13 GB 25 hours ago
hf.co/bartowski/Mistral-Small-24B-Instruct-2501-GGUF:Q8_0 277756ddf3c1 25 GB 25 hours ago
hf.co/lmstudio-community/DeepSeek-R1-Distill-Qwen-7B-GGUF:Q8_0 709f5ec4b28d 8.1 GB 25 hours ago
hf.co/Qwen/Qwen2.5-3B-Instruct-GGUF:Q8_0 b958eea7abce 3.6 GB 25 hours ago
llama3.2:latest a80c4f17acd5 2.0 GB 27 hours ago
hf.co/bartowski/Llama-3.2-3B-Instruct-GGUF:Q8_0 66d1fb5ce973 3.4 GB 28 hours ago
```

### Token Usage and Limits

Check your OpenRouter AI API key limits and usage:

```bash
python or-cli.py --limits
```
```bash
python or-cli.py --limits

--- API Key Limits and Usage ---
Label: sk-or-v1-f20...469
Usage: 0 credits used
Credit Limit: Unlimited
Free Tier: True
Rate Limit: 10 requests per 10s
```

Track token usage for a request - using default OpenRouter AI API endpoint and default Google Gemini 2.0 Flash Lite Preview LLM model. Add `--ollama` flag to use locally self-hosted Ollama and default local, llama3.2 LLM model:

```bash
python or-cli.py -p "You are a helpful assistant." -m "What is the capital of France?" -t
```

### Prompt Compression

See [Web Page Processing](#web-page-processing) usage example of how [Microsoft LLMLingua](https://llmlingua.com/) prompt token compression is used.

![Microsoft LLMLingua prompt compression Screenshots](/screenshots/LLMLingua-overview.png)

![Microsoft LLMLingua-2 prompt compression Screenshots](/screenshots/LLMLingua-2.png)

### Multi-model Features

Compare responses from different models using `--multi` and comma separated list of OpenRouter AI API supported LLM models using `--model` flag:

Ask Meta Llama 3.3 70b Instruct model and Google Gemini 2.0 Flash Lite Preview model to both response to the same prompt.

```bash
python or-cli.py -p "You are a helpful assistant." -m "What is the capital of France?" --model google/gemini-2.0-flash-lite-preview-02-05,meta-llama/llama-3.3-70b-instruct:free --multi -t
```
```bash
python or-cli.py -p "You are a helpful assistant." -m "What is the capital of France?" --model google/gemini-2.0-flash-lite-preview-02-05,meta-llama/llama-3.3-70b-instruct:free --multi -t

----- Response from model google/gemini-2.0-flash-lite-preview-02-05 -----
The capital of France is Paris.

----- Response from model meta-llama/llama-3.3-70b-instruct:free -----
The capital of France is Paris.

----- Generation Stats for model google/gemini-2.0-flash-lite-preview-02-05 -----
ID: gen-1740152320-HbyNBokmpnJQNSt1GtZU
Created At: 2025-02-21T15:38:42.158665+00:00
Streamed: True
Finish Reason: stop
Native Finish Reason: STOP
Model Used: google/gemini-2.0-flash-lite-preview-02-05
Provider Name: Google AI Studio
Generation Time: 157 ms
Prompt Tokens: 24
Completion Tokens: 7
Total Tokens: 31
Total Cost: $0
Usage: 0
Latency: 644 ms
Native Tokens Prompt: 13
Native Tokens Completion: 8
Native Tokens Reasoning: 0
Native Tokens Total: 21
Cache Discount: None

----- Generation Stats for model meta-llama/llama-3.3-70b-instruct:free -----
ID: gen-1740152322-qRiPy0Sa4DhueEe3ELus
Created At: 2025-02-21T15:38:51.872362+00:00
Streamed: True
Finish Reason: stop
Native Finish Reason: stop
Model Used: meta-llama/llama-3.3-70b-instruct:free
Provider Name: Crusoe
Generation Time: 1182 ms
Prompt Tokens: 74
Completion Tokens: 7
Total Tokens: 81
Total Cost: $0
Usage: 0
Latency: 7512 ms
Native Tokens Prompt: 29
Native Tokens Completion: 8
Native Tokens Reasoning: 0
Native Tokens Total: 37
Cache Discount: None
```

Have models evaluate each other using `--eval` and comma separated list of OpenRouter AI API supported LLM models using `--model` flag:

Ask Meta Llama 3.3 70b Instruct model to evaluate the response from Google Gemini 2.0 Flash Lite Preview model.

```bash
python or-cli.py -p "You are a helpful assistant." -m "What is the capital of France?" --model google/gemini-2.0-flash-lite-preview-02-05,meta-llama/llama-3.3-70b-instruct:free --eval -t
```
```bash
python or-cli.py -p "You are a helpful assistant." -m "What is the capital of France?" --model google/gemini-2.0-flash-lite-preview-02-05,meta-llama/llama-3.3-70b-instruct:free --eval -t

----- First Model Response -----
The capital of France is Paris.

----- Evaluation Response (Second Model) -----
The response is accurate. It correctly identifies the capital of France as Paris.

There are no suggestions for improvement needed as the statement is straightforward and factual.

Improved response: The capital of France is Paris.

----- First Model Generation Stats -----
ID: gen-1740152320-HbyNBokmpnJQNSt1GtZU
Created At: 2025-02-21T15:38:42.158665+00:00
Streamed: True
Finish Reason: stop
Native Finish Reason: STOP
Model Used: google/gemini-2.0-flash-lite-preview-02-05
Provider Name: Google AI Studio
Generation Time: 157 ms
Prompt Tokens: 24
Completion Tokens: 7
Total Tokens: 31
Total Cost: $0
Usage: 0
Latency: 644 ms
Native Tokens Prompt: 13
Native Tokens Completion: 8
Native Tokens Reasoning: 0
Native Tokens Total: 21
Cache Discount: None
Temperature: 0.3
Top P: 1.0
Seed: None
Max Tokens: None
Compress: False
Compress Rate (Setting): 0.4
Original Tokens (LLMLingua-2): N/A
Compressed Tokens (LLMLingua-2): N/A
Compression Rate (LLMLingua-2): N/A
LLMLingua-2 max_batch_size: N/A
LLMLingua-2 max_force_token: N/A

----- Second Model Generation Stats -----
ID: gen-1740152387-GiQSMRE3GxO1J5A5jz7Z
Created At: 2025-02-21T15:39:49.71953+00:00
Streamed: True
Finish Reason: stop
Native Finish Reason: stop
Model Used: meta-llama/llama-3.3-70b-instruct:free
Provider Name: Chutes
Generation Time: 1221 ms
Prompt Tokens: 104
Completion Tokens: 43
Total Tokens: 147
Total Cost: $0
Usage: 0
Latency: 237 ms
Native Tokens Prompt: 59
Native Tokens Completion: 43
Native Tokens Reasoning: 0
Native Tokens Total: 102
Cache Discount: None
Temperature: 0.3
Top P: 1.0
Seed: None
Max Tokens: None
Compress: False
Compress Rate (Setting): 0.4
Original Tokens (LLMLingua-2): N/A
Compressed Tokens (LLMLingua-2): N/A
Compression Rate (LLMLingua-2): N/A
LLMLingua-2 max_batch_size: N/A
LLMLingua-2 max_force_token: N/A
```

### Web Page Processing

Analyze a web page - leveraging [Microsoft LLMLingua](https://llmlingua.com/) prompt token compression techniques to reduce input prompt token size by up to 60% (default Microsoft LLMLingua compression rate is set to 0.4 (40% of original prompt token size):

Using default OpenRouter AI API endpoint and default Google Gemini 2.0 Flash Lite Preview LLM model:

```bash
python or-cli.py --webpage https://awscli-get.centminmod.com/ | python or-cli.py --compress --compress-long --compress-save --compress-batch-size 500 --compress-save-path ./compress-awscli-get-long.txt

cat ./compress-awscli-get-long.txt | python or-cli.py -p "Act like expert summarizer. Summarize this web page." -t --temperature 0.7
```
```bash
cat ./compress-awscli-get-long.txt | python or-cli.py -p "Act like expert summarizer. Summarize this web page." -t --temperature 0.7

----- Assistant Response -----
This webpage provides instructions and examples for using the AWS CLI (Command Line Interface) and the s5cmd tool, primarily within the context of a Centmin Mod LEMP stack. It covers:

* **Installation and Basic Usage:** How to install the AWS CLI and s5cmd, including setting up environment variables for AWS access keys, secret keys, and default regions.
* **Configuration:** Explains how to configure the AWS CLI for different profiles, including setting up credentials and regions, and how to handle configuration files.
* **S3 Region Codes:** Lists S3 region codes for various AWS regions and also for Wasabi S3.
* **Custom Profiles:** Demonstrates how to create and use custom profiles for different services like Cloudflare R2, Linode Object Storage, DigitalOcean Spaces, and Backblaze B2. Each section provides specific configuration adjustments and example commands.
* **Cloudflare R2 Integration:** Provides specific configuration steps and commands for using the AWS CLI with Cloudflare R2 storage, including setting multipart upload thresholds and addressing styles.
* **s5cmd Usage:** Introduces s5cmd as a faster alternative to the AWS CLI, highlighting its advantages and limitations. It covers basic s5cmd commands and provides examples of how to use it with various S3-compatible services, including backing up newer files. It also includes performance comparisons with the AWS CLI.

----- Generation Stats -----
Model Used: google/gemini-2.0-flash-lite-preview-02-05:free
Provider Name: Google AI Studio
Generation Time: 1282 ms
Prompt Tokens: 2080
Completion Tokens: 288
Total Tokens: 2368
Total Cost: $0
Usage: 0
Latency: 864 ms
Native Tokens Prompt: 2447
Native Tokens Completion: 296
Native Tokens Reasoning: 0
Native Tokens Total: 2743
Cache Discount: None
Temperature: 0.7
Top P: 1.0
Seed: None
Max Tokens: None
Compress: False
Compress Rate (Setting): 0.4
Original Tokens (LLMLingua-2): N/A
Compressed Tokens (LLMLingua-2): N/A
Compression Rate (LLMLingua-2): N/A
LLMLingua-2 max_batch_size: N/A
LLMLingua-2 max_force_token: N/A
```

You can also pipe in the web page content - using default OpenRouter AI API endpoint and default Google Gemini 2.0 Flash Lite Preview LLM model:

```bash
python or-cli.py --webpage https://awscli-get.centminmod.com/ | python or-cli.py -p "Act like expert summarizer. Summarize this web page." --compress --compress-long --compress-batch-size 500 -t --temperature 0.7
```

Microsoft LLMLingua-2 and LongLLMLingua 2-stage prompt token compression reduced prompt token size to 47.9% of the original size or 52.1% reduction in prompt token size!. LLMLinua reported original token size of web page at 4,267 tokens and compressed it down to 2,045 tokens. OpenRouter AI API calculated it as processing 2,443 native prompt tokens after prompt token compression was applied.

```bash
python or-cli.py --webpage https://awscli-get.centminmod.com/ | python or-cli.py -p "Act like expert summarizer. Summarize this web page." --compress --compress-long --compress-batch-size 500 -t --temperature 0.7

----- Assistant Response -----
This webpage provides instructions and usage examples for the AWS CLI (Command Line Interface) and the s5cmd tool, particularly within a Centmin Mod LEMP stack environment. It covers:

* **Installation and Configuration:** Guides for installing and configuring the AWS CLI, including setting up profiles, environment variables for AWS credentials (access key, secret key, and default region), and output formats.
* **Region Codes:** Lists region codes for various AWS regions and also for Wasabi S3.
* **Custom Profiles:** Demonstrates how to create and use custom profiles for different cloud storage providers, such as Cloudflare R2, Linode Object Storage, DigitalOcean Spaces, and Backblaze B2. Includes specific configuration adjustments for each provider.
* **s5cmd Usage:** Introduces s5cmd as a faster alternative to the AWS CLI. It provides s5cmd commands for listing buckets and objects, and for backup operations, along with performance comparisons.

----- Generation Stats -----
Model Used: google/gemini-2.0-flash-lite-preview-02-05:free
Provider Name: Google
Generation Time: 849 ms
Prompt Tokens: 2078
Completion Tokens: 197
Total Tokens: 2275
Total Cost: $0
Usage: 0
Latency: 533 ms
Native Tokens Prompt: 2443
Native Tokens Completion: 201
Native Tokens Reasoning: 0
Native Tokens Total: 2644
Cache Discount: None
Temperature: 0.7
Top P: 1.0
Seed: None
Max Tokens: None
Compress: True
Compress Rate (Setting): 0.4
Original Tokens (LLMLingua-2): 4267
Compressed Tokens (LLMLingua-2): 2045
Compression Rate (LLMLingua-2): 47.9%
LLMLingua-2 max_batch_size: 500
LLMLingua-2 max_force_token: 10000
```

You can use `--compress-rate` to change default Microsoft LLMLingua-2 compression rate from `0.3` default i.e. use 15% `0.15` to reduce original prompt token size. The lower the compression rate = higher compression which impacts quality of key prompt text information.

```bash
python or-cli.py --webpage https://awscli-get.centminmod.com/ | python or-cli.py -p "Act like expert summarizer. Summarize this web page." --compress --compress-long --compress-batch-size 500 -t --temperature 0.7 --compress-rate 0.15
```

Microsoft LLMLingua-2 and LongLLMLingua 2-stage prompt token compression reduced prompt token size to 15.5% of the original size or 84.5% reduction in prompt token size!. LLMLinua reported original token size of web page at 4,267 tokens and compressed it down to 662 tokens. OpenRouter AI API calculated it as processing 909 native prompt tokens after prompt token compression was applied.

```bash
python or-cli.py --webpage https://awscli-get.centminmod.com/ | python or-cli.py -p "Act like expert summarizer. Summarize this web page." --compress --compress-long --compress-batch-size 500 -t --temperature 0.7 --compress-rate 0.15

----- Assistant Response -----
This webpage appears to be a technical guide or documentation related to configuring and using various cloud storage services like AWS S3, Cloudflare R2, DigitalOcean Spaces, and Linode Object Storage. It provides information on setting up credentials, specifying regions, and handling multipart uploads, along with potential troubleshooting steps and error messages. The content seems to be structured with code snippets and configuration examples, covering topics such as access keys, secret keys, default regions, and output formats.

----- Generation Stats -----
Model Used: google/gemini-2.0-flash-lite-preview-02-05:free
Provider Name: Google
Generation Time: 196 ms
Prompt Tokens: 686
Completion Tokens: 94
Total Tokens: 780
Total Cost: $0
Usage: 0
Latency: 635 ms
Native Tokens Prompt: 909
Native Tokens Completion: 95
Native Tokens Reasoning: 0
Native Tokens Total: 1004
Cache Discount: None
Temperature: 0.7
Top P: 1.0
Seed: None
Max Tokens: None
Compress: True
Compress Rate (Setting): 0.15
Original Tokens (LLMLingua-2): 4267
Compressed Tokens (LLMLingua-2): 662
Compression Rate (LLMLingua-2): 15.5%
LLMLingua-2 max_batch_size: 500
LLMLingua-2 max_force_token: 10000
```

Using locally self-hosted Ollama via `--ollama` flag and default local, llama3.2 LLM model:

```bash
python or-cli.py --webpage https://awscli-get.centminmod.com/ | python or-cli.py --compress --compress-long --compress-save --compress-batch-size 500 --compress-save-path ./compress-awscli-get-long.txt

cat ./compress-awscli-get-long.txt | python or-cli.py -p "Act like expert summarizer. Summarize this web page." -t --ollama --temperature 0.7 --model llama3.2-custom
```
```bash
cat ./compress-awscli-get-long.txt | python or-cli.py -p "Act like expert summarizer. Summarize this web page." -t --ollama --temperature 0.7 --model llama3.2-custom

----- Assistant Response -----
Here is a summary of the web page:

**Introduction to AWS CLI Installer**

The web page provides instructions on how to install and configure the AWS Command Line Interface (CLI) on various Linux distributions using Centmin Mod LEMP Stack.

**Installation Steps**

1. Download the AWS CLI installer script.
2. Run the script with `chmod` privileges.
3. Follow the prompts to set up your AWS credentials, region, and output format.

**Supported Regions and Profiles**

The web page lists supported regions for AWS S3 buckets:

* Wasabi: US East 1 N. Virginia, US East 2 N. Virginia, US West 1 Oregon
* DigitalOcean Spaces: sfo2
* Backblaze: US West 001

It also provides instructions on how to create custom profiles with specific settings.

**Comparison between AWS CLI and S5Cmd**

The web page compares the performance of the AWS CLI with S5Cmd, a faster alternative. While S5Cmd supports more features, it has limitations in terms of file size transfer and multiple user profiles.

**Key Features and Benefits**

* Fast performance using S5Cmd
* Supports reading existing AWS CLI configured credentials
* Compatible with LEMP Stack

Overall, the web page provides detailed instructions on installing and configuring the AWS CLI on various Linux distributions, as well as comparing it with S5Cmd.

----- Usage Stats (Ollama) -----
Model Used: llama3.2-custom
Prompt Tokens: 2085
Completion Tokens: 278
Total Tokens: 2363
```

You can also pipe in the web page content - using locally self-hosted Ollama via `--ollama` flag and default local, llama3.2 LLM model:

```bash
python or-cli.py --webpage https://awscli-get.centminmod.com/ | python or-cli.py -p "Act like expert summarizer. Summarize this web page." --compress --compress-long --compress-batch-size 500 -t --ollama --temperature 0.7 --model llama3.2-custom
```

Process a multi-page Xenforo forum thread efficiently using `--condense` level `1` to `3` flag - using default OpenRouter AI API endpoint and default Google Gemini 2.0 Flash Lite Preview LLM model:

```bash
python or-cli.py --condense 1 --webpage https://xenforo.com/community/threads/uk-online-safety-regulations-and-impact-on-forums.227661/ | python or-cli.py -p "Act like expert summarizer. Summarize this web page." --compress --compress-long --compress-batch-size 500 -t --temperature 0.7
```

Process a multi-page forum thread efficiently using locally self-hosted Ollama via `--ollama` flag and default local, llama3.2 LLM model:

```bash
python or-cli.py --condense 1 --webpage https://xenforo.com/community/threads/uk-online-safety-regulations-and-impact-on-forums.227661/ | python or-cli.py -p "Act like expert summarizer. Summarize this web page." --compress --compress-long --compress-batch-size 500 -t --temperature 0.7 --ollama
```

The `--condense` flag was written specifically for Xenforo thread analysis and controls how much of a multi‑page Xenforo thread is processed by discarding a portion of the middle pages. If the thread contains more than 10 pages, the tool applies dynamic breakpoints based on the total page count:

- **For threads with >10 pages but ≤20 pages:**
- **Level 1 (default):** Fetches approximately the first and last third of the pages.
- **Level 2:** Fetches roughly the first 2/5 and last 1/5 of pages.
- **Level 3:** Fetches roughly the first 1/5 and last 1/5 of pages.

- **For threads with >20, >30, >40, or >50 pages:**
The tool adjusts the fractions to maintain a similar token-to-page ratio. For example, for threads with more than 50 pages:
- **Level 1:** Fetches about 1/7 of the pages from the beginning and 1/7 from the end.
- **Level 2:** Fetches approximately 1/6 of the pages from the beginning and 1/12 from the end.
- **Level 3:** Fetches roughly 1/12 of the pages from both the beginning and the end.

If no level is specified (i.e. simply using `--condense`), it defaults to Level 1. You can adjust these fractions as needed to fine-tune the balance between content coverage and token count.

#### Xenforo Thread Summary

Or skip using `--condense` flag and leverage Micosoft LLMLingua prompt token compression via LLMLingua-2 + contexual optimization via LongLLMLingua to reduce Xenforo thread pages down to a manageable prompt token size. Default uses `--compress-rate 0.3` so reduces original prompt token size to 30% of original size = ~70% savings. Though in practise it ended up with ~48% savings.

```bash
time python or-cli.py --webpage https://xenforo.com/community/threads/uk-online-safety-regulations-and-impact-on-forums.227661/ | python or-cli.py -p "Act like expert summarizer. Summarize this Xenforo forum thread and all it's posts." --compress --compress-long --compress-batch-size 500 -t --temperature 0.7
```

With LLMLingua prompt token compression shows OpenRouter API reported native prompt tokens = 124,090

```bash
time python or-cli.py --webpage https://xenforo.com/community/threads/uk-online-safety-regulations-and-impact-on-forums.227661/ | python or-cli.py -p "Act like expert summarizer. Summarize this Xenforo forum thread and all it's posts." --compress --compress-long --compress-batch-size 500 -t --temperature 0.7

----- Assistant Response -----
Here's a summary of the Xenforo forum thread and its posts, acting as an expert summarizer:

**Overall Thread Summary:**

The thread discusses the impact of the UK's Online Safety Act (OSA) on online forums, particularly those using the Xenforo platform. The primary concerns revolve around the new regulations, which aim to protect users from illegal and harmful content, especially children. The discussions cover the implications for forum owners, including the need for age verification, content moderation, risk assessments, and data privacy. Many participants express significant anxiety and concern about the potential costs, complexities, and potential for censorship that the OSA introduces. Some forum owners are considering drastic measures, such as banning UK users, disabling features like DMs, or even shutting down their forums entirely. There's also a lot of discussion about the practical challenges of implementing the regulations, the lack of clear guidance, and the potential for unintended consequences.

**Key Themes and Issues:**

* **Age Verification:** The need for robust age verification methods is a central theme. The thread discusses the limitations of self-declaration, the costs and complexities of various verification services (open banking, ID checks, etc.), and the potential for circumventing age restrictions. A recurring concern is the cost and practicality of implementation, especially for smaller forums.
* **Content Moderation and Risk Assessment:** Forum owners are grappling with the requirements for content moderation and risk assessment. The thread explores the challenges of identifying and removing illegal/harmful content, especially in the context of DMs. Discussions include the use of AI tools, keyword filtering, and the need for human moderators. The need for written records, the 17 categories of illegal content, and the need to document steps taken are also highlighted.
* **Data Privacy and User Rights:** The intersection of the OSA with data privacy regulations (like GDPR) is a concern. The thread discusses the implications of collecting and storing user data for age verification and content moderation. The impact on users' rights to free speech and privacy is also a topic of debate.
* **Cost and Burden on Forum Owners:** A significant concern is the financial and administrative burden the OSA places on forum owners. The thread explores the potential costs of age verification, content moderation, legal compliance, and the impact on smaller, volunteer-run forums. Many participants express the view that the regulations are disproportionately burdensome.
* **Impact on Free Speech and Censorship:** The OSA's potential impact on free speech and the risk of censorship are debated. Participants express concerns about the definition of "harmful" content, the potential for overreach, and the chilling effect on open discussion. The discussions include comparisons with free speech laws in other countries (e.g., the US).
* **Technical Challenges and Solutions:** The thread discusses the technical challenges of implementing the OSA and explores potential solutions. These include the use of AI tools, third-party services (e.g., for age verification), and the need for Xenforo to provide features to aid compliance. Code examples and links to helpful resources are also shared.
* **Geoblocking and User Restrictions:** Some forum owners are considering or implementing geoblocking to restrict access from the UK and EU to avoid the regulations. The thread discusses the implications of this approach and the potential impact on user communities.
* **Uncertainty and Lack of Clarity:** A pervasive theme is the uncertainty surrounding the OSA. Participants express frustration with the lack of clear guidance from Ofcom and the difficulty of interpreting the regulations. The thread reflects a sense that the regulations are a "moving target" and that compliance is a constant challenge.

**Specific Points from the Posts (Illustrative Examples):**

* **Fear and Uncertainty:** "I'm terrified of the regulations" and "The uncertainty is killing me."
* **Financial Concerns:** "Age verification would bankrupt me" and "The costs are prohibitive."
* **Impact on User Experience:** "I'm afraid of the impact on our community" and "We might have to disable DMs."
* **AI and Moderation:** "AI is not a perfect solution, but it's a start" and "We need custom development for AI tools."
* **Legal Concerns:** "We need to consult with lawyers" and "The legal liability is a huge concern."
* **Comparison to GDPR:** "The GDPR overreaction is happening all over again."
* **Call for Action:** "Xenforo needs to provide tools" and "We need a clear, concise guide."
* **Specific Solutions:** Some users share links to helpful resources, templates, and add-ons.
* Users posted links to the Ofcom website and guidance documents.
* Users shared links to custom Xenforo templates to help set up risk assessments.
* Users discussed specific Xenforo addons to automatically block and report content.
* Users were working on a new Xenforo plugin to assist with the OSA.
* Users are exploring the potential for AI and machine learning to help with content moderation.
* **Example of Action:** Some forum owners are taking steps to comply, such as creating risk assessments, disabling certain features, and exploring age verification options.
* Users are discussing the pros and cons of AI-based content moderation.
* Several users are considering blocking IPs from the UK or EU.
* Users are concerned that the new rules are a targeted attack on smaller communities.
* Users are considering a combination of age verification tools and manual moderation.
* Users are discussing the costs of age verification services.
* Users are discussing the need for clear communication with users about any changes.
* Users are discussing the need for a risk assessment.

**Overall, the thread serves as a valuable resource for Xenforo forum owners navigating the complex landscape of the UK's Online Safety Act. It highlights the challenges, concerns, and potential solutions being discussed within the community.**

----- Generation Stats -----
Model Used: google/gemini-2.0-flash-lite-preview-02-05:free
Provider Name: Google AI Studio
Generation Time: 7448 ms
Prompt Tokens: 92713
Completion Tokens: 1223
Total Tokens: 93936
Total Cost: $0
Usage: 0
Latency: 977 ms
Native Tokens Prompt: 124090
Native Tokens Completion: 1251
Native Tokens Reasoning: 0
Native Tokens Total: 125341
Cache Discount: None
Temperature: 0.7
Top P: 1.0
Seed: None
Max Tokens: None
Compress: True
Compress Rate (Setting): 0.4
Original Tokens (LLMLingua-2): 192122
Compressed Tokens (LLMLingua-2): 93048
Compression Rate (LLMLingua-2): 48.4%
LLMLingua-2 max_batch_size: 500
LLMLingua-2 max_force_token: 10000

real 2m36.223s
user 11m7.163s
sys 1m16.138s
```

Re-run without LLMLingua prompt token compression shows OpenRouter API reported native prompt tokens = 237,950 compared to prompt token compression above reporting native prompt tokens = 124,090 - roughly 47.8% reduction in prompt token size.

```bash
time python or-cli.py --webpage https://xenforo.com/community/threads/uk-online-safety-regulations-and-impact-on-forums.227661/ | python or-cli.py -p "Act like expert summarizer. Summarize this Xenforo forum thread and all it's posts." -t --temperature 0.7

----- Assistant Response -----
Here's a summary of the Xenforo forum thread regarding the UK Online Safety Regulations and their impact on online forums:

**Overview:**

* The UK's Online Safety Act (OSA) came into effect in December 2024, with regulations starting March 17, 2025.
* The regulations aim to protect UK users from illegal and harmful online content.
* Forums with links to the UK, including those with UK users, are affected.
* Key requirements highlighted are strong age verification and content scanning.
* Some forums have announced closures due to the perceived burden of the regulations.
* A survey is available to check if a system is covered.

**Key Discussions and Concerns:**

* **Impact on Forum Owners:**
* The regulations are seen as a potential burden, with costs likely passed on to forum owners.
* Smaller forums (under 7 million UK users) face requirements for search and content moderation.
* There are concerns about the broad scope of the regulations and the resources required for compliance.
* Some forum owners considered blocking UK users to avoid compliance.
* **Age Verification:**
* Age verification is a key requirement, but the details on how to implement it are still being released.
* Concerns were raised about the effectiveness of self-declaration of age.
* There is a need for technically accurate, robust, reliable, and fair age assurance.
* **Content Moderation:**
* Content scanning, including AI-based detection, is discussed as a potential solution.
* Concerns were raised about the difficulty of moderating content, particularly in DMs.
* Some users have expressed concerns about the AI-based moderation.
* There is a desire for improved reporting tools and more granular moderation options.
* **Free Speech and Censorship:**
* The regulations are viewed by some as an infringement on free speech.
* There is a debate about what constitutes "hate speech" and "harmful content."
* The potential for overreach and censorship is a concern.
* **Potential Solutions and Strategies:**
* Some forum owners are considering banning UK users.
* AI-powered tools are suggested for content monitoring and moderation.
* Some users are creating risk assessment templates and guidance.
* There is a discussion of using third-party services for age verification and content scanning.
* Structuring a company to limit liability is proposed.
* **Specific Examples and Actions:**
* The closure of the LFGSS and Microcosm forums was cited as an example of the impact of the regulations.
* Some users shared their approach to content moderation and rule enforcement.
* Some users were blocking EU and UK users.
* One user shared a template for creating a risk assessment.

**Key Takeaways:**

* The UK Online Safety Act is causing uncertainty and concern among forum owners.
* Age verification and content moderation are the most significant challenges.
* The regulations could lead to increased costs, reduced functionality, and potential forum closures.
* The overall impact of the regulations is still unclear, as some guidance is still missing.
* There is a need for clear guidance, effective tools, and a balanced approach that protects users while respecting freedom of expression.

----- Generation Stats -----
Model Used: google/gemini-2.0-flash-lite-preview-02-05:free
Provider Name: Google AI Studio
Generation Time: 4747 ms
Prompt Tokens: 190085
Completion Tokens: 688
Total Tokens: 190773
Total Cost: $0
Usage: 0
Latency: 2373 ms
Native Tokens Prompt: 237950
Native Tokens Completion: 728
Native Tokens Reasoning: 0
Native Tokens Total: 238678
Cache Discount: None
Temperature: 0.7
Top P: 1.0
Seed: None
Max Tokens: None
Compress: False
Compress Rate (Setting): 0.4
Original Tokens (LLMLingua-2): N/A
Compressed Tokens (LLMLingua-2): N/A
Compression Rate (LLMLingua-2): N/A
LLMLingua-2 max_batch_size: N/A
LLMLingua-2 max_force_token: N/A

real 0m15.419s
user 0m4.573s
sys 0m0.166s
```

Instead of processing Xenforo thread repeatedly, we can cache and save a locally LLMLingua compressed copy `compress-xf-thread.txt` to be fed into `or-cli.py`:

```bash
python or-cli.py --webpage https://xenforo.com/community/threads/uk-online-safety-regulations-and-impact-on-forums.227661/ | python or-cli.py --compress --compress-long --compress-save --compress-batch-size 500 --compress-save-path ./compress-xf-thread.txt

cat ./compress-xf-thread.txt | python or-cli.py -p "Act like expert summarizer. Summarize this Xenforo forum thread and all it's posts." -t --temperature 0.7
```
```bash
cat ./compress-xf-thread.txt | python or-cli.py -p "Act like expert summarizer. Summarize this Xenforo forum thread and all it's posts." -t --temperature 0.7

----- Assistant Response -----
This XenForo forum thread discusses the implications of the UK's Online Safety Act (OSA) for online forums. Here's a summarized breakdown:

**Key Concerns & Discussions:**

* **Age Verification:** A central theme is the need for robust age verification methods to comply with OSA, with strong criticism of self-declaration and online payments as insufficient. The thread explores various methods, including third-party services, and the potential costs and privacy implications.
* **Content Moderation:** The burden of content moderation, especially for detecting and removing illegal content (CSAM, hate speech, etc.), is a major concern. Discussions revolve around AI tools, keyword filtering, and the challenges of accurately identifying and addressing harmful content. The role of moderators and the need for clear guidelines are also highlighted.
* **Impact on Small Forums:** A significant portion of the discussion focuses on the disproportionate burden the OSA places on smaller forums, particularly those run by volunteers or with limited resources. Concerns include the financial costs of compliance, the complexities of the regulations, and the potential for these forums to close down or restrict access.
* **Privacy and Free Speech:** The thread touches on the tension between the OSA's aims and the protection of user privacy and free speech. The potential for censorship and the impact of government overreach are discussed, along with concerns about data breaches and the chilling effect on online expression.
* **Geographic Issues:** The impact of the OSA on forums based outside the UK or with users from other countries is considered, including the possibility of blocking UK users or facing legal challenges.
* **Practical Implementation:** There's a lot of discussion about the practical steps forums need to take to comply with the OSA, including risk assessments, record-keeping, and the implementation of various safety measures. Several users share resources like template documents.
* **Specific Act Provisions:** Discussions touch on specific aspects of the OSA, such as requirements for reporting illegal content, handling private messages, and the need to have a nominated individual responsible for compliance.

**Key Concerns (summarized):**

* **Cost and Complexity:** The financial and administrative burdens of compliance are a major worry, especially for smaller forums.
* **Privacy Risks:** Concerns exist about the collection and use of user data for age verification and content moderation.
* **Censorship and Free Speech:** The potential for overzealous content moderation and the chilling effect on free speech are significant concerns.
* **Enforcement Uncertainty:** There's uncertainty about how the OSA will be enforced and the potential penalties for non-compliance.
* **Technical Challenges:** Implementing effective age verification and content moderation tools is technically challenging.

**Overall:**

The forum thread reflects a general sense of anxiety and uncertainty about the impact of the UK's Online Safety Act. While many users support the goal of protecting children online, there are significant concerns about the practical implications of the law and its potential consequences for online forums. The thread highlights the challenges of balancing safety with free speech, privacy, and the viability of online communities.

----- Generation Stats -----
Model Used: google/gemini-2.0-flash-lite-preview-02-05:free
Provider Name: Google AI Studio
Generation Time: 3759 ms
Prompt Tokens: 92921
Completion Tokens: 634
Total Tokens: 93555
Total Cost: $0
Usage: 0
Latency: 1017 ms
Native Tokens Prompt: 124365
Native Tokens Completion: 640
Native Tokens Reasoning: 0
Native Tokens Total: 125005
Cache Discount: None
Temperature: 0.7
Top P: 1.0
Seed: None
Max Tokens: None
Compress: False
Compress Rate (Setting): 0.4
Original Tokens (LLMLingua-2): N/A
Compressed Tokens (LLMLingua-2): N/A
Compression Rate (LLMLingua-2): N/A
LLMLingua-2 max_batch_size: N/A
LLMLingua-2 max_force_token: N/A

real 0m9.045s
user 0m0.568s
sys 0m0.074s
```

### Local Ollama Integration

Use a locally-running model via `--ollama` flag:

```bash
python or-cli.py -p "You are a helpful assistant." -m "Explain why the sky is blue" --ollama -t
```
```bash
python or-cli.py -p "You are a helpful assistant." -m "Explain why the sky is blue" --ollama -t

----- Assistant Response -----
The sky appears blue because of a phenomenon called scattering, which occurs when sunlight interacts with the tiny molecules of gases in the Earth's atmosphere.

Here's a simplified explanation:

1. **Sunlight enters the atmosphere**: When the sun rises or sets, its light travels through the air and hits the tiny molecules of gases such as nitrogen (N2) and oxygen (O2).
2. **Scattering occurs**: These gas molecules scatter the sunlight in all directions, but they scatter shorter (blue) wavelengths more than longer (red) wavelengths. This is known as Rayleigh scattering, named after the British physicist Lord Rayleigh, who first described it.
3. **Blue light is scattered more**: The blue light is scattered in all directions by the gas molecules, while the red light is scattered less and continues to travel in a straight line.
4. **Our eyes see the scattered blue light**: When we look up at the sky, our eyes see the scattered blue light that has been dispersed in all directions. This is why the sky appears blue during the daytime.

**Why isn't the sky red?**

The reason the sky doesn't appear red is because the longer wavelengths of light (like red and orange) are scattered less than the shorter wavelengths (like blue and violet). If the atmosphere were to scatter these longer wavelengths more, the sky would indeed appear reddish. However, this effect is much weaker than the scattering of blue light.

**Other factors that influence the color of the sky**

While Rayleigh scattering is the primary reason for the blue color of the sky, other factors can affect its appearance:

* **Dust and water vapor**: Tiny particles in the atmosphere can scatter light, making the sky appear more hazy or gray.
* **Pollution**: Air pollution can also scatter light, altering the apparent color of the sky.
* **Atmospheric conditions**: The angle of the sun, atmospheric pressure, and humidity can all impact the way light is scattered and perceived.

Now, next time you gaze up at a blue sky, remember the fascinating science behind its beauty!

----- Usage Stats (Ollama) -----
Model Used: llama3.2-custom
Prompt Tokens: 38
Completion Tokens: 421
Total Tokens: 459
```

### Conversational Exchanges

Maintain context across multiple exchanges - using default OpenRouter AI API endpoint and default Google Gemini 2.0 Flash Lite Preview LLM model. Add `--ollama` flag to use locally self-hosted Ollama and default local, llama3.2 LLM model:

```bash
python or-cli.py -p "You are a math tutor." -m "Explain calculus" --follow-up "What are derivatives?" --follow-up "How do they relate to integrals?" -t
```

```bash
python or-cli.py -p "You are an assistant." -m "Tell me a joke." -t --ollama --follow-up "Tell me another joke." -t
```
```bash
python or-cli.py -p "You are an assistant." -m "Tell me a joke." -t --ollama --follow-up "Tell me another joke." -t

----- Assistant Response -----
Here's one:

What do you call a fake noodle?

An impasta!

----- Usage Stats (Ollama) -----
Model Used: llama3.2-custom
Prompt Tokens: 35
Completion Tokens: 18
Total Tokens: 53

----- Follow-up Assistant Response -----
Here's another one:

Why couldn't the bicycle stand up by itself?

Because it was two-tired!

----- Follow-up Usage Stats (Ollama) -----
Model Used: llama3.2-custom
Prompt Tokens: 67
Completion Tokens: 23
Total Tokens: 90
```

### Structured Output

Get responses in JSON format - using default OpenRouter AI API endpoint and default Google Gemini 2.0 Flash Lite Preview LLM model. Add `--ollama` flag to use locally self-hosted Ollama and default local, llama3.2 LLM model:

```bash
python or-cli.py -p "You are an assistant." -m "Tell me a joke." --response-format json -t
```
```bash
python or-cli.py -p "You are an assistant." -m "Tell me a joke." --response-format json -t

----- Assistant Response -----
{
"type": "joke",
"setup": "Why don't scientists trust atoms?",
"punchline": "Because they make up everything!"
}

----- Generation Stats -----
Model Used: google/gemini-2.0-flash-lite-preview-02-05:free
Provider Name: Google
Generation Time: 57 ms
Prompt Tokens: 21
Completion Tokens: 35
Total Tokens: 56
Total Cost: $0
Usage: 0
Latency: 600 ms
Native Tokens Prompt: 10
Native Tokens Completion: 38
Native Tokens Reasoning: 0
Native Tokens Total: 48
Cache Discount: None
Temperature: 0.3
Top P: 1.0
Seed: None
Max Tokens: None
Compress: False
Compress Rate (Setting): 0.4
Original Tokens (LLMLingua-2): N/A
Compressed Tokens (LLMLingua-2): N/A
Compression Rate (LLMLingua-2): N/A
LLMLingua-2 max_batch_size: N/A
LLMLingua-2 max_force_token: N/A
```

## Technical Details

### Functions Overview

The script is structured around several key components:

#### Core Text Processing Functions

- `minify_markdown(text: str) -> str`:
Optimizes markdown content by removing redundant whitespace and normalizing formatting.

- `get_compressed_prompt(prompt: str, instruction: str, rate: float, compressor, cache: Dict[str, Any], use_context: bool, extended: bool = False) -> Dict[str, Any]`:
Manages prompt compression with caching to avoid redundant processing of identical content.

- `compress_prompt_optional(prompt: str, instruction: str, rate: float, compressor_cache: Dict[str, Any], llmlingua_config: Dict[str, Any], debug: bool, compress_long_flag: bool, compress_long_question: str = "") -> Dict[str, Any]`:
Implements a sophisticated two-stage compression pipeline that first applies optional coarse compression with LongLLMLingua for very large documents, then fine-grained compression with LLMLingua-2.

#### Web Content Handling

- `fetch_page(url: str, session: aiohttp.ClientSession, cache: Dict[str, Any], ttl: int = CACHE_TTL) -> str`:
Asynchronously retrieves web content with TTL-based caching to improve performance.

- `process_webpage(args: argparse.Namespace) -> str`:
Orchestrates webpage processing by extracting content (with special handling for forums), analyzing pagination, and converting HTML to markdown with appropriate condensation.

#### OpenRouterClient Class

The central client class that manages all API interactions:

- `setup_logging(debug: bool) -> logging.Logger`:
Configures the logging system with appropriate verbosity.

- `print_full_stats(stats_data: Dict[str, Any], model_label: str, debug: bool = False, extra_params: Optional[Dict[str, Any]] = None) -> None`:
Formats and displays detailed generation statistics including token counts, costs, and latency.

- `model_supports_structured_outputs(model: str) -> bool`:
Determines if a specified model supports structured output formats.

- `encode_image(image_path: str) -> str`:
Converts image files to base64 encoding for API transmission.

- `create_message_content(text: str, image_path: Optional[str] = None, is_code: bool = False) -> Union[str, List[Dict[str, Any]]]`:
Formats text and image content for API submission, with special handling for code.

- `send_completion_request(...)`:
Sends a fully-configured chat completion request to the appropriate endpoint.

- `get_generation_stats(generation_id: str) -> Dict[str, Any]`:
Retrieves detailed generation statistics for completed requests.

- `get_api_key_limits() -> Dict[str, Any]`:
Fetches API key usage information, rate limits, and credit balance.

#### Conversational Support

- `async_handle_follow_ups(client: OpenRouterClient, prompt: str, initial_message: str, initial_response: str, follow_up_messages: List[str], args: argparse.Namespace, comp_result: Optional[Dict[str, Any]], llmlingua_config: Dict[str, Any]) -> None`:
Manages multi-turn conversations with context preservation and optional compression.

#### Main Function

- `main()`:
Entry point that processes command-line arguments, initializes components, orchestrates requests, and manages the execution flow.

### Yappi Profiling

Yappi Profiling Support with three optional formats via `--yappi-export-format` flag:

```bash
python or-cli.py --limit --yappi --yappi-export-format callgrind

--- API Key Limits and Usage ---
Label: sk-or-v1-f20...469
Usage: 0 credits used
Credit Limit: Unlimited
Free Tier: True
Rate Limit: 10 requests per 10s

Profiling results saved in callgrind format to yappi.callgrind.
You can view these with KCachegrind/QCachegrind.
```
```bash
python or-cli.py --limit --yappi --yappi-export-format snakeviz

--- API Key Limits and Usage ---
Label: sk-or-v1-f20...469
Usage: 0 credits used
Credit Limit: Unlimited
Free Tier: True
Rate Limit: 10 requests per 10s

Profiling results saved in pstats format to yappi.pstats.
Run 'snakeviz yappi.pstats' to explore the profile interactively.
```
```bash
python or-cli.py --limit --yappi --yappi-export-format gprof2dot

--- API Key Limits and Usage ---
Label: sk-or-v1-f20...469
Usage: 0 credits used
Credit Limit: Unlimited
Free Tier: True
Rate Limit: 10 requests per 10s

Profiling data saved in callgrind format to yappi.callgrind and converted to dot graph at yappi.dot, then to PNG at yappi.png.
```

## Advanced Features

### Prompt Compression

`or-cli.py` implements state-of-the-art prompt compression technology with [LLMLingua-2](https://llmlingua.com/), enabling you to process documents that would otherwise exceed token limits. The compression system:

- Preserves critical semantic content while reducing token count
- Maintains important structural elements like newlines, bullets, and paragraph boundaries
- Offers single-stage compression (`--compress`) or two-stage pipeline (`--compress-long`) for extremely large documents
- Provides adjustable compression rates from 0.1 (aggressive) to 0.9 (minimal)
- Can save compressed prompts for inspection or reuse

The two-stage pipeline first applies coarse compression with `LongLLMLingua` to get the document within manageable size, then applies fine-grained `Microsoft LLMLingua-2` compression to preserve maximum semantic meaning.

### Multi-model Evaluation

The script offers two powerful multi-model workflows:

1. **Evaluation Mode** (`--eval`):
Creates an AI feedback loop where a second model evaluates the first model's response, with an optional third model for further refinement. This enables automatic quality assessment and improvement without human intervention.

2. **Comparison Mode** (`--multi`):
Gets parallel responses from up to four different models for the same prompt, allowing side-by-side comparison of capabilities, styles, and approaches across model providers and architectures.

These features facilitate model benchmarking, response optimization, and selecting the best model for specific use cases.

### Web Page Processing

The `--webpage` feature transforms web content into AI-ready format by:

- Converting HTML to clean, well-formatted markdown
- Detecting content structures (especially Xenforo forums) for intelligent extraction
- Supporting multi-page content with page detection and navigation
- Applying condensation strategies for lengthy content (via `--condense` levels)
- Minifying the result to reduce token consumption

For forum threads, the condensation system intelligently samples from the beginning and end of discussions to capture both the initial topic and conclusions.

### Cloudflare AI Gateway Integration

`or-cli.py` integrates with Cloudflare AI Gateway for enhanced performance and reliability:

- Request caching with configurable TTL (default: 900 seconds)
- Automatic cache bypassing for rate limit handling
- Support for additional authorization headers via `CF_AIG_AUTH` variable
- Potential cost savings through cache hits

To enable, set the following environment variables:
- `USE_CLOUDFLARE_AI_GATEWAY=y`
- `CF_ACCOUNT_ID` (required)
- `CF_GATEWAY_ID` (optional, defaults to "openrouter")

![Cloudflare AI Gateway Screenshots](/screenshots/cloudflare-ai-gateway-openrouter-api-7.png)

![Cloudflare AI Gateway Screenshots](/screenshots/cloudflare-ai-gateway-openrouter-api-8.png)

![Cloudflare AI Gateway Screenshots](/screenshots/cloudflare-ai-gateway-openrouter-api-9.png)

### Local Ollama Integration

For privacy, lower cost, or offline use cases, `or-cli.py` supports local models via Ollama:

- Automatic endpoint reconfiguration with the `--ollama` flag
- Context window size adjustment with `--ollama-max-tokens`
- Compatibility with Ollama's model library
- Support for both SDK-based and curl-based API methods

The tool adapts its parameters and behavior based on the endpoint type, providing a consistent interface regardless of whether you're using OpenRouter, Cloudflare AI Gateway, or Ollama.