{"id":32460161,"url":"https://github.com/leamsi9/llm-behaviour-lab","last_synced_at":"2025-10-26T11:22:37.800Z","repository":{"id":319121466,"uuid":"1075724919","full_name":"Leamsi9/llm-behaviour-lab","owner":"Leamsi9","description":"Web application for comparing the behaviour of multiple language models under different inference time configs and post-training inputs using Ollama.","archived":false,"fork":false,"pushed_at":"2025-10-16T23:58:44.000Z","size":98,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-10-17T18:03:39.695Z","etag":null,"topics":["ai","alignment","deterministic-analysis","fastapi","inference","inference-rules","interpretability","llm","llm-aligment","llm-evaluation","llm-inference","llm-tools","llm-training","machine-learning","model-comparison","ollama","python","web-app"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Leamsi9.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-10-13T22:48:42.000Z","updated_at":"2025-10-16T23:58:47.000Z","dependencies_parsed_at":null,"dependency_job_id":"d00e9630-c777-4f18-99ed-a0709c381d17","html_url":"https://github.com/Leamsi9/llm-behaviour-lab","commit_stats":null,"previous_names":["leamsi9/llm-behaviour-lab"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/Leamsi9/llm-behaviour-lab","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Leamsi9%2Fllm-behaviour-lab","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Leamsi9%2Fllm-behaviour-lab/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Leamsi9%2Fllm-behaviour-lab/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Leamsi9%2Fllm-behaviour-lab/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Leamsi9","download_url":"https://codeload.github.com/Leamsi9/llm-behaviour-lab/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Leamsi9%2Fllm-behaviour-lab/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":281094081,"owners_count":26442699,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-26T02:00:06.575Z","response_time":61,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","alignment","deterministic-analysis","fastapi","inference","inference-rules","interpretability","llm","llm-aligment","llm-evaluation","llm-inference","llm-tools","llm-training","machine-learning","model-comparison","ollama","python","web-app"],"created_at":"2025-10-26T11:22:32.340Z","updated_at":"2025-10-26T11:22:37.791Z","avatar_url":"https://github.com/Leamsi9.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# LLM Behaviour Lab\n\nA FastAPI-based web application for comparing multiple language models side by side using Ollama. Features dynamic model selection, per-pane controls, and stability limits to prevent system freezes.\n\n## Table of Contents\n\n- [Basic Workflow](#basic-workflow)\n- [Quick Start](#quick-start)\n- [Features](#features)\n- [Usage](#usage)\n- [Stability Features](#stability-features)\n- [API Endpoints](#api-endpoints)\n- [Configuration](#configuration)\n- [Troubleshooting](#troubleshooting)\n- [Project Structure](#project-structure)\n- [Dependencies](#dependencies)\n- [Performance Tips](#performance-tips)\n- [Contributing](#contributing)\n- [License](#license)\n- [Acknowledgments](#acknowledgments)\n\n\nAt its core, the LLM Behaviour Lab enables systematic exploration of how **deterministic, interpretable and corrigible human-defined parameters extrinsic to the model** interact with the **intrinsic, probabilistic model outputs**. These deterministic parameters include both the direct inference time configuration and code scaffolds (e.g. system/user prompts, temperature, token limits), and the post training inputs (e.g. Q\u0026A, instructions, preferences, reinforcements).\n\n\n## Basic Workflow\n1. **Select models** from the multi-select dropdown (hold Ctrl/Cmd for multiple)\n2. **Click \"Add Selected\"** to create comparison panes\n3. **Craft prompts** in the system/user input fields - this is where you control the deterministic variables\n4. **Adjust parameters** like temperature (0.0-2.0) and max tokens to see their effects\n5. **Click \"Generate\"** on individual panes or **\"Generate All\"** for batch\n6. **Use \"Stop\"** buttons to cancel generation\n7. **Add aliases** to distinguish between similar models\n\n## Quick Start\n\n### 1. Install Ollama\n```bash\n# Linux\ncurl -fsSL https://ollama.ai/install.sh | sh\n\n# macOS\nbrew install ollama\n\n# Windows: Download from https://ollama.ai/download\n```\n\n### 2. Pull Models\n```bash\nollama serve  # Start Ollama in another terminal\n\n# Pull some models to compare\nollama pull qwen2.5:7b       # Instruct model\nollama pull llama3.2:3b      # Smaller model\nollama pull gemma2:9b        # Different architecture\n```\n\n**Browse all available models**: Visit [https://ollama.com/search](https://ollama.com/search) to explore the full catalog of models available for comparison.\n\n### 3. Setup Python Environment\n```bash\n# Clone and setup\ngit clone https://github.com/Leamsi9/llm-behaviour-lab.git\ncd llm-behaviour-lab\n\n# Use the setup script\n./setup.sh\n\n# Or manually:\npython -m venv venv\nsource venv/bin/activate\npip install -r requirements.txt\n```\n\n### 4. Configure Stability Limits (Optional)\nEdit `.env` file with your system specs:\n```bash\n# For 32GB RAM systems (default)\nMAX_INPUT_LENGTH=12000\nMAX_CONTEXT_TOKENS=8192\nMAX_OUTPUT_TOKENS=4096\n\n# For 16GB RAM systems\nMAX_INPUT_LENGTH=8000\nMAX_CONTEXT_TOKENS=4096\nMAX_OUTPUT_TOKENS=2048\n```\n\n### 5. Run the Application\n```bash\nsource venv/bin/activate\nuvicorn app_ollama:app --host 0.0.0.0 --port 8000 --reload\n```\n\n### 6. Open UI\nNavigate to: `http://localhost:8000`\n\n## Features\n\n- ✅ **Multi-model comparison**: Compare any number of Ollama models simultaneously\n- ✅ **Dynamic model loading**: Automatically detects and lists all pulled Ollama models\n- ✅ **Per-pane controls**: Individual Generate/Stop/Clear/Remove buttons for each model\n- ✅ **Global controls**: Generate All and Stop All buttons for batch operations\n- ✅ **Real-time streaming**: Token-by-token generation with visual indicators\n- ✅ **Stability limits**: Configurable limits to prevent system freezes (.env file)\n- ✅ **Cancellation support**: Properly interrupts generation without leaving orphaned processes\n- ✅ **Token counting**: Detailed metrics (prompt tokens, completion tokens, latency, TPS)\n- ✅ **Model aliases**: Tag each model pane with custom labels\n- ✅ **Responsive UI**: Works on desktop and mobile devices\n\n## Usage\n\n### Comparison Strategies\nEach model comparison reveals insights about:\n\n#### The Deterministic Elements (Human-Controlled)\n- **System Prompt**: Defines the AI's role, personality, and behavioral constraints. To compare behaviour under the system prompts of major LLMs, see https://github.com/elder-plinius/CL4R1T4S for a collection of system prompts for major LLMs and tools which you can use.\n- **User Prompt**: The specific task or question being asked\n- **Temperature**: Controls randomness (0.0 = deterministic, 1.0 = creative, 2.0 = chaotic)\n- **Token limits**: Limits output length and computational cost\n- **Post training inputs**: Fine-tuning ([instruction, preference and reinforcement](https://www.interconnects.ai/p/the-state-of-post-training-2025)) comparable in this app by selecting base and fine tuned models (e.g. `qwen2.5:7b` vs `qwen2.5:7b-instruct`) or different fine-tuning approaches to the same base model (e.g. tool-using vs instruction-tuned vs abliterated (uncensored) vs niche-tuned). The post-training becomes intrinsic to the model, but the process of post-training relies on deterministic human artefacts extrinsic to the base models, in the form of explicit instructions, preferences and reinforcements that are fully interpretable and corrigible.\n\n#### The Probabilistic Elements (Model-Dependent)\n- **Architecture differences**: Transformer variants, attention mechanisms, parameter counts\n- **Training data**: What knowledge and patterns each model has learned\n- **Fine-tuning approach**: Base models vs instruction-tuned vs tool-using variants\n- **Token generation**: How each model chooses the next word given identical inputs\n\n#### Temperature Testing\nCompare the same model at different temperatures:\n- `llama3.2:3b [Temp 0.1]` - Precise, factual responses\n- `llama3.2:3b [Temp 0.7]` - Balanced creativity and coherence  \n- `llama3.2:3b [Temp 1.5]` - Highly creative, more unpredictable\n\n#### Architecture Comparison\nCompare different model families:\n- `qwen2.5:7b` vs `llama3.2:3b` vs `gemma2:9b` - Same prompt, different architectures\n- Base vs instruction-tuned variants of the same model\n- Small vs large parameter counts within the same family\n\n#### Fine-tuning Analysis\nCompare different training approaches:\n- Base models (raw, pre-training only)\n- Instruction-tuned (RLHF, aligned for helpfulness)\n- Tool-using variants (function calling, API integration)\n- Domain-specific fine-tunes (coding, medical, legal)\n\n### Advanced Features\n\n#### Model Aliases\nEach pane can have a custom alias (displayed in brackets):\n- `qwen2.5:7b [Base]` - for base model comparisons\n- `qwen2.5:7b [Creative]` - for creative writing tests\n- `llama3.2:3b [Fast]` - for quick iterations\n- `mistral:7b [Temp 0.1]` - for precise, factual responses\n\n#### Global Controls\n- **Generate All**: Start generation on all panes simultaneously\n- **Stop All**: Cancel all active generations\n- **Model Status**: Shows number of active WebSocket connections\n\n#### Per-Pane Controls\n- **Generate**: Start generation for this model\n- **Stop**: Cancel generation (with \"Stopping...\" feedback)\n- **Clear**: Reset output and metrics\n- **Remove**: Delete this pane and close its WebSocket\n\n## Stability Features\n\n### Input Validation\n- **Character limits**: `MAX_INPUT_LENGTH` prevents memory exhaustion\n- **Token capping**: `MAX_OUTPUT_TOKENS` limits generation length\n- **Context windows**: `MAX_CONTEXT_TOKENS` prevents overflow\n\n### System Protection\n- **Thread limiting**: Caps CPU usage to 4 threads\n- **Request timeouts**: `REQUEST_TIMEOUT` prevents infinite hangs\n- **HTTP cleanup**: Properly closes connections on cancellation\n\n### Emergency Recovery\nIf you experience freezes:\n```bash\n# Kill processes\npkill -9 ollama\npkill -9 python\n\n# Reduce limits in .env\nMAX_INPUT_LENGTH=4000\nMAX_CONTEXT_TOKENS=2048\n\n# Restart\nollama serve\nuvicorn app_ollama:app --reload\n```\n\n## API Endpoints\n\n### WebSocket `/ws`\nStreaming inference endpoint with cancellation support.\n\n**Request payload:**\n```json\n{\n  \"model_name\": \"qwen2.5:7b\",\n  \"system\": \"You are a helpful assistant.\",\n  \"user\": \"Explain quantum computing.\",\n  \"temp\": 0.7,\n  \"max_tokens\": 1024,\n  \"stop\": [\"USER:\", \"ASSISTANT:\", \"\u003c/s\u003e\"]\n}\n```\n\n**Response stream:**\n```json\n{\"token\": \"Quantum\"}\n{\"token\": \" computing\"}\n{\"token\": \" is\"}\n{\"token\": \"...\"}\n{\"token\": \"[DONE]\", \"done\": true, \"metrics\": {...}}\n```\n\n### GET `/api/models`\nReturns available Ollama models.\n\n**Response:**\n```json\n{\n  \"models\": [\"qwen2.5:7b\", \"llama3.2:3b\", \"gemma2:9b\"],\n  \"current\": {\n    \"base\": \"qwen2.5:7b-base\",\n    \"instruct\": \"qwen2.5:7b\"\n  }\n}\n```\n\n### GET `/api/health`\nHealth check endpoint.\n\n**Response:**\n```json\n{\n  \"status\": \"ok\",\n  \"ollama\": true,\n  \"websocket\": true,\n  \"models\": {\n    \"base\": \"qwen2.5:7b-base\",\n    \"instruct\": \"qwen2.5:7b\"\n  }\n}\n```\n\n## Configuration\n\n### Environment Variables\nCreate a `.env` file in the project root from the `.env-example` file:\n\n```bash\n# Stability limits\nMAX_INPUT_LENGTH=8000          # Character limit for prompts\nMAX_CONTEXT_TOKENS=4096        # Ollama context window\nMAX_OUTPUT_TOKENS=2048         # Maximum generation length\nREQUEST_TIMEOUT=180.0          # Seconds before timeout\n```\n\n### System Recommendations\n\n| RAM | Input Length | Context Tokens | Output Tokens | Example Models |\n|-----|-------------|----------------|----------------|-------------------|\n| 8GB | 4,000 | 2,048 | 1,024 | `llama3.2:1b`, `phi3:mini` |\n| 16GB | 8,000 | 4,096 | 2,048 | `llama3.2:3b`, `mistral:7b` |\n| 32GB | 16,000 | 16,384 | 8,192 | `llama3:8b`, `mixtral:8x7b` |\n| 64GB | 32,000 | 32,768 | 16,384 | `llama3:70b`, `qwen2.5:72b` |\n\n## Troubleshooting\n\n### \"Cannot connect to Ollama\"\n```bash\n# Ensure Ollama is running\nollama serve\n\n# Check connection\ncurl http://localhost:11434/api/tags\n\n# Change port if needed\nexport OLLAMA_HOST=0.0.0.0:11435\n```\n\n### \"No models found\"\n```bash\n# Pull models\nollama pull qwen2.5:7b\nollama pull llama3.2:3b\n\n# List available\nollama list\n```\n\n### System Freezes\n1. **Reduce limits** in `.env`:\n   ```bash\n   MAX_INPUT_LENGTH=4000\n   MAX_CONTEXT_TOKENS=2048\n   ```\n\n2. **Use smaller models**:\n   ```bash\n   ollama pull llama3.2:1b\n   ```\n\n3. **Monitor resources**:\n   ```bash\n   htop                    # CPU/RAM\n   watch -n 1 nvidia-smi   # GPU (if available)\n   ```\n\n4. **Ollama Constraints**\n\nTo modify constraints directly in Ollama for better stability, set these environment variables before running `ollama serve`:\n\n```bash\nexport OLLAMA_NUM_THREADS=4       # Limit CPU threads to 4\nexport OLLAMA_GPU_LAYERS=35       # Limit GPU layers (0 disables GPU)\nexport OLLAMA_MAX_LOADED_MODELS=3 # Limit concurrent loaded models\n```\n\nThese environment variables allow fine-tuning Ollama's resource consumption to match your system's capabilities, preventing freezes and ensuring stable operation.\n\n### WebSocket Errors\n- Check browser console for connection issues\n- Ensure no firewall blocks WebSocket connections\n- Try different browser (Chrome recommended)\n\n## Project Structure\n\n```\nllm-behaviour-lab/\n├── app_ollama.py          # FastAPI application with Ollama integration\n├── static/\n│   └── ui_multi.html      # Multi-model comparison UI\n├── .env-example          # Environment configuration template\n├── .gitignore            # Git ignore rules\n├── requirements.txt      # Python dependencies\n├── setup.sh             # Automated setup script\n├── README.md            # This file\n└── Stability.md         # Detailed stability configuration\n```\n\n## Dependencies\n\n- **FastAPI**: Web framework\n- **Uvicorn**: ASGI server\n- **httpx**: HTTP client for Ollama API\n- **python-dotenv**: Environment configuration\n- **Ollama**: Local LLM inference server\n\n## Performance Tips\n\n1. **Model caching**: Pull frequently used models for faster startup\n2. **Concurrent limits**: Don't run too many large models simultaneously\n3. **GPU acceleration**: Ollama automatically uses GPU if available\n4. **Memory management**: Clear unused models with `ollama stop \u003cmodel\u003e`\n\n## Contributing\n\n1. Fork the repository\n2. Create a feature branch\n3. Make your changes\n4. Test with different model configurations\n5. Submit a pull request\n\n## License\n\nThis project is licensed under the **MIT License** \n\n**This software is fully Free and Open Source.** You are free to:\n- ✅ Use it for any purpose (personal, commercial, educational)\n- ✅ Modify and distribute your changes\n- ✅ Include it in other projects\n- ✅ Use it in production environments\n\n### Author\n**Ismael Velasco** - Original developer and maintainer\n\n## Acknowledgments\n\n- [Ollama](https://ollama.ai/) for efficient local LLM inference\n- [FastAPI](https://fastapi.tiangolo.com/) for the web framework\n- [Meta Llama](https://ai.meta.com/llama/) and other model providers\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fleamsi9%2Fllm-behaviour-lab","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fleamsi9%2Fllm-behaviour-lab","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fleamsi9%2Fllm-behaviour-lab/lists"}