{"id":27414684,"url":"https://github.com/helpingai/inferno","last_synced_at":"2026-02-25T17:33:37.319Z","repository":{"id":286418020,"uuid":"961336674","full_name":"HelpingAI/inferno","owner":"HelpingAI","description":"A production-ready inference server supporting any AI model on all major hardware platforms (CPU, GPU, TPU, Apple Silicon). Inferno seamlessly deploys and serves language models from Hugging Face, local files, or GGUF format with automatic memory management and hardware optimization. Developed by HelpingAI.","archived":false,"fork":false,"pushed_at":"2025-04-06T11:08:03.000Z","size":1198,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-04-06T11:24:46.746Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/HelpingAI.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-04-06T09:52:28.000Z","updated_at":"2025-04-06T11:08:06.000Z","dependencies_parsed_at":"2025-04-06T11:34:55.425Z","dependency_job_id":null,"html_url":"https://github.com/HelpingAI/inferno","commit_stats":null,"previous_names":["helpingai/inferno"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HelpingAI%2Finferno","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HelpingAI%2Finferno/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HelpingAI%2Finferno/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HelpingAI%2Finferno/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/HelpingAI","download_url":"https://codeload.github.com/HelpingAI/inferno/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248845571,"owners_count":21170803,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-04-14T08:30:15.637Z","updated_at":"2026-02-25T17:33:37.312Z","avatar_url":"https://github.com/HelpingAI.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 🔥 Inferno: Ignite Your Local AI Experience 🔥\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"https://img.shields.io/badge/Inferno-Local%20LLM%20Server-orange?style=for-the-badge\u0026logo=python\u0026logoColor=white\" alt=\"Inferno Logo\"\u003e\n\n  \u003cp\u003e\u003cstrong\u003eUnleash the Blazing Power of Cutting-Edge LLMs on Your Own Hardware\u003c/strong\u003e\u003c/p\u003e\n\n  \u003cp\u003e\n    Run Llama 3.3, DeepSeek-R1, Phi-4, Gemma 3, Mistral Small 3.1, and other state-of-the-art language models locally with scorching-fast performance. Inferno provides an intuitive CLI and an OpenAI/Ollama-compatible API, putting the inferno of AI innovation directly in your hands.\n  \u003c/p\u003e\n\n  \u003c!-- Badges --\u003e\n  \u003cp\u003e\n    \u003ca href=\"LICENSE\"\u003e\u003cimg src=\"https://img.shields.io/badge/License-HelpingAI%20Open%20Source-blue?style=flat-square\" alt=\"License\"\u003e\u003c/a\u003e\n    \u003ca href=\"#requirements\"\u003e\u003cimg src=\"https://img.shields.io/badge/Python-3.9+-blue?style=flat-square\u0026logo=python\u0026logoColor=white\" alt=\"Python Version\"\u003e\u003c/a\u003e\n    \u003ca href=\"#installation\"\u003e\u003cimg src=\"https://img.shields.io/badge/Platform-Windows%20%7C%20macOS%20%7C%20Linux-lightgrey?style=flat-square\" alt=\"Platform\"\u003e\u003c/a\u003e\n  \u003c/p\u003e\n\n  \u003cdiv\u003e\n    \u003cimg src=\"https://img.shields.io/badge/GPU-Accelerated-76B900?style=for-the-badge\u0026logo=nvidia\u0026logoColor=white\" alt=\"GPU Accelerated\"\u003e\n    \u003cimg src=\"https://img.shields.io/badge/API-OpenAI%20Compatible-000000?style=for-the-badge\u0026logo=openai\u0026logoColor=white\" alt=\"OpenAI Compatible\"\u003e\n    \u003cimg src=\"https://img.shields.io/badge/Models-Hugging%20Face-FFD21E?style=for-the-badge\u0026logo=huggingface\u0026logoColor=white\" alt=\"Hugging Face\"\u003e\n  \u003c/div\u003e\n\u003c/div\u003e\n\n---\n\n**Navigation**\n\n*   [✨ Overview](https://github.com/HelpingAI/inferno#-overview)\n*   [🚀 Key Features](https://github.com/HelpingAI/inferno#-key-features)\n*   [⚙️ Installation](#️-installation)\n    *   [Hardware Acceleration (Critical Prerequisite)](#hardware-acceleration-llama-cpp-python-critical-prerequisite)\n*   [🖥️ Command Line Interface (CLI)](#️-command-line-interface-cli)\n*   [🔥 Getting Started](https://github.com/HelpingAI/inferno#-getting-started)\n*   [📋 Usage Guide](https://github.com/HelpingAI/inferno#-usage-guide)\n    *   [Download Models](#download-a-model)\n    *   [Quantization](#model-quantization)\n    *   [List Models](#list-downloaded-models)\n    *   [Start the Server](#start-the-server)\n    *   [Chat (CLI)](#chat-with-a-model)\n*   [🔌 API Usage](https://github.com/HelpingAI/inferno#-api-usage)\n    *   [OpenAI Compatible](#openai-api-endpoints)\n    *   [Ollama Compatible](#ollama-api-endpoints)\n    *   [Python Examples](#python-examples)\n*   [🐍 Native Python Client](https://github.com/HelpingAI/inferno#-native-python-client)\n*   [🧩 Integrations](https://github.com/HelpingAI/inferno#-integration-with-applications)\n*   [📦 Requirements](https://github.com/HelpingAI/inferno#-requirements)\n*   [🔧 Advanced Configuration](https://github.com/HelpingAI/inferno#-advanced-configuration)\n*   [🤝 Contributing](https://github.com/HelpingAI/inferno#-contributing)\n*   [📄 License](https://github.com/HelpingAI/inferno#-license)\n*   [📚 Full Documentation](https://github.com/HelpingAI/inferno#-full-documentation)\n\n---\n\n## ✨ Overview\n\nInferno is your personal gateway to the blazing frontier of Artificial Intelligence. Designed for both newcomers and seasoned developers, it provides a powerful yet user-friendly platform to run the latest Large Language Models (LLMs) directly on your local machine. Experience the raw power of models like Llama 3.3 and Phi-4 without relying on cloud services, ensuring full control over your data and costs.\n\nInferno offers an experience similar to Ollama but turbo-charged with enhanced features, including seamless Hugging Face integration, advanced quantization tools, and flexible model management. Its OpenAI \u0026 Ollama-compatible APIs ensure drop-in compatibility with your favorite AI frameworks and tools.\n\n\u003e [!TIP]\n\u003e New to local LLMs? Inferno makes it incredibly easy to get started. Pull a model and ignite your first conversation within minutes!\n\n## 🚀 Key Features\n\n- **Bleeding-Edge Model Support:** Run the latest models such as Llama 3.3, DeepSeek-R1, Phi-4, Gemma 3, Mistral Small 3.1, and more as soon as GGUF versions are available.\n\n- **Hugging Face Integration:** Download models with interactive file selection, repository browsing, and direct `repo_id:filename` targeting.\n\n- **Dual API Compatibility:** Serve models through both OpenAI and Ollama compatible API endpoints. Use Inferno with almost any AI client or framework.\n\n- **Native Python Client:** Includes a built-in, OpenAI-compatible Python client for seamless integration into your Python projects. Supports streaming, embeddings, multimodal inputs, and tool calling.\n\n- **Interactive CLI:** Command-line interface for downloading, managing, quantizing, and chatting with models.\n\n- **Blazing-Fast Inference:** GPU acceleration (CUDA, Metal, ROCm, Vulkan, SYCL) for faster response times. CPU acceleration via OpenBLAS is also supported.\n\n- **Real-time Streaming:** Get instant feedback with streaming support for both chat and completions APIs.\n\n- **Flexible Context Control:** Adjust the context window size (`n_ctx`) per model or session. Max context length is automatically detected from GGUF metadata.\n\n- **Smart Model Management:** List, show details, copy, remove, and see running models (`ps`). Includes RAM requirement estimates.\n\n- **Embeddings Generation:** Create embeddings using your local models via the API.\n\n- **Advanced Quantization:** Convert models between various GGUF quantization levels (including importance matrix methods like `iq4_nl`) with interactive comparison and RAM estimates.\n\n- **Keep-Alive Management:** Control how long models stay loaded in memory when idle.\n\n- **Fine-Grained Configuration:** Customize inference parameters such as GPU layers, threads, batch size, RoPE settings, and mlock.\n\n## ⚙️ Installation\n\n\u003e [!IMPORTANT]\n\u003e **Critical Prerequisite: Install `llama-cpp-python` First!**\n\u003e Inferno relies heavily on `llama-cpp-python`. For optimal performance, especially GPU acceleration, you **MUST** install `llama-cpp-python` with the correct hardware backend flags *before* installing Inferno. Failure to do this may result in suboptimal performance or CPU-only operation.\n\n### 1. Install `llama-cpp-python` with Hardware Acceleration\n\nChoose **one** of the following commands based on your hardware. See the detailed [Hardware Acceleration](#hardware-acceleration-llama-cpp-python-critical-prerequisite) section below for more options and explanations.\n\n*   **NVIDIA GPU (CUDA):**\n    ```bash\n    CMAKE_ARGS=\"-DGGML_CUDA=on\" pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir\n    # Or use pre-built wheels if available (see details below)\n    ```\n*   **Apple Silicon GPU (Metal):**\n    ```bash\n    CMAKE_ARGS=\"-DGGML_METAL=on\" pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir\n    # Or use pre-built wheels if available (see details below)\n    ```\n*   **AMD GPU (ROCm):**\n    ```bash\n    CMAKE_ARGS=\"-DGGML_HIPBLAS=on\" pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir\n    ```\n*   **CPU Only (OpenBLAS):**\n    ```bash\n    CMAKE_ARGS=\"-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS\" pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir\n    ```\n*   **Other Backends (Vulkan, SYCL, etc.):** See the detailed section below.\n\n\u003e [!TIP]\n\u003e Using a virtual environment (like `venv` or `conda`) is highly recommended. Ensure you have Python 3.9+ and the necessary build tools (CMake, C++ compiler) installed. Adding `--force-reinstall --upgrade --no-cache-dir` helps ensure a clean build against your system's libraries.\n\n### 2. Install Inferno\n\nOnce `llama-cpp-python` is installed with your desired backend, you can install Inferno directly from PyPI:\n\n```bash\n# Install the latest stable release from PyPI\npip install inferno-llm\n```\n\nOr, for development or the latest features, install from source:\n\n```bash\n# Clone the Inferno repository\ngit clone https://github.com/HelpingAI/inferno.git\ncd inferno\n\n# Install Inferno in editable mode (recommended for development)\npip install -e .\n\n# Or install with all optional dependencies (like quantization tools)\n# pip install -e \".[dev]\"\n```\n\n### Hardware Acceleration (`llama-cpp-python` Critical Prerequisite)\n\n`llama.cpp` (the engine behind `llama-cpp-python`) supports multiple hardware acceleration backends. You need to tell `pip` how to build `llama-cpp-python` using `CMAKE_ARGS`.\n\n\u003cdetails\u003e\n\u003csummary\u003eHow to Set Build Options (Environment Variables vs. CLI)\u003c/summary\u003e\n\nYou can set `CMAKE_ARGS` either as an environment variable before running `pip install` or directly via the `-C / --config-settings` flag.\n\n**Environment Variable Method (Linux/macOS):**\n```bash\nCMAKE_ARGS=\"-DOPTION=on\" pip install llama-cpp-python ...\n```\n\n**Environment Variable Method (Windows PowerShell):**\n```powershell\n$env:CMAKE_ARGS = \"-DOPTION=on\"\npip install llama-cpp-python ...\n```\n\n**CLI Method (Works Everywhere, Good for requirements.txt):**\n```bash\n# Use semicolons to separate multiple CMake args with -C\npip install llama-cpp-python -C cmake.args=\"-DOPTION1=on;-DOPTION2=off\" ...\n```\n\u003c/details\u003e\n\n\u003cdetails open\u003e\n\u003csummary\u003eSupported Backends (Install ONE)\u003c/summary\u003e\n\n*   **CUDA (NVIDIA):** Requires NVIDIA drivers \u0026 CUDA Toolkit.\n    ```bash\n    CMAKE_ARGS=\"-DGGML_CUDA=on\" pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir\n    ```\n    *   **Pre-built Wheels (Alternative):** If you have CUDA 12.1-12.5 and Python 3.10-3.12, try:\n        ```bash\n        # Replace \u003ccuda-version\u003e with cu121, cu122, cu123, cu124, or cu125\n        pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir \\\n          --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/\u003ccuda-version\u003e\n        # Example: pip install ... --extra-index-url .../whl/cu121\n        ```\n\n*   **Metal (Apple Silicon):** Requires macOS 11.0+ \u0026 Xcode Command Line Tools.\n    ```bash\n    CMAKE_ARGS=\"-DGGML_METAL=on\" pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir\n    ```\n    *   **Pre-built Wheels (Alternative):** If you have macOS 11.0+ and Python 3.10-3.12, try:\n        ```bash\n        pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir \\\n          --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/metal\n        ```\n\n*   **hipBLAS / ROCm (AMD):** Requires ROCm toolkit.\n    ```bash\n    CMAKE_ARGS=\"-DGGML_HIPBLAS=on\" pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir\n    ```\n\n*   **OpenBLAS (CPU Acceleration):** Recommended for CPU-only systems. Requires OpenBLAS library installed.\n    ```bash\n    CMAKE_ARGS=\"-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS\" pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir\n    ```\n\n*   **Vulkan:** Requires Vulkan SDK. May accelerate various GPUs (Intel, AMD).\n    ```bash\n    CMAKE_ARGS=\"-DGGML_VULKAN=on\" pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir\n    ```\n\n*   **SYCL (Intel GPU):** Requires Intel oneAPI Base Toolkit.\n    ```bash\n    # Set up oneAPI environment first (adjust path as needed)\n    source /opt/intel/oneapi/setvars.sh\n    CMAKE_ARGS=\"-DGGML_SYCL=on -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx\" pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir\n    ```\n\n*   **RPC (Distributed):** For multi-machine inference setups.\n    ```bash\n    CMAKE_ARGS=\"-DGGML_RPC=on\" pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir\n    ```\n\n\u003c/details\u003e\n\n## 🖥️ Command Line Interface (CLI)\n\nAccess Inferno's features through its intuitive CLI:\n\n```bash\n# Show available commands and options\ninferno --help\n\n# Alternatively, run as a Python module\npython -m inferno --help\n```\n\n**Core Commands:**\n\n| Command                    | Description                                          | Example                                                   |\n| :------------------------- | :--------------------------------------------------- | :-------------------------------------------------------- |\n| `pull \u003cmodel_id_or_path\u003e`  | Download models (GGUF) from Hugging Face             | `inferno pull meta-llama/Llama-3.3-8B-Instruct-GGUF`        |\n| `list` or `ls`             | List locally downloaded models \u0026 RAM estimates       | `inferno list`                                            |\n| `serve \u003cmodel_name_or_id\u003e` | Start API server (OpenAI \u0026 Ollama compatible)        | `inferno serve MyLlama3 --port 8080`                      |\n| `run \u003cmodel_name_or_id\u003e`   | Start interactive chat session in the terminal     | `inferno run MyLlama3`                                    |\n| `remove \u003cmodel_name\u003e`      | Delete a downloaded model                            | `inferno remove MyLlama3`                                 |\n| `copy \u003csource\u003e \u003cdest\u003e`     | Duplicate a model locally                            | `inferno copy MyLlama3 MyLlama3-Experiment`               |\n| `show \u003cmodel_name\u003e`        | Display detailed model info (metadata, path, etc.)   | `inferno show MyLlama3`                                   |\n| `ps`                       | Show running Inferno server processes/models         | `inferno ps`                                              |\n| `quantize \u003cinput\u003e [out]`   | Convert models (HF or GGUF) to different quant levels | `inferno quantize hf:Qwen/Qwen3-0.6B Qwen3-0.6B-Q4_K_M` |\n| `compare \u003cmodels...\u003e`      | Compare specs of multiple local models               | `inferno compare ModelA ModelB`                           |\n| `estimate \u003cmodel_name\u003e`    | Show RAM usage estimates for different quants        | `inferno estimate MyLlama3-f16`                           |\n| `version`                  | Display Inferno version information                  | `inferno version`                                         |\n\n## 🔥 Getting Started\n\nLet's ignite your first model!\n\n1.  **Download a Model:** Choose a model from Hugging Face (GGUF format). Inferno helps you select the specific file.\n    ```bash\n    # Example: Download Llama 3.3 8B Instruct (will prompt for file selection)\n    inferno pull meta-llama/Llama-3.3-8B-Instruct-GGUF\n\n    # Example: Download Mistral Small 3.1\n    inferno pull mistralai/Mistral-Small-3.1-GGUF\n\n    # Example: Download Phi-4 Mini\n    inferno pull microsoft/Phi-4-mini-GGUF\n\n    # Example: Specify a direct file if you know the exact name\n    # inferno pull user/repo-GGUF:model-q4_k_m.gguf\n    ```\n    \u003e [!WARNING]\n    \u003e Some models require Hugging Face authentication. Run `huggingface-cli login` in your terminal beforehand if needed. Inferno will warn you about estimated RAM requirements.\n\n2.  **List Your Models:** Verify the download.\n    ```bash\n    inferno list\n    ```\n    *(You'll see your downloaded model listed, e.g., `Llama-3.3-8B-Instruct-GGUF`)*\n\n3.  **Chat with the Model:** Start an interactive session.\n    ```bash\n    inferno run Llama-3.3-8B-Instruct-GGUF\n    ```\n    Type your questions and press Enter. Use `/help` inside the chat for commands like changing the system prompt (`/set system ...`) or context size (`/set context ...`). Use `/bye` to exit.\n\n4.  **(Alternative) Start the API Server:** Serve the model for use with other applications.\n    ```bash\n    inferno serve Llama-3.3-8B-Instruct-GGUF --port 8000\n    ```\n    The server will be available at `http://localhost:8000`. You can now use clients pointing to `http://localhost:8000/v1` (OpenAI API) or `http://localhost:8000/api` (Ollama API).\n\n## 📋 Usage Guide\n\n### Download a Model\n\n```bash\n# Interactive download (recommended)\ninferno pull \u003crepo_id\u003e\n# Example: inferno pull google/gemma-1.1-7b-it-gguf\n\n# Direct file download\ninferno pull \u003crepo_id\u003e:\u003cfilename.gguf\u003e\n# Example: inferno pull google/gemma-1.1-7b-it-gguf:gemma-1.1-7b-it-Q4_K_M.gguf\n\n# HuggingFace prefix with repository ID\ninferno pull hf:\u003crepo_id\u003e\n# Example: inferno pull hf:mradermacher/DAN-Qwen3-1.7B-GGUF\n\n# HuggingFace prefix with repository ID and quantization\ninferno pull hf:\u003crepo_id\u003e:\u003cquantization\u003e\n# Example: inferno pull hf:mradermacher/DAN-Qwen3-1.7B-GGUF:Q2_K\n```\nInferno shows available GGUF files, sizes, and estimated RAM needed, warning if it exceeds your system's available RAM.\n\n### Model Quantization\n\nConvert models to smaller, faster GGUF formats. This is useful if you download a large `F16` model or want to experiment with different precision levels.\n\n```bash\n# Quantize a downloaded F16 GGUF model (interactive method selection)\ninferno quantize MyModel-f16 MyModel-Q4_K_M\n\n# Quantize directly from a Hugging Face repo (interactive)\n# This downloads the original (often PyTorch/Safetensors) model and converts it.\ninferno quantize hf:NousResearch/Hermes-2-Pro-Llama-3-8B Hermes-2-Pro-Llama-3-8B-Q5_K_M\n\n# Specify quantization method directly (e.g., q4_k_m)\ninferno quantize MyModel-f16 MyModel-Q4_K_M --method q4_k_m\n```\n\n**Common Quantization Methods:**\n\n| Method | Approx Bits | Size Multiplier (vs F16) | Use Case                                    |\n| :----- | :---------- | :----------------------- | :------------------------------------------ |\n| q2_k   | ~2.5 bits   | ~0.16x                   | Minimum RAM, experimental quality           |\n| iq3_m  | ~3.0 bits   | ~0.21x                   | Good quality for 3-bit (Importance Matrix)  |\n| q3_k_m | ~3.5 bits   | ~0.24x                   | Balanced low RAM / decent quality           |\n| iq4_nl | ~4.0 bits   | ~0.29x                   | Best 4-bit quality (Non-Linear Importance)  |\n| iq4_xs | ~4.0 bits   | ~0.29x                   | Extra small 4-bit (Importance Matrix)       |\n| q4_k_m | ~4.5 bits   | ~0.31x                   | **Excellent general-purpose balance**       |\n| q5_k_m | ~5.5 bits   | ~0.38x                   | Higher quality, moderate RAM increase       |\n| q6_k   | ~6.5 bits   | ~0.44x                   | Very high quality, significant RAM          |\n| q8_0   | ~8.5 bits   | ~0.53x                   | Near-lossless quality, highest non-F16 RAM |\n| f16    | 16.0 bits   | 1.00x                    | Full precision, highest RAM, source quality |\n\n\u003e [!NOTE]\n\u003e RAM estimates provided during quantization are approximate. Actual usage depends on context size and backend. Use `inferno estimate \u003cmodel_name\u003e` for more detailed projections.\n\n### List Downloaded Models\n\n```bash\ninferno list # or inferno ls\n```\nShows local model names, original repo, file size, quantization type, estimated base RAM, and download date.\n\n### Start the Server\n\n```bash\n# Serve a locally downloaded model\ninferno serve MyModel-Q4_K_M\n\n# Serve directly from Hugging Face (downloads if not present)\ninferno serve teknium/OpenHermes-2.5-Mistral-7B-GGUF:openhermes-2.5-mistral-7b.Q4_K_M.gguf\n\n# Specify host and port (0.0.0.0 makes it accessible on your network)\ninferno serve MyModel-Q4_K_M --host 0.0.0.0 --port 8080\n\n# Advanced: Offload layers to GPU, set context size\ninferno serve MyModel-Q4_K_M --n_gpu_layers 35 --n_ctx 8192\n```\n\u003e [!WARNING]\n\u003e Using `--host 0.0.0.0` exposes the server to your local network. Ensure your firewall settings are appropriate.\n\n### Chat with a Model\n\n```bash\ninferno run MyModel-Q4_K_M\n\n# Set context size on launch\ninferno run MyModel-Q4_K_M --n_ctx 4096\n```\n**In-Chat Commands:**\n\n| Command                 | Description                             |\n| :---------------------- | :---------------------------------------------------- |\n| `/help` or `/?`         | Show this help message                                |\n| `/bye`                  | Exit the chat                                         |\n| `/set system \u003cprompt\u003e`  | Set the system prompt                                 |\n| `/set context \u003csize\u003e`   | Set context window size (reloads model)               |\n| `/show context`         | Show the current and maximum context window size      |\n| `/clear` or `/cls`      | Clear the terminal screen                             |\n| `/reset`                | Reset chat history and system prompt                  |\n\n## 🔌 API Usage\n\nInferno exposes OpenAI and Ollama compatible API endpoints when you run `inferno serve`.\n\n*   **OpenAI Base URL:** `http://localhost:8000/v1` (Default port)\n*   **Ollama Base URL:** `http://localhost:8000/api` (Default port)\n\n### OpenAI API Endpoints\n\n*   `/v1/models` (GET): List available models (returns the currently served model).\n*   `/v1/chat/completions` (POST): Generate chat responses (supports streaming).\n*   `/v1/completions` (POST): Generate text completions (legacy, use chat).\n*   `/v1/embeddings` (POST): Generate text embeddings.\n\n### Ollama API Endpoints\n\n*   `/api/tags` (GET): List available models (returns the currently served model).\n*   `/api/chat` (POST): Generate chat responses (supports streaming).\n*   `/api/generate` (POST): Generate text completions.\n*   `/api/embed` (POST): Generate text embeddings.\n*   `/api/show` (POST): Show details for a loaded model.\n*   *(Model management endpoints like `/api/pull`, `/api/copy`, `/api/delete` are generally handled by the CLI)*\n\n### Python Examples\n\n#### Using `openai` Package\n\n```python\nimport openai\n\n# Point the official OpenAI client to your Inferno server\nclient = openai.OpenAI(\n    api_key=\"dummy-key\", # Required by the library, but not used by Inferno\n    base_url=\"http://localhost:8000/v1\" # Your Inferno server URL\n)\n\n# --- Chat Completion ---\ntry:\n    response = client.chat.completions.create(\n        model=\"MyModel-Q4_K_M\", # Must match the model name used in `inferno serve`\n        messages=[\n            {\"role\": \"system\", \"content\": \"You are Inferno, a helpful AI assistant.\"},\n            {\"role\": \"user\", \"content\": \"Explain the concept of quantization.\"}\n        ],\n        max_tokens=150,\n        temperature=0.7,\n        stream=False # Set to True for streaming\n    )\n    print(\"Full Response:\")\n    print(response.choices[0].message.content)\n\nexcept openai.APIConnectionError as e:\n    print(f\"Connection Error: Is the Inferno server running? {e}\")\nexcept Exception as e:\n    print(f\"An error occurred: {e}\")\n\n\n# --- Streaming Chat Completion ---\ntry:\n    print(\"\\nStreaming Response:\")\n    stream = client.chat.completions.create(\n        model=\"MyModel-Q4_K_M\",\n        messages=[{\"role\": \"user\", \"content\": \"Write a short poem about fire.\"}],\n        stream=True\n    )\n    for chunk in stream:\n        if chunk.choices and chunk.choices[0].delta and chunk.choices[0].delta.content:\n            print(chunk.choices[0].delta.content, end=\"\", flush=True)\n    print() # Newline after stream\n\nexcept openai.APIConnectionError as e:\n    print(f\"Connection Error: Is the Inferno server running? {e}\")\nexcept Exception as e:\n    print(f\"An error occurred: {e}\")\n\n# --- Embeddings ---\ntry:\n    response = client.embeddings.create(\n        model=\"MyModel-Q4_K_M\", # Ensure the model supports embeddings\n        input=\"Inferno is heating up the local AI scene!\"\n    )\n    print(f\"\\nEmbedding Vector (first 5 dims): {response.data[0].embedding[:5]}...\")\n    print(f\"Total dimensions: {len(response.data[0].embedding)}\")\n\nexcept openai.APIConnectionError as e:\n    print(f\"Connection Error: Is the Inferno server running? {e}\")\nexcept Exception as e:\n    print(f\"An error occurred: {e}\")\n\n```\n\n#### Using `requests` (Ollama API)\n\n```python\nimport requests\nimport json\n\nbase_url = \"http://localhost:8000/api\" # Ollama API base\nmodel_name = \"MyModel-Q4_K_M\" # Replace with your served model\n\n# --- Chat Completion ---\ntry:\n    response = requests.post(\n        f\"{base_url}/chat\",\n        json={\n            \"model\": model_name,\n            \"messages\": [{\"role\": \"user\", \"content\": \"Hello there!\"}],\n            \"stream\": False\n        }\n    )\n    response.raise_for_status() # Raise an exception for bad status codes\n    print(\"Ollama API Chat Response:\")\n    print(response.json()[\"message\"][\"content\"])\n\nexcept requests.exceptions.RequestException as e:\n    print(f\"Ollama API Error: {e}\")\n\n\n# --- Embeddings ---\ntry:\n    response = requests.post(\n        f\"{base_url}/embed\",\n        json={\n            \"model\": model_name,\n            \"input\": \"This is text to embed.\"\n        }\n    )\n    response.raise_for_status()\n    print(\"\\nOllama API Embedding (first 5 dims):\")\n    print(response.json()[\"embeddings\"][:5],\"...\")\n    print(f\"Total dimensions: {len(response.json()['embeddings'])}\")\n\nexcept requests.exceptions.RequestException as e:\n    print(f\"Ollama API Error: {e}\")\n\n```\n\n## 🐍 Native Python Client\n\nInferno includes its own `InfernoClient`, a drop-in replacement for the official `openai` client, offering the same interface.\n\n```python\n# Ensure inferno is installed: pip install -e .\nfrom inferno.client import InfernoClient\nimport json # For parsing tool arguments if needed\n\n# Initialize pointing to your server\nclient = InfernoClient(\n    api_key=\"dummy\",\n    base_url=\"http://localhost:8000/v1\", # Use the OpenAI-compatible endpoint\n)\n\nmodel_name = \"MyModel-Q4_K_M\" # Replace with your served model\n\n# --- Basic Chat ---\nprint(\"--- Native Client Chat ---\")\nresponse = client.chat.create(\n    model=model_name,\n    messages=[{\"role\": \"user\", \"content\": \"What is Inferno?\"}],\n)\nprint(response[\"choices\"][0][\"message\"][\"content\"])\n\n\n# --- Multimodal Chat (Example - Requires a multimodal model like LLaVA) ---\n# Make sure you are serving a model that supports image input (e.g., a LLaVA GGUF)\n# model_name_multimodal = \"llava-v1.6-mistral-7b-GGUF\" # Example name\n# print(\"\\n--- Native Client Multimodal Chat (Requires Multimodal Model) ---\")\n# try:\n#     response = client.chat.create(\n#         model=model_name_multimodal, # Use the multimodal model name\n#         messages=[\n#             {\n#                 \"role\": \"user\",\n#                 \"content\": [\n#                     {\"type\": \"text\", \"text\": \"Describe this image.\"},\n#                     {\"type\": \"image_url\", \"image_url\": {\"url\": \"https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg\"}}\n#                 ]\n#             }\n#         ]\n#     )\n#     print(response.get(\"choices\", [{}])[0].get(\"message\", {}).get(\"content\"))\n# except Exception as e:\n#     print(f\"Could not run multimodal example: {e}\")\n\n\n# --- Tool Calling (Example - Requires a model supporting tools/functions) ---\n# Make sure you are serving a model fine-tuned for tool/function calling\n# model_name_tools = \"hermes-2-pro-llama-3-8b-GGUF\" # Example name\n# print(\"\\n--- Native Client Tool Calling (Requires Tool-Supporting Model) ---\")\n# tools = [{\n#     \"type\": \"function\",\n#     \"function\": {\n#         \"name\": \"get_current_weather\",\n#         \"description\": \"Get the current weather in a given location\",\n#         \"parameters\": {\n#             \"type\": \"object\",\n#             \"properties\": {\n#                 \"location\": {\"type\": \"string\", \"description\": \"The city and state, e.g. San Francisco, CA\"},\n#                 \"unit\": {\"type\": \"string\", \"enum\": [\"celsius\", \"fahrenheit\"]}\n#             },\n#             \"required\": [\"location\"]\n#         }\n#     }\n# }]\n# try:\n#     response = client.chat.create(\n#         model=model_name_tools, # Use the tool-supporting model name\n#         messages=[{\"role\": \"user\", \"content\": \"What's the weather like in Boston?\"}],\n#         tools=tools,\n#         tool_choice=\"auto\" # or force: {\"type\": \"function\", \"function\": {\"name\": \"get_current_weather\"}}\n#     )\n#     message = response.get(\"choices\", [{}])[0].get(\"message\", {})\n#     if message.get(\"tool_calls\"):\n#         tool_call = message[\"tool_calls\"][0]\n#         function_name = tool_call[\"function\"][\"name\"]\n#         function_args = json.loads(tool_call[\"function\"][\"arguments\"])\n#         print(f\"Function Call Requested: {function_name}\")\n#         print(f\"Arguments: {function_args}\")\n#         # --- Here you would execute the function and send back the result ---\n#     else:\n#         print(f\"Response Content (No Tool Call): {message.get('content')}\")\n# except Exception as e:\n#     print(f\"Could not run tool calling example: {e}\")\n\n```\n\u003e [!TIP]\n\u003e The `InfernoClient` provides a familiar interface for developers already using the `openai` package, simplifying integration.\n\n## 🧩 Integration with Applications\n\nInferno's OpenAI API compatibility makes it easy to integrate with popular AI frameworks.\n\n```python\n# Example using LangChain\nfrom langchain_openai import ChatOpenAI\nfrom langchain_core.messages import HumanMessage\n\n# Configure LangChain to use your local Inferno server\nllm = ChatOpenAI(\n    model=\"MyModel-Q4_K_M\", # Your served model name\n    openai_api_key=\"dummy\",\n    openai_api_base=\"http://localhost:8000/v1\",\n    temperature=0.7,\n    streaming=True # Enable streaming if desired\n)\n\nprint(\"--- LangChain Integration Example ---\")\n# Simple invocation\n# response = llm.invoke([HumanMessage(content=\"Explain the difference between heat and temperature.\")])\n# print(response.content)\n\n# Streaming invocation\nprint(\"Streaming with LangChain:\")\nfor chunk in llm.stream(\"Write a haiku about a campfire.\"):\n    print(chunk.content, end=\"\", flush=True)\nprint()\n```\nWorks similarly with LlamaIndex, Semantic Kernel, Haystack, and any tool supporting the OpenAI API standard. Just point them to your Inferno server's URL (`http://localhost:8000/v1`).\n\n## 📦 Requirements\n\n### Hardware\n\n*   **RAM:** This is the most critical factor.\n    *   ~2-4 GB RAM for 1-3B parameter models (e.g., Phi-3 Mini, Gemma 2B)\n    *   **8 GB+ RAM** recommended for 7-8B models (e.g., Llama 3.3 8B, Mistral 7B)\n    *   **16 GB+ RAM** recommended for 13B models\n    *   **32 GB+ RAM** needed for ~30B models\n    *   **64 GB+ RAM** needed for ~70B models\n*   **CPU:** A modern multi-core CPU. Performance scales with core count and speed.\n*   **GPU (Highly Recommended):** An NVIDIA, AMD, or Apple Silicon GPU significantly accelerates inference. VRAM requirements depend on the model size and number of layers offloaded (`--n_gpu_layers`). Even partial offloading helps.\n*   **Disk Space:** Enough space for downloaded models (GGUF files can range from ~1GB to 100GB+).\n\n\u003e [!WARNING]\n\u003e Running models requiring more RAM than physically available will lead to *extreme* slowdowns due to disk swapping. Check model RAM estimates (`inferno list`, `inferno estimate`) before running.\n\n### Software\n\n*   **Python:** 3.9 or newer.\n*   **Build Tools:** A C++ compiler (like GCC, Clang, or MSVC) and CMake are required for building `llama-cpp-python`.\n*   **Core Dependencies:** `llama-cpp-python`, `fastapi`, `uvicorn`, `rich`, `typer`, `huggingface-hub`, `pydantic`, `requests`. (Installed automatically with `pip install -e .`)\n*   **(Optional) Git:** For cloning the repository.\n\n## 🔧 Advanced Configuration\n\nPass these options to `inferno serve` and/or `inferno run` as indicated:\n\n| Option              | Description                                                      | Example                               | Default            |\n| :------------------ | :--------------------------------------------------------------- | :------------------------------------ | :----------------- |\n| `--host \u003cip\u003e`       | IP address to bind the server to                                 | `--host 0.0.0.0`                      | `127.0.0.1`        | `serve` only   |\n| `--port \u003cnum\u003e`      | Port for the API server                                          | `--port 8080`                         | `8000`             | `serve` only   |\n| `--n_gpu_layers \u003cn\u003e`| Number of model layers to offload to GPU (-1 for max)            | `--n_gpu_layers 35`                   | `0` (CPU only)     | `serve`, `run` |\n| `--n_ctx \u003cn\u003e`       | Context window size (tokens), overrides auto-detection           | `--n_ctx 8192`                        | Auto/4096          | `serve`, `run` |\n| `--n_threads \u003cn\u003e`   | Number of CPU threads for computation                            | `--n_threads 4`                       | (Auto-detected)    | `serve`, `run` |\n| `--use_mlock`       | Force model to stay in RAM (prevents swapping if possible)       | `--use_mlock`                         | (Disabled)         | `serve`, `run` |\n\n\u003e [!TIP]\n\u003e For optimal CPU performance, set `--n_threads` to the number of *physical* cores on your CPU. Check your CPU specs (e.g., via Task Manager on Windows or `lscpu` on Linux). Start with `--n_gpu_layers -1` to offload as much as possible to VRAM, then reduce if you encounter memory errors.\n\n## 🤝 Contributing\n\nHelp fuel the fire! Contributions are highly welcome.\n\n1.  **Fork** the repository on GitHub.\n2.  **Clone** your fork locally: `git clone https://github.com/HelpingAI/inferno.git`\n3.  Create a **new branch** for your changes: `git checkout -b feature/my-cool-feature` or `bugfix/fix-that-issue`.\n4.  Make your changes, **commit** them with clear messages: `git commit -m \"Add feature X\"`\n5.  **Push** your branch to your fork: `git push origin feature/my-cool-feature`\n6.  Open a **Pull Request** (PR) from your branch to the `main` branch of the `HelpingAI/inferno` repository.\n\nPlease ensure your code follows basic Python best practices and includes relevant tests or documentation updates if applicable.\n\n## 📄 License\n\nInferno is licensed under the [HelpingAI Open Source License](LICENSE). This license promotes open innovation and collaboration while ensuring responsible and ethical use of AI technology.\n\n## 📚 Full Documentation\n\n\u003cdiv align=\"center\"\u003e\n  \u003ch3\u003e\u003ca href=\"https://deepwiki.com/HelpingAI/inferno\"\u003e📖 Dive Deeper: Read the Full Documentation\u003c/a\u003e\u003c/h3\u003e\n  \u003cp\u003eFind comprehensive guides, API references, advanced configuration details, and tutorials at \u003ccode\u003edeepwiki.com/HelpingAI/inferno\u003c/code\u003e\u003c/p\u003e\n\u003c/div\u003e\n\n---\n\n\u003cdiv align=\"center\"\u003e\n  \u003cp\u003eMade with ❤️ by \u003ca href=\"https://helpingai.co\"\u003eHelpingAI\u003c/a\u003e\u003c/p\u003e\n\u003c/div\u003e","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhelpingai%2Finferno","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhelpingai%2Finferno","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhelpingai%2Finferno/lists"}