https://github.com/fluidinference/fluid-server
Local AI server for your Windows apps.
https://github.com/fluidinference/fluid-server
Last synced: 4 months ago
JSON representation
Local AI server for your Windows apps.
- Host: GitHub
- URL: https://github.com/fluidinference/fluid-server
- Owner: FluidInference
- License: apache-2.0
- Created: 2025-08-10T21:22:19.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2025-09-30T00:11:34.000Z (5 months ago)
- Last Synced: 2025-09-30T02:26:21.383Z (5 months ago)
- Language: Python
- Size: 777 KB
- Stars: 3
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Agents: AGENTS.md
Awesome Lists containing this project
README
# Fluid Server: Local AI server for your Windows apps
[](https://discord.gg/WNsvaCtmDe)
[](https://huggingface.co/collections/FluidInference)
**THIS PROJECT IS UNDER ACTIVE DEVELOPMENT** Its not ready for production usage but serves as a good reference for hwo to run whisper on Qualcomm and Intel NPUs
A portable, packaged OpenAI-compatible server for Windows desktop applications. LLM, Transcription, embeddings, and vector DB, all out of the box.
Note that this does require you to run the .exe as a sepearte async process, like a local serving server in your application, and you will need to make requests to serve inference.
## Features
**Core Capabilities**
- **LLM Chat Completions** - OpenAI-compatible API with streaming, backed by llama.cpp and OpenVINO
- **Audio Transcription** - Whisper models with NPU acceleration, backed by OpenVINO and Qualcomm QNN
- **Text Embeddings** - Vector embeddings for search and RAG
- **Vector Database** - LanceDB integration for multimodal storage
**Hardware Acceleration**
- **Intel NPU** via OpenVINO backend
- **Qualcomm NPU** via QNN (Snapdragon X Elite)
- **Vulkan GPU** via llama-cpp
## Quick Start
### 1. Download or Build
**Option A: Download Release**
- Download `fluid-server.exe` from [releases](https://github.com/FluidInference/fluid-server/releases)
**Option B: Run from Source**
```powershell
# Install dependencies and run
uv sync
uv run
```
### 2. Run the Server
```powershell
# Run with default settings
.\dist\fluid-server.exe
# Or with custom options
.\dist\fluid-server.exe --host 127.0.0.1 --port 8080
```
### 3. Test the API
- **Health Check**: http://localhost:8080/health
- **API Docs**: http://localhost:8080/docs
- **Models**: http://localhost:8080/v1/models
## Usage Examples
### Basic Chat Completion
```bash
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "qwen3-8b-int8-ov", "messages": [{"role": "user", "content": "Hello!"}]}'
```
### Python Integration
```python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="local")
# Chat with streaming
for chunk in client.chat.completions.create(
model="qwen3-8b-int8-ov",
messages=[{"role": "user", "content": "Hello!"}],
stream=True
):
print(chunk.choices[0].delta.content or "", end="")
```
### Audio Transcription
```bash
curl -X POST http://localhost:8080/v1/audio/transcriptions \
-F "file=@audio.wav" \
-F "model=whisper-large-v3-turbo-qnn"
```
## Documentation
📖 **Comprehensive Guides**
- [NPU Support Guide](docs/npu-support.md) - Intel & Qualcomm NPU configuration
- [Integration Guide](docs/integration-guide.md) - Python, .NET, Node.js examples
- [Development Guide](docs/development.md) - Setup, building, and contributing
- [LanceDB Integration](docs/lancedb.md) - Vector database and embeddings
- [GGUF Model Support](docs/GGUF-model-support.md) - Using any GGUF model
- [Compilation Guide](docs/compilation-guide.md) - Build system details
## FAQ
**Why Python?** Best ML ecosystem support and PyInstaller packaging.
**Why not llama.cpp?** We support multiple runtimes and AI accelerators beyond GGML.
## Acknowledgements
Built using `ty`, `FastAPI`, `Pydantic`, `ONNX Runtime`, `OpenAI Whisper`, and various other AI libraries.
**Runtime Technologies:**
- `OpenVINO` - Intel NPU and GPU acceleration
- `Qualcomm QNN` - Snapdragon NPU optimization with HTP backend
- `ONNX Runtime` - Cross-platform AI inference