https://github.com/inftyai/puma
Aim to be a lightweight, high-performance inference engine for heterogeneous devices. WIP.
https://github.com/inftyai/puma
llm llm-inference rust
Last synced: 5 days ago
JSON representation
Aim to be a lightweight, high-performance inference engine for heterogeneous devices. WIP.
- Host: GitHub
- URL: https://github.com/inftyai/puma
- Owner: InftyAI
- License: apache-2.0
- Created: 2024-09-15T08:12:38.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-02-25T08:28:50.000Z (over 1 year ago)
- Last Synced: 2025-03-04T16:15:25.977Z (over 1 year ago)
- Topics: llm, llm-inference, rust
- Language: Rust
- Homepage:
- Size: 43.9 KB
- Stars: 0
- Watchers: 1
- Forks: 1
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
README

**A lightweight, high-performance inference engine for local AI**
[](https://github.com/InftyAI/PUMA)
[](https://github.com/InftyAI/PUMA/releases)
## ✨ Features
🔧 **Model Management** - Download, cache, and organize AI models from Hugging Face
🔍 **Advanced Filtering** - Search models with regex patterns and SQL-style queries
💻 **System Detection** - Automatic GPU detection and resource reporting
🚀 **OpenAI-Compatible API** - RESTful API with streaming support
## Installation
### Install with Cargo
```bash
cargo install puma
```
### Build from Source
```bash
# Clone the repository
git clone https://github.com/InftyAI/PUMA.git
cd PUMA
# Build the binary
make build
# The binary will be available at ./puma
./puma version
```
## Quick Start
### CLI Usage
```bash
# Download a model
puma pull inftyai/tiny-random-gpt2
# List all models
puma ls
# Inspect model details
puma inspect inftyai/tiny-random-gpt2
# Check system info
puma info
# Remove a model
puma rm inftyai/tiny-random-gpt2
```
### API Server
```bash
# Start the inference server with a model
puma serve inftyai/tiny-random-gpt2
# Server will start on http://0.0.0.0:8000
# API endpoints:
# POST /v1/chat/completions
# POST /v1/completions
# GET /v1/models
# GET /v1/models/:model
# GET /health
```
**Test the API:**
```bash
# Health check
curl http://localhost:8000/health
# Chat completion
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "inftyai/tiny-random-gpt2",
"messages": [{"role": "user", "content": "Hello!"}]
}'
# Or use the test script
./hack/scripts/test_api.sh
```
## Commands
| Command | Status | Description |
|---------|--------|-------------|
| `pull ` | ✅ | Download model from provider |
| `ls` | ✅ | List models (supports regex, label filters) |
| `inspect ` | ✅ | Show detailed model information |
| `rm ` | ✅ | Remove model and cache |
| `info` | ✅ | Display system information |
| `version` | ✅ | Show PUMA version |
| `serve ` | ✅ | Start OpenAI-compatible API server with a model |
| `ps` | 🚧 | List running models |
| `run` | 🚧 | Start model inference |
| `stop` | 🚧 | Stop running model |
## Advanced Usage
### Pattern Matching
```bash
# Substring match
puma ls qwen
# Prefix match
puma ls "^inftyai/"
# Alternation
puma ls "llama-(2|3)"
```
### Label Filtering
```bash
# Single filter
puma ls -l author=inftyai
# Multiple filters (AND condition)
puma ls -l author=inftyai,license=mit
# Combine pattern + filter
puma ls llama -l author=meta
```
**Available filters:** `author`, `task`, `license`, `provider`, `model_series`
## API Server
PUMA provides an OpenAI-compatible API server for model inference.
### Starting the Server
```bash
# Start server with a model (default: 0.0.0.0:8000)
puma serve inftyai/tiny-random-gpt2
# Custom host and port
puma serve inftyai/tiny-random-gpt2 --host 127.0.0.1 --port 3000
# Model must be pulled first
puma pull inftyai/tiny-random-gpt2
```
### API Endpoints
#### Chat Completions (Recommended)
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "inftyai/tiny-random-gpt2",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"}
],
"max_tokens": 100,
"temperature": 0.7
}'
```
#### Streaming (Server-Sent Events)
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "inftyai/tiny-random-gpt2",
"messages": [{"role": "user", "content": "Tell me a story"}],
"stream": true
}'
```
#### List Models
```bash
# Returns the currently loaded model
curl http://localhost:8000/v1/models
```
#### Health Check
```bash
curl http://localhost:8000/health
# Returns: {"status":"ok"}
```
### OpenAI Python Client
PUMA is compatible with the OpenAI Python SDK:
```python
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="dummy" # Not required
)
response = client.chat.completions.create(
model="inftyai/tiny-random-gpt2",
messages=[
{"role": "user", "content": "Hello!"}
]
)
print(response.choices[0].message.content)
```
### Inspect Output
```bash
$ puma inspect inftyai/tiny-random-gpt2
name: inftyai/tiny-random-gpt2
kind: Model
spec:
author: inftyai
model_series: gpt2
task: text-generation
license: MIT
context_window: 2.05K
safetensors:
total: 7.00B
parameters:
f32: 7.00B
provider: huggingface
cache:
revision: abc123de
size: 1.24 GB
path: ~/.puma/cache/...
status:
created: 2 hours ago
updated: 2 hours ago
```
## Model Management
- **Database:** `~/.puma/models.db` (SQLite)
- **Cache:** `~/.puma/cache/` (model files)
Models are stored with lowercase names for case-insensitive matching.
## Development
```bash
# Build
make build
# Run all tests
make test
# Test API manually
./hack/scripts/test_api.sh
```
### Project Structure
```
puma/
├── src/
│ ├── api/ # OpenAI-compatible API
│ ├── backend/ # Inference backends (Mock, MLX)
│ ├── cli/ # Command implementations
│ ├── downloader/ # HuggingFace download logic
│ ├── registry/ # Model registry & metadata
│ ├── storage/ # SQLite storage backend
│ ├── system/ # System info detection
│ └── utils/ # Formatting & helpers
├── tests/ # Integration tests
├── hack/ # Development scripts
├── Cargo.toml # Rust dependencies
└── Makefile # Build commands
```
## License
Apache-2.0
## Star History
[](https://www.star-history.com/#inftyai/puma&Date)