An open API service indexing awesome lists of open source software.

https://github.com/inftyai/puma

Aim to be a lightweight, high-performance inference engine for heterogeneous devices. WIP.
https://github.com/inftyai/puma

llm llm-inference rust

Last synced: 5 days ago
JSON representation

Aim to be a lightweight, high-performance inference engine for heterogeneous devices. WIP.

Awesome Lists containing this project

README

          



PUMA Logo

**A lightweight, high-performance inference engine for local AI**

[![Stability: Active](https://img.shields.io/badge/stability-active-brightgreen.svg)](https://github.com/InftyAI/PUMA)
[![Latest Release](https://img.shields.io/github/v/release/InftyAI/PUMA)](https://github.com/InftyAI/PUMA/releases)

## ✨ Features

🔧 **Model Management** - Download, cache, and organize AI models from Hugging Face

🔍 **Advanced Filtering** - Search models with regex patterns and SQL-style queries

💻 **System Detection** - Automatic GPU detection and resource reporting

🚀 **OpenAI-Compatible API** - RESTful API with streaming support

## Installation

### Install with Cargo

```bash
cargo install puma
```

### Build from Source

```bash
# Clone the repository
git clone https://github.com/InftyAI/PUMA.git
cd PUMA

# Build the binary
make build

# The binary will be available at ./puma
./puma version
```

## Quick Start

### CLI Usage

```bash
# Download a model
puma pull inftyai/tiny-random-gpt2

# List all models
puma ls

# Inspect model details
puma inspect inftyai/tiny-random-gpt2

# Check system info
puma info

# Remove a model
puma rm inftyai/tiny-random-gpt2
```

### API Server

```bash
# Start the inference server with a model
puma serve inftyai/tiny-random-gpt2

# Server will start on http://0.0.0.0:8000
# API endpoints:
# POST /v1/chat/completions
# POST /v1/completions
# GET /v1/models
# GET /v1/models/:model
# GET /health
```

**Test the API:**

```bash
# Health check
curl http://localhost:8000/health

# Chat completion
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "inftyai/tiny-random-gpt2",
"messages": [{"role": "user", "content": "Hello!"}]
}'

# Or use the test script
./hack/scripts/test_api.sh
```

## Commands

| Command | Status | Description |
|---------|--------|-------------|
| `pull ` | ✅ | Download model from provider |
| `ls` | ✅ | List models (supports regex, label filters) |
| `inspect ` | ✅ | Show detailed model information |
| `rm ` | ✅ | Remove model and cache |
| `info` | ✅ | Display system information |
| `version` | ✅ | Show PUMA version |
| `serve ` | ✅ | Start OpenAI-compatible API server with a model |
| `ps` | 🚧 | List running models |
| `run` | 🚧 | Start model inference |
| `stop` | 🚧 | Stop running model |

## Advanced Usage

### Pattern Matching

```bash
# Substring match
puma ls qwen

# Prefix match
puma ls "^inftyai/"

# Alternation
puma ls "llama-(2|3)"
```

### Label Filtering

```bash
# Single filter
puma ls -l author=inftyai

# Multiple filters (AND condition)
puma ls -l author=inftyai,license=mit

# Combine pattern + filter
puma ls llama -l author=meta
```

**Available filters:** `author`, `task`, `license`, `provider`, `model_series`

## API Server

PUMA provides an OpenAI-compatible API server for model inference.

### Starting the Server

```bash
# Start server with a model (default: 0.0.0.0:8000)
puma serve inftyai/tiny-random-gpt2

# Custom host and port
puma serve inftyai/tiny-random-gpt2 --host 127.0.0.1 --port 3000

# Model must be pulled first
puma pull inftyai/tiny-random-gpt2
```

### API Endpoints

#### Chat Completions (Recommended)
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "inftyai/tiny-random-gpt2",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"}
],
"max_tokens": 100,
"temperature": 0.7
}'
```

#### Streaming (Server-Sent Events)
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "inftyai/tiny-random-gpt2",
"messages": [{"role": "user", "content": "Tell me a story"}],
"stream": true
}'
```

#### List Models
```bash
# Returns the currently loaded model
curl http://localhost:8000/v1/models
```

#### Health Check
```bash
curl http://localhost:8000/health
# Returns: {"status":"ok"}
```

### OpenAI Python Client

PUMA is compatible with the OpenAI Python SDK:

```python
from openai import OpenAI

client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="dummy" # Not required
)

response = client.chat.completions.create(
model="inftyai/tiny-random-gpt2",
messages=[
{"role": "user", "content": "Hello!"}
]
)

print(response.choices[0].message.content)
```

### Inspect Output

```bash
$ puma inspect inftyai/tiny-random-gpt2

name: inftyai/tiny-random-gpt2
kind: Model
spec:
author: inftyai
model_series: gpt2
task: text-generation
license: MIT
context_window: 2.05K
safetensors:
total: 7.00B
parameters:
f32: 7.00B
provider: huggingface
cache:
revision: abc123de
size: 1.24 GB
path: ~/.puma/cache/...
status:
created: 2 hours ago
updated: 2 hours ago
```

## Model Management

- **Database:** `~/.puma/models.db` (SQLite)
- **Cache:** `~/.puma/cache/` (model files)

Models are stored with lowercase names for case-insensitive matching.

## Development

```bash
# Build
make build

# Run all tests
make test

# Test API manually
./hack/scripts/test_api.sh
```

### Project Structure

```
puma/
├── src/
│ ├── api/ # OpenAI-compatible API
│ ├── backend/ # Inference backends (Mock, MLX)
│ ├── cli/ # Command implementations
│ ├── downloader/ # HuggingFace download logic
│ ├── registry/ # Model registry & metadata
│ ├── storage/ # SQLite storage backend
│ ├── system/ # System info detection
│ └── utils/ # Formatting & helpers
├── tests/ # Integration tests
├── hack/ # Development scripts
├── Cargo.toml # Rust dependencies
└── Makefile # Build commands
```

## License

Apache-2.0

## Star History

[![Star History Chart](https://api.star-history.com/svg?repos=inftyai/puma&type=Date)](https://www.star-history.com/#inftyai/puma&Date)