{"id":26103492,"url":"https://github.com/inftyai/puma","last_synced_at":"2026-06-08T14:32:41.473Z","repository":{"id":276820997,"uuid":"857647040","full_name":"InftyAI/PUMA","owner":"InftyAI","description":"Aim to be a lightweight, high-performance inference engine for heterogeneous devices. WIP.","archived":false,"fork":false,"pushed_at":"2025-02-25T08:28:50.000Z","size":45,"stargazers_count":0,"open_issues_count":4,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-04T16:15:25.977Z","etag":null,"topics":["llm","llm-inference","rust"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/InftyAI.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-09-15T08:12:38.000Z","updated_at":"2025-02-25T08:28:54.000Z","dependencies_parsed_at":"2025-02-25T07:33:19.630Z","dependency_job_id":null,"html_url":"https://github.com/InftyAI/PUMA","commit_stats":null,"previous_names":["inftyai/puma"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/InftyAI%2FPUMA","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/InftyAI%2FPUMA/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/InftyAI%2FPUMA/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/InftyAI%2FPUMA/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/InftyAI","download_url":"https://codeload.github.com/InftyAI/PUMA/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":242744089,"owners_count":20178174,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["llm","llm-inference","rust"],"created_at":"2025-03-09T20:07:13.608Z","updated_at":"2026-06-08T14:32:41.468Z","avatar_url":"https://github.com/InftyAI.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n\n\u003cpicture\u003e\n  \u003csource media=\"(prefers-color-scheme: dark)\" srcset=\"https://raw.githubusercontent.com/InftyAI/PUMA/main/site/images/logo-dark.svg\"\u003e\n  \u003csource media=\"(prefers-color-scheme: light)\" srcset=\"https://raw.githubusercontent.com/InftyAI/PUMA/main/site/images/logo-light.svg\"\u003e\n  \u003cimg alt=\"PUMA Logo\" src=\"https://raw.githubusercontent.com/InftyAI/PUMA/main/site/images/logo-light.svg\" width=\"240\"\u003e\n\u003c/picture\u003e\n\n**A lightweight, high-performance inference engine for local AI**\n\n[![Stability: Active](https://img.shields.io/badge/stability-active-brightgreen.svg)](https://github.com/InftyAI/PUMA)\n[![Latest Release](https://img.shields.io/github/v/release/InftyAI/PUMA)](https://github.com/InftyAI/PUMA/releases)\n\n\u003c/div\u003e\n\n## ✨ Features\n\n🔧 **Model Management** - Download, cache, and organize AI models from Hugging Face\n\n🔍 **Advanced Filtering** - Search models with regex patterns and SQL-style queries\n\n💻 **System Detection** - Automatic GPU detection and resource reporting\n\n🚀 **OpenAI-Compatible API** - RESTful API with streaming support\n\n## Installation\n\n### Install with Cargo\n\n```bash\ncargo install puma\n```\n\n### Build from Source\n\n```bash\n# Clone the repository\ngit clone https://github.com/InftyAI/PUMA.git\ncd PUMA\n\n# Build the binary\nmake build\n\n# The binary will be available at ./puma\n./puma version\n```\n\n## Quick Start\n\n### CLI Usage\n\n```bash\n# Download a model\npuma pull inftyai/tiny-random-gpt2\n\n# List all models\npuma ls\n\n# Inspect model details\npuma inspect inftyai/tiny-random-gpt2\n\n# Check system info\npuma info\n\n# Remove a model\npuma rm inftyai/tiny-random-gpt2\n```\n\n### API Server\n\n```bash\n# Start the inference server with a model\npuma serve inftyai/tiny-random-gpt2\n\n# Server will start on http://0.0.0.0:8000\n# API endpoints:\n#   POST /v1/chat/completions\n#   POST /v1/completions\n#   GET  /v1/models\n#   GET  /v1/models/:model\n#   GET  /health\n```\n\n**Test the API:**\n\n```bash\n# Health check\ncurl http://localhost:8000/health\n\n# Chat completion\ncurl http://localhost:8000/v1/chat/completions \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"model\": \"inftyai/tiny-random-gpt2\",\n    \"messages\": [{\"role\": \"user\", \"content\": \"Hello!\"}]\n  }'\n\n# Or use the test script\n./hack/scripts/test_api.sh\n```\n\n## Commands\n\n| Command | Status | Description |\n|---------|--------|-------------|\n| `pull \u003cmodel\u003e` | ✅ | Download model from provider |\n| `ls` | ✅ | List models (supports regex, label filters) |\n| `inspect \u003cmodel\u003e` | ✅ | Show detailed model information |\n| `rm \u003cmodel\u003e` | ✅ | Remove model and cache |\n| `info` | ✅ | Display system information |\n| `version` | ✅ | Show PUMA version |\n| `serve \u003cmodel\u003e` | ✅ | Start OpenAI-compatible API server with a model |\n| `ps` | 🚧 | List running models |\n| `run` | 🚧 | Start model inference |\n| `stop` | 🚧 | Stop running model |\n\n## Advanced Usage\n\n### Pattern Matching\n\n```bash\n# Substring match\npuma ls qwen\n\n# Prefix match\npuma ls \"^inftyai/\"\n\n# Alternation\npuma ls \"llama-(2|3)\"\n```\n\n### Label Filtering\n\n```bash\n# Single filter\npuma ls -l author=inftyai\n\n# Multiple filters (AND condition)\npuma ls -l author=inftyai,license=mit\n\n# Combine pattern + filter\npuma ls llama -l author=meta\n```\n\n**Available filters:** `author`, `task`, `license`, `provider`, `model_series`\n\n## API Server\n\nPUMA provides an OpenAI-compatible API server for model inference.\n\n### Starting the Server\n\n```bash\n# Start server with a model (default: 0.0.0.0:8000)\npuma serve inftyai/tiny-random-gpt2\n\n# Custom host and port\npuma serve inftyai/tiny-random-gpt2 --host 127.0.0.1 --port 3000\n\n# Model must be pulled first\npuma pull inftyai/tiny-random-gpt2\n```\n\n### API Endpoints\n\n#### Chat Completions (Recommended)\n```bash\ncurl http://localhost:8000/v1/chat/completions \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"model\": \"inftyai/tiny-random-gpt2\",\n    \"messages\": [\n      {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},\n      {\"role\": \"user\", \"content\": \"Hello!\"}\n    ],\n    \"max_tokens\": 100,\n    \"temperature\": 0.7\n  }'\n```\n\n#### Streaming (Server-Sent Events)\n```bash\ncurl http://localhost:8000/v1/chat/completions \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"model\": \"inftyai/tiny-random-gpt2\",\n    \"messages\": [{\"role\": \"user\", \"content\": \"Tell me a story\"}],\n    \"stream\": true\n  }'\n```\n\n#### List Models\n```bash\n# Returns the currently loaded model\ncurl http://localhost:8000/v1/models\n```\n\n#### Health Check\n```bash\ncurl http://localhost:8000/health\n# Returns: {\"status\":\"ok\"}\n```\n\n### OpenAI Python Client\n\nPUMA is compatible with the OpenAI Python SDK:\n\n```python\nfrom openai import OpenAI\n\nclient = OpenAI(\n    base_url=\"http://localhost:8000/v1\",\n    api_key=\"dummy\"  # Not required\n)\n\nresponse = client.chat.completions.create(\n    model=\"inftyai/tiny-random-gpt2\",\n    messages=[\n        {\"role\": \"user\", \"content\": \"Hello!\"}\n    ]\n)\n\nprint(response.choices[0].message.content)\n```\n\n### Inspect Output\n\n```bash\n$ puma inspect inftyai/tiny-random-gpt2\n\nname: inftyai/tiny-random-gpt2\nkind: Model\nspec:\n  author:         inftyai\n  model_series:   gpt2\n  task:           text-generation\n  license:        MIT\n  context_window: 2.05K\n  safetensors:\n    total:        7.00B\n    parameters:\n      f32:        7.00B\n  provider:     huggingface\n  cache:\n    revision:     abc123de\n    size:         1.24 GB\n    path:   ~/.puma/cache/...\nstatus:\n  created:      2 hours ago\n  updated:      2 hours ago\n```\n\n## Model Management\n\n- **Database:** `~/.puma/models.db` (SQLite)\n- **Cache:** `~/.puma/cache/` (model files)\n\nModels are stored with lowercase names for case-insensitive matching.\n\n## Development\n\n```bash\n# Build\nmake build\n\n# Run all tests\nmake test\n\n# Test API manually\n./hack/scripts/test_api.sh\n```\n\n### Project Structure\n\n```\npuma/\n├── src/\n│   ├── api/          # OpenAI-compatible API\n│   ├── backend/      # Inference backends (Mock, MLX)\n│   ├── cli/          # Command implementations\n│   ├── downloader/   # HuggingFace download logic\n│   ├── registry/     # Model registry \u0026 metadata\n│   ├── storage/      # SQLite storage backend\n│   ├── system/       # System info detection\n│   └── utils/        # Formatting \u0026 helpers\n├── tests/            # Integration tests\n├── hack/             # Development scripts\n├── Cargo.toml        # Rust dependencies\n└── Makefile          # Build commands\n```\n\n## License\n\nApache-2.0\n\n## Star History\n\n[![Star History Chart](https://api.star-history.com/svg?repos=inftyai/puma\u0026type=Date)](https://www.star-history.com/#inftyai/puma\u0026Date)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Finftyai%2Fpuma","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Finftyai%2Fpuma","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Finftyai%2Fpuma/lists"}