https://github.com/rahulunair/xpu_tgi
TGI server setup for Intel Data Centre GPUs
https://github.com/rahulunair/xpu_tgi
intel intelgpu llm llm-inference tgi xpu
Last synced: 2 months ago
JSON representation
TGI server setup for Intel Data Centre GPUs
- Host: GitHub
- URL: https://github.com/rahulunair/xpu_tgi
- Owner: rahulunair
- License: apache-2.0
- Created: 2024-11-15T21:38:08.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2024-11-26T00:11:01.000Z (6 months ago)
- Last Synced: 2025-01-19T21:48:25.239Z (4 months ago)
- Topics: intel, intelgpu, llm, llm-inference, tgi, xpu
- Language: Shell
- Homepage:
- Size: 452 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: Readme.md
- License: LICENSE
Awesome Lists containing this project
README
# TGI Models Collection
Welcome to `xpu_tgi`! 🚀
A curated collection of Text Generation Inference (TGI) models optimized for Intel XPU, with built-in security and traffic management.
![]()
## Quick Start
```bash
# 1. Generate authentication token
python utils/generate_token.py# 2. Start a model
./start.sh Flan-T5-XXL# 3. Make a request
curl -X POST http://localhost:8000/generate \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{"inputs": "What is quantum computing?", "parameters": {"max_new_tokens": 50}}'
```## Architecture & Security
```mermaid
flowchart LR
Client([Client])
Traefik[Traefik Proxy]
Auth[Auth Service]
TGI[TGI Service]Client --> Traefik
Traefik --> Auth
Auth --> Traefik
Traefik --> TGI
TGI --> Traefik
Traefik --> Clientsubgraph Internal["Internal Network"]
Traefik
Auth
TGI
endclassDef client fill:#f2d2ff,stroke:#9645b7,stroke-width:2px;
classDef proxy fill:#bbdefb,stroke:#1976d2,stroke-width:2px;
classDef auth fill:#c8e6c9,stroke:#388e3c,stroke-width:2px;
classDef tgi fill:#ffccbc,stroke:#e64a19,stroke-width:2px;
classDef network fill:#fff9c4,stroke:#fbc02d,stroke-width:1px;class Client client;
class Traefik proxy;
class Auth auth;
class TGI tgi;
class Internal network;```
### Key Features
- 🔒 Token-based authentication with automatic ban after failed attempts
- 🚦 Rate limiting (global: 10 req/s, per-IP: 10 req/s)
- 🛡️ Security headers and IP protection
- 🔄 Health monitoring and automatic recovery
- 🚀 Optimized for Intel GPUs## Available Models
### Long Context Models (>8k tokens)
- **Phi-3-mini-128k** - 128k context window
- **Hermes-3-llama3.1** - 8k context window### Code Generation
- **CodeLlama-7b** - Specialized for code completion
- **Phi-3-mini-4k** - Efficient code generation### General Purpose
- **Flan-T5-XXL** - Versatile text generation
- **Flan-UL2** - Advanced language understanding
- **Hermes-2-pro** - Balanced performance
- **OpenHermes-Mistral** - Fast inferenceEach model includes:
- Individual configuration (`config/model.env`)
- Detailed documentation (`README.md`)
- Optimized parameters for Intel XPU## Security & Configuration
### Authentication
```bash
# Generate secure token (admin)
python utils/generate_token.py# Example output:
# --------------------------------------------------------------------------------
# Generated at: 2024-03-22T15:30:45.123456
# Token: XcAwKq7BSbGSoJCsVhUQ2e6MZ4ZOAH_mRR0HgmMNBQg
# --------------------------------------------------------------------------------
```### Traffic Management
```yaml
# Rate Limits
Global: 10 req/s (burst: 25)
Per-IP: 10 req/s (burst: 25)# Security Headers
- XSS Protection
- Content Type Nosniff
- Frame Deny
- HSTS
```## API Usage
### Basic Generation
```bash
curl -X POST http://localhost:8000/generate \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"inputs": "What is quantum computing?",
"parameters": {"max_new_tokens": 50}
}'
```### Advanced Parameters
```bash
curl -X POST http://localhost:8000/generate \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"inputs": "Explain AI",
"parameters": {
"max_new_tokens": 100,
"temperature": 0.7,
"top_p": 0.95
}
}'
```### Health Monitoring
```bash
# System health
curl http://localhost:8000/health# Model status
curl http://localhost:8000/v1/models
```## Contributing
Contributions are welcome! Please read our [Contributing Guidelines](CONTRIBUTING.md) first.
## License Notes
Each model has its own license terms. Please review individual model READMEs before use.