{"id":25721362,"url":"https://github.com/oeo/llm-accelerate-gpu-flash","last_synced_at":"2026-05-16T06:02:57.814Z","repository":{"id":279153456,"uuid":"937878481","full_name":"oeo/llm-accelerate-gpu-flash","owner":"oeo","description":"A FastAPI-based server implementation for the DeepSeek R1 Distill Qwen 32B","archived":false,"fork":false,"pushed_at":"2025-02-24T03:59:38.000Z","size":0,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-02-24T04:30:48.699Z","etag":null,"topics":["accelerator","fastapi","huggingface","nvidia","openai-api"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/oeo.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-02-24T03:56:47.000Z","updated_at":"2025-02-24T04:00:31.000Z","dependencies_parsed_at":"2025-02-24T04:30:52.420Z","dependency_job_id":"786f4293-cc1b-4645-889e-248d1e497722","html_url":"https://github.com/oeo/llm-accelerate-gpu-flash","commit_stats":null,"previous_names":["oeo/llm-accelerate-gpu-flash"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oeo%2Fllm-accelerate-gpu-flash","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oeo%2Fllm-accelerate-gpu-flash/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oeo%2Fllm-accelerate-gpu-flash/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oeo%2Fllm-accelerate-gpu-flash/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/oeo","download_url":"https://codeload.github.com/oeo/llm-accelerate-gpu-flash/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":240729184,"owners_count":19848120,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["accelerator","fastapi","huggingface","nvidia","openai-api"],"created_at":"2025-02-25T18:48:37.573Z","updated_at":"2026-05-16T06:02:53.967Z","avatar_url":"https://github.com/oeo.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# DeepSeek R1 LLM Server\n\nA FastAPI-based server implementation for DeepSeek R1 language models, providing OpenAI-compatible API endpoints for chat completions and text generation. This server supports multiple models with optimized GPU memory management and automatic model loading.\n\n## Features\n\n- OpenAI-compatible API endpoints (`/v1/chat/completions`)\n- Multi-GPU support with optimized memory management\n- Automatic model loading and intelligent unloading\n- 8-bit quantization for efficient memory usage\n- Flash Attention 2 support\n- Real-time monitoring and health checks\n- NUMA-aware optimizations\n- Streaming and non-streaming responses\n- Optimized for NVIDIA A2 GPUs\n- Automatic device mapping for multi-GPU setups\n- Comprehensive monitoring and diagnostics\n- Production-ready error handling\n- Thermal management and overload protection\n- Intelligent model persistence and unloading\n- Extensive testing suite with stress testing capabilities\n\n## Model Persistence Strategy\n\nThe server implements an intelligent model persistence strategy:\n\n1. **Core Models (Always Loaded)**\n   - `deepseek-r1-distil-32b`: Primary large model\n   - `deepseek-r1-distil-14b`: Balanced performance model\n   - `deepseek-r1-distil-8b`: Fast routing model\n\n2. **Specialized Models (Dynamic Loading)**\n   - `mixtral-8x7b`: Unloads after 1 hour of inactivity\n   - `idefics-80b`: Unloads after 30 minutes of inactivity\n   - `fuyu-8b`: Unloads immediately after use\n\n## Thermal Management\n\nAdvanced thermal protection features:\n\n```yaml\nthermal_management:\n  max_temperature: 87°C    # Maximum safe temperature\n  thermal_throttle: 82°C   # Start throttling\n  critical_temp: 85°C     # Pause new requests\n  throttle_steps:\n    - {temp: 82°C, max_concurrent: 3}\n    - {temp: 84°C, max_concurrent: 2}\n    - {temp: 85°C, max_concurrent: 1}\n```\n\n## Testing Capabilities\n\n### 1. Stress Testing\n```bash\n# Run comprehensive stress test\n./tests/stress_test.sh\n```\nFeatures:\n- Sequential and parallel model loading\n- Throughput measurement (tokens/sec)\n- Rate limiting validation\n- Memory usage patterns\n- Temperature monitoring\n- VRAM utilization tracking\n\n### 2. Curl Tests\n```bash\n# Run API endpoint tests\n./tests/curl_tests.sh\n```\nTests:\n- Model loading/unloading\n- Chat completions\n- Streaming responses\n- Error handling\n- Rate limiting\n\n### 3. Load Testing\n```bash\n# Run load test with custom parameters\npython tests/load_test.py --concurrent 3 --requests 5 --delay 0.1\n```\nCapabilities:\n- Concurrent request handling\n- Response time measurement\n- Success rate tracking\n- Resource utilization monitoring\n\n## Table of Contents\n- [Prerequisites](#prerequisites)\n- [Installation](#installation)\n- [Configuration](#configuration)\n- [Running Models](#running-models)\n- [API Usage](#api-usage)\n- [Monitoring](#monitoring)\n- [Performance Optimization](#performance-optimization)\n- [Troubleshooting](#troubleshooting)\n- [Contributing](#contributing)\n- [License](#license)\n\n## Prerequisites\n\nRequired software:\n- Python 3.8+\n- PyTorch 2.2.0+\n- CUDA-capable GPUs (4x NVIDIA A2 16GB recommended)\n- 64GB+ System RAM\n\nOptional but recommended:\n- NVIDIA Container Toolkit (for Docker deployment)\n- NUMA-enabled system\n- PCIe Gen4 support\n\n## Installation\n\n1. Clone the repository\n\n2. Install dependencies:\n```bash\npip install -r requirements.txt\n```\n\n3. Verify GPU setup:\n```bash\npython3 -c \"import torch; print(f'CUDA available: {torch.cuda.is_available()}, Device count: {torch.cuda.device_count()}')\"\n```\n\n4. Run system optimization:\n```bash\nsudo ./scripts/gpu-optimize.sh\n```\n\n## Docker Deployment\n\nThe server can be run in a Docker container with full GPU and NUMA support.\n\n### Prerequisites\n- Docker 20.10+\n- NVIDIA Container Toolkit\n- Docker Compose v2.x\n\n### Configuration Files\n\n1. **Dockerfile**:\n```dockerfile\nFROM nvidia/cuda:12.1.0-runtime-ubuntu22.04\n\n# Set environment variables\nENV DEBIAN_FRONTEND=noninteractive \\\n    PYTHONUNBUFFERED=1 \\\n    LANG=C.UTF-8 \\\n    NVIDIA_VISIBLE_DEVICES=all \\\n    NVIDIA_DRIVER_CAPABILITIES=compute,utility\n\n# Install system dependencies\nRUN apt-get update \u0026\u0026 apt-get install -y \\\n    python3 \\\n    python3-pip \\\n    python3-dev \\\n    numactl \\\n    nvidia-utils-535 \\\n    curl \\\n    jq \\\n    bc \\\n    \u0026\u0026 rm -rf /var/lib/apt/lists/*\n\n# Set Python aliases\nRUN ln -sf /usr/bin/python3 /usr/bin/python \u0026\u0026 \\\n    ln -sf /usr/bin/pip3 /usr/bin/pip\n\nWORKDIR /app\nCOPY requirements.txt .\nRUN pip install --no-cache-dir -r requirements.txt\nCOPY . .\nRUN chmod +x scripts/*.sh scripts/models/*.sh tests/*.sh\n\nCMD [\"./scripts/start-server.sh\"]\n```\n\n2. **docker-compose.yml**:\n```yaml\nversion: '3.8'\n\nservices:\n  llm-server:\n    build: .\n    runtime: nvidia\n    ports:\n      - \"8000:8000\"\n    volumes:\n      - /tmp/models:/app/models  # Persistent storage for downloaded models\n    environment:\n      - NVIDIA_VISIBLE_DEVICES=all\n      - NVIDIA_DRIVER_CAPABILITIES=compute,utility\n      - CUDA_VISIBLE_DEVICES=0,1,2,3\n      - CUDA_DEVICE_MAX_CONNECTIONS=1\n      - CUDA_DEVICE_ORDER=PCI_BUS_ID\n      - PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128,expandable_segments:True\n      - TORCH_DISTRIBUTED_DEBUG=INFO\n      - TORCH_SHOW_CPP_STACKTRACES=1\n      - NUMA_GPU_NODE_PREFERRED=1\n    deploy:\n      resources:\n        reservations:\n          devices:\n            - driver: nvidia\n              count: all\n              capabilities: [gpu]\n    ulimits:\n      memlock: -1\n      stack: 67108864\n    privileged: true  # Needed for NUMA control\n    network_mode: host  # Better performance for GPU communication\n```\n\n### Running with Docker\n\n1. Build the container:\n```bash\ndocker-compose build\n```\n\n2. Start the server:\n```bash\ndocker-compose up -d\n```\n\n3. View logs:\n```bash\ndocker-compose logs -f\n```\n\n4. Run tests:\n```bash\ndocker-compose exec llm-server ./tests/stress_test.sh\n```\n\n### Docker Features\n- NVIDIA GPU support with proper driver mapping\n- NUMA-aware CPU and memory allocation\n- Persistent model storage\n- Optimized network performance with host networking\n- Proper resource limits and GPU capabilities\n- Python 3 environment with all dependencies\n\n## Configuration\n\n### Directory Structure\n\n```\n.\n├── scripts/\n│   ├── gpu-optimize.sh     # GPU optimization script\n│   ├── start-server.sh     # Server startup script with NUMA optimizations\n│   ├── monitor.sh          # Standalone monitoring script\n│   └── models/\n│       ├── model-template.sh  # Base template for model scripts\n│       └── run-model.sh      # Generic model runner script\n├── lib/\n│   └── models/            # Model library and configurations\n│       ├── __init__.py\n│       ├── base_config.py\n│       ├── model_configs.py\n│       └── model_manager.py\n├── config.yml             # Main configuration file\n├── server.py             # Main server implementation\n└── requirements.txt      # Python dependencies\n```\n\n### Configuration File (config.yml)\n\nThe `config.yml` file controls all aspects of the server and models:\n\n```yaml\n# Server Configuration\nserver:\n  host: \"0.0.0.0\"\n  port: 8000\n  workers: 1\n  auto_load_enabled: true  # Enable automatic model loading\n\n# GPU Configuration\ngpu:\n  devices: [0, 1, 2, 3]  # Using all 4 GPUs\n  memory_per_gpu: \"15GB\"  # Leave 1GB headroom\n  optimization:\n    torch_compile: true\n    mixed_precision: \"bf16\"\n    attn_implementation: \"flash_attention_2\"\n\n# Model Configurations\nmodels:\n  deepseek-r1-distil-32b:\n    name: \"DeepSeek-R1-Distill-32B\"\n    auto_load: false  # Don't load on startup\n    device_map: {...}  # Distributed across GPUs\n    \n  deepseek-r1-distil-14b:\n    name: \"DeepSeek-R1-Distill-14B\"\n    auto_load: true   # Load on startup\n    device: \"auto\"    # Automatic device mapping\n    \n  deepseek-r1-distil-8b:\n    name: \"DeepSeek-R1-Distill-8B\"\n    auto_load: true   # Load on startup\n    device: 0         # Single GPU\n```\n\n### Auto-Loading Models\n\nModels can be configured to load automatically when the server starts:\n\n1. Set `auto_load_enabled: true` in server config\n2. Configure `auto_load: true` for specific models\n3. Models will load during server initialization\n\n### Environment Variables\n\nThe server respects the following environment variables:\n\n```bash\n# GPU Configuration\nCUDA_VISIBLE_DEVICES=\"0,1,2,3\"\nCUDA_DEVICE_MAX_CONNECTIONS=\"1\"\nCUDA_DEVICE_ORDER=\"PCI_BUS_ID\"\n\n# PyTorch Optimization\nPYTORCH_CUDA_ALLOC_CONF=\"max_split_size_mb:128,expandable_segments:True\"\nTORCH_DISTRIBUTED_DEBUG=\"INFO\"\nTORCH_SHOW_CPP_STACKTRACES=\"1\"\n\n# NUMA Configuration\nNUMA_GPU_NODE_PREFERRED=\"1\"\n```\n\n## Running Models\n\n### Using the Start Script\n\nThe `start-server.sh` script provides optimized startup:\n\n```bash\n# Start server with NUMA optimizations\n./scripts/start-server.sh\n```\n\n### Using the Model Runner\n\nThe `run-model.sh` script provides flexible model management:\n\n```bash\n# Basic usage\n./scripts/models/run-model.sh --model MODEL_NAME\n\n# Examples:\n# Run 8B model with custom port\n./scripts/models/run-model.sh --model deepseek-r1-distil-8b --port 8001\n\n# Run 14B model with specific thread count\n./scripts/models/run-model.sh --model deepseek-r1-distil-14b --threads 32\n```\n\n### GPU Optimization\n\nRun the GPU optimization script before starting the server:\n\n```bash\n# Optimize GPU settings\nsudo ./scripts/gpu-optimize.sh\n```\n\nThis script:\n- Sets optimal GPU clock speeds\n- Configures power limits\n- Optimizes PCIe settings\n- Sets up NUMA affinities\n\n### Model Memory Requirements\n\nRecommended GPU configurations:\n\n1. **DeepSeek-R1-Distill-32B**\n   - Distributed across 4 GPUs\n   - ~15GB per GPU\n   - 4K context length\n   - Best for high-accuracy tasks\n   - Recommended for: Complex reasoning, code generation\n\n2. **DeepSeek-R1-Distill-14B**\n   - Auto device mapping\n   - ~30GB total VRAM\n   - 8K context length\n   - Good balance of performance/memory\n   - Recommended for: General use, balanced performance\n\n3. **DeepSeek-R1-Distill-8B**\n   - Single GPU\n   - ~15GB VRAM\n   - 16K context length\n   - Fastest inference\n   - Recommended for: High-throughput, real-time applications\n\n## API Usage\n\n### List Available Models\n```bash\ncurl http://localhost:8000/v1/models\n```\n\n### Load a Model\n```bash\ncurl -X POST http://localhost:8000/v1/models/deepseek-r1-distil-8b/load\n```\n\n### Generate Response\n```bash\ncurl -X POST http://localhost:8000/v1/chat/completions \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"model\": \"deepseek-r1-distil-8b\",\n    \"messages\": [{\"role\": \"user\", \"content\": \"Hello!\"}],\n    \"temperature\": 0.7,\n    \"stream\": false\n  }'\n```\n\n### Stream Response\n```bash\ncurl -X POST http://localhost:8000/v1/chat/completions \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"model\": \"deepseek-r1-distil-8b\",\n    \"messages\": [{\"role\": \"user\", \"content\": \"Hello!\"}],\n    \"stream\": true\n  }'\n```\n\n### API Parameters\n\nThe chat completion endpoint accepts the following parameters:\n\n```json\n{\n  \"model\": \"string\",          // Model ID to use\n  \"messages\": [              // Array of messages\n    {\n      \"role\": \"string\",      // \"system\", \"user\", or \"assistant\"\n      \"content\": \"string\"    // Message content\n    }\n  ],\n  \"temperature\": 0.7,        // 0.0 to 2.0 (default: 0.7)\n  \"top_p\": 0.95,            // 0.0 to 1.0 (default: 0.95)\n  \"max_tokens\": 2048,        // Maximum tokens to generate\n  \"stream\": false,           // Enable streaming responses\n  \"stop\": [\"string\"]        // Optional array of stop sequences\n}\n```\n\n## Monitoring\n\n### Built-in Monitoring\n\nThe `/health` endpoint provides comprehensive monitoring:\n```bash\ncurl http://localhost:8000/health\n```\n\nReturns:\n- GPU memory usage per device\n- System memory status\n- Loaded models\n- Device mapping\n- Process statistics\n- NUMA topology\n\n### Monitoring Script\n\nUse the dedicated monitoring script:\n```bash\n./scripts/monitor.sh\n```\n\nFeatures:\n- Real-time GPU metrics\n- Memory usage tracking\n- NUMA status\n- PCIe bandwidth\n- Process information\n- Temperature monitoring\n- Power consumption\n\n## Performance Optimization\n\n### GPU Memory Optimization\n\n1. **Memory Allocation**\n   - Use 8-bit quantization\n   - Enable gradient checkpointing\n   - Implement proper cleanup\n\n2. **Device Mapping**\n   - Balance model layers across GPUs\n   - Consider NUMA topology\n   - Optimize for PCIe bandwidth\n\n3. **Inference Settings**\n   - Use Flash Attention 2\n   - Enable torch.compile\n   - Implement proper batching\n\n### NUMA Optimization\n\n1. **CPU Affinity**\n   - Bind processes to NUMA nodes\n   - Align with GPU placement\n   - Optimize thread count\n\n2. **Memory Access**\n   - Local memory allocation\n   - Minimize cross-node traffic\n   - Monitor bandwidth\n\n## Troubleshooting\n\n### Common Issues\n\n1. **GPU Memory Issues**\n   - Check GPU memory allocation in config\n   - Use `nvidia-smi` to monitor usage\n   - Consider using a smaller model\n   - Check device mapping configuration\n\n2. **Model Loading Failures**\n   - Verify auto-load settings\n   - Check GPU memory availability\n   - Review server logs for errors\n   - Ensure correct device mapping\n\n3. **Performance Issues**\n   - Run GPU optimization script\n   - Check NUMA configuration\n   - Monitor PCIe bandwidth\n   - Review thread settings\n\n### Best Practices\n\n1. **Memory Management**\n   - Leave 1GB headroom per GPU\n   - Use appropriate device mapping\n   - Monitor memory usage\n   - Clean up unused models\n\n2. **GPU Optimization**\n   - Run optimization script before starting\n   - Use NUMA-aware configurations\n   - Monitor GPU temperatures\n   - Check PCIe bandwidth\n\n3. **Model Selection**\n   - Use 32B for high accuracy (4K context)\n   - Use 14B for balanced performance (8K context)\n   - Use 8B for speed (16K context)\n   - Consider auto-loading frequently used models\n\n## Contributing\n\nContributions are welcome! Please read our [Contributing Guide](CONTRIBUTING.md) for details on our code of conduct and the process for submitting pull requests.\n\n### Development Setup\n\n1. Fork the repository\n2. Create a virtual environment\n3. Install development dependencies:\n```bash\npip install -r requirements-dev.txt\n```\n4. Run tests:\n```bash\npytest tests/\n```\n\n## License\n\nMIT\n\n## Acknowledgments\n\n- DeepSeek AI for the DeepSeek R1 models\n- Hugging Face Transformers library\n- FastAPI framework\n- NVIDIA for GPU optimization guidance\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foeo%2Fllm-accelerate-gpu-flash","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Foeo%2Fllm-accelerate-gpu-flash","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foeo%2Fllm-accelerate-gpu-flash/lists"}