{"id":31505653,"url":"https://github.com/boltzmannentropy/vllm-5090","last_synced_at":"2025-10-08T04:35:25.758Z","repository":{"id":317687041,"uuid":"1068270139","full_name":"BoltzmannEntropy/vLLM-5090","owner":"BoltzmannEntropy","description":"vLLM-5090: Docker Container for RTX 5090 on WSL2/Windows","archived":false,"fork":false,"pushed_at":"2025-10-02T11:12:39.000Z","size":1272,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-10-02T13:25:45.864Z","etag":null,"topics":["5090","cuda","docker","vllm"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/BoltzmannEntropy.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-10-02T05:56:26.000Z","updated_at":"2025-10-02T12:38:01.000Z","dependencies_parsed_at":"2025-10-02T13:26:04.603Z","dependency_job_id":"edc1b47c-0278-4f82-bae8-ef4b4689d205","html_url":"https://github.com/BoltzmannEntropy/vLLM-5090","commit_stats":null,"previous_names":["boltzmannentropy/vllm-5090"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/BoltzmannEntropy/vLLM-5090","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BoltzmannEntropy%2FvLLM-5090","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BoltzmannEntropy%2FvLLM-5090/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BoltzmannEntropy%2FvLLM-5090/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BoltzmannEntropy%2FvLLM-5090/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/BoltzmannEntropy","download_url":"https://codeload.github.com/BoltzmannEntropy/vLLM-5090/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BoltzmannEntropy%2FvLLM-5090/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":278063119,"owners_count":25923594,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-02T02:00:08.890Z","response_time":67,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["5090","cuda","docker","vllm"],"created_at":"2025-10-02T20:09:17.755Z","updated_at":"2025-10-02T20:09:19.234Z","avatar_url":"https://github.com/BoltzmannEntropy.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"assets/light-mode-logo.svg\" alt=\"metalQwen3 Logo\" width=\"400\"/\u003e\n  \u003cbr\u003e\n\u003c/div\u003e\n\n\n# vLLM-5090: Docker Container for RTX 5090 on WSL2/Windows\n\n**👨‍💻 Author**: Shlomo Kashani\n**🏫 Affiliation**: Johns Hopkins University, Maryland U.S.A.\n**🏢 Organization**: QNeura.ai\n\n[![Docker](https://img.shields.io/badge/docker-%230db7ed.svg?style=for-the-badge\u0026logo=docker\u0026logoColor=white)](https://docker.com)\n[![WSL2](https://img.shields.io/badge/WSL2-0078D4?style=for-the-badge\u0026logo=windows\u0026logoColor=white)](https://docs.microsoft.com/en-us/windows/wsl/)\n[![RTX 5090](https://img.shields.io/badge/RTX_5090-76B900?style=for-the-badge\u0026logo=nvidia\u0026logoColor=white)](https://nvidia.com)\n[![Linux](https://img.shields.io/badge/Linux-FCC624?style=for-the-badge\u0026logo=linux\u0026logoColor=black)](https://linux.org)\n\nA pre-configured Docker environment specifically built for running vLLM on NVIDIA RTX 5090 GPUs within Windows WSL2. Includes demonstration applications showing vLLM's capabilities with vision-language models for video analysis and image processing.\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"assets/03.png\" alt=\"metalQwen3 Logo\" width=\"400\"/\u003e\n  \u003cbr\u003e\n\u003c/div\u003e\n\n## 🚀 What This Solves\n\nRunning large language models on cutting-edge GPUs like the RTX 5090 in WSL2 can be challenging due to dependency management and environment conflicts. This project provides a ready-to-use Docker container that eliminates setup headaches, allowing WSL2 users to run vLLM immediately on their 5090 GPUs.\n\n## 🎯 Key Features\n\n- **RTX 5090 Ready**: Containerized environment tuned for 32GB GDDR7 memory and Ada Lovelace architecture\n- **WSL2 First**: Optimized Docker configuration for Windows Subsystem for Linux 2\n- **Zero Setup**: Pre-built with all dependencies, CUDA 12.8, and PyTorch 2.7.0\n- **Multi-Platform**: Works on both Windows WSL2 and native Linux systems\n- **Vision-Language Demo**: Includes video captioning and image analysis examples to verify functionality\n- **OpenAI Compatible**: Provides compatible API server for easy integration\n- **High Performance**: Memory and thread optimizations for maximum throughput\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"assets/02.png\" alt=\"metalQwen3 Logo\" width=\"400\"/\u003e\n  \u003cbr\u003e\n\u003c/div\u003e\n\n## 📋 System Requirements\n\n### Hardware Requirements\n- **GPU**: NVIDIA RTX 5090 with 32GB GDDR7 VRAM\n- **RAM**: Minimum 32GB system RAM (64GB recommended)\n- **Storage**: 100GB+ free space for models and Docker images\n\n### Software Requirements\n\n#### Windows with WSL2\n- Windows 11 (Build 22000 or later)\n- WSL2 with Ubuntu 20.04+ or compatible distribution\n- NVIDIA GPU drivers with WSL2 support\n- Docker Desktop with WSL2 backend\n\n#### Linux (Ubuntu/Debian)\n- Ubuntu 20.04+ or compatible distribution\n- NVIDIA GPU drivers (550+ series recommended)\n- Docker CE/EE\n\n### Dependencies\n- Docker 24.0+\n- NVIDIA Container Toolkit (nvidia-docker2)\n- Python 3.11\n- CUDA 12.8\n- PyTorch 2.7.0\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"assets/01.png\" alt=\"metalQwen3 Logo\" width=\"400\"/\u003e\n  \u003cbr\u003e\n\u003c/div\u003e\n\n## 🛠️ Quick Start\n\n### Option 1: Docker Build (Recommended)\n\n1. **Clone and navigate to project directory:**\n   ```bash\n   git clone \u003crepository-url\u003e\n   cd vllm-5090\n   ```\n\n2. **Build the optimized Docker image:**\n   ```bash\n   # Windows\n   build.bat\n\n   # Linux/Mac\n   docker build -t vllm-small-5090 .\n   ```\n\n3. **Run the container:**\n   ```bash\n   # Windows (run-d.bat)\n   run-d.bat\n\n   # Linux/Mac\n   docker run --gpus all --rm -it \\\n     --shm-size=8gb --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \\\n     --memory=16g --env=DISPLAY \\\n     -p 8000:8000 -p 8078:7842 -p 7861:7860 -p 8502:8501 \\\n     -v $(pwd):/root/app \\\n     -v ~/cache:/root/.cache \\\n     vllm-small-5090:latest\n   ```\n\n### Option 2: Manual Installation\n\n1. **Install vLLM from source (in container or host):**\n   ```bash\n   # Install build dependencies\n   pip install -r __vllm/requirements/build.txt\n   pip install setuptools_scm\n\n   # Build vLLM with CUDA support\n   cd __vllm\n   python use_existing_torch.py\n   pip install --no-build-isolation -v -e .\n   ```\n\n2. **Install additional dependencies:**\n   ```bash\n   pip install qwen-vl-utils accelerate gradio gradio_toggle \\\n       openai beautifulsoup4 ftfy bitsandbytes datasets optimum \\\n       auto-gptq soundfile librosa webrtcvad\n   pip install -U transformers spaces modelscope\n   ```\n\n## 🎯 Usage\n\n### Demo Applications\n\nThis project includes two demo applications for person detection:\n\n1. **Direct vLLM Integration** (`app-vllm-gradio.py`): Loads vLLM directly in Python with a Gradio interface. This approach gives you full control over the model and is ideal for custom deployments.\n\n2. **OpenAI-Compatible Server** (`app-openapi-vllm.py`): Uses the OpenAI-compatible API server. This approach is better for production deployments and allows easy integration with existing OpenAI-based applications.\n\n### Starting the VLLM Server\n\n#### Gemma-3 Model (Recommended for RTX 5090)\n```bash\npython3 -m vllm.entrypoints.openai.api_server \\\n    --model google/gemma-3-12b-it \\\n    --port 8000 \\\n    --max_model_len 18000 \\\n    --tensor-parallel-size 1 \\\n    --gpu_memory_utilization 0.95 \\\n    --max_num_seqs 1 \\\n    --enforce-eager\n```\n\n#### Qwen2.5-VL Model\n```bash\npython3 -m vllm.entrypoints.openai.api_server \\\n    --model Qwen/Qwen2.5-VL-7B-Instruct \\\n    --port 8000 \\\n    --max_model_len 24000 \\\n    --tensor-parallel-size 1 \\\n    --gpu_memory_utilization 0.95 \\\n    --max_num_seqs 16 \\\n    --enforce-eager\n```\n\n#### JoyCaption Model\n```bash\npython3 -m vllm.entrypoints.openai.api_server \\\n    --model fancyfeast/llama-joycaption-beta-one-hf-llava \\\n    --port 8000 \\\n    --max_model_len 24000 \\\n    --tensor-parallel-size 1 \\\n    --gpu_memory_utilization 0.95 \\\n    --max_num_seqs 16 \\\n    --enforce-eager\n```\n\n### Video Captioning Web UI\n\n1. **Start the server** as described above\n2. **Launch the Gradio interface:**\n   ```bash\n   python app-opernai-vid-cap-vllm.py\n   ```\n\n3. **Access the web interface** at `http://localhost:7860`\n\n### API Usage Example\n\n```python\nimport requests\nimport base64\nfrom PIL import Image\n\n# Server configuration\nVLLM_SERVER = \"http://localhost:8000\"\n\ndef encode_image(image_path):\n    with open(image_path, \"rb\") as image_file:\n        return base64.b64encode(image_file.read()).decode('utf-8')\n\n# Single image analysis\nimage_base64 = encode_image(\"image.jpg\")\npayload = {\n    \"model\": \"google/gemma-3-12b-it\",\n    \"messages\": [{\n        \"role\": \"user\",\n        \"content\": [\n            {\"type\": \"text\", \"text\": \"Describe this image in detail.\"},\n            {\"type\": \"image_url\", \"image_url\": {\"url\": f\"data:image/jpeg;base64,{image_base64}\"}}\n        ]\n    }],\n    \"max_tokens\": 1000\n}\n\nresponse = requests.post(f\"{VLLM_SERVER}/v1/chat/completions\", json=payload)\nprint(response.json()['choices'][0]['message']['content'])\n```\n\n## 📖 Supported Models\n\n| Model | Size | Context Length | Best For |\n|-------|------|----------------|----------|\n| `google/gemma-3-12b-it` | 12B | 18K | General purpose, optimal for 5090 |\n| `Qwen/Qwen2.5-VL-7B-Instruct` | 7B | 24K | Video analysis, detailed descriptions |\n| `fancyfeast/llama-joycaption-beta-one-hf-llava` | 7B | 24K | Creative image/video captioning |\n\n## ⚙️ Configuration Options\n\n### Environment Variables\n\n| Variable | Default | Description |\n|----------|---------|-------------|\n| `MAX_JOBS` | `16` | Number of parallel build jobs |\n| `NVCC_THREADS` | `4` | CUDA compilation threads |\n| `TORCH_CUDA_ARCH_LIST` | `'12.0+PTX'` | Target CUDA architectures |\n| `FLASH_ATTN_CUDA_ARCHS` | `120` | FlashAttention architectures |\n| `GPU_MEMORY_UTILIZATION` | `0.95` | GPU memory usage ratio |\n\n### Performance Tuning for RTX 5090\n\n```bash\n# Optimal settings for 32GB GDDR7 memory\n--gpu_memory_utilization 0.95 \\\n--max_num_seqs 1 \\\n--enforce-eager \\\n--tensor-parallel-size 1\n```\n\n## 🐳 Docker Configuration\n\nThe Dockerfile is optimized for:\n- **Base Image**: `pytorch/pytorch:2.7.0-cuda12.8-cudnn9-devel`\n- **Python Version**: 3.11 (latest stable)\n- **Memory Management**: Shared memory and IPC configuration for video processing\n- **GPU Access**: NVIDIA Container Toolkit integration\n- **Development Tools**: Includes VS Code server, Node.js, npm\n\n### Build Customization\n\n```dockerfile\n# Add custom models or dependencies\nRUN pip install your-custom-package\n\n# Modify model configurations\nENV GPU_MEMORY_UTILIZATION=0.90\n```\n\n## 🔧 Troubleshooting\n\n### Common Issues\n\n**1. CUDA Out of Memory**\n```bash\n# Reduce memory utilization\n--gpu_memory_utilization 0.8\n\n# Use smaller batch sizes\n--max_num_seqs 1\n\n# Enable memory defragmentation\nexport PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512\n```\n\n**2. Port Conflicts**\n```bash\n# Change default ports\n-p 8001:8000  # API server\n-p 7861:7860  # Gradio UI\n```\n\n**3. WSL2 GPU Issues**\n```bash\n# Ensure NVIDIA drivers are installed in WSL2\nnvidia-smi  # Should show GPU info\n\n# Update WSL2 kernel\nwsl --update --web-download\n```\n\n**4. Model Download Issues**\n```bash\n# Set HuggingFace cache\nexport HF_HOME=/root/.cache/huggingface\n\n# Login to HuggingFace (if needed)\nhuggingface-cli login\n```\n\n### Performance Optimization\n\n- **Model Quantization**: Consider AWQ/INT4 quantization for larger models\n- **Batch Size**: Start with `max_num_seqs=1` and increase gradually\n- **Memory Pre-allocation**: Use `--enforce-eager` for predictable memory usage\n- **Cache Management**: Mount model cache volume for faster startups\n\n## 📊 Performance Benchmarks\n\n### RTX 5090 Performance (Estimated)\n\n| Model | Memory Usage | Token/s | Max Context |\n|-------|--------------|---------|-------------|\n| Gemma-3-12B | ~28GB | 50-80 | 18K |\n| Qwen2.5-VL-7B | ~16GB | 80-120 | 24K |\n| JoyCaption-7B | ~18GB | 70-100 | 24K |\n\n*Benchmarks performed with standard video analysis prompts. Results may vary based on specific use case.*\n\n## 🤝 Contributing\n\nWe welcome contributions! Please follow these guidelines:\n\n1. **Fork** the repository\n2. **Create** a feature branch\n3. **Test** thoroughly on both WSL2 and Linux\n4. **Submit** a pull request with detailed description\n\n### Development Setup\n\n```bash\n# Install development dependencies\npip install -r __vllm/requirements/dev.txt\n\n# Run tests\npython -m pytest tests/\n```\n\n## 📄 License\n\nThis project builds upon [vLLM](https://github.com/vllm-project/vllm) and includes various open-source components. Please refer to the individual licenses:\n- vLLM: Apache 2.0 License\n- PyTorch: BSD-style License\n- NVIDIA Components: NVIDIA Software License\n\n## 🙋 Support \u0026 Community\n\n- **Issues**: [GitHub Issues](https://github.com/your-repo/issues)\n- **Discussions**: [GitHub Discussions](https://github.com/your-repo/discussions)\n- **vLLM Community**: [Slack](https://slack.vllm.ai)\n- **Documentation**: [vLLM Docs](https://docs.vllm.ai)\n\n## 📝 Citation\n\nIf you use vLLM-5090 in your research or projects, please cite the original vLLM paper:\n\n```bibtex\n@inproceedings{kwon2023efficient,\n  title={Efficient Memory Management for Large Language Model Serving with PagedAttention},\n  author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},\n  booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},\n  year={2023}\n}\n```\n\n## 🔄 Changelog\n\n### Latest Updates\n- **v1.0 (2025-10-02)**: Docker container for RTX 5090 WSL2 deployment\n  - Ready-to-run vLLM container optimized for 5090 architecture\n  - Seamless WSL2 integration eliminating complex setup\n  - Includes video captioning demo to verify functionality\n  - CUDA 12.8, PyTorch 2.7.0, and complete dependency stack\n\n---\n\n*Built with ❤️ for high-performance AI inference on cutting-edge hardware.*\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fboltzmannentropy%2Fvllm-5090","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fboltzmannentropy%2Fvllm-5090","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fboltzmannentropy%2Fvllm-5090/lists"}