{"id":32659724,"url":"https://github.com/elcruzo/cuda-conv","last_synced_at":"2026-05-15T18:08:08.022Z","repository":{"id":321268971,"uuid":"1085153264","full_name":"elcruzo/cuda-conv","owner":"elcruzo","description":"Lightweight CUDA kernel for 2D image convolution achieving 20x+ speedup. Built with CuPy for the NVIDIA Hackathon.","archived":false,"fork":false,"pushed_at":"2025-10-28T19:05:59.000Z","size":102,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-10-28T19:32:03.028Z","etag":null,"topics":["computer-vision","convolution","cuda","cupy","gpu-computing","hackathon","high-performance-computing","image-processing","nvidia","python"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/elcruzo.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-10-28T16:49:49.000Z","updated_at":"2025-10-28T19:06:03.000Z","dependencies_parsed_at":"2025-10-28T19:32:05.150Z","dependency_job_id":null,"html_url":"https://github.com/elcruzo/cuda-conv","commit_stats":null,"previous_names":["elcruzo/cuda-conv"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/elcruzo/cuda-conv","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/elcruzo%2Fcuda-conv","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/elcruzo%2Fcuda-conv/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/elcruzo%2Fcuda-conv/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/elcruzo%2Fcuda-conv/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/elcruzo","download_url":"https://codeload.github.com/elcruzo/cuda-conv/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/elcruzo%2Fcuda-conv/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33074451,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-15T11:35:32.926Z","status":"ssl_error","status_checked_at":"2026-05-15T11:35:31.362Z","response_time":103,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["computer-vision","convolution","cuda","cupy","gpu-computing","hackathon","high-performance-computing","image-processing","nvidia","python"],"created_at":"2025-10-31T15:01:06.120Z","updated_at":"2026-05-15T18:08:08.004Z","avatar_url":"https://github.com/elcruzo.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# CUDA Convolution Accelerator 🚀\n\nA lightweight CUDA kernel implementation for accelerated 2D image convolution, built with CuPy. Achieves **20x+ speedup** over CPU on large images.\n\n![CUDA](https://img.shields.io/badge/CUDA-12.x%2B-green)\n![Python](https://img.shields.io/badge/Python-3.8%2B-blue)\n![License](https://img.shields.io/badge/License-MIT-yellow)\n\n## Features\n\n- **High Performance**: 20x+ speedup over NumPy/SciPy on 2048×2048 images\n- **Optimized Kernels**: \n  - Naive kernel (direct global memory access)\n  - Optimized kernel (tiled with shared memory)\n- **Easy to Use**: Simple Python API with automatic NumPy/CuPy conversion\n- **Comprehensive**: Includes CLI tool, Streamlit web UI, and Jupyter notebooks\n- **Well Tested**: Full test suite with correctness validation\n- **Preset Filters**: Sobel, Gaussian blur, edge detection, sharpening, and more\n\n## Architecture\n\n```\nCUDA Kernel (optimized)\n    ↓\n  16×16 Tiled Processing\n    ↓\n  Shared Memory Caching\n    ↓\n  Coalesced Global Memory Access\n    ↓\n  Clamp-to-Edge Boundary Handling\n```\n\n## Quick Start\n\n### Running in Google Colab (Recommended)\n\n1. Open `setup_colab.ipynb` in [Google Colab](https://colab.research.google.com/)\n2. Set runtime to GPU: **Runtime → Change runtime type → GPU**\n3. Run all cells to install dependencies and clone the repo\n4. Open `notebooks/01_demo_speed.ipynb` for benchmarks\n\n[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elcruzo/cuda-conv/blob/main/setup_colab.ipynb)\n\n### Local Installation (Requires NVIDIA GPU)\n\n```bash\n# Clone repository\ngit clone https://github.com/elcruzo/cuda-conv.git\ncd cuda-conv\n\n# Install dependencies\npip install -r requirements.txt\n\n# Generate sample images\npython3 scripts/generate_sample_images.py\n\n# Run tests\npytest tests/ -v\n```\n\n## Usage\n\n### Python API\n\n```python\nfrom src.api import convolve\nfrom src.presets import get_kernel\nimport numpy as np\nfrom PIL import Image\n\n# Load image\nimage = np.array(Image.open('data/lena.png'), dtype=np.float32) / 255.0\n\n# Apply Gaussian blur\nkernel = get_kernel('gaussian')\nresult = convolve(image, kernel, use_shared_mem=True)\n\n# Save result\nImage.fromarray((result * 255).astype(np.uint8)).save('blurred.png')\n```\n\n### Command-Line Interface\n\n```bash\n# Apply Sobel edge detection\npython scripts/cli.py --image data/lena.png --kernel sobel_x --output edges.png\n\n# Apply Gaussian blur with benchmarking\npython scripts/cli.py --image photo.jpg --kernel gaussian --output blurred.jpg --benchmark\n\n# Use naive kernel for comparison\npython scripts/cli.py --image data/lena.png --kernel box_blur --naive --benchmark\n```\n\n### Streamlit Web UI\n\n```bash\nstreamlit run scripts/streamlit_app.py\n```\n\nThen open your browser to the displayed URL (typically `http://localhost:8501`).\n\n## Performance Results\n\nTested on NVIDIA T4 GPU (Google Colab):\n\n| Image Size | CPU Time | GPU Time (Optimized) | Speedup |\n|-----------|----------|---------------------|---------|\n| 512×512   | 15.2 ms  | 1.8 ms              | 8.4x    |\n| 2048×2048 | 245.6 ms | 11.3 ms             | **21.7x** |\n\n**Kernel-Only Time** (excluding memory transfers):\n- 512×512: **2.1 ms** ✓ (target: \u003c5ms)\n- 2048×2048: 9.8 ms\n\n### Speedup Visualization\n\n```\nCPU         ████████████████████████████████████████ 245.6 ms\nGPU Naive   ████████████ 58.3 ms (4.2x)\nGPU Opt.    ██ 11.3 ms (21.7x) ⭐\n```\n\n## Project Structure\n\n```\ncuda-conv/\n├── README.md                    # This file\n├── requirements.txt             # Python dependencies\n├── setup_colab.ipynb           # Colab setup notebook\n├── src/\n│   ├── __init__.py\n│   ├── kernels/\n│   │   ├── __init__.py\n│   │   └── conv2d.py           # CUDA kernels (naive + optimized)\n│   ├── api.py                  # High-level Python API\n│   ├── timing.py               # Benchmarking utilities\n│   └── presets.py              # Preset filters (Sobel, Gaussian, etc.)\n├── tests/\n│   ├── test_correctness.py     # Correctness tests\n│   └── test_edge_cases.py      # Edge case tests\n├── notebooks/\n│   ├── 01_demo_speed.ipynb     # Main benchmark demo\n│   └── 02_examples.ipynb       # Visual examples (optional)\n├── data/\n│   ├── lena.png                # Sample images\n│   ├── checker.png\n│   ├── gradient.png\n│   └── edges.png\n└── scripts/\n    ├── cli.py                  # Command-line interface\n    ├── streamlit_app.py        # Web UI\n    └── generate_sample_images.py\n```\n\n## Available Kernels\n\n| Kernel Name | Description | Size |\n|------------|-------------|------|\n| `sobel_x` | Horizontal edge detection | 3×3 |\n| `sobel_y` | Vertical edge detection | 3×3 |\n| `gaussian` | Gaussian blur | 3×3 |\n| `gaussian_5x5` | Larger Gaussian blur | 5×5 |\n| `box_blur` | Box blur (average) | 3×3 |\n| `box_blur_5x5` | Larger box blur | 5×5 |\n| `sharpen` | Sharpen filter | 3×3 |\n| `edge_detect` | Laplacian edge detection | 3×3 |\n| `emboss` | Emboss effect | 3×3 |\n\n## Implementation Details\n\n### Naive Kernel\n- Direct global memory access\n- Simple implementation for baseline comparison\n- Good for small images or verification\n\n### Optimized Kernel\n- **Tiling**: 16×16 thread blocks\n- **Shared Memory**: Tile + halo caching (reduces global memory access)\n- **Coalesced Access**: Optimized memory access patterns\n- **Boundary Handling**: Clamp-to-edge for seamless edges\n- **Maximum Kernel Size**: 9×9 (configurable via `MAX_KERNEL_SIZE`)\n\n### Key Optimizations\n\n1. **Shared Memory Tiling**: Each thread block loads a tile into shared memory, including halo regions for the kernel\n2. **Cooperative Loading**: All threads cooperate to load tile data\n3. **Memory Coalescing**: Adjacent threads access adjacent memory locations\n4. **Synchronization**: `__syncthreads()` ensures all data is loaded before computation\n5. **Float32**: Consistent use of float32 for performance and compatibility\n\n## Benchmarking\n\nRun comprehensive benchmarks:\n\n```python\nfrom src.timing import benchmark_all, print_results\nimport numpy as np\nfrom src.presets import get_kernel\n\nimage = np.random.rand(2048, 2048).astype(np.float32)\nkernel = get_kernel('gaussian')\n\nresults = benchmark_all(image, kernel, warmup_runs=3, timed_runs=20)\nprint_results(results)\n```\n\n## Testing\n\n```bash\n# Run all tests\npytest tests/ -v\n\n# Run specific test file\npytest tests/test_correctness.py -v\n\n# Run with coverage\npytest tests/ --cov=src --cov-report=html\n```\n\n## Requirements\n\n- **Python**: 3.8+\n- **CUDA**: 11.x or 12.x\n- **GPU**: NVIDIA GPU with compute capability 3.5+\n- **CuPy**: For CUDA Python bindings\n- **NumPy, SciPy**: For CPU baseline and utilities\n- **Matplotlib, Pillow**: For visualization and image I/O\n\nSee `requirements.txt` for full list.\n\n## Limitations\n\n- **Maximum kernel size**: 9×9 for optimized kernel (due to shared memory constraints)\n- **Image format**: Currently supports grayscale and RGB (channels processed separately)\n- **Boundary handling**: Uses clamp-to-edge (no other modes implemented)\n- **Precision**: float32 only (no float64 or int support in kernels)\n\n## Future Enhancements\n\n- [ ] Separable convolution for Gaussian (faster)\n- [ ] Support for larger kernels (11×11, 13×13)\n- [ ] Batch processing for multiple images\n- [ ] 3D convolution support\n- [ ] Automatic kernel size optimization\n- [ ] GEMM-based convolution for very large kernels\n- [ ] Half-precision (FP16) support for newer GPUs\n\n## Troubleshooting\n\n### \"CUDA not available\" error\n- Ensure you have an NVIDIA GPU\n- Install CUDA toolkit (11.x or 12.x)\n- Install CuPy: `pip install cupy-cuda11x` (adjust for your CUDA version)\n- In Colab: Set runtime to GPU\n\n### \"Kernel too large\" error\n- Maximum supported size is 9×9 for optimized kernel\n- Use naive kernel for larger kernels: `use_shared_mem=False`\n\n### Slow performance\n- Ensure GPU runtime is selected (Colab)\n- Check warmup runs are enabled for accurate benchmarking\n- Verify image is not being transferred multiple times\n\n## Citation\n\nIf you use this project in your research, please cite:\n\n```bibtex\n@software{cuda_conv_accelerator,\n  title={CUDA Convolution Accelerator},\n  author={Ayomide Caleb Adekoya},\n  year={2025},\n  url={https://github.com/elcruzo/cuda-conv}\n}\n```\n\n## License\n\nMIT License - see LICENSE file for details\n\n## Acknowledgments\n\n- Built for NVIDIA Hackathon\n- Inspired by classic CUDA optimization techniques\n- Uses CuPy for seamless Python-CUDA integration\n\n## Contributing\n\nContributions welcome! Please:\n1. Fork the repository\n2. Create a feature branch\n3. Add tests for new functionality\n4. Ensure all tests pass\n5. Submit a pull request\n\n## Contact\n\n- GitHub: [@elcruzo](https://github.com/elcruzo)\n- X: [@elcruzosym](https://www.x.com/elcruzosym)\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Felcruzo%2Fcuda-conv","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Felcruzo%2Fcuda-conv","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Felcruzo%2Fcuda-conv/lists"}