{"id":49914688,"url":"https://github.com/canreader/tensorbench","last_synced_at":"2026-05-16T15:10:45.172Z","repository":{"id":323733075,"uuid":"1094472550","full_name":"CanReader/TensorBench","owner":"CanReader","description":null,"archived":false,"fork":false,"pushed_at":"2025-11-11T19:51:56.000Z","size":39,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-11-11T21:04:33.711Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Cuda","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/CanReader.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-11-11T18:50:36.000Z","updated_at":"2025-11-11T19:52:00.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/CanReader/TensorBench","commit_stats":null,"previous_names":["canreader/tensorbench"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/CanReader/TensorBench","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CanReader%2FTensorBench","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CanReader%2FTensorBench/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CanReader%2FTensorBench/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CanReader%2FTensorBench/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/CanReader","download_url":"https://codeload.github.com/CanReader/TensorBench/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CanReader%2FTensorBench/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33107650,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-16T04:41:52.686Z","status":"ssl_error","status_checked_at":"2026-05-16T04:41:52.009Z","response_time":115,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-05-16T15:10:28.794Z","updated_at":"2026-05-16T15:10:45.164Z","avatar_url":"https://github.com/CanReader.png","language":"Cuda","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 🚀 TensorBench: Advanced CUDA Tensor Operation Benchmarking Suite\n\n\u003cdiv align=\"center\"\u003e\n\n![CUDA](https://img.shields.io/badge/CUDA-12.0+-green?style=flat-square\u0026logo=nvidia)\n![C++](https://img.shields.io/badge/C++-17-blue?style=flat-square\u0026logo=cplusplus)\n![License](https://img.shields.io/badge/license-MIT-green?style=flat-square)\n![Status](https://img.shields.io/badge/status-active-brightgreen?style=flat-square)\n\n**Comprehensive benchmarking framework for GPU tensor operations with roofline model analysis, multi-algorithm comparison, and advanced performance characterization.**\n\n[Features](#-features) • [Quick Start](#-quick-start) • [Benchmarks](#-benchmark-suite) • [Results](#-output--analysis) • [Architecture](#-architecture)\n\n\u003c/div\u003e\n\n---\n\n## 📋 Overview\n\n**TensorBench** is a production-grade CUDA benchmarking suite designed for comprehensive performance analysis of tensor operations on NVIDIA GPUs. It provides deep insights into GPU utilization, memory hierarchy behavior, and comparative performance across multiple algorithms.\n\n### Use Cases\n- 🔍 **Performance Profiling**: Detailed analysis of tensor operation performance\n- 🏗️ **Architecture Evaluation**: Compare different implementation strategies\n- 📊 **Optimization Research**: Identify bottlenecks and optimization opportunities\n- 🎯 **GPU Capability Assessment**: Understand your hardware's strengths and limitations\n- 📈 **Comparative Studies**: Benchmark naive vs. optimized implementations\n\n---\n\n## ✨ Features\n\n### 🎯 Multi-Algorithm Comparison\n- **cuBLAS Optimized**: Highly-tuned vendor library implementation\n- **Naive Kernel**: Reference implementation for correctness validation\n- **Fused Operations**: Advanced multi-operation kernels\n\n### 📊 Advanced Performance Analysis\n- **Roofline Model**: Theoretical performance ceiling computation\n- **Arithmetic Intensity**: FLOPS/Byte analysis for memory vs. compute bottleneck classification\n- **Cache Behavior**: L1/L2 cache miss estimation\n- **Memory Bandwidth**: Real-time measurement and analysis\n- **Statistical Analysis**: Mean, variance, standard deviation, 95% confidence intervals\n\n### 🔬 Deep Profiling Capabilities\n- **Thermal Throttling**: Risk assessment based on sustained execution\n- **Efficiency Metrics**: Peak efficiency percentage calculation\n- **Variance Analysis**: Execution time stability tracking\n- **Power Efficiency**: GFLOPS/Watt estimation\n- **GPU Properties**: Full device capability reporting\n\n### 📁 Comprehensive Output\n- CSV exports for advanced analysis\n- Multi-phase benchmarking (single ops, fused ops, batch processing)\n- Detailed performance logs with human-readable formatting\n\n---\n\n## 🏗️ Architecture\n\n### Project Structure\n```\nTensorBench/\n├── CMakeLists.txt              # Build configuration\n├── README.md                   # This file\n├── include/\n│   ├── MatrixFP16.cuh          # FP16 matrix class definition\n│   ├── MatrixFP32.cuh          # FP32 matrix class definition\n│   ├── naive_tensor_tgemm.cuh  # Naive GEMM kernel header\n│   └── utils.cuh               # Utility functions\n├── src/\n│   ├── MatrixFP16.cu           # FP16 matrix implementation\n│   ├── MatrixFP32.cu           # FP32 matrix implementation\n│   ├── naive_tensor_tgemm.cu   # Naive GEMM kernel implementation\n│   └── utils.cu                # Utility implementations\n├── test/\n│   ├── 00_benchmark_cuBLAS.cu                 # Test 1: cuBLAS Baseline\n│   ├── 01_benchmark_naive.cu                  # Test 2: Naive Implementation\n│   ├── 02_benchmark_mixed_precision.cu        # Test 3: Mixed Precision Analysis\n│   ├── 03_benchmark_scaling.cu                # Test 4: Strong Scaling\n│   ├── 04_benchmark_stress_test.cu            # Test 5: Stress Testing\n│   └── 05_benchmark_advanced_tensor_ops.cu    # Test 6: Advanced Analysis (500+ lines)\n└── build/                      # CMake build output\n    └── *.out                   # Compiled executables\n```\n\n---\n\n## 🚀 Quick Start\n\n### Prerequisites\n- **NVIDIA GPU** with compute capability 6.1+ (Pascal or newer)\n- **CUDA Toolkit** 12.0 or later\n- **CMake** 3.18+\n- **GCC/G++** 9+ or **LLVM/Clang** 10+\n\n### Installation \u0026 Build\n\nFor the simplest setup, run the appropriate command below. These scripts automatically handle the CMake configuration and compilation.\n\n\u003e ### **🐧 Linux/macOS**\n\u003e Run:\n\u003e ```bash\n\u003e ./build.sh\n\u003e ```\n\n\u003e ### **🪟 Windows**\n\u003e Run:\n\u003e ```bash\n\u003e build.bat\n\u003e ```\n\n**or**\n\n\n1. **Clone and navigate to the project:**\n```bash\ncd TensorBench\n```\n\n2. **Configure with CMake:**\n```bash\ncmake -S . -B build -DCMAKE_EXPORT_COMPILE_COMMANDS=ON\n```\n\n3. **Build all benchmarks:**\n```bash\ncmake --build build -j$(nproc)\n```\n\n4. **Or build specific tests:**\n```bash\n# Build only advanced tensor ops benchmark\ncmake --build build --target bench_advanced_tensor_ops\n\n# Build only scaling analysis\ncmake --build build --target bench_scaling\n\n# List all targets\ncmake --build build --target help\n```\n\n### Running Benchmarks\n\nAfter building, run individual benchmarks from the `build/` directory:\n\n```bash\ncd build\n\n# Test 1: cuBLAS Baseline Performance\n./00_benchmark_cuBLAS.out\n\n# Test 2: Naive Kernel Comparison\n./01_benchmark_naive.out\n\n# Test 3: Mixed Precision Analysis\n./02_benchmark_mixed_precision.out\n\n# Test 4: Strong Scaling Analysis\n./03_benchmark_scaling.out\n\n# Test 5: Stress Testing\n./04_benchmark_stress_test.out\n\n# Test 6: Advanced Tensor Operations (Most Comprehensive)\n./05_benchmark_advanced_tensor_ops.out\n```\n\n---\n\n## 📊 Benchmark Suite\n\n### Test 1: cuBLAS Baseline (`00_benchmark_cuBLAS.cu`)\n**Purpose**: Establish performance baseline with vendor-optimized library\n\n| Aspect | Details |\n|--------|---------|\n| **Sizes** | 128, 256, 512, 1024, 2048, 4096 |\n| **Runs** | 10 per size |\n| **Algorithm** | cuBLAS GemmEx with tensor operations |\n| **Output** | GFLOPS, execution time |\n| **Use Case** | Reference performance ceiling |\n\n---\n\n### Test 2: Naive Implementation (`01_benchmark_naive.cu`)\n**Purpose**: Reference kernel for correctness validation\n\n| Aspect | Details |\n|--------|---------|\n| **Sizes** | 128, 256, 512, 1024, 2048, 4096 |\n| **Runs** | 10 per size |\n| **Algorithm** | Custom naive GEMM kernel |\n| **Validation** | Assert correctness against cuBLAS |\n| **Output** | GFLOPS comparison, error analysis |\n| **Use Case** | Correctness verification, optimization baseline |\n\n---\n\n### Test 3: Mixed Precision Analysis (`02_benchmark_mixed_precision.cu`) ⭐\n**Purpose**: Comprehensive mixed-precision performance comparison\n\n| Aspect | Details |\n|--------|---------|\n| **Sizes** | 128, 256, 512, 1024, 2048, 4096, **8192** |\n| **Precision** | FP16 (input) × FP16 (input) → FP32 (output) |\n| **Batches** | 3 independent benchmark runs |\n| **Runs** | 5 per batch |\n| **Statistics** | Mean time, GFLOPS, speedup metrics |\n| **Output** | `benchmark_results.csv` with detailed metrics |\n| **Features** | GPU device properties, warmup runs |\n| **Use Case** | Mixed-precision optimization analysis |\n\n**Key Metrics:**\n- Time per operation (ms)\n- GFLOPS achieved\n- Speedup relative to naive implementation\n- Numerical accuracy validation\n\n---\n\n### Test 4: Strong Scaling Analysis (`03_benchmark_scaling.cu`) 📈\n**Purpose**: Analyze batch processing and strong scaling behavior\n\n| Aspect | Details |\n|--------|---------|\n| **Sizes** | 256, 512, 1024, 2048 |\n| **Batch Sizes** | 1, 2, 4, 8, 16 matrices |\n| **Modes** | Sequential vs. queue-based execution |\n| **Metrics** | Throughput (matrices/sec), memory bandwidth (GB/s) |\n| **Output** | `benchmark_scaling_results.csv` |\n| **Analysis** | Speedup, efficiency, bottleneck identification |\n\n**Key Insights:**\n- Batch processing efficiency\n- Strong scaling characteristics\n- Memory bandwidth utilization\n- Queue vs. sequential overhead\n\n---\n\n### Test 5: Stress Testing (`04_benchmark_stress_test.cu`) 🔥\n**Purpose**: Push GPU to limits; analyze maximum problem sizes and stability\n\n| Aspect | Details |\n|--------|---------|\n| **Max Size** | **12288 × 12288** matrices |\n| **Runs** | 20-50 per size (adaptive) |\n| **Focus** | Execution time variance, stability |\n| **Metrics** | Min, max, avg GFLOPS; variance analysis |\n| **Output** | CSV with detailed statistics |\n| **Safety** | Handles out-of-memory gracefully |\n| **Use Case** | Maximum capacity planning, thermal limits |\n\n**Variance Analysis:**\n- Identifies thermal throttling\n- Detects performance degradation\n- Measures consistency across runs\n\n---\n\n### Test 6: Advanced Tensor Operations (`05_benchmark_advanced_tensor_ops.cu`) 🌟\n**Purpose**: Most comprehensive analysis with roofline model and multi-algorithm comparison\n**Lines of Code**: 650+ lines\n\n| Aspect | Details |\n|--------|---------|\n| **Sizes** | 256, 512, 1024, 2048, 4096 |\n| **Runs** | 15 per size (high statistical significance) |\n| **Phases** | 2 benchmark phases |\n| **Algorithms** | 3 implementations (cuBLAS, Naive, Fused) |\n| **Output** | Multiple CSV files for advanced analysis |\n\n#### Phase 1: Single Matrix Multiplication Comparison\n- Compares cuBLAS vs. naive kernel\n- Statistical analysis with 95% confidence intervals\n- Roofline model analysis\n- Cache miss estimation\n- Thermal throttling risk assessment\n\n#### Phase 2: Fused Operations Analysis\n- Tests combined operations: C = A₁B₁ + A₂B₂\n- Kernel fusion efficiency\n- Memory access pattern optimization\n\n#### Metrics Tracked:\n```\nPer Operation:\n├── Execution Time (ms) + Variance\n├── GFLOPS + Efficiency %\n├── Memory Bandwidth (GB/s)\n├── Compute Intensity (FLOPS/Byte)\n├── Cache Miss Estimation\n└── Thermal Throttle Risk (0.0-1.0)\n\nRoofline Model:\n├── Compute Intensity\n├── Achieved GFLOPS\n├── Peak Compute (Theoretical)\n├── Peak Memory Bandwidth (Theoretical)\n└── Bottleneck Classification (Compute vs. Memory)\n\nConfidence Intervals:\n├── 95% CI for mean execution time\n├── Statistical significance\n└── Accuracy bounds\n```\n\n#### GPU Specifications Reported:\n- Peak compute (GFLOPS)\n- Peak memory bandwidth (GB/s)\n- Warp size\n- Max threads per block\n- Number of SMs\n- TDP estimate\n\n#### CSV Outputs:\n- `benchmark_advanced_metrics.csv` - Per-operation detailed metrics\n- `benchmark_roofline_model.csv` - Roofline analysis data\n\n#### Expected Runtime:\n- Small GPUs: 2-3 minutes\n- Large GPUs (RTX 4090): 3-5 minutes\n\n---\n\n## 📈 Output \u0026 Analysis\n\n### CSV Export Format\n\nAll benchmarks export detailed metrics to CSV for further analysis:\n\n**benchmark_results.csv** (Mixed Precision):\n```\nMatrixSize,Batch,cuBLAS_Time_ms,Naive_Time_ms,cuBLAS_GFLOPS,Naive_GFLOPS,Speedup,MaxError,AvgError\n128,1,0.123456,0.987654,123.45,15.67,8.03,1.2e-5,3.4e-6\n256,1,0.234567,1.234567,234.56,18.90,5.25,1.5e-5,4.2e-6\n...\n```\n\n**benchmark_scaling_results.csv** (Scaling Analysis):\n```\nMatrixSize,BatchSize,SequentialTime_ms,BatchTime_ms,SequentialGFLOPS,BatchGFLOPS,Speedup,Throughput_matrices_per_sec,MemoryBandwidth_GB_s\n256,1,0.123,0.123,1234.5,1234.5,1.00,8130.08,987.65\n256,2,0.246,0.180,617.3,841.5,1.37,10869.57,1289.45\n...\n```\n\n**benchmark_advanced_metrics.csv** (Advanced):\n```\nMatrixSize,Algorithm,ExecutionTime_ms,GFLOPS,Efficiency_%,MemoryBandwidth_GB_s,ComputeIntensity,CacheMisses,ThrottleRisk\n256,cuBLAS,0.123,1234.56,85.3,123.45,16.78,1024,0.15\n256,Naive,0.456,333.33,23.0,45.67,16.78,4096,0.22\n...\n```\n\n### Python Analysis Example\n\n```python\nimport pandas as pd\nimport matplotlib.pyplot as plt\n\n# Load results\ndf = pd.read_csv('benchmark_advanced_metrics.csv')\n\n# Plot GFLOPS vs. Matrix Size\nplt.figure(figsize=(12, 6))\nfor algo in df['Algorithm'].unique():\n    subset = df[df['Algorithm'] == algo]\n    plt.plot(subset['MatrixSize'], subset['GFLOPS'], marker='o', label=algo)\nplt.xlabel('Matrix Size')\nplt.ylabel('GFLOPS')\nplt.legend()\nplt.xscale('log')\nplt.yscale('log')\nplt.grid(True, alpha=0.3)\nplt.title('Tensor Operation Performance Scaling')\nplt.savefig('performance_scaling.png', dpi=300)\nplt.show()\n```\n\n---\n\n## 🔧 Configuration\n\n### Adjusting GPU Architecture\n\nEdit `CMakeLists.txt` line 15 to match your GPU:\n\n```cmake\n# Compute Capability Reference:\n# 61  -\u003e Pascal (GTX 1080, GTX 1070, etc.)\n# 75  -\u003e Turing (RTX 2080, RTX 2070, etc.)\n# 86  -\u003e Ampere (RTX 3090, RTX 3080, A100, etc.)\n# 89  -\u003e Ada (RTX 4090, RTX 4080, RTX 4070 Ti, etc.)\n\nset(CMAKE_CUDA_ARCHITECTURES 89)  # \u003c-- CHANGE THIS\n```\n\n### Adjusting Benchmark Parameters\n\nEach test allows configuration through source code (test files):\n\n**Matrix Sizes** (in each test file):\n```cuda\nint mat_sizes[] = {256, 512, 1024, 2048, 4096};  // Modify as needed\n```\n\n**Number of Runs**:\n```cuda\nint runs_per_batch = 5;   // Increase for more statistical precision\n```\n\n**Batch Sizes** (scaling test):\n```cuda\nint batch_sizes[] = {1, 2, 4, 8, 16};  // Modify batch configurations\n```\n\n---\n\n## 📊 Understanding Roofline Model\n\nThe roofline model provides a theoretical performance ceiling based on:\n\n1. **Arithmetic Intensity (AI)**: FLOPS per byte of memory transferred\n2. **Peak Compute**: Maximum GFLOPS the GPU can achieve\n3. **Peak Memory BW**: Maximum memory bandwidth available\n\n**Performance Ceiling = min(Peak Compute, AI × Peak Memory BW)**\n\n### Classification:\n- **Memory-Bound**: Performance limited by memory bandwidth\n- **Compute-Bound**: Performance limited by compute capacity\n\nTensorBench automatically classifies each operation and suggests optimization directions.\n\n---\n\n## 🎯 Performance Optimization Tips\n\nBased on TensorBench results, consider:\n\n### If Memory-Bound:\n- ✅ Increase tile size\n- ✅ Improve cache locality\n- ✅ Use mixed precision (FP16 inputs)\n- ✅ Fuse multiple operations\n\n### If Compute-Bound:\n- ✅ Increase parallelism\n- ✅ Improve instruction-level parallelism\n- ✅ Use tensor cores (Turing+)\n- ✅ Optimize register usage\n\n### General:\n- ✅ Monitor thermal throttling warnings\n- ✅ Analyze variance for stability\n- ✅ Compare against roofline ceiling\n- ✅ Validate numerical accuracy\n\n---\n\n## 🖥️ System Requirements\n\n### Minimum\n- NVIDIA GPU: Compute Capability 6.1+ (GTX 1080 or newer)\n- CUDA Toolkit: 11.0+\n- RAM: 8GB\n- Storage: 1GB\n\n### Recommended\n- NVIDIA GPU: Compute Capability 7.0+ (Turing+)\n- CUDA Toolkit: 12.0+\n- RAM: 16GB\n- Storage: 2GB (for large benchmark runs)\n\n### Tested Platforms\n- ✅ Arch Linux-6.17.7 (x86_64)\n- ✅ CUDA 13.0\n- ✅ RTX 4050\n\n---\n\n## 🐛 Troubleshooting\n\n### Build Errors\n\n**\"cuda_fp16.h not found\"**\n```bash\n# Update your CUDA include path in .vscode/c_cpp_properties.json\n# Or regenerate CMake configuration:\ncmake -S . -B build -DCMAKE_CUDA_ARCHITECTURES=89\n```\n\n**\"Cannot find cuBLAS\"**\n```bash\n# Ensure CUDA toolkit is properly installed:\nnvcc --version\n\n# If not found, install or set CUDA path:\nexport CUDA_PATH=/usr/local/cuda\ncmake -S . -B build\n```\n\n**Compilation fails on specific architecture**\n```bash\n# Check your GPU's compute capability:\nnvidia-smi -q | grep -i \"Compute Capability\"\n\n# Update CMakeLists.txt with correct value\nset(CMAKE_CUDA_ARCHITECTURES 89)\n```\n\n### Runtime Issues\n\n**Out of Memory**\n- Reduce matrix size in test files\n- Close other GPU applications\n- Use `nvidia-smi` to check VRAM usage\n\n**Thermal Throttling**\n- Allow GPU cooling period between runs\n- Reduce problem sizes\n- Monitor with `nvidia-smi dmon`\n\n**Inconsistent Results**\n- Run benchmarks multiple times\n- Check system background processes\n- Verify power management settings\n- Review confidence intervals in output\n\n---\n\n## 📚 Output Interpretation\n\n### GFLOPS\n- **Good**: \u003e 80% of peak GPU GFLOPS\n- **Acceptable**: 50-80% of peak\n- **Poor**: \u003c 50% indicates optimization opportunity\n\n### Memory Bandwidth\n- Check against theoretical peak\n- High utilization (\u003e90%) suggests memory optimization needed\n\n### Variance\n- **Low Variance**: Stable, consistent performance\n- **High Variance**: May indicate thermal throttling or system interference\n\n### Efficiency %\n- **90-100%**: Excellent\n- **70-90%**: Good\n- **50-70%**: Fair, consider optimization\n- **\u003c50%**: Poor, significant optimization opportunity\n\n---\n\n## 📖 References \u0026 Resources\n\n- [NVIDIA CUDA Programming Guide](https://docs.nvidia.com/cuda/cuda-c-programming-guide/)\n- [cuBLAS Documentation](https://docs.nvidia.com/cuda/cublas/)\n- [Roofline Model Paper](https://people.eecs.berkeley.edu/~sameh/SC06_paper.pdf)\n- [NVIDIA GPU Architecture](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/)\n\n---\n\n## 📄 License\n\nThis project is licensed under the MIT License - see the LICENSE file for details.\n\n---\n\n## 🤝 Contributing\n\nContributions are welcome! Areas of interest:\n\n- Additional optimization algorithms\n- Extended GPU architecture support\n- Performance analysis tools\n- Documentation improvements\n- Bug reports and fixes\n\n---\n\n## 📝 Citation\n\nIf you use TensorBench in your research, please cite:\n\n```bibtex\n@software{tensorbench2024,\n  title={TensorBench: Advanced CUDA Tensor Operation Benchmarking Suite},\n  author={Your Name},\n  year={2024},\n  url={https://github.com/yourusername/TensorBench}\n}\n```\n\n---\n\n## ❓ FAQ\n\n**Q: What GPU do I need?**\nA: Any NVIDIA GPU with compute capability 6.1 or higher (GTX 1080 or newer). Newer GPUs (Turing+) provide better mixed-precision support.\n\n**Q: How long do benchmarks take?**\nA: \n- Quick tests (00-02): 30 seconds - 2 minutes\n- Medium tests (03-04): 1-3 minutes\n- Full suite (05): 3-5 minutes\n\n**Q: Can I modify matrix sizes?**\nA: Yes! Edit the test files to adjust `mat_sizes[]` array. Larger sizes require more VRAM.\n\n**Q: How do I interpret results?**\nA: Compare GFLOPS against your GPU's theoretical peak. Use roofline model to identify bottlenecks.\n\n**Q: Why is my performance lower than expected?**\nA: Check thermal throttling risk, compare against roofline ceiling, verify no system processes are interfering.\n\n---\n\n\u003cdiv align=\"center\"\u003e\n\n**[⬆ Back to Top](#-tensorbench-advanced-cuda-tensor-operation-benchmarking-suite)**\n\nMade with ❤️ for GPU performance enthusiasts and researchers\n\n\u003c/div\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcanreader%2Ftensorbench","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcanreader%2Ftensorbench","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcanreader%2Ftensorbench/lists"}