https://github.com/bivex/cudahte
https://github.com/bivex/cudahte
Last synced: 25 days ago
JSON representation
- Host: GitHub
- URL: https://github.com/bivex/cudahte
- Owner: bivex
- Created: 2026-05-18T21:39:09.000Z (about 1 month ago)
- Default Branch: main
- Last Pushed: 2026-05-18T21:58:15.000Z (about 1 month ago)
- Last Synced: 2026-05-18T23:55:47.254Z (about 1 month ago)
- Language: Python
- Size: 143 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# CUDA Code Smells Analyzer
A Python-based static analysis tool designed to detect "code smells" and critical issues in NVIDIA CUDA (`.cu`, `.cuh`) files.
This project leverages [ANTLR4](https://www.antlr.org/) for accurate parsing of CUDA C++ grammar and is structured using **Clean Architecture** principles to ensure clear separation of concerns, high maintainability, and easy extensibility.
## Features
- Parse CUDA source code using an ANTLR4-generated AST.
- Detect potentially unchecked CUDA API calls.
- Detect naive potential memory leaks (mismatched `cudaMalloc` / `cudaFree` counts).
- Clean Architecture (Domain, Application, Infrastructure layers).
## Detected Code Smells
| Rule Name | Description | Severity |
| :--- | :--- | :--- |
| **UncheckedCudaAPI** | CUDA API calls (e.g., `cudaMalloc`, `cudaMemcpy`, `cudaFree`, `cudaDeviceSynchronize`) should have their return values checked for errors. This rule detects if the call is not wrapped in a checking macro (like `CHECK`, `EXPECT`, `assert`) or assigned to a variable within 8 levels of the AST. | CRITICAL |
| **PotentialMemoryLeak** | A naive heuristic that triggers if the number of `cudaMalloc` calls in a file is greater than the number of `cudaFree` calls. | WARNING |
| **WarpDivergence** | Detects potential warp divergence by checking if an `if` statement's condition explicitly depends on `threadIdx` or `blockIdx` using equality or modulo operators. | CRITICAL |
| **HostDeviceTransferInLoop** | Detects if `cudaMemcpy` is called inside an iteration statement (`for`, `while`, `do-while`), which leads to massive PCIe bottlenecking. | CRITICAL |
| **SuboptimalGridBlock** | Checks kernel launch parameters `<<>>`. Flags a warning if the `block` size is not a multiple of 32 (warp size), is hardcoded to `< 128`, or if the `grid` is `1`. | WARNING |
| **CudaDeviceSynchronizeInHotPath** | Detects if `cudaDeviceSynchronize()` is called inside a loop, which blocks the CPU from doing any other work and breaks async pipelines. | WARNING |
| **SyncthreadsInDivergentCode** | Calling `__syncthreads()` inside a divergent branch (`if` statements depending on thread indices) can cause a deadlock. Ensures all threads reach the barrier. | CRITICAL |
| **IntegerOverflowInIndex** | Detects global thread index calculation (`blockIdx.x * blockDim.x + threadIdx.x`) assigned to an `int`. For large arrays, this causes integer overflow. Recommends using `size_t`. | CRITICAL |
| **KernelLaunchInLoop** | Detects kernel launches (`<<<...>>>`) inside loops. Launch overhead multiplies by iterations; recommends batching or using CUDA Graphs. | CRITICAL |
| **DoubleUsage** | Detects usage of `double` precision. Double precision is significantly slower than float on most consumer GPUs. Recommends using `float` if precision is not critical. | WARNING |
| **SlowMathFunction** | Detects standard math functions (e.g., `sin`, `cos`, `sqrt`) that might not be optimal. Recommends using intrinsic fast functions (e.g., `__sinf`) or compiling with `--use_fast_math`. | WARNING |
| **MissingKernelErrorCheck** | Kernel launches `<<<...>>>` are asynchronous. Without calling `cudaGetLastError()` afterward, invalid grid/block configs fail silently. | CRITICAL |
| **LargeSharedMemoryAllocation** | Detects static `__shared__` array allocations that are unusually large (>8192 items) and might exceed the hardware 48KB limit without explicit dynamic opt-in. | CRITICAL |
| **VolatileUsage** | Using `volatile` to synchronize threads or blocks across the GPU memory model is unsafe and undefined behavior. | CRITICAL |
| **DefaultStreamUsage** | Detects kernel launches or `cudaMemcpyAsync` calls in the default (NULL) stream, which prevents concurrent execution with other operations. | WARNING |
| **HardcodedDeviceId** | Detects `cudaSetDevice(0)` or other hardcoded IDs, which may fail on multi-GPU systems. | WARNING |
| **DeprecatedAPI** | Detects usage of legacy CUDA APIs like `cudaThreadSynchronize` and recommends modern alternatives. | WARNING |
| **UncoalescedMemoryAccess** | Detects non-coalesced memory access patterns like `ptr[threadIdx.x * stride]` where `stride > 1`, which drastically reduces global memory throughput. | CRITICAL |
| **SharedMemoryBankConflict** | Detects access to `__shared__` memory with a stride that is a multiple of 32, causing serialization of access within a warp. | WARNING |
| **ArchitecturalCudaLeak** | Detects CUDA-specific code or headers inside Domain/Application layers, enforcing Clean Architecture boundaries. | CRITICAL |
| **UnifiedMemoryWithoutPrefetch** | Detects `cudaMallocManaged` without explicit `cudaMemPrefetchAsync` in the same context, which triggers slow page faults. | WARNING |
| **MissingBoundsCheckInKernel** | Detects kernel functions that compute a thread index and access arrays without a bounds guard (`if (tid < N)`), causing out-of-bounds access when the grid is larger than the data. | CRITICAL |
| **MissingSyncthreadsAfterSharedWrite** | Detects writes to `__shared__` memory followed by reads without an intervening `__syncthreads()`, causing data races and undefined values. | CRITICAL |
| **IncorrectGridDimensionCalculation** | Detects grid dimension calculated with integer division (`N / blockSize`) instead of ceiling division, leaving trailing elements unprocessed. | WARNING |
| **SharedMemoryUninitializedForAtomics** | Detects `__shared__` arrays used with atomic operations without prior zero-initialization, producing incorrect accumulation results. | WARNING |
| **ConstantMemoryWrongCopyMethod** | Detects `cudaMemcpy` used with `__constant__` variables instead of `cudaMemcpyToSymbol`, which fails because constant memory resides in a separate address space. | WARNING |
| **GlobalAtomicWithoutSharedIntermediate** | Detects atomic operations on global memory arrays without shared memory intermediate, causing severe thread contention when many threads compete for few locations. | WARNING |
| **SynchronousMemcpyWithActiveStreams** | Detects synchronous `cudaMemcpy` when CUDA streams are active, which blocks the CPU and prevents copy/kernel overlap. Recommends `cudaMemcpyAsync`. | WARNING |
| **MissingRestrictOnKernelPointers** | Detects kernel functions with multiple pointer parameters lacking `__restrict__` qualifiers, preventing compiler optimizations due to assumed pointer aliasing. | INFO |
| **NonPowerOf2ReductionBlock** | Detects parallel reduction patterns where block size is not a power of 2, causing iterative halving (`i /= 2`) to skip elements and produce incorrect results. | WARNING |
| **CudaEventResourceLeak** | Detects `cudaEventCreate` without a matching `cudaEventDestroy`, leaking finite GPU event resources. | WARNING |
*Note: New rules can be easily added by implementing a new `CUDAParserVisitor` in `src/infrastructure/rules/` and registering it in `src/infrastructure/cli/main.py`.*
## Architecture
This project strictly adheres to Clean Architecture:
- **Domain (`src/domain/`)**: Contains core entities (`CodeSmell`, `Position`) and abstract ports (`CodeAnalyzerPort`). Contains NO dependencies on ANTLR or the CLI.
- **Application (`src/application/`)**: Contains Use Cases (`AnalyzeFileUseCase`, `AnalyzeDirectoryUseCase`) that orchestrate the analysis workflow.
- **Infrastructure (`src/infrastructure/`)**:
- **CLI**: The command-line interface (`main.py`).
- **Parsers**: `AntlrCudaAnalyzer`, an adapter that implements `CodeAnalyzerPort` and encapsulates the ANTLR parsing logic.
- **Rules**: Specific ANTLR visitors that traverse the AST to find code smells.
## Setup and Installation
1. **Clone the repository:**
Ensure you initialize the submodules to fetch the ANTLR grammar.
```bash
git clone --recursive
cd
```
2. **Set up a Python Virtual Environment:**
```bash
python3 -m venv venv
source venv/bin/activate
```
3. **Install Dependencies:**
```bash
pip install antlr4-python3-runtime==4.13.2
```
4. **(Optional) Re-generate the ANTLR Parser:**
If you modify the grammar in `parser/*.g4`, regenerate the Python parser using the `antlr4` tool:
```bash
cd parser
antlr4 -Dlanguage=Python3 -visitor CUDALexer.g4 CUDAParser.g4
```
## Testing
The project includes a comprehensive test suite covering all detection rules.
```bash
# Activate environment
source venv/bin/activate
# Run all tests
python3 -m unittest discover tests
# Run specific test suites
python3 tests/test_new_rules.py # Unit tests for new rules
python3 tests/test_integration.py # Integration test with PMP book examples
```
## Usage
Activate your virtual environment before running the tool:
```bash
source venv/bin/activate
```
### Analyze a Single File
```bash
python src/infrastructure/cli/main.py smells-file
```
### Analyze a Directory
Recursively scans for `.cu` and `.cuh` files.
```bash
python src/infrastructure/cli/main.py smells-dir
```
### Example Output
```text
[CRITICAL] UncheckedCudaAPI at test.cu:6:4
CUDA API call 'cudaMalloc' is potentially unchecked. Always check the return value of CUDA API calls.
[WARNING] PotentialMemoryLeak at test.cu:0:0
Found 1 'cudaMalloc' calls but only 0 'cudaFree' calls in file.
```