https://github.com/marcramonmoreno/cuda-saxpy-kernel
CUDA parallelized SAXPY kernel defined by the BLAS (Basic Lineal Algebra Subroutine)
https://github.com/marcramonmoreno/cuda-saxpy-kernel
Last synced: about 2 months ago
JSON representation
CUDA parallelized SAXPY kernel defined by the BLAS (Basic Lineal Algebra Subroutine)
- Host: GitHub
- URL: https://github.com/marcramonmoreno/cuda-saxpy-kernel
- Owner: MarcRamonMoreno
- License: gpl-3.0
- Created: 2023-11-07T11:47:37.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2023-11-16T06:43:50.000Z (over 1 year ago)
- Last Synced: 2025-02-12T17:59:49.846Z (4 months ago)
- Language: Cuda
- Homepage:
- Size: 616 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# CUDA SAXPY Kernel Implementation
OverviewThis code provides a CUDA implementation of the SAXPY operation, a common computation in linear algebra. SAXPY stands for "Single-Precision A*X Plus Y", where A, X, and Y are vectors. The code performs this operation in parallel on a GPU using CUDA, NVIDIA's parallel computing platform and programming model.
RequirementsNVIDIA GPU with CUDA support
CUDA Toolkit installed on your systemDescription
The code is divided into two main parts:
CUDA Kernel (saxpy_parallel): This function runs on the GPU and performs the SAXPY computation. Each thread computes a single element of the result vector.
Host Code (main function): This part runs on the CPU. It allocates memory on the GPU, copies data from the host to the GPU, invokes the CUDA kernel, and then copies the result back to the host.
Key Components
saxpy_parallel: CUDA kernel function for the SAXPY operation.
main: Host function to set up CUDA environment and invoke the kernel.
n: Size of the input vectors.
alpha: Scalar value in the SAXPY operation.
x, y: Input vectors.
d_x, d_y: Device (GPU) copies of x and y.Usage
Compile the code using the CUDA compiler (nvcc).
Run the executable. It will perform the SAXPY operation and print the results.Important Considerations
The code assumes a certain size for the vectors (n = 2000), but this can be adjusted as needed.
The number of threads per block is set to 256, but this value can be tuned for different GPUs.
Memory allocation and free operations are crucial to avoid memory leaks.
Error checks for CUDA operations (like cudaMalloc, cudaMemcpy) are not included but are recommended for robust code.Conclusion
This CUDA SAXPY implementation demonstrates basic GPU programming concepts, including memory management, kernel invocation, and parallel computation. It serves as a starting point for more complex GPU-accelerated applications.
Before running the make command, ensure that the makefile has the correct paths and instructions for your system's architecture and the intended CUDA toolkit version.