https://github.com/tenxlenx/gpudct
A library to extract DCT hashes with CUDA
https://github.com/tenxlenx/gpudct
computer-vision cpp cuda image-feature image-processing image-similarity perceptual-hashing
Last synced: about 2 hours ago
JSON representation
A library to extract DCT hashes with CUDA
- Host: GitHub
- URL: https://github.com/tenxlenx/gpudct
- Owner: tenxlenx
- License: mit
- Created: 2023-08-28T09:13:32.000Z (almost 3 years ago)
- Default Branch: main
- Last Pushed: 2025-10-21T04:35:40.000Z (8 months ago)
- Last Synced: 2025-10-21T06:16:25.921Z (8 months ago)
- Topics: computer-vision, cpp, cuda, image-feature, image-processing, image-similarity, perceptual-hashing
- Language: Cuda
- Homepage:
- Size: 208 KB
- Stars: 4
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# GpuDct: CUDA DCT Hashing Library
GpuDct is a CUDA C++20 library that computes 64-bit perceptual hashes from square images using fused Discrete Cosine Transform (DCT) kernels. Each kernel evaluates the full T * A * T' pipeline, extracts the 8x8 low-frequency block on device, and emits a median-threshold signature without extra launches or host round trips.
## Highlights
- Fused single-pass kernels for 32, 64, 128, and 256 sized images with constant-memory transforms
- Stream-ordered temporary allocations via CUDA memory pools (no hot-path malloc)
- In-kernel 8x8 hashing and median selection yielding a 64-bit binary fingerprint
- Batch and multi-stream helpers for high-throughput pipelines
- Benchmarks instrumented with CUDA events for precise GPU time attribution
- CMake package configured for CUDA + C++20, friendly with FetchContent and install exports
## Requirements
- NVIDIA GPU with compute capability 7.5 or newer (tune `CMAKE_CUDA_ARCHITECTURES` as needed)
- CUDA Toolkit 12.x (tested) with `nvcc`
- CMake 3.18 or newer
- Host compiler with full C++20 support (GCC 11+, Clang 14+, MSVC 19.3+)
- No bundled image-processing dependencies. Provide your own contiguous buffers from any loader you prefer (stb_image, OpenCV, etc.).
## Quick Start
```bash
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)
```
Outputs include `libGpuDct.a` and sample binaries under `build/examples/`. Sanity check performance with:
```bash
./build/examples/gpu_dct_benchmark # defaults to 32x32
./build/examples/gpu_dct_benchmark 256 # alternate size
```
## Basic Tutorial
The primary entry point is `gpu_dct::GpuDct`. Supported image sizes are 32, 64, 128, and 256.
### 1. Single image hashing from host memory
```cpp
#include
#include
#include
#include
int main() {
constexpr int N = 32;
gpu_dct::GpuDct dct(N);
std::vector image(N * N);
for (size_t i = 0; i < image.size(); ++i) {
image[i] = static_cast(i % 256);
}
const uint64_t hash = dct.dct_host(image.data());
std::cout << "hash: 0x" << std::hex << hash << std::dec << "\n";
return 0;
}
```
`dct_host` is synchronous and optionally accepts a CUDA stream to integrate with existing GPU work.
### 2. Batched host processing
```cpp
constexpr int N = 64;
constexpr int batch = 16;
gpu_dct::GpuDct dct(N);
std::vector images(static_cast(N) * N * batch);
std::vector hashes(batch);
dct.batch_dct_host(images.data(), hashes.data(), batch);
```
The helper stages data through stream-ordered pools, launches fused kernels for the entire batch, and returns once hashes are copied back.
### 3. Device-to-device workflows and multi-stream execution
```cpp
#include
gpu_dct::GpuDct dct(128);
constexpr int batch = 64;
float* d_images = nullptr;
uint64_t* d_hashes = nullptr;
cudaMalloc(&d_images, 128 * 128 * batch * sizeof(float));
cudaMalloc(&d_hashes, batch * sizeof(uint64_t));
// populate d_images on device...
dct.batch_dct_device(d_images, d_hashes, batch);
std::array streams{};
for (auto& s : streams) {
cudaStreamCreate(&s);
}
dct.batch_dct_device_multistream(d_images, d_hashes, batch, streams);
for (auto s : streams) {
cudaStreamDestroy(s);
}
cudaFree(d_images);
cudaFree(d_hashes);
```
Hashes remain on the device, enabling additional GPU-side comparisons before any host transfer.
### 4. Hashing a real image
Download any public grayscale or RGB square image and feed it through the helper utility:
```bash
cmake --build build -j$(nproc)
./build/examples/gpu_dct_hash_image path/to/lena.jpg 256
```
The tool uses stb_image to decode the asset, converts it to grayscale, downsamples to the requested DCT size (32, 64, 128, or 256), and prints the 64-bit perceptual hash so you can cross-check against other implementations.
### Feeding data from image libraries (optional)
GpuDct only expects a contiguous buffer of pixel intensities, so you can lift data from whatever host-side library you already use without additional dependencies. For example, with OpenCV:
```cpp
cv::Mat gray = cv::imread(path, cv::IMREAD_GRAYSCALE);
if (!gray.data || gray.rows != N || gray.cols != N) {
throw std::runtime_error("unexpected image dimensions");
}
std::vector image(gray.rows * gray.cols);
std::transform(gray.begin(), gray.end(), image.begin(),
[](uint8_t v) { return static_cast(v); });
gpu_dct::GpuDct dct(N);
const uint64_t hash = dct.dct_host(image.data());
```
Any loader that produces a contiguous block (stb_image, libpng, custom CUDA pipelines) can be wired up the same way.
## Using GpuDct in another CMake project
```cmake
include(FetchContent)
FetchContent_Declare(
GpuDct
GIT_REPOSITORY https://github.com/tenxlenx/GpuDct.git
GIT_TAG main
)
FetchContent_MakeAvailable(GpuDct)
add_executable(hash_demo main.cpp)
target_link_libraries(hash_demo PRIVATE GpuDct CUDA::cudart)
set_property(TARGET hash_demo PROPERTY CXX_STANDARD 20)
```
Override `CMAKE_CUDA_ARCHITECTURES` in the parent project to match deployment hardware.
## Benchmarking
`examples/gpu_dct_benchmark` exercises single images, batched runs, and multi-stream scenarios with CUDA event profiling on every test. CLI usage:
```
./gpu_dct_benchmark # 32x32, default iterations
./gpu_dct_benchmark 128 # choose image size
./gpu_dct_benchmark 64 --streams 4 # adjust streams or iterations
```
The tool reports per-image latency, throughput, and data type comparisons for quick regression checks.
## Troubleshooting
- Mismatch between compiled and runtime GPU architectures: set `CMAKE_CUDA_ARCHITECTURES` explicitly.
- Out-of-memory during large batches: raise the CUDA malloc heap limit or reduce concurrent streams.
- Integrating with pre-existing CUDA streams: pass your stream to constructors or method overloads to preserve ordering.
## License
MIT. See `LICENSE` for details.
The repository vendors `stb_image.h` (public-domain / MIT dual licensed) in `third_party/` for sample image decoding.