https://github.com/vicen-te/tiny-nn
A tiny neural network framework for fully-connected layers with CPU and CUDA support
https://github.com/vicen-te/tiny-nn
backpropagation cplusplus-20 cpu cuda cuda-12-8 kernel multi-threaded neural-network nn
Last synced: 5 months ago
JSON representation
A tiny neural network framework for fully-connected layers with CPU and CUDA support
- Host: GitHub
- URL: https://github.com/vicen-te/tiny-nn
- Owner: Vicen-te
- License: mit
- Created: 2025-10-06T21:59:05.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2025-10-28T23:39:09.000Z (5 months ago)
- Last Synced: 2025-10-29T00:34:16.406Z (5 months ago)
- Topics: backpropagation, cplusplus-20, cpu, cuda, cuda-12-8, kernel, multi-threaded, neural-network, nn
- Language: C++
- Homepage:
- Size: 3.83 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Tiny-NN — Fully Connected Neural Networks in C++20 + CUDA 12.8
Tiny-NN is a high-performance implementation of fully connected neural networks supporting both CPU and GPU execution. It's designed for easy experimentation and benchmarking, featuring:
- CPU execution (parallelized)
- CUDA execution with memory reuse (weights and biases uploaded only once per layer)
- Training with backpropagation and SGD
- Model serialization using [`json.hpp`](https://github.com/nlohmann/json) (MIT licensed) included in the repository
- Simple MNIST dataset integration and ASCII preview
## Requirements
- C++20 compatible compiler
- CUDA 12.8 (for GPU support)
- CMake >= 3.24
- Python 3.12 (optional, for dataset download and preview)
> Although development was done on Windows 10/11 using Visual Studio 2022, the project can be built on any OS with a compatible C++20 compiler and CUDA installation.
## Setup
1. Clone or copy the repository to your machine.
```bash
git clone https://github.com/Vicen-te/tiny-nn.git
cd tiny-nn
```
2. Download the MNIST dataset:
Using Python script (recommended):
```bash
python scripts/download_mnist.py
```
- This will download and save the MNIST dataset in `data/minst/`.
- Alternatively, you can download the dataset manually from [Kaggle](https://www.kaggle.com/datasets/hojjatk/mnist-dataset)
3. Optional: generate a small model using Python (arguments: `input layer`, `hidden layer`, `output layer`):
```bash
python data/generate_model.py 128 64 10
```
4. Optional: preview MNIST digits:
- Python: `python scripts/preview.py`
- C++: `ascii_preview()` function in MNISTLoader
## Build
### Visual Studio:
- Open Visual Studio -> File -> Open -> Folder... and select the project folder.
- Visual Studio will detect CMake. For GPU usage, choose x64 configuration.
- Build -> Build All.
### PowerShell / Developer Command Prompt (recommended):
#### Option 1: Specify all options manually
```powershell
mkdir build
cd build
cmake .. -G "Visual Studio 17 2022" -A x64 -DCUDA_TOOLKIT_ROOT_DIR="C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v12.8"
cmake --build . --config Release
```
- `-G "Visual Studio 17 2022"` selects Visual Studio 2022
- `-A x64` selects 64-bit architecture (recommended for CUDA)
- `-DCUDA_TOOLKIT_ROOT_DIR` is optional, CMake can auto-detect CUDA
> Note: The -A x64 option is recommended if you want to use CUDA on Windows. On Linux or macOS, this is not necessary.
#### Option 2: Let CMake detect everything automatically (recommended)
```powershell
cmake -B build -S .
cmake --build build --config Release
```
- CMake will detect Visual Studio and CUDA if installed in standard locations
- `-S` is the source folder, `-B` is the build folder
>Both methods produce the same result. Use Option 2 for simplicity and fewer manual settings.
## Run
From the `build/Release` folder:
```kotlin
tinny-nn.exe
```
Modes:
- train or t → Train model
- inference or i → Run inference on a sample
- benchmark or b → Compare CPU vs CUDA performance
### Expected output
#### Training (train / t)
- Training progress printed to console
- Training duration in seconds
- Saved model JSON to ./data/models/fc_digit_classification.json
- ASCII MNIST preview of a single sample image
#### Inference (inference / i)
- Output values of selected sample
- Maximum value and its index
- ASCII preview of the sample
#### Benchmark (benchmark / b)
- CPU vs GPU inference correctness check
- Average inference timings per method
- CSV results saved to ./data/results/bench.csv
>Currently, benchmark only measures inference, not training. Measuring training performance would require additional implementation.
## Notes & Improvements
- Currently, weights `W` and biases `b` are uploaded to the GPU **once per layer**. The input vector is uploaded for each inference.
- cuBLAS GEMM is already used for matrix multiplications, replacing the simple custom FC kernel.
- Intermediate GPU buffers (`dX`/`dY`) are allocated per layer and batch and are **not fully reused**, though CUDA streams enable asynchronous execution.
- For higher performance (future improvements):
- Reusing intermediate GPU buffers across layers and batches via CUDA streams.
- Implementing more efficient batching and overlapping of data transfers with computation.
- Profiling can be done with Nsight Systems / Nsight Compute.