An open API service indexing awesome lists of open source software.

https://github.com/vicen-te/tiny-nn

A tiny neural network framework for fully-connected layers with CPU and CUDA support
https://github.com/vicen-te/tiny-nn

backpropagation cplusplus-20 cpu cuda cuda-12-8 kernel multi-threaded neural-network nn

Last synced: 5 months ago
JSON representation

A tiny neural network framework for fully-connected layers with CPU and CUDA support

Awesome Lists containing this project

README

          

# Tiny-NN — Fully Connected Neural Networks in C++20 + CUDA 12.8

Tiny-NN is a high-performance implementation of fully connected neural networks supporting both CPU and GPU execution. It's designed for easy experimentation and benchmarking, featuring:
- CPU execution (parallelized)
- CUDA execution with memory reuse (weights and biases uploaded only once per layer)
- Training with backpropagation and SGD
- Model serialization using [`json.hpp`](https://github.com/nlohmann/json) (MIT licensed) included in the repository
- Simple MNIST dataset integration and ASCII preview

## Requirements
- C++20 compatible compiler
- CUDA 12.8 (for GPU support)
- CMake >= 3.24
- Python 3.12 (optional, for dataset download and preview)

> Although development was done on Windows 10/11 using Visual Studio 2022, the project can be built on any OS with a compatible C++20 compiler and CUDA installation.

## Setup

1. Clone or copy the repository to your machine.
```bash
git clone https://github.com/Vicen-te/tiny-nn.git
cd tiny-nn
```

2. Download the MNIST dataset:
Using Python script (recommended):
```bash
python scripts/download_mnist.py
```
- This will download and save the MNIST dataset in `data/minst/`.
- Alternatively, you can download the dataset manually from [Kaggle](https://www.kaggle.com/datasets/hojjatk/mnist-dataset)

3. Optional: generate a small model using Python (arguments: `input layer`, `hidden layer`, `output layer`):
```bash
python data/generate_model.py 128 64 10
```

4. Optional: preview MNIST digits:

- Python: `python scripts/preview.py`
- C++: `ascii_preview()` function in MNISTLoader


## Build
### Visual Studio:

- Open Visual Studio -> File -> Open -> Folder... and select the project folder.
- Visual Studio will detect CMake. For GPU usage, choose x64 configuration.
- Build -> Build All.

### PowerShell / Developer Command Prompt (recommended):
#### Option 1: Specify all options manually
```powershell
mkdir build
cd build
cmake .. -G "Visual Studio 17 2022" -A x64 -DCUDA_TOOLKIT_ROOT_DIR="C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v12.8"
cmake --build . --config Release
```

- `-G "Visual Studio 17 2022"` selects Visual Studio 2022
- `-A x64` selects 64-bit architecture (recommended for CUDA)
- `-DCUDA_TOOLKIT_ROOT_DIR` is optional, CMake can auto-detect CUDA
> Note: The -A x64 option is recommended if you want to use CUDA on Windows. On Linux or macOS, this is not necessary.

#### Option 2: Let CMake detect everything automatically (recommended)
```powershell
cmake -B build -S .
cmake --build build --config Release
```

- CMake will detect Visual Studio and CUDA if installed in standard locations
- `-S` is the source folder, `-B` is the build folder
>Both methods produce the same result. Use Option 2 for simplicity and fewer manual settings.


## Run

From the `build/Release` folder:
```kotlin
tinny-nn.exe
```

Modes:

- train or t → Train model
- inference or i → Run inference on a sample
- benchmark or b → Compare CPU vs CUDA performance

### Expected output

#### Training (train / t)

- Training progress printed to console
- Training duration in seconds
- Saved model JSON to ./data/models/fc_digit_classification.json
- ASCII MNIST preview of a single sample image

#### Inference (inference / i)

- Output values of selected sample
- Maximum value and its index
- ASCII preview of the sample

#### Benchmark (benchmark / b)

- CPU vs GPU inference correctness check
- Average inference timings per method
- CSV results saved to ./data/results/bench.csv

>Currently, benchmark only measures inference, not training. Measuring training performance would require additional implementation.

## Notes & Improvements

- Currently, weights `W` and biases `b` are uploaded to the GPU **once per layer**. The input vector is uploaded for each inference.
- cuBLAS GEMM is already used for matrix multiplications, replacing the simple custom FC kernel.
- Intermediate GPU buffers (`dX`/`dY`) are allocated per layer and batch and are **not fully reused**, though CUDA streams enable asynchronous execution.
- For higher performance (future improvements):
- Reusing intermediate GPU buffers across layers and batches via CUDA streams.
- Implementing more efficient batching and overlapping of data transfers with computation.
- Profiling can be done with Nsight Systems / Nsight Compute.