https://github.com/acly/vision.cpp
Computer Vision ML inference in C++
https://github.com/acly/vision.cpp
Last synced: 10 months ago
JSON representation
Computer Vision ML inference in C++
- Host: GitHub
- URL: https://github.com/acly/vision.cpp
- Owner: Acly
- License: mit
- Created: 2025-07-10T07:32:05.000Z (12 months ago)
- Default Branch: main
- Last Pushed: 2025-08-17T13:10:32.000Z (10 months ago)
- Last Synced: 2025-08-17T15:10:36.251Z (10 months ago)
- Language: C++
- Size: 665 KB
- Stars: 2
- Watchers: 0
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
README
# _vision_.cpp
Computer Vision ML inference in C++
* Self-contained C++ library
* Efficient inference on consumer CPU and GPUs (NVIDIA, AMD, Intel)
* Lightweight deployment on many platforms (Windows, Linux, MacOS)
* Growing number of supported models behind a simple API
* Modular design for full control and implementing your own models
Based on [ggml](https://github.com/ggml-org/ggml) similar to the [llama.cpp](https://github.com/ggml-org/llama.cpp) project.
### Features
| Model | Task | Backends |
| :-------------------------- | :--------------- | :---------- |
| [**MobileSAM**](#mobilesam) | Segmentation | CPU, Vulkan |
| [**BiRefNet**](#birefnet) | Segmentation | CPU, Vulkan |
| [**MI-GAN**](#mi-gan) | Inpainting | CPU, Vulkan |
| [**ESRGAN**](#real-esrgan) | Super-resolution | CPU, Vulkan |
| [_Implement a model [**Guide**]_](docs/model-implementation-guide.md) | | |
## Get Started
Get the library and executables:
* Download a [release package](https://github.com/Acly/vision.cpp/releases) and extract it,
* or [build from source](#building).
### Example: Select an object in an image
Let's use MobileSAM to generate a segmentation mask of the plushy on the right
by passing in a box describing its approximate location.

You can download the model and input image here: [MobileSAM-F16.gguf](https://huggingface.co/Acly/MobileSAM-GGUF/resolve/main/MobileSAM-F16.gguf) | [input.jpg](docs/media/input.jpg)
#### CLI
Find the `vision-cli` executable in the `bin` folder and run it to generate the mask:
```sh
vision-cli -m MobileSAM-F16.gguf -i input.jpg -p 420 120 650 430 -o mask.png
```
Pass `--composite output.png` to composite input and mask. Use `--help` for more options.
#### API
```c++
#include
using namespace visp;
void main() {
backend_device cpu = backend_init(backend_type::cpu);
sam_model sam = sam_load_model("MobileSAM-F16.gguf", cpu);
image_data input_image = image_load("input.jpg");
sam_encode(sam, input_image);
image_data object_mask = sam_compute(sam, box_2d{{420, 120}, {650, 320}});
image_save(object_mask, "mask.png");
}
```
This shows the high-level API. Internally it is composed of multiple smaller
functions that handle model loading, pre-processing inputs, transferring data to
backend devices, post-processing output, etc. These can be used as building
blocks for flexible functions which integrate with your existing data sources
and infrastructure.
## Models
#### MobileSAM

[Model download](https://huggingface.co/Acly/MobileSAM-GGUF/tree/main) | [Paper (arXiv)](https://arxiv.org/pdf/2306.14289.pdf) | [Repository (GitHub)](https://github.com/ChaoningZhang/MobileSAM) | [Segment-Anything-Model](https://segment-anything.com/) | License: Apache-2
```sh
vision-cli sam -m MobileSAM-F16.gguf -i input.png -p 300 200 -o mask.png --composite comp.png
```
#### BiRefNet

[Model download](https://huggingface.co/Acly/BiRefNet-GGUF/tree/main) | [Paper (arXiv)](https://arxiv.org/pdf/2401.03407) | [Repository (GitHub)](https://github.com/ZhengPeng7/BiRefNet) | License: MIT
```sh
vision-cli birefnet -m BiRefNet-lite-F16.gguf -i input.png -o mask.png --composite comp.png
```
#### MI-GAN

[Model download](https://huggingface.co/Acly/MIGAN-GGUF/tree/main) | [Paper (thecvf.com)](https://openaccess.thecvf.com/content/ICCV2023/papers/Sargsyan_MI-GAN_A_Simple_Baseline_for_Image_Inpainting_on_Mobile_Devices_ICCV_2023_paper.pdf) | [Repository (GitHub)](https://github.com/Picsart-AI-Research/MI-GAN) | License: MIT
```sh
vision-cli migan -m MIGAN-512-places2-F16.gguf -i image.png mask.png -o output.png
```
#### Real-ESRGAN

[Model download](https://huggingface.co/Acly/Real-ESRGAN-GGUF) | [Paper (arXiv)](https://arxiv.org/abs/2107.10833) | [Repository (GitHub)](https://github.com/xinntao/Real-ESRGAN) | License: BSD-3-Clause
```sh
vision-cli esrgan -m ESRGAN-4x-foolhardy_Remacri-F16.gguf -i input.png -o output.png
```
### Converting models
Models need to be converted to GGUF before they can be used. This will also
rearrange or precompute tensors for more optimal inference.
To convert a model, install [uv](https://docs.astral.sh/uv/) and run:
```sh
uv run scripts/convert.py MyModel.pth
```
where `` is one of `sam, birefnet, esrgan, ...`.
This will create `models/MyModel.gguf`. See `convert.py --help` for more options.
## Building
Building requires CMake and a compiler with C++20 support.
**Get the sources**
```sh
git clone https://github.com/Acly/vision.cpp.git --recursive
cd vision.cpp
```
**Configure and build**
```sh
cmake . -B build -D CMAKE_BUILD_TYPE=Release
cmake --build build --config Release
```
### Vulkan _(Optional)_
Building with Vulkan GPU support requires the [Vulkan SDK](https://www.lunarg.com/vulkan-sdk/) to be installed.
```sh
cmake . -B build -D CMAKE_BUILD_TYPE=Release -D VISP_VULKAN=ON
```
### Tests _(Optional)_
Build with `-DVISP_TESTS=ON`. Run all C++ tests with the following command:
```sh
cd build
ctest -C Release
```
Some tests require a Python environment. It can be set up with [uv](https://docs.astral.sh/uv/):
```sh
# Setup venv and install dependencies (once only)
uv sync
# Run python tests
uv run pytest
```
## Performance
Performance optimization is an ongoing process. The aim is to be in the same ballpark
as other frameworks for inference speed, but with:
* much faster initialization and model loading time (<100 ms)
* lower memory overhead
* tiny deployment size (<5 MB for CPU, +30 MB for GPU)
### Inference speed
* CPU: AMD Ryzen 5 5600X (6 cores)
* GPU: NVIDIA GeForce RTX 4070
#### MobileSAM, 1024x1024
| | | _vision.cpp_ | PyTorch | ONNX Runtime |
| :--- | :--- | -----------: | ------: | -----------: |
| cpu | f32 | 669 ms | 601 ms | 805 ms |
| gpu | f16 | 19 ms | 16 ms | |
#### BiRefNet, 1024x1024
| Model | | | _vision.cpp_ | PyTorch | ONNX Runtime |
| :---- | :--- | :--- | -----------: | -------: | -----------: |
| Full | cpu | f32 | 16333 ms | 18800 ms | |
| Full | gpu | f16 | 243 ms | 140 ms | |
| Lite | cpu | f32 | 4505 ms | 10900 ms | 6978 ms |
| Lite | gpu | f16 | 86 ms | 59 ms | |
#### MI-GAN, 512x512
| Model | | | _vision.cpp_ | PyTorch |
| :---------- | :--- | :--- | -----------: | ------: |
| 512-places2 | cpu | f32 | 523 ms | 637 ms |
| 512-places2 | gpu | f16 | 21 ms | 17 ms |
#### Setup
* vision.cpp: using vision-bench, GPU via Vulkan, eg. `vision-bench -m sam -b cpu`
* PyTorch: v2.7.1+cu128, eager eval, GPU via CUDA, average n iterations after warm-up
## Dependencies (integrated)
* [ggml](https://github.com/ggml-org/ggml) - ML tensor library | MIT
* [stb-image](https://github.com/nothings/stb) - Image load/save/resize | Public Domain
* [fmt](https://github.com/fmtlib/fmt) - String formatting _(only if compiler doesn't support <format>)_ | MIT