https://github.com/acly/vision.cpp

Computer Vision ML inference in C++
https://github.com/acly/vision.cpp
Last synced: 10 months ago
JSON representation
Computer Vision ML inference in C++
Host: GitHub
URL: https://github.com/acly/vision.cpp
Owner: Acly
License: mit
Created: 2025-07-10T07:32:05.000Z (12 months ago)
Default Branch: main
Last Pushed: 2025-08-17T13:10:32.000Z (10 months ago)
Last Synced: 2025-08-17T15:10:36.251Z (10 months ago)
Language: C++
Size: 665 KB
Stars: 2
Watchers: 0
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project

README

          # _vision_.cpp

Computer Vision ML inference in C++

* Self-contained C++ library

* Efficient inference on consumer CPU and GPUs (NVIDIA, AMD, Intel)

* Lightweight deployment on many platforms (Windows, Linux, MacOS)

* Growing number of supported models behind a simple API

* Modular design for full control and implementing your own models

Based on [ggml](https://github.com/ggml-org/ggml) similar to the [llama.cpp](https://github.com/ggml-org/llama.cpp) project.

### Features

| Model                       | Task             | Backends    |

| :-------------------------- | :--------------- | :---------- |

| [**MobileSAM**](#mobilesam) | Segmentation     | CPU, Vulkan |

| [**BiRefNet**](#birefnet)   | Segmentation     | CPU, Vulkan |

| [**MI-GAN**](#mi-gan)       | Inpainting       | CPU, Vulkan |

| [**ESRGAN**](#real-esrgan)  | Super-resolution | CPU, Vulkan |

| [_Implement a model [**Guide**]_](docs/model-implementation-guide.md) | | |

## Get Started

Get the library and executables:

* Download a [release package](https://github.com/Acly/vision.cpp/releases) and extract it,

* or [build from source](#building).

### Example: Select an object in an image

Let's use MobileSAM to generate a segmentation mask of the plushy on the right

by passing in a box describing its approximate location.



You can download the model and input image here: [MobileSAM-F16.gguf](https://huggingface.co/Acly/MobileSAM-GGUF/resolve/main/MobileSAM-F16.gguf) | [input.jpg](docs/media/input.jpg)

#### CLI

Find the `vision-cli` executable in the `bin` folder and run it to generate the mask:

```sh

vision-cli -m MobileSAM-F16.gguf -i input.jpg -p 420 120 650 430 -o mask.png

```

Pass `--composite output.png` to composite input and mask. Use `--help` for more options.

#### API

```c++

#include 

using namespace visp;

void main() {

  backend_device cpu = backend_init(backend_type::cpu);

  sam_model sam = sam_load_model("MobileSAM-F16.gguf", cpu);

  

  image_data input_image = image_load("input.jpg");

  sam_encode(sam, input_image);

  image_data object_mask = sam_compute(sam, box_2d{{420, 120}, {650, 320}});

  image_save(object_mask, "mask.png");

}

```

This shows the high-level API. Internally it is composed of multiple smaller

functions that handle model loading, pre-processing inputs, transferring data to

backend devices, post-processing output, etc. These can be used as building

blocks for flexible functions which integrate with your existing data sources

and infrastructure.

## Models

#### MobileSAM



[Model download](https://huggingface.co/Acly/MobileSAM-GGUF/tree/main) | [Paper (arXiv)](https://arxiv.org/pdf/2306.14289.pdf) | [Repository (GitHub)](https://github.com/ChaoningZhang/MobileSAM) | [Segment-Anything-Model](https://segment-anything.com/) | License: Apache-2

```sh

vision-cli sam -m MobileSAM-F16.gguf -i input.png -p 300 200 -o mask.png --composite comp.png

```

#### BiRefNet



[Model download](https://huggingface.co/Acly/BiRefNet-GGUF/tree/main) | [Paper (arXiv)](https://arxiv.org/pdf/2401.03407) | [Repository (GitHub)](https://github.com/ZhengPeng7/BiRefNet) | License: MIT

```sh

vision-cli birefnet -m BiRefNet-lite-F16.gguf -i input.png -o mask.png --composite comp.png

```

#### MI-GAN



[Model download](https://huggingface.co/Acly/MIGAN-GGUF/tree/main) | [Paper (thecvf.com)](https://openaccess.thecvf.com/content/ICCV2023/papers/Sargsyan_MI-GAN_A_Simple_Baseline_for_Image_Inpainting_on_Mobile_Devices_ICCV_2023_paper.pdf) | [Repository (GitHub)](https://github.com/Picsart-AI-Research/MI-GAN) | License: MIT

```sh

vision-cli migan -m MIGAN-512-places2-F16.gguf -i image.png mask.png -o output.png

```

#### Real-ESRGAN



[Model download](https://huggingface.co/Acly/Real-ESRGAN-GGUF) | [Paper (arXiv)](https://arxiv.org/abs/2107.10833) | [Repository (GitHub)](https://github.com/xinntao/Real-ESRGAN) | License: BSD-3-Clause

```sh

vision-cli esrgan -m ESRGAN-4x-foolhardy_Remacri-F16.gguf -i input.png -o output.png

```

### Converting models

Models need to be converted to GGUF before they can be used. This will also

rearrange or precompute tensors for more optimal inference.

To convert a model, install [uv](https://docs.astral.sh/uv/) and run:

```sh

uv run scripts/convert.py  MyModel.pth

```

where `` is one of `sam, birefnet, esrgan, ...`.

This will create `models/MyModel.gguf`. See `convert.py --help` for more options.

## Building

Building requires CMake and a compiler with C++20 support.

**Get the sources**

```sh

git clone https://github.com/Acly/vision.cpp.git --recursive

cd vision.cpp

```

**Configure and build**

```sh

cmake . -B build -D CMAKE_BUILD_TYPE=Release

cmake --build build --config Release

```

### Vulkan _(Optional)_

Building with Vulkan GPU support requires the [Vulkan SDK](https://www.lunarg.com/vulkan-sdk/) to be installed.

```sh

cmake . -B build -D CMAKE_BUILD_TYPE=Release -D VISP_VULKAN=ON

```

### Tests _(Optional)_

Build with `-DVISP_TESTS=ON`. Run all C++ tests with the following command:

```sh

cd build

ctest -C Release

```

Some tests require a Python environment. It can be set up with [uv](https://docs.astral.sh/uv/):

```sh

# Setup venv and install dependencies (once only)

uv sync

# Run python tests

uv run pytest

```

## Performance

Performance optimization is an ongoing process. The aim is to be in the same ballpark

as other frameworks for inference speed, but with:

* much faster initialization and model loading time (<100 ms)

* lower memory overhead

* tiny deployment size (<5 MB for CPU, +30 MB for GPU)

### Inference speed

* CPU: AMD Ryzen 5 5600X (6 cores)

* GPU: NVIDIA GeForce RTX 4070

#### MobileSAM, 1024x1024

|      |      | _vision.cpp_ | PyTorch | ONNX Runtime |

| :--- | :--- | -----------: | ------: | -----------: |

| cpu  | f32  |       669 ms |  601 ms |       805 ms |

| gpu  | f16  |        19 ms |   16 ms |              |

#### BiRefNet, 1024x1024

| Model |      |      | _vision.cpp_ |  PyTorch | ONNX Runtime |

| :---- | :--- | :--- | -----------: | -------: | -----------: |

| Full  | cpu  | f32  |     16333 ms | 18800 ms |              |

| Full  | gpu  | f16  |       243 ms |   140 ms |              |

| Lite  | cpu  | f32  |      4505 ms | 10900 ms |      6978 ms |

| Lite  | gpu  | f16  |        86 ms |    59 ms |              |

#### MI-GAN, 512x512

| Model       |      |      | _vision.cpp_ | PyTorch |

| :---------- | :--- | :--- | -----------: | ------: |

| 512-places2 | cpu  | f32  |       523 ms |  637 ms |

| 512-places2 | gpu  | f16  |        21 ms |   17 ms |

#### Setup

* vision.cpp: using vision-bench, GPU via Vulkan, eg. `vision-bench -m sam -b cpu`

* PyTorch: v2.7.1+cu128, eager eval, GPU via CUDA, average n iterations after warm-up

## Dependencies (integrated)

* [ggml](https://github.com/ggml-org/ggml) - ML tensor library | MIT

* [stb-image](https://github.com/nothings/stb) - Image load/save/resize | Public Domain

* [fmt](https://github.com/fmtlib/fmt) - String formatting _(only if compiler doesn't support <format>)_ | MIT
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/acly/vision.cpp

Awesome Lists containing this project

README