An open API service indexing awesome lists of open source software.

https://github.com/acly/vision.cpp

Computer Vision ML inference in C++
https://github.com/acly/vision.cpp

Last synced: 10 months ago
JSON representation

Computer Vision ML inference in C++

Awesome Lists containing this project

README

          

# _vision_.cpp

Computer Vision ML inference in C++

* Self-contained C++ library
* Efficient inference on consumer CPU and GPUs (NVIDIA, AMD, Intel)
* Lightweight deployment on many platforms (Windows, Linux, MacOS)
* Growing number of supported models behind a simple API
* Modular design for full control and implementing your own models

Based on [ggml](https://github.com/ggml-org/ggml) similar to the [llama.cpp](https://github.com/ggml-org/llama.cpp) project.

### Features

| Model | Task | Backends |
| :-------------------------- | :--------------- | :---------- |
| [**MobileSAM**](#mobilesam) | Segmentation | CPU, Vulkan |
| [**BiRefNet**](#birefnet) | Segmentation | CPU, Vulkan |
| [**MI-GAN**](#mi-gan) | Inpainting | CPU, Vulkan |
| [**ESRGAN**](#real-esrgan) | Super-resolution | CPU, Vulkan |
| [_Implement a model [**Guide**]_](docs/model-implementation-guide.md) | | |

## Get Started

Get the library and executables:
* Download a [release package](https://github.com/Acly/vision.cpp/releases) and extract it,
* or [build from source](#building).

### Example: Select an object in an image

Let's use MobileSAM to generate a segmentation mask of the plushy on the right
by passing in a box describing its approximate location.

Example image showing box prompt at pixel location (420, 120) - (650, 430), and the output mask

You can download the model and input image here: [MobileSAM-F16.gguf](https://huggingface.co/Acly/MobileSAM-GGUF/resolve/main/MobileSAM-F16.gguf) | [input.jpg](docs/media/input.jpg)

#### CLI

Find the `vision-cli` executable in the `bin` folder and run it to generate the mask:

```sh
vision-cli -m MobileSAM-F16.gguf -i input.jpg -p 420 120 650 430 -o mask.png
```
Pass `--composite output.png` to composite input and mask. Use `--help` for more options.

#### API

```c++
#include
using namespace visp;

void main() {
backend_device cpu = backend_init(backend_type::cpu);
sam_model sam = sam_load_model("MobileSAM-F16.gguf", cpu);

image_data input_image = image_load("input.jpg");
sam_encode(sam, input_image);

image_data object_mask = sam_compute(sam, box_2d{{420, 120}, {650, 320}});
image_save(object_mask, "mask.png");
}
```
This shows the high-level API. Internally it is composed of multiple smaller
functions that handle model loading, pre-processing inputs, transferring data to
backend devices, post-processing output, etc. These can be used as building
blocks for flexible functions which integrate with your existing data sources
and infrastructure.

## Models

#### MobileSAM

example-sam

[Model download](https://huggingface.co/Acly/MobileSAM-GGUF/tree/main) | [Paper (arXiv)](https://arxiv.org/pdf/2306.14289.pdf) | [Repository (GitHub)](https://github.com/ChaoningZhang/MobileSAM) | [Segment-Anything-Model](https://segment-anything.com/) | License: Apache-2

```sh
vision-cli sam -m MobileSAM-F16.gguf -i input.png -p 300 200 -o mask.png --composite comp.png
```

#### BiRefNet

example-birefnet

[Model download](https://huggingface.co/Acly/BiRefNet-GGUF/tree/main) | [Paper (arXiv)](https://arxiv.org/pdf/2401.03407) | [Repository (GitHub)](https://github.com/ZhengPeng7/BiRefNet) | License: MIT

```sh
vision-cli birefnet -m BiRefNet-lite-F16.gguf -i input.png -o mask.png --composite comp.png
```

#### MI-GAN

example-migan

[Model download](https://huggingface.co/Acly/MIGAN-GGUF/tree/main) | [Paper (thecvf.com)](https://openaccess.thecvf.com/content/ICCV2023/papers/Sargsyan_MI-GAN_A_Simple_Baseline_for_Image_Inpainting_on_Mobile_Devices_ICCV_2023_paper.pdf) | [Repository (GitHub)](https://github.com/Picsart-AI-Research/MI-GAN) | License: MIT

```sh
vision-cli migan -m MIGAN-512-places2-F16.gguf -i image.png mask.png -o output.png
```

#### Real-ESRGAN

example-esrgan

[Model download](https://huggingface.co/Acly/Real-ESRGAN-GGUF) | [Paper (arXiv)](https://arxiv.org/abs/2107.10833) | [Repository (GitHub)](https://github.com/xinntao/Real-ESRGAN) | License: BSD-3-Clause

```sh
vision-cli esrgan -m ESRGAN-4x-foolhardy_Remacri-F16.gguf -i input.png -o output.png
```

### Converting models

Models need to be converted to GGUF before they can be used. This will also
rearrange or precompute tensors for more optimal inference.

To convert a model, install [uv](https://docs.astral.sh/uv/) and run:
```sh
uv run scripts/convert.py MyModel.pth
```
where `` is one of `sam, birefnet, esrgan, ...`.

This will create `models/MyModel.gguf`. See `convert.py --help` for more options.

## Building

Building requires CMake and a compiler with C++20 support.

**Get the sources**
```sh
git clone https://github.com/Acly/vision.cpp.git --recursive
cd vision.cpp
```

**Configure and build**
```sh
cmake . -B build -D CMAKE_BUILD_TYPE=Release
cmake --build build --config Release
```

### Vulkan _(Optional)_

Building with Vulkan GPU support requires the [Vulkan SDK](https://www.lunarg.com/vulkan-sdk/) to be installed.

```sh
cmake . -B build -D CMAKE_BUILD_TYPE=Release -D VISP_VULKAN=ON
```

### Tests _(Optional)_

Build with `-DVISP_TESTS=ON`. Run all C++ tests with the following command:
```sh
cd build
ctest -C Release
```

Some tests require a Python environment. It can be set up with [uv](https://docs.astral.sh/uv/):
```sh
# Setup venv and install dependencies (once only)
uv sync

# Run python tests
uv run pytest
```

## Performance

Performance optimization is an ongoing process. The aim is to be in the same ballpark
as other frameworks for inference speed, but with:
* much faster initialization and model loading time (<100 ms)
* lower memory overhead
* tiny deployment size (<5 MB for CPU, +30 MB for GPU)

### Inference speed

* CPU: AMD Ryzen 5 5600X (6 cores)
* GPU: NVIDIA GeForce RTX 4070

#### MobileSAM, 1024x1024

| | | _vision.cpp_ | PyTorch | ONNX Runtime |
| :--- | :--- | -----------: | ------: | -----------: |
| cpu | f32 | 669 ms | 601 ms | 805 ms |
| gpu | f16 | 19 ms | 16 ms | |

#### BiRefNet, 1024x1024

| Model | | | _vision.cpp_ | PyTorch | ONNX Runtime |
| :---- | :--- | :--- | -----------: | -------: | -----------: |
| Full | cpu | f32 | 16333 ms | 18800 ms | |
| Full | gpu | f16 | 243 ms | 140 ms | |
| Lite | cpu | f32 | 4505 ms | 10900 ms | 6978 ms |
| Lite | gpu | f16 | 86 ms | 59 ms | |

#### MI-GAN, 512x512

| Model | | | _vision.cpp_ | PyTorch |
| :---------- | :--- | :--- | -----------: | ------: |
| 512-places2 | cpu | f32 | 523 ms | 637 ms |
| 512-places2 | gpu | f16 | 21 ms | 17 ms |

#### Setup

* vision.cpp: using vision-bench, GPU via Vulkan, eg. `vision-bench -m sam -b cpu`
* PyTorch: v2.7.1+cu128, eager eval, GPU via CUDA, average n iterations after warm-up

## Dependencies (integrated)

* [ggml](https://github.com/ggml-org/ggml) - ML tensor library | MIT
* [stb-image](https://github.com/nothings/stb) - Image load/save/resize | Public Domain
* [fmt](https://github.com/fmtlib/fmt) - String formatting _(only if compiler doesn't support <format>)_ | MIT