https://github.com/rifkybujana/sam3.c
Efficient SAM3 (Segment Anything Model 3) inference from scratch in pure C — Metal GPU + multithreaded CPU, no Python dependencies
https://github.com/rifkybujana/sam3.c
apple-silicon c computer-vision from-scratch ggml image-segmentation inference machine-learning metal pure-c sam3 segment-anything
Last synced: 5 days ago
JSON representation
Efficient SAM3 (Segment Anything Model 3) inference from scratch in pure C — Metal GPU + multithreaded CPU, no Python dependencies
- Host: GitHub
- URL: https://github.com/rifkybujana/sam3.c
- Owner: rifkybujana
- License: mit
- Created: 2026-04-03T12:17:17.000Z (2 months ago)
- Default Branch: main
- Last Pushed: 2026-05-10T19:32:05.000Z (29 days ago)
- Last Synced: 2026-05-22T23:28:40.377Z (17 days ago)
- Topics: apple-silicon, c, computer-vision, from-scratch, ggml, image-segmentation, inference, machine-learning, metal, pure-c, sam3, segment-anything
- Language: C
- Size: 13 MB
- Stars: 9
- Watchers: 1
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Security: SECURITY.md
Awesome Lists containing this project
README
# sam3.c — Efficient SAM3 Inference From Scratch in Pure C
A lightweight, dependency-free C11 implementation of [Segment Anything Model 3 (SAM3)](https://github.com/facebookresearch/sam3) built from scratch for efficient inference on Apple Silicon and x86 CPUs.
Inspired by [ggml](https://github.com/ggerganov/ggml) and [llama.cpp](https://github.com/ggerganov/llama.cpp), sam3.c implements the full SAM3 pipeline — image encoder, prompt encoder, mask decoder — in ~57K lines of portable C with zero Python dependencies.
Four-mask segmentation output from sam3.c — each object highlighted in a distinct color
## Why sam3.c?
| | sam3.c | Official SAM3 (Python) |
|---|---|---|
| **Language** | C11, no dependencies | Python + PyTorch |
| **Binary size** | Single static binary | GB-scale environment |
| **GPU support** | Apple Metal (native) | CUDA |
| **Precision** | FP32, FP16, BF16 | FP32 |
| **Memory** | Arena allocator, mmap weights | PyTorch allocator |
| **Startup** | Instant (mmap) | Seconds (model load) |
## Features
- **Built from scratch in pure C** — no PyTorch, no ONNX, no wrappers. Every tensor op, every layer, written by hand.
- **Metal GPU backend** — hardware-accelerated inference on Apple Silicon (M1/M2/M3/M4).
- **Multithreaded CPU backend** — optimized SIMD kernels with thread pool for x86 and ARM.
- **FP16 and BF16 support** — run inference in half precision for lower memory and faster compute.
- **Custom `.sam3` weight format** — mmap-friendly binary format with O(1) tensor lookup via hash table.
- **Full SAM3 pipeline** — image encoder (Hiera, EfficientViT, TinyViT), prompt encoder (points, boxes, masks), mask decoder, text encoder, and tokenizer.
- **Video object tracking** — memory-based frame-by-frame propagation with point, box, and mask prompts. Supports MPEG video files and frame directories.
- **Multiple backbones** — Hiera (full accuracy), EfficientViT-B1 (lightweight, 512px), and TinyViT-21M (128x128 masks at 1008px input).
- **Unified CLI** — single `sam3_cli` binary with `segment`, `convert`, and `info` subcommands. Supports stdin/stdout piping, JSON output, and multi-mask color overlays.
- **48 unit tests** — comprehensive test suite covering numerical operators, memory management, and end-to-end inference.
- **Built-in profiling** — latency tracing subsystem to identify bottlenecks.
## Supported Models
| Backbone | Input Size | Mask Resolution | Parameters | Encode (ms) | Segment (ms) | Use Case |
|---|---|---|---|---:|---:|---|
| **Hiera** | 1008x1008 | 288x288 | 1.6B | 2336 | 1224 | Full accuracy |
| **TinyViT-21M** | 1008x1008 | 128x128 | 0.8B | 487 | 363 | Balanced quality/speed |
| **EfficientViT-B1** | 512x512 | 64x64 | 0.8B | 70 | 177 | Fastest, interactive |
Timings on Apple M4 (10-core GPU, Metal backend, Release build). Encode = `sam3_set_image`, Segment = `sam3_segment` with a box prompt. See [BENCHMARK.md](BENCHMARK.md) for full results.
All backbones share the same prompt encoder, mask decoder, and text encoder. The backbone is selected automatically based on the checkpoint.
## Quick Start
### Build
```bash
git clone https://github.com/rifkybujana/sam3.c.git
cd sam3.c
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)
```
> **First build note:** the build fetches and statically compiles
> FFmpeg, openh264, and libvpx into `build/external/`. Expect ~10-15
> minutes on first configure; subsequent incremental builds are fast.
> The resulting binary has no runtime dependency on system ffmpeg.
### Convert Weights
Download a SAM3 checkpoint in SafeTensors format, then convert to the optimized `.sam3` format:
```bash
# Hiera (default backbone)
./sam3_cli convert -i models/sam3.safetensors -o models/sam3.sam3
# TinyViT or EfficientViT (specify backbone)
./sam3_cli convert -i models/tinyvit.safetensors -o models/tinyvit.sam3 --backbone tinyvit
./sam3_cli convert -i models/evit.safetensors -o models/evit.sam3 --backbone efficientvit
```
### SAM 3.1
SAM 3.1 ships as a PyTorch `.pt` checkpoint only
(`sam3.1_multiplex.pt`, ~3.3 GB). Convert in two steps:
```bash
# 1. Normalize into .safetensors (unwraps {"model": ...}, remaps
# sam3_model.* -> detector.* and sam2_predictor.* -> tracker.*)
python tools/pt_to_safetensors.py \
models/sam3.1_multiplex.pt \
models/sam3.1_multiplex.safetensors
# 2. Convert to .sam3 with the SAM 3.1 variant flag
./sam3_cli convert \
-i models/sam3.1_multiplex.safetensors \
-o models/sam3.1.sam3 \
--variant sam3.1
# 3. Use it
./sam3_cli segment -m models/sam3.1.sam3 -i img.jpg -t "cat"
```
Only the image-detector path is wired in this release. The SAM 3.1
multiplex tracker and joint multi-object video pass are planned for
follow-up work — SAM 3 continues to handle all video tracking today.
### Run Inference
```bash
# Point prompt (foreground point at x=500, y=375)
./sam3_cli segment -m models/sam3.sam3 -i photo.jpg -p 500,375,1 --overlay
# Text prompt
./sam3_cli segment -m models/sam3.sam3 -i photo.jpg -t "person" --overlay
# Box prompt
./sam3_cli segment -m models/sam3.sam3 -i photo.jpg -b 100,100,400,400 --all
```
### Video tracking
Track an object across frames of a video:
```bash
./sam3_cli track --model models/sam3.sam3 --video clip.mp4 \
--point 504,504,1 --frame 0 --output out/
```
Output: `out/frame_NNNNN.png` binary mask per frame.
### Inspect a Model
```bash
./sam3_cli info models/sam3.sam3
```
### Run Tests
```bash
ctest --output-on-failure
```
## Language bindings
sam3.c ships bindings for multiple languages under `bindings/`:
- **Python** — `bindings/python/`, CFFI-based. Requires Python ≥ 3.9.
- **Rust** — `bindings/rust/`. Cargo workspace with `sam3-sys` (FFI) and
`sam3` (safe API: owned `Ctx`, typed prompt enum, RAII result cleanup,
`SegmentResult::nms` matching the CLI's post-processing). See
`bindings/rust/README.md`.
> **Python cache-API gap:** the [caching API](#caching) is exposed in the
> Rust binding but not yet in Python; see
> `docs/superpowers/plans/2026-04-22-cache-api-bindings.md`.
### Python
```bash
pip install -e bindings/python
```
`setup.py` runs CMake under the hood: it configures the project with
`-DSAM3_SHARED=ON`, builds `libsam3.{dylib,so}` into `build-python/`,
and copies the shared library into the installed package. **No manual
`cmake` step, no `DYLD_LIBRARY_PATH`/`LD_LIBRARY_PATH` setup** — the
package ships its own libsam3 next to `sam3/_lib.py` and loads it by
relative path. First install takes ~10-15 minutes because CMake also
compiles FFmpeg from source; reinstalls are fast.
Requirements:
- Python ≥ 3.9
- CMake ≥ 3.20 and a C11 toolchain on `$PATH`
- `cffi>=1.15`, `numpy>=1.21` (installed automatically)
- On macOS, Xcode command-line tools; on Linux, a recent GCC/Clang
Minimal usage:
```python
import sam3
with sam3.Model("models/sam3.sam3") as model:
model.set_image("photo.jpg")
result = model.segment(text="person")
print(result.masks.shape, result.iou_scores[:3])
```
Run the test suite:
```bash
pip install -e 'bindings/python[dev]'
pytest bindings/python/tests -v
```
Troubleshooting — if `import sam3` raises
`OSError: libsam3 not found at .../sam3/libsam3.dylib`, the package
was installed without the bundled shared library (usually a stale
install from before `setup.py` was updated). Force a rebuild with:
```bash
pip install --force-reinstall --no-deps -e bindings/python
```
### Rust
Build `libsam3` shared once, then build/test the crate against it:
```bash
# 1. Build libsam3.{dylib,so}
cmake -S . -B build -DSAM3_SHARED=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --parallel
# 2. Build + test the Rust workspace (point the loader at build/)
cd bindings/rust
DYLD_LIBRARY_PATH=../../build cargo test --release # macOS
LD_LIBRARY_PATH=../../build cargo test --release # Linux
```
Install `libsam3` system-wide (`cmake --install build`) to skip the env
var step; the loader will then find it on the default search path. See
`bindings/rust/README.md` for the three supported runtime resolution
workflows (env var, system install, rpath-shipped).
Minimal Rust usage:
```rust
use sam3::{Ctx, Prompt};
let mut ctx = Ctx::new()?;
ctx.load_model("models/efficient.sam3")?; // auto-loads co-located BPE
ctx.set_image_file("photo.jpg")?;
ctx.set_text("person")?;
let raw = ctx.segment(&[Prompt::Text("person")])?;
let hits = raw.nms(0.5, 0.5, 0.0)?; // 200 candidates → ~N detections
println!("found {} objects, top score {:.3}",
hits.n_masks(), hits.iou_scores()[0]);
```
## Architecture
```
sam3.c
├── include/sam3/ Public API headers
├── src/
│ ├── core/ Tensor ops, arena allocator, compute graph, weight loader
│ ├── backend/
│ │ ├── cpu/ Multithreaded CPU kernels (SIMD-optimized)
│ │ └── metal/ Apple Metal GPU backend
│ ├── model/ SAM3 layers
│ │ ├── image_encoder Vision transformer (Hiera, EfficientViT, TinyViT)
│ │ ├── prompt_encoder Point, box, and mask prompts
│ │ ├── mask_decoder Lightweight mask prediction head
│ │ ├── text_encoder Text prompt encoding
│ │ └── tokenizer BPE tokenizer
│ └── util/ Logging, error codes
├── tools/ Unified CLI (sam3_cli: segment, convert, info)
└── tests/ 48 test files
```
## Performance
On an Apple M4 with the Metal backend, EfficientViT delivers end-to-end
image-to-mask in ~250 ms (4 FPS), making interactive point-and-click
segmentation practical. Once the image is encoded, each additional prompt on
the same image resolves in under 200 ms.
Hiera-Large trades speed for accuracy at 3.6 s per image with 5184 patches and
32 transformer blocks. Multi-prompt workflows amortize the 2.3 s encode cost.
The Metal backend achieves 90.8% of theoretical F32 peak (3086 / 3400 GFLOPS)
on matmul microbenchmarks and up to 149x speedup over CPU for F16 workloads.
Full kernel-level and pipeline benchmark results are in
[BENCHMARK.md](BENCHMARK.md).
## Caching
sam3.c caches encoded image and text features so repeated prompts on the
same inputs resolve in microseconds instead of re-running the encoders.
The caches are enabled by default with tunable slot counts, an LRU RAM
budget, and optional disk spill.
**Defaults** — override via `sam3_cache_opts` + `sam3_init_ex()`:
| Tunable | Default | Purpose |
|---|---|---|
| `n_image_slots` | 8 | Max cached image entries |
| `n_text_slots` | 16 | Max cached text entries |
| `image_mem_budget_bytes` | 1 GiB (~4 hot slots at 256 MiB) | RAM ceiling; excess slots spill to disk |
| `image_spill_dir` | auto-created `/tmp` dir | Where spilled bundles live |
### Pre-warm
Populate the cache while the app is idle so the user's first prompt only
pays segment latency:
```c
sam3_precache_image_file(ctx, "photo.jpg"); /* runs image encoder now */
sam3_precache_text(ctx, "person"); /* runs text encoder now */
/* Later: these hit the cache and return in microseconds */
sam3_set_image_file(ctx, "photo.jpg");
sam3_set_text(ctx, "person");
```
### Persist across runs
Serialize an encoded entry to a `.sam3cache` file and reload it on the
next startup. Files are model-signature-gated — loading into a different
model is rejected with `SAM3_EMODEL`:
```c
sam3_cache_save_image(ctx, pixels, w, h, "photo.sam3cache");
/* Next run, after sam3_load_model(): */
sam3_cache_load_image(ctx, "photo.sam3cache");
sam3_set_image(ctx, pixels, w, h); /* cache hit */
```
### Inspect and flush
```c
struct sam3_cache_stats s;
sam3_cache_stats(ctx, &s);
/* s.image_hits, s.image_misses, s.image_evictions, and text_* */
sam3_cache_clear(ctx, SAM3_CACHE_IMAGE | SAM3_CACHE_TEXT);
```
### Video frame cache
Video tracking has its own two-tier cache for per-frame backbone
features, tuned via `sam3_video_start_opts`:
- `frame_cache_backend_budget` — resident RAM (default 4 GiB)
- `frame_cache_spill_budget` — disk spill (default 16 GiB; `SIZE_MAX` disables spill)
See `include/sam3/sam3.h` for the full cache API.
## Weight Format
Model weights use the `.sam3` binary format — a compact, mmap-friendly layout designed for instant loading:
- 48-byte header + 176-byte tensor descriptors + page-aligned data blob
- FNV-1a hash table for O(1) tensor lookup by name
- Supports FP32, FP16, BF16, I32, I8, and Q8_0 (block-quantized int8)
- Converted from SafeTensors via `sam3_cli convert`
See [docs/weight-format.md](docs/weight-format.md) for the full specification.
## License
MIT — see [LICENSE](LICENSE).