An open API service indexing awesome lists of open source software.

https://github.com/stradichenko/ocr-to-anki

A tool to ease the production of Anki cards for language learning.
https://github.com/stradichenko/ocr-to-anki

anki anki-cards anki-flashcards ankiconnect llm-inference local-llm ocr

Last synced: about 1 month ago
JSON representation

A tool to ease the production of Anki cards for language learning.

Awesome Lists containing this project

README

          


OCR to Anki

![Build Status](https://img.shields.io/github/actions/workflow/status/stradichenko/ocr-to-anki/build.yml?branch=master&label=build)
![GitHub License](https://img.shields.io/github/license/stradichenko/ocr-to-anki)
![GitHub Release](https://img.shields.io/github/v/release/stradichenko/ocr-to-anki)


Consider supporting:



Patreon


GitHub Sponsors

[![Share on X](https://img.shields.io/badge/-Share%20on%20X-gray?style=flat&logo=x)](https://x.com/intent/tweet?text=OCR%20to%20Anki!%20Extract%20vocabulary%20from%20images%20and%20create%20flashcards%20offline%20with%20local%20AI.&url=https://github.com/stradichenko/ocr-to-anki&hashtags=Anki,OCR,LLM,llama)

## About

Cross-platform application for extracting vocabulary from images and creating
[Anki](https://apps.ankiweb.net/) flashcards. Everything runs locally using
[llama.cpp](https://github.com/ggerganov/llama.cpp) and the
[Gemma 3 4B](https://ai.google.dev/gemma/docs/gemma3) model. No cloud
dependencies, no API keys, fully offline.

Supports **Linux, macOS, Windows, and Android**.

The application is composed of two layers: a Flutter GUI that provides the user
interface and a Python FastAPI backend that handles vision OCR and vocabulary
enrichment through llama.cpp.

| Layer | Technology | Purpose |
|-------|-----------|---------|
| Flutter GUI | Dart, Material 3 | Interface (Linux, macOS, Windows, Android) |
| Python API | FastAPI, llama.cpp | Vision OCR and text enrichment backend |
| Vision OCR | llama-mtmd-cli | Extract text from images (GPU accelerated) |
| Text tasks | llama-server | Definitions, examples, vocabulary enrichment |
| Model | Gemma 3 4B QAT Q4_0 | Single model for both vision and text |


Download

## Installation

### Prerequisites

**Release binaries require no pre-installed dependencies.**
On first launch the app will automatically download:

1. **Python runtime** (~30 MB) — a portable copy, cached locally
2. **AI model** (~3.2 GB) — Gemma 3 4B, one-time download

The only system requirement is **GTK 3** on Linux.

> For building from source, have [Nix](https://zero-to-nix.com/start/install)
> installed with flakes enabled.

### Download a release (Android)

#### 1. Download the APK

Grab the latest APK (`ocr-to-anki-vX.Y.Z-android-arm64.apk`) from the
[releases page](https://github.com/stradichenko/ocr-to-anki/releases)
and transfer it to your phone:

- **Option A — Direct download on phone:** Open the releases page in your
mobile browser and tap the APK file.
- **Option B — Transfer from PC:** Download on your computer, then transfer
via USB, Bluetooth, or cloud storage (Google Drive, Nextcloud, etc.).

#### 2. Install the APK

- Open the file manager on your phone, navigate to the APK, and tap it.
- If prompted, allow **"Install from unknown sources"** for your file manager
or browser. This is a standard Android security prompt for apps outside the
Play Store.
- Tap **Install** and wait for the process to complete.

> adb alternative (for developers):
>
> ```bash
> adb install ocr-to-anki-v0.2.0-android-arm64.apk
> ```

#### 3. First launch setup

On first run the app performs a one-time setup:

1. **Extracts native binaries** — The bundled `llama-server` and
`llama-mtmd-cli` are copied to the app's private storage (~100 MB).
2. **Downloads the AI model** — The Gemma 3 4B model (~2.4 GB) and vision
projector (~812 MB) are downloaded directly to your device.

> **WiFi is required for the model download** unless you disabled
> "WiFi-only downloads" in Settings. The download supports resume, so if
> interrupted it will continue from where it left off.
>
> **Requirements:** Android 9+ (API 28), ARM64 device, ~4 GB free storage.

### Download a release (Linux)

Grab the latest tarball from the
[releases page](https://github.com/stradichenko/ocr-to-anki/releases),
extract, and run:

```bash
tar xzf ocr-to-anki-v0.2.0-linux-x86_64.tar.gz
cd ocr-to-anki-v0.2.0-linux-x86_64

# GTK3 is required at runtime.
# On Ubuntu/Debian: sudo apt install libgtk-3-0
# On Fedora: sudo dnf install gtk3
# On NixOS: already available

./run.sh
```

> **First launch — fully automatic setup**
>
> On first run the app detects what's missing and guides you through two
> one-click downloads:
>
> 1. *"Python runtime needed — Download Python"* (~30 MB, if Python is not
> already installed)
> 2. *"Model download required — Download now"* (~3.2 GB)
>
> Both downloads are cached locally and only happen once. After that the
> app starts instantly.
>
> You can also download the model manually with the bundled script:
>
> ```bash
> ./scripts/setup-llama-cpp.sh
> ```

### Build from source

```bash
git clone https://github.com/stradichenko/ocr-to-anki.git
cd ocr-to-anki

# 1. Download the model and vision projector (~3 GB total, one time)
nix develop
./scripts/setup-llama-cpp.sh

# 2. Build the Flutter app
nix develop .#flutter
cd app
flutter pub get
flutter build linux --release

# The binary is at: app/build/linux/x64/release/bundle/ocr_to_anki
```

For a distributable tarball that bundles the backend source:

```bash
nix develop .#flutter --command ./scripts/build-flutter.sh linux
# Output: output/release/ocr-to-anki-v0.1.0-linux-x86_64.tar.gz
```

Or as a pure Nix derivation:

```bash
nix build .#flutter-app
./result/bin/ocr-to-anki
```

### Build for Android

Requires the Android SDK and NDK:

```bash
# 1. Install Android NDK (via Android Studio or sdkmanager)
export ANDROID_NDK=$HOME/Android/Sdk/ndk/27.0.11718014

# 2. Build llama.cpp native binaries for Android
cd ocr-to-anki
./scripts/build-llama-android.sh

# 3. Build the Flutter APK
cd app
flutter pub get
flutter build apk --release

# The APK is at: app/build/app/outputs/flutter-apk/app-release.apk
```

The build script cross-compiles `llama-server` and `llama-mtmd-cli` for ARM64
and bundles them as Flutter assets. On first launch the app copies them to the
device's private storage and sets executable permissions.

See [docs/building.md](docs/building.md) for macOS, Windows, and advanced build
options.

### Model files

| File | Size | Source |
|------|------|--------|
| gemma-3-4b-it-q4_0_s.gguf | ~2.4 GB | [stduhpf/google-gemma-3-4b-it-qat-q4_0-gguf-small](https://huggingface.co/stduhpf/google-gemma-3-4b-it-qat-q4_0-gguf-small) |
| mmproj-model-f16-4B.gguf | ~812 MB | [stduhpf/google-gemma-3-4b-it-qat-q4_0-gguf-small](https://huggingface.co/stduhpf/google-gemma-3-4b-it-qat-q4_0-gguf-small) |

Both are downloaded by `./scripts/setup-llama-cpp.sh` via direct URL. No
authentication required.

Quantization-Aware Training (QAT) produces roughly 15% better perplexity than
standard post-training Q4_0 quantization at the same size. The stduhpf repack
also fixes broken control token metadata.

## Getting Started

### Workflow

1. Select context: handwritten or printed text, or highlighted words (pick
colour)
2. Add images:
- **Desktop:** drag and drop, or use the file picker
- **Android:** tap **Camera** to take a photo, or **Gallery** to pick
existing photos (multi-select supported)
3. Vision OCR: Gemma 3 extracts words from the image
4. Enrich: the LLM generates definitions and example sentences
5. Review: edit the generated cards before export
6. Export:
- **Desktop:** send to Anki via AnkiConnect, or save as TSV/JSON
- **Android:** share TSV directly to AnkiDroid via the share sheet

### Starting the backend

The Flutter app manages the backend process automatically. No manual server
management is needed.

- **Desktop:** spawns the Python FastAPI server on startup
- **Android:** extracts bundled `llama-server` and `llama-mtmd-cli` binaries,
downloads the model on first launch, then starts `llama-server` directly

If you prefer to run the backend separately on desktop:

```bash
nix develop
PYTHONPATH=src uvicorn src.api.app:app --host 0.0.0.0 --port 8000
```

### Configuration

Edit `config/settings.yaml` to customize the backend:

```yaml
ai_backend:
type: 'llama_cpp'

llama_cpp:
host: '127.0.0.1'
port: 8090
context_size: 4096
n_gpu_layers: -1
mmproj_offload: false # set true when using OpenCL backend
```

Most settings are also available through the in-app Settings screen.

## Building for other platforms

Flutter desktop does not support cross-compilation. Each platform must be built
on its native OS. The CI/CD workflow at `.github/workflows/build.yml` handles
this using platform-specific runners.

| Build host | Linux | macOS | Windows | Android |
|------------|-------|-------|---------|---------------------|
| Linux | yes | no | no | yes (cross-compile) |
| macOS | no | yes | no | no |
| Windows | no | no | yes | no |

### macOS

Requires a Mac with Xcode installed:

```bash
nix develop .#flutter
cd app && flutter pub get && flutter build macos --release
```

### Windows

Requires Visual Studio 2022 with the "Desktop development with C++" workload:

```powershell
cd app
flutter pub get
flutter build windows --release
```

### CI/CD

Push a version tag to trigger builds for all three platforms:

```bash
git tag v0.2.0
git push origin v0.2.0
```

This creates a draft GitHub Release with Linux, macOS, Windows, and Android artifacts.
See [docs/building.md](docs/building.md) for the full reference.

## Building llama-mtmd-cli (vision)

The vision backend requires `llama-mtmd-cli` built with GPU support:

```bash
# OpenCL (recommended for Intel integrated GPUs)
nix develop .#sycl
./scripts/build-llama-mtmd-opencl.sh

# Vulkan (fallback, see note below)
./scripts/build-llama-mtmd-vulkan.sh
```

Auto-detection picks the best available backend: CUDA, Metal, OpenCL, Vulkan,
then CPU.

### Intel iGPU: OpenCL vs Vulkan

| Backend | Vision encoder | Encode time | Text gen | Binary |
|---------|---------------|-------------|----------|--------|
| OpenCL | correct | ~2 min (GPU) | 4.1 tok/s | llama-mtmd-cli-opencl |
| Vulkan | corrupted | 0.4s (garbage) | 3.6 tok/s | llama-mtmd-cli |
| CPU | correct | ~43 min | 0.7 tok/s | any binary with --no-mmproj-offload |

OpenCL is roughly 20x faster than CPU vision and produces correct output. It
requires a one-line patch for Intel work group sizes, applied automatically by
the build script. See
[patches/opencl-intel-workgroup-fix.patch](patches/opencl-intel-workgroup-fix.patch).

Vulkan corruption details

On Intel integrated GPUs (for example UHD Graphics CML GT2), the Vulkan compute
backend produces corrupted output from the SigLIP vision encoder. Text
generation works fine on Vulkan; only the vision projector is affected.

Root cause: Intel Vulkan compute shaders produce f16 underflow and overflow in
the CLIP/SigLIP transformer. Debug embeddings show 75%+ of values saturate to
exactly -1.0 (clamped NaN/inf). This is a
[known class of bug on integrated GPUs](https://github.com/ggml-org/llama.cpp/issues/15034).

If you have a discrete NVIDIA GPU, Vulkan and CUDA both work fine. Set
`mmproj_offload: true` in `config/settings.yaml`.

## API Endpoints

```
GET /health Backend status
GET /backends Detected GPU hardware
POST /ocr/vision Vision OCR (base64 image)
POST /ocr/vision/upload Vision OCR (file upload)
POST /generate Raw text generation
POST /enrich Vocabulary enrichment (definitions + examples)
POST /pipeline/image-to-cards Full pipeline: image to OCR to enrich to Anki cards
```

## Android Notes

### Architecture

On Android the app does **not** use the Python FastAPI backend. Instead, it
bundles native `llama-server` and `llama-mtmd-cli` binaries compiled for ARM64.
The Flutter app spawns these directly and communicates with `llama-server` over
HTTP on `localhost:8090`. Vision OCR runs `llama-mtmd-cli` as a subprocess.

This avoids the need for a Python runtime on Android while keeping all
inference fully local and offline.

### Camera and Gallery

The Android home screen shows two prominent buttons:

- **Camera** — opens the system camera to take a photo for OCR
- **Gallery** — opens the photo picker (multi-select supported on Android 13+)

No runtime permissions are needed on Android 13+; the photo picker uses the
system UI. Camera access is handled automatically by the `image_picker` plugin.

### Model Download

The first time you launch the Android app, it downloads the Gemma 3 4B model
(~2.4 GB) and vision projector (~812 MB) directly to the app's private storage.
Downloads support resume, so if interrupted they will continue from where they
left off.

### AnkiDroid Export

On Android, the "Export to Anki" button in the review screen generates a TSV
file and opens the Android share sheet. Select **AnkiDroid** from the share
sheet to import the cards. AnkiDroid must be installed on the device.

## Project Structure

```
app/ Flutter GUI application
android/ Android platform files
app/src/main/
AndroidManifest.xml Permissions (camera, storage, internet)
assets/ Bundled native llama.cpp binaries (ARM64)
lib/
main.dart Entry point and routing
models/ Data models (AnkiNote, AppSettings, HighlightColor)
services/ Business logic
inference_service.dart LLM inference (FastAPI or native)
highlight_detector.dart HSV highlight colour detection
anki_export_service.dart AnkiConnect / AnkiDroid / JSON export
backend_server_service.dart Python backend process lifecycle
llama_cpp_android_service.dart Native binary management (Android)
model_download_service.dart Resume-capable model downloads (Android)
database/ Drift (SQLite) local storage
providers/ Riverpod state management
screens/ Home, Processing, Review, Settings, History
src/ Python backend
api/
app.py FastAPI endpoints and lifespan hooks
models.py Pydantic request/response models
backends/
auto_detect.py GPU and backend auto detection
mtmd_cli.py llama-mtmd-cli wrapper (vision, subprocess)
llama_cpp_server.py llama-server wrapper (text, persistent HTTP)
preprocessing/
highlight_cropper.py HSV highlight detection (Python reference)
workflows/ End to end pipelines
output/ Anki export and JSON output
config/
settings.yaml All configuration
scripts/ Build and setup scripts
build-flutter.sh Build Flutter for Linux/macOS/Windows
build-llama-android.sh Cross-compile llama.cpp for Android ARM64
bundle-backend.sh Bundle Python backend with PyInstaller
setup-llama-cpp.sh Download model and vision projector
build-llama-mtmd-*.sh Build llama-mtmd-cli with various GPU backends
docs/
building.md Full build and release documentation
```

## Nix Flake Outputs

### Development shells

```bash
nix develop # Default: Python backend development
nix develop .#flutter # Flutter app build and development
nix develop .#cuda # With CUDA toolkit
nix develop .#sycl # With Intel OneAPI/SYCL and OpenCL
```

### Packages

```bash
nix build .#flutter-app # Flutter Linux desktop binary
nix build .#backend # Nix-wrapped Python backend
nix build .#bundle # Complete distribution (GUI + backend + launcher)
nix build .#dockerImage # Docker image for server deployment
```

## License

[MIT](LICENSE)