https://github.com/kisaesdevlab/vibe-glm-ocr

Last synced: 12 days ago
JSON representation
Host: GitHub
URL: https://github.com/kisaesdevlab/vibe-glm-ocr
Owner: KisaesDevLab
License: other
Created: 2026-04-13T15:07:20.000Z (3 months ago)
Default Branch: main
Last Pushed: 2026-04-13T15:31:13.000Z (3 months ago)
Last Synced: 2026-04-13T17:27:40.855Z (3 months ago)
Language: Dockerfile
Size: 16.6 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

          # Kisaes OCR Server

Self-contained Docker image running [llama.cpp](https://github.com/ggml-org/llama.cpp) server with [GLM-OCR](https://huggingface.co/ggml-org/GLM-OCR-GGUF) (0.9B parameter multimodal OCR model). Provides an OpenAI-compatible `/v1/chat/completions` endpoint that accepts base64-encoded images and returns recognized text or structured Markdown tables.

**No HuggingFace downloads at runtime. No Ollama dependency. No model management. Pull the image, run it, send images.**

## Quick Start

```bash

# Pull and run

docker pull ghcr.io/kisaesdevlab/vibe-glm-ocr:latest

docker run -p 8090:8090 ghcr.io/kisaesdevlab/vibe-glm-ocr:latest

# Check health

curl http://localhost:8090/health

# OCR a document

BASE64=$(base64 -w0 document.png)

curl -s http://localhost:8090/v1/chat/completions \

  -H "Content-Type: application/json" \

  -d "{

    \"model\": \"GLM-OCR\",

    \"messages\": [{

      \"role\": \"user\",

      \"content\": [

        {\"type\": \"image_url\", \"image_url\": {\"url\": \"data:image/png;base64,$BASE64\"}},

        {\"type\": \"text\", \"text\": \"Text Recognition:\"}

      ]

    }],

    \"temperature\": 0.02

  }"

```

## Build from Source

```bash

# Clone and build

git clone https://github.com/KisaesDevLab/Vibe-GLM-OCR.git

cd Vibe-GLM-OCR

docker compose -f docker-compose.dev.yml build

# Run locally

docker compose -f docker-compose.dev.yml up

```

## OCR Prompts

GLM-OCR supports two primary prompt modes:

| Prompt | Use Case |

|--------|----------|

| `Text Recognition:` | General text extraction — receipts, forms, letters, any unstructured document |

| `Table Recognition:` | Structured table extraction — returns Markdown or HTML tables |

## API

### Request

```json

{

  "model": "GLM-OCR",

  "messages": [{

    "role": "user",

    "content": [

      {

        "type": "image_url",

        "image_url": {

          "url": "data:image/png;base64,{base64data}"

        }

      },

      {

        "type": "text",

        "text": "Table Recognition:"

      }

    ]

  }],

  "temperature": 0.02

}

```

### Response

Standard OpenAI chat completion format. OCR text is in `choices[0].message.content`:

```json

{

  "choices": [{

    "message": {

      "role": "assistant",

      "content": "| Date | Description | Amount | Balance |\n|---|---|---|---|\n| 01/15 | Direct Deposit | 3,500.00 | 4,200.50 |"

    },

    "finish_reason": "stop"

  }],

  "usage": {

    "prompt_tokens": 1842,

    "completion_tokens": 156,

    "total_tokens": 1998

  }

}

```

### Endpoints

| Path | Method | Description |

|------|--------|-------------|

| `/health` | GET | Returns `{"status":"ok"}` when model is loaded |

| `/v1/chat/completions` | POST | OpenAI-compatible chat endpoint (OCR requests) |

| `/metrics` | GET | Prometheus metrics (request count, latency, tokens) |

## Configuration

All configuration is via environment variables:

| Variable | Default | Description |

|----------|---------|-------------|

| `OCR_PORT` | `8090` | Server listen port |

| `OCR_THREADS` | `4` | CPU threads for inference |

| `OCR_CTX_SIZE` | `32768` | Context window (must be >= 16384 for GLM-OCR images) |

| `OCR_PARALLEL` | `2` | Concurrent request slots |

| `OCR_TEMPERATURE` | `0.02` | Sampling temperature (keep low for OCR) |

| `OCR_API_KEY` | *(empty)* | Bearer token for endpoint protection (optional) |

### Example with custom config

```bash

docker run -p 9090:9090 \

  -e OCR_PORT=9090 \

  -e OCR_THREADS=8 \

  -e OCR_PARALLEL=4 \

  -e OCR_API_KEY=my-secret-key \

  ghcr.io/kisaesdevlab/vibe-glm-ocr:latest

```

## Architecture

```

                POST /v1/chat/completions

                (base64 image + prompt)

                        |

                +-------v--------+

                |  ocr-server    |

                |  :8090         |

                |                |

                |  llama-server  |

                |  GLM-OCR F16   |

                |  ~1.8 GB model |

                |  ~2-3 GB RAM   |

                +----------------+

                        |

                OCR text / Markdown table

```

## Resource Requirements

| Metric | Value |

|--------|-------|

| RAM (idle, model loaded) | ~2 GB |

| RAM (peak, during inference) | ~3 GB |

| CPU (during inference) | All configured threads saturated |

| Disk (image) | ~2.1 GB |

| Startup time (model load) | ~5-10s |

| Inference time per page | ~40-60s CPU, ~2-3s with GPU |

## Why llama.cpp Instead of Ollama

- **Smaller image**: No Ollama runtime, no model registry, no Go binary

- **More control**: Direct access to `--cache-type-k`, `--flash-attn`, `--temperature` flags

- **Slight speedup**: llama.cpp direct is marginally faster than Ollama for the same model

- **Simpler healthcheck**: `curl /health` on a single-purpose server

- **Appliance model**: One image, one model, one purpose

## Model Details

| File | Size | Purpose |

|------|------|---------|

| `GLM-OCR-F16.gguf` | 1.79 GB | Language decoder (GLM-0.5B) — F16 for max OCR accuracy |

| `mmproj-GLM-OCR-Q8_0.gguf` | ~160 MB | CogViT visual encoder + projection |

F16 is chosen for the decoder because at only 0.9B parameters, the size difference vs Q8_0 is negligible, while F16 preserves full precision for financial documents where a single misread digit matters.

### Slim variant (Q8_0 decoder)

For bandwidth-constrained deployments, a `:slim` tag using the Q8_0 decoder (~950 MB vs 1.79 GB) reduces the compressed image by roughly 700 MB. Accuracy loss is minimal for printed documents; for handwritten notes or faint scans, stick with the default F16 tag.

```bash

docker pull ghcr.io/kisaesdevlab/vibe-glm-ocr:slim

```

To build slim locally, override the decoder filename in a fork of the Dockerfile's `model-fetcher` stage (`GLM-OCR-Q8_0.gguf`) and update the entrypoint's `--model` path.

## Operations

### Log rotation

`llama-server` logs request lines and token counts to stdout. Under sustained traffic, unbounded Docker logs will eventually fill the host disk. Configure the `json-file` driver with rotation, or switch to `journald` / a remote syslog sink.

Per-container (Docker CLI):

```bash

docker run -p 8090:8090 \

  --log-driver json-file \

  --log-opt max-size=50m \

  --log-opt max-file=5 \

  ghcr.io/kisaesdevlab/vibe-glm-ocr:latest

```

Compose:

```yaml

services:

  ocr-server:

    image: ghcr.io/kisaesdevlab/vibe-glm-ocr:latest

    logging:

      driver: json-file

      options:

        max-size: "50m"

        max-file: "5"

```

Host-wide default lives in `/etc/docker/daemon.json`:

```json

{

  "log-driver": "json-file",

  "log-opts": { "max-size": "50m", "max-file": "5" }

}

```

## License

MIT (Dockerfile, entrypoint scripts, and repository code). GLM-OCR model is MIT licensed. llama.cpp is MIT licensed.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/kisaesdevlab/vibe-glm-ocr

Awesome Lists containing this project

README