{"id":50912761,"url":"https://github.com/rapidai/rapidocrvl","last_synced_at":"2026-06-16T12:00:25.374Z","repository":{"id":365198189,"uuid":"1268178896","full_name":"RapidAI/RapidOCRvl","owner":"RapidAI","description":"Inference Service for PaddleOCR  VL 0.9B","archived":false,"fork":false,"pushed_at":"2026-06-16T10:23:21.000Z","size":517,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-16T10:27:30.645Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/RapidAI.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-06-13T08:21:31.000Z","updated_at":"2026-06-16T10:23:24.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/RapidAI/RapidOCRvl","commit_stats":null,"previous_names":["rapidai/rapidocrvl"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/RapidAI/RapidOCRvl","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RapidAI%2FRapidOCRvl","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RapidAI%2FRapidOCRvl/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RapidAI%2FRapidOCRvl/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RapidAI%2FRapidOCRvl/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/RapidAI","download_url":"https://codeload.github.com/RapidAI/RapidOCRvl/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RapidAI%2FRapidOCRvl/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34404748,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-16T02:00:06.860Z","response_time":126,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-06-16T12:00:18.244Z","updated_at":"2026-06-16T12:00:25.318Z","avatar_url":"https://github.com/RapidAI.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# PaddleOCR-VL 0.9B Pure Go Runtime\n\nExperimental pure Go inference runtime for the PaddleOCR-VL 0.9B Hugging Face\ncheckpoint.\n\nThis project is pure Go at runtime: no Python, no PaddlePaddle, no PyTorch, no\nTransformers, and no external inference subprocess.\n\nCurrent scope:\n\n- reads `config.json`\n- reads Hugging Face `model.safetensors` and sharded\n  `model.safetensors.index.json`\n- supports BF16/F16/F32 tensor conversion\n- implements the ERNIE text decoder path in Go\n- implements the PaddleOCR-VL image preprocessing path in Go\n- implements the vision Transformer encoder path in Go\n- projects visual features into text hidden states and replaces image tokens\n- provides a token-id based greedy generation CLI\n\nNot complete yet:\n\n- full SentencePiece parity\n- exact bicubic resize parity with Pillow\n- GPU/SIMD kernels\n\n## Usage\n\nDownload the model files:\n\n```powershell\ngo run ./cmd/paddleocrvl-download D:\\models\\PaddleOCR-VL\n```\n\nDownloader also supports `-out`, `-base-url`, `-timeout`, and `-json` for\nmirrors and automated setup. JSON output includes per-file bytes and SHA256.\n\nConvert safetensors to GGUF manually:\n\n```powershell\ngo run ./cmd/paddleocrvl-convert -model-dir D:\\models\\PaddleOCR-VL\n```\n\nThe converter logs progress by default; use `-progress=false` for quiet runs\nand `-json` for a machine-readable conversion summary. Conversion summaries\ninclude output path, bytes, SHA256, source, F32, and quantized tensor counts.\nUse `-gomaxprocs` and\n`-gc-percent` during large conversions to tune CPU and GC behavior.\nCustom `-out` parent directories are created automatically.\n\nBuild a quantized GGUF for faster startup and smaller text weights:\n\n```powershell\ngo run ./cmd/paddleocrvl-convert -model-dir D:\\models\\PaddleOCR-VL -quant q8\ngo run ./cmd/paddleocrvl-convert -model-dir D:\\models\\PaddleOCR-VL -quant q6\ngo run ./cmd/paddleocrvl-convert -model-dir D:\\models\\PaddleOCR-VL -quant q4\n```\n\nAll runtime loaders prefer GGUF. If it is missing and safetensors weights exist\nas either `model.safetensors` or `model.safetensors.index.json` shards, the\nloader converts to GGUF automatically before inference. With `-quant q8`,\n`-quant q6`, or `-quant q4`, loaders prefer `model-q8.gguf`, `model-q6.gguf`,\nor `model-q4.gguf`; if missing, the loader converts directly to a quantized\nGGUF before inference.\n\nCLI, server, and local benchmark loaders log auto-conversion and text-weight\nloading progress during first load.\n\nInspect model metadata without loading all weights:\n\n```powershell\ngo run ./cmd/paddleocrvl-inspect D:\\models\\PaddleOCR-VL\ngo run ./cmd/paddleocrvl-inspect -json D:\\models\\PaddleOCR-VL\n```\n\nInspect output reports the active weight format, quantization, path, and file\nsize directly. JSON and text output also include per-weight-file SHA256 hashes.\n\nThen run token-id inference:\n\n```powershell\ngo run ./cmd/paddleocrvl-go `\n  -model-dir D:\\models\\PaddleOCR-VL `\n  -tokens 100273,1234,5678 `\n  -max-new-tokens 16\n```\n\nImage run:\n\n```powershell\ngo run ./cmd/paddleocrvl-go `\n  -model-dir D:\\models\\PaddleOCR-VL `\n  -image D:\\docs\\page.png `\n  -tokens 100273,101305,100295,101306,1234 `\n  -max-new-tokens 64\n```\n\nWhen `-image` is set, a single `100295` image placeholder is expanded to the\nnumber of projected visual tokens for that image.\n\nOfficial-style task prompt:\n\n```powershell\ngo run ./cmd/paddleocrvl-go `\n  -model-dir D:\\models\\PaddleOCR-VL `\n  -image D:\\docs\\page.png `\n  -task ocr `\n  -max-new-tokens 1024 `\n  -decode-generated-only `\n  -skip-special\n```\n\n`-task` accepts `ocr`, `table`, `formula`, and `chart`.\n\nHTTP inference service:\n\n```powershell\ngo run ./cmd/paddleocrvl-server -model-dir D:\\models\\PaddleOCR-VL -addr 127.0.0.1:8080\n```\n\nAdmin console and API documentation:\n\n- open `http://127.0.0.1:8080/admin` for first-run admin initialization,\n  login, model-path settings, API key issuance, quota management, and service\n  overview.\n- open `http://127.0.0.1:8080/doc` for human-readable API documentation.\n- open `http://127.0.0.1:8080/doc/openapi.json` for machine-readable OpenAPI\n  3.1, useful for other AI agents or HTTP clients. Operations include stable\n  `operationId` values, `status`/`inference` tags, and request examples for\n  OCR, JSON generation, and batch calls, plus success/error response examples.\n- open `http://127.0.0.1:8080/doc/llms.txt` for a concise plaintext integration\n  guide intended for AI agents.\n\nThe first admin initialization creates a default API key and shows the key only\nonce. Additional keys are generated by the server in the admin console. Keys can\nbe named or renamed, disabled, deleted, assigned a request quota, and reset for a new quota\nperiod. Keys can also have a per-minute rate limit to protect local inference\ncapacity. Keys can be rotated when leaked; the old key is revoked immediately\nand the new plaintext key is shown once. The console shows recent usage time and\nclient IP per key. A quota or rate limit of `0` means unlimited. Inference\nendpoints require:\n\n```text\nAuthorization: Bearer \u003cAPI_KEY\u003e\n```\n\nThe admin console keeps an in-memory recent-call audit log with API key, client\nIP, path, HTTP status, latency, authentication/quota errors, and JSON error\nmessages returned by inference handlers. It stores request metadata only, not\nuploaded images or prompts. The recent audit log can be exported as CSV or\ncleared from the admin console before a focused integration test.\n\nAdmin configuration can be exported and restored from the Settings page. Backups\ninclude admin password hashes, API key hashes/limits, model path settings, and\npost-processing path settings, but never plaintext API keys or passwords.\nAdmin pages and admin JSON endpoints use `no-store`, frame-deny, MIME nosniff,\nsame-origin referrer, and a restrictive content security policy.\nAdmin login applies a short failed-attempt lockout per client IP.\nAdmin write endpoints reject cross-origin `Origin`/`Referer` headers.\n\nUseful service flags:\n\n- `-admin-config paddleocrvl-admin.json` sets the admin console config path.\n- `-timeout 10m` sets a per-request timeout.\n- `-shutdown-timeout 30s` controls graceful shutdown after SIGINT/SIGTERM.\n- `-request-limit 134217728` caps request body size.\n- `-multipart-memory 33554432` caps memory used while parsing multipart forms.\n- `-max-new-limit 4096` caps generated tokens per request.\n- `-max-input-tokens 0` caps prompt/input tokens per request; `0` disables the cap.\n- `-max-batch-size 0` caps `/v1/batch` item count; `0` disables the cap.\n- generation options reject negative `max_new_tokens`, `temperature`, and `top_k`.\n- `-concurrency 1` controls concurrent inference slots.\n- `-gomaxprocs 0` controls Go CPU worker threads; `0` keeps the current value.\n- `-gc-percent 0` controls Go GC target; `0` keeps current value and `-1` disables GC.\n- `-preload-vision` loads vision weights during startup.\n- `-warmup` runs one text-token warmup during startup.\n- `-quant q8` enables row-wise int8 text-weight quantization.\n- `-quant q6` enables row-wise int6 text-weight quantization.\n- `-quant q4` enables row-wise int4 text-weight quantization.\n- `-quant auto` picks an existing `model-q4.gguf`, then `model-q6.gguf`,\n  then `model-q8.gguf`, then `model.gguf`; if only safetensors exists, it\n  builds `model-q6.gguf`.\n\nWindows service control:\n\n```powershell\npaddleocrvl-server.exe service install -model-dir \"C:\\ProgramData\\PaddleOCRVL\\models\" -admin-config \"C:\\ProgramData\\PaddleOCRVL\\paddleocrvl-admin.json\" -addr 127.0.0.1:8080\npaddleocrvl-server.exe service start\npaddleocrvl-server.exe service stop\npaddleocrvl-server.exe service uninstall\n```\n\n## Installers\n\nGitHub Actions release workflow:\n\n- `.github/workflows/release.yml` runs tests, then builds Windows NSIS,\n  macOS PKG, and Linux AppImage artifacts.\n- Windows builds `amd64` and `arm64` installers.\n- macOS builds one universal PKG containing x86_64 and arm64 binaries.\n- Linux builds AppImage artifacts for `x86_64` and `aarch64`.\n- Push a tag like `v1.0.0`, or run the workflow manually with a version input.\n\nWindows NSIS installer:\n\n```powershell\n.\\packaging\\windows\\build-nsis.ps1 -Version 1.0.0\n```\n\nThe installer packages `paddleocrvl-client.exe` and `paddleocrvl-server.exe`.\nOn install it registers `PaddleOCRVLService` as an automatic NT service and\nstarts it. On uninstall it stops and removes the service before deleting files.\nThe service uses `C:\\ProgramData\\PaddleOCRVL\\models` as its default model\ndirectory and `C:\\ProgramData\\PaddleOCRVL\\paddleocrvl-admin.json` for admin\nstate. Put the model files in the default model directory before installing, or\nthe installer will fail when it verifies that the service can start.\n\nmacOS universal PKG installer:\n\n```sh\nVERSION=1.0.0 sh ./packaging/macos/build-pkg.sh\n```\n\nLinux AppImage:\n\n```sh\nVERSION=1.0.0 ARCH=x86_64 sh ./packaging/linux/build-appimage.sh\nVERSION=1.0.0 ARCH=aarch64 sh ./packaging/linux/build-appimage.sh\n```\n\nThe Linux script uses `linuxdeploy` with the GTK plugin so the Wails/WebKitGTK\nclient gets a relocatable AppDir before AppImage output.\n\nThe PKG contains a universal Wails client app, a universal\n`paddleocrvl-server`, and a LaunchDaemon named\n`com.znsoft.paddleocrvl.service`. Post-install scripts load and kickstart the\nservice. The default model directory is\n`/Library/Application Support/PaddleOCRVL/models`; it must contain model files\nbefore installation so the LaunchDaemon can start successfully. The package\nalso installs `/usr/local/paddleocrvl/uninstall.sh`, which removes the\nLaunchDaemon, server binary, and client app while keeping model/admin data.\n- `-quant auto-fast` prefers/builds Q4 for speed and size.\n- `-quant auto-quality` prefers/builds Q8 for quality.\n- `-backend cpu|auto|vulkan` selects compute backend. `vulkan` is strict for\n  loader/interface probing and fails startup if Vulkan is unavailable; `auto`\n  falls back to CPU. The current pure-Go Vulkan layer exposes loader/device/driver\n  status plus registered matvec/QKV/SwiGLU compute-kernel plans and model-shaped\n  dispatch grids/summaries/stage execution graph in `/stats`. Kernel metadata\n  includes descriptor bindings, push-constant ABI, and comparable pipeline cache\n  keys. Command plans also expose per-dispatch resource bindings, pipeline\n  layout plans, shader-module plans, descriptor-set layout/index plans,\n  push-constant payloads, descriptor write plans, command-buffer recording\n  plans, dispatch-batch bind-reuse plans, buffer barrier plans, buffer\n  allocation plans, host/device transfer plans, byte ranges, and aligned\n  storage-buffer allocation sizes plus\n  descriptor-pool, command-pool,\n  queue-submit, timeline, fence, pipeline-cache, pipeline-lifecycle, and\n  validation plans for future descriptor-set writes. `/stats` and `-stats-only\n  -json` expose `vulkan_command_plan_valid` and `vulkan_command_plan_error`.\n  It reports CPU as the active tensor backend until GPU command submission is\n  enabled.\n\nHealth and stats:\n\n```powershell\ncurl http://127.0.0.1:8080/health\ncurl http://127.0.0.1:8080/ready\ncurl http://127.0.0.1:8080/stats\n```\n\n`/ready` reports whether the model, tokenizer, and inference slots are\ninitialized. `/stats` includes uptime, in-flight requests, request counters, failures,\ncancel count, generated token count, average latency, average queue wait, last error, requested/effective\nquantization, loaded `weight_path`, `weight_sha256`, Go memory stats, and model\ndimensions. `weight_source` is `existing_gguf` when loading a GGUF file directly\nand `converted_safetensors` when the loader converted original safetensors.\n`load_stats` reports milliseconds spent in weight open/auto-convert, text preload,\nruntime quantization, and total load. The `cache` section includes reusable task\nprompts, tokenizer cache stats, and runtime vision position-table cache stats.\n\nBatch JSON inference:\n\n```powershell\ncurl -X POST http://127.0.0.1:8080/v1/batch ^\n  -H \"Authorization: Bearer \u003cAPI_KEY\u003e\" ^\n  -H \"Content-Type: application/json\" ^\n  -d \"{\\\"requests\\\":[{\\\"prompt\\\":\\\"\u003c|begin_of_sentence|\u003ehello\\\",\\\"max_new_tokens\\\":1},{\\\"task\\\":\\\"ocr\\\",\\\"image_path\\\":\\\"D:\\\\docs\\\\page.png\\\",\\\"decode\\\":true,\\\"decode_generated_only\\\":true,\\\"skip_special\\\":true}]}\"\n```\n\nBatch responses include `items`, aggregate `generated_tokens`, and per-request\n`responses`. Batch items run concurrently up to the server `-concurrency`\nslot limit while preserving response order.\n\nHTTP benchmark:\n\n```powershell\ngo run ./cmd/paddleocrvl-bench `\n  -url http://127.0.0.1:8080/v1/generate `\n  -n 20 `\n  -c 1 `\n  -prompt \"\u003c|begin_of_sentence|\u003ehello\" `\n  -max-new-tokens 1 `\n  -temperature 0 `\n  -top-k 0 `\n  -batch-size 1\n```\n\nLocal runtime benchmark, bypassing HTTP:\n\n```powershell\ngo run ./cmd/paddleocrvl-bench `\n  -mode local `\n  -model-dir D:\\models\\PaddleOCR-VL `\n  -n 5 `\n  -c 1 `\n  -prompt \"\u003c|begin_of_sentence|\u003ehello\" `\n  -max-new-tokens 1\n```\n\nKernel microbenchmarks:\n\n```powershell\ngo test ./internal/tensor -bench \"MatVec|Fused|Quantize\" -run \"^$\"\ngo test ./internal/model -bench \"Sample|TopK\" -run \"^$\"\ngo test ./internal/tokenizer -bench Encode -run \"^$\"\n```\n\nSet `-batch-size` above `1` to benchmark `/v1/batch`; the default `/v1/generate`\nURL is automatically rewritten to `/v1/batch`. Benchmark output includes\nrequest, item, and generated token throughput, plus `last_error` when failures\noccur. Add `-json` for machine-readable benchmark results with CPU and memory\nsnapshots plus mode/backend/quantization/weight path/source context.\n\nJSON request with an image path:\n\n```powershell\ncurl -X POST http://127.0.0.1:8080/v1/generate ^\n  -H \"Content-Type: application/json\" ^\n  -d \"{\\\"task\\\":\\\"ocr\\\",\\\"image_path\\\":\\\"D:\\\\docs\\\\page.png\\\",\\\"max_new_tokens\\\":1024,\\\"decode\\\":true,\\\"decode_generated_only\\\":true,\\\"skip_special\\\":true}\"\n```\n\nJSON request with base64 image data:\n\n```powershell\ncurl -X POST http://127.0.0.1:8080/v1/generate ^\n  -H \"Content-Type: application/json\" ^\n  -d \"{\\\"task\\\":\\\"ocr\\\",\\\"image_base64\\\":\\\"\u003cbase64-or-data-url\u003e\\\",\\\"max_new_tokens\\\":1024,\\\"eos_token_ids\\\":[2],\\\"decode\\\":true,\\\"decode_generated_only\\\":true,\\\"skip_special\\\":true}\"\n```\n\nBase64 and multipart image requests are decoded in memory and passed directly\nto the Go image preprocessing path. JSON generation responses include\n`tokens`, `prompt_tokens`, and `generated_tokens`; `text` is included when\ndecoding is requested.\n\nMultipart upload:\n\n```powershell\ncurl -X POST http://127.0.0.1:8080/v1/ocr ^\n  -F \"task=ocr\" ^\n  -F \"image=@D:\\docs\\page.png\"\n```\n\nDesktop client:\n\n```powershell\ncd cmd\\paddleocrvl-client\nwails dev\n```\n\nThe Wails client lets you set API URL and API Key, check `/ready`, open `/doc`,\nchoose and preview one or more images, manage the selected image queue with\nper-image status, upload them to the multipart OCR endpoint, optionally continue a batch after per-image\nerrors, copy or save results, export batch results as JSON, set a request timeout, cancel in-flight\nrequests, inspect each batch result row, view the decoded text plus raw JSON, and reopen the last 10 local\nruns from history. Client settings can be imported/exported as JSON. If the API\nURL has no path, the client appends `/v1/ocr`. API Key is sent as both\n`Authorization: Bearer \u003ckey\u003e` and `X-API-Key`.\n\nPrompt inference:\n\n```powershell\ngo run ./cmd/paddleocrvl-go `\n  -model-dir D:\\models\\PaddleOCR-VL `\n  -prompt \"\u003c|begin_of_sentence|\u003ehello\" `\n  -max-new-tokens 16 `\n  -decode-generated-only\n```\n\nLoad, convert, quantize, and print memory stats without generating:\n\n```powershell\ngo run ./cmd/paddleocrvl-go -model-dir D:\\models\\PaddleOCR-VL -quant auto -stats-only\ngo run ./cmd/paddleocrvl-go -model-dir D:\\models\\PaddleOCR-VL -quant auto-fast -verify-only -verify-vision\n```\n\nAdd `-json` to `paddleocrvl-go` for machine-readable stats, verification, or\ngeneration output.\nStats and verification output include the loaded `weight_path` and\n`weight_source`. Stats output also includes `load_stats`.\n\n`-stats-only` also prints CPU features and backend details. On Linux, Vulkan\nbackend details include discovered ICD manifests and driver API versions; on\nWindows, the loader reports the Vulkan instance API version when available.\nCPU details include `num_cpu` and `gomaxprocs` for throughput tuning. Memory\ndetails include heap, system allocation, object, and GC counters.\n`-verify-only` exits after text weights load; add `-verify-vision` to force\nvision weight loading too.\n\nTokenizer support is based on `tokenizer.json`: added special tokens, BPE merge\nranks, byte fallback, and the model's space replacement decoder are implemented.\n\nThe runtime uses Go CPU parallelism for large linear layers and batched vision\nprojections. Set `GOMAXPROCS` to control CPU worker count.\n\nOn Windows, if a default `go build ./cmd/...` output executable is locked by a\nrunning process, build to an explicit path:\n\n```powershell\ngo build -o .\\.gocache\\bin\\paddleocrvl-server.exe ./cmd/paddleocrvl-server\n```\n\nGeneration defaults to greedy decoding. Set `-temperature` above `0` to sample;\ncombine with `-top-k` and `-seed` for reproducible sampled runs.\n\nAcceleration work in tree:\n\n- row-wise int8 quantized text projection/MLP path (`-quant q8`)\n- row-wise int6 quantized text projection/MLP path (`-quant q6`)\n- row-wise int4 quantized text projection/MLP path (`-quant q4`)\n- quantized GGUF conversion/loading path (`model.safetensors` -\u003e `model-q8.gguf` / `model-q6.gguf` / `model-q4.gguf`)\n- row-streamed GGUF quantized conversion to reduce peak memory during first load\n- reusable safetensors row buffers during GGUF quantized conversion to avoid\n  per-tensor block reallocations\n- GGUF quantized conversion reuses scale and quantized-row buffers across\n  tensors to reduce multi-tensor conversion churn\n- reusable GGUF F32 row buffers during runtime Q8/Q6/Q4 quantization from\n  `model.gguf`\n- lower-peak GGUF quantized tensor loading\n- single-read GGUF Q8/Q6/Q4 tensor loading with shared scale/data backing\n- existing GGUF load path opens candidate files directly and skips a separate\n  pre-open stat call\n- GGUF metadata open reuses shape backing storage and zero-copy string views to\n  reduce first-load allocation count\n- safetensors fallback probing is lazy and cached during weight selection\n- row-streamed F32 GGUF -\u003e Q8/Q6/Q4 runtime quantization when only `model.gguf`\n  is available\n- pre-sized runtime text/vision weight maps to reduce loader rehash churn\n- release text-weight map entries after caching layer pointers to reduce\n  runtime map scanning and retained references\n- release vision-weight map entries after caching vision layer pointers for the\n  same reason\n- unrolled safetensors BF16/F16 decode paths for first-load conversion\n- wider safetensors F16 decode loop for lookup-table conversion\n- fused Q8/Q6/Q4 SwiGLU MLP path\n- fused Q/K/V attention projection path for F32/Q8/Q6/Q4 weights\n- fused residual-add + RMSNorm path in the text decoder, including next-layer\n  pre-normalization after MLP residuals\n- vision residual LayerNorm path uses the faster two-pass AddInPlace+LayerNorm\n  sequence on Go CPU kernels\n- per-token text RoPE table reuse across decoder layers to avoid repeated\n  sin/cos table builds\n- `head_dim=128` and `head_dim=64` attention-score dot-product fast paths\n- single-token `head_dim=128` and `head_dim=64` text attention-score fast paths\n- fused text KV-cache score + Softmax + value path for stable 2-token decoding\n- `head_dim=128` and `head_dim=64` attention value aggregation fast paths for\n  text KV cache and vision attention\n- short-context text KV and vision value aggregation fast paths for\n  `head_dim=64`\n- unrolled RMSNorm hot path\n- unrolled greedy Argmax path\n- unrolled Softmax path\n- specialized length-5/6/7/8 Softmax paths for early attention steps\n- wider Q4/Q6 dot-product decode loops\n- Q8/Q6/Q4 decode lookup tables for quantized dot products\n- single-pass Q8 triplet dot product for fused Q/K/V projection\n- wider Q8/Q6/Q4 row quantization loops for safetensors -\u003e GGUF conversion\n- wider F32/Q8 dot-product loops\n- parallel GELU over large vision MLP row batches\n- no-sort full-vocab sampling path and unsorted heap-backed top-k sampling\n- full-vocab sampling max pass avoids per-logit temperature multiplication\n- full-vocab `temperature=1` sampling fast path skips per-logit temperature\n  scaling\n- top-k sampling scan skips eight low-score logits per branch before heap work\n- sampled-token weighted pick loop checks eight weights per iteration\n- short EOS-id lists use direct comparisons in the generation loop\n- greedy decoding skips RNG initialization when sampling is disabled\n- zero-token text generation returns before allocating KV/scratch state\n- zero-token image generation returns before image decode/vision encoding\n- multimodal RoPE position construction tracks the current maximum position\n  incrementally instead of rescanning previous tokens for every image block\n- multimodal RoPE position buffers are reused from generation scratch during\n  image generation to avoid per-request position slice allocation\n- server greedy requests skip random seed generation when sampling is disabled\n- CLI greedy requests skip random seed generation when sampling is disabled\n- single-item batch requests avoid unused response/error slice allocation\n- health, ready, and stats endpoints use fixed response structs instead of\n  dynamic maps for lower monitoring overhead\n- short server error responses use a stack buffer before falling back to heap\n- read-only tokenizer encode cache path for server/CLI prompts to avoid cached\n  slice copies\n- tokenizer special-token matching keeps ordered token/id entries to avoid map\n  lookups on prefix matches\n- empty tokenizer encode/decode inputs return before cache locks or builders\n- shared RGBA resize/preprocess bilinear-index backing to reduce hot-path\n  allocations\n- exact-size RGBA preprocessing skips bilinear resize and extracts patches\n  directly in parallel\n- base 27x27 vision position tables reuse a cached row view instead of\n  interpolating or allocating\n- vision position and RoPE cache hits use read locks so concurrent image\n  requests do not serialize on monitoring/cache reads\n- BF16/F16 safetensors -\u003e quantized GGUF conversion reuses a raw decode buffer\n- direct BF16/F16 safetensors row quantization during runtime load reuses raw\n  decode storage across tensors\n- safetensors model selection opens candidate files directly and preserves bad\n  single-file errors instead of probing with extra stat calls\n- incremental vision projection block indexing to reduce per-row integer\n  division/modulo work\n- pooled attention score buffers and lower-overhead KV cache appends\n- packed per-layer text attention score buffers to reduce first-request and\n  long-context allocation count\n- generation scratch and KV cache getters have direct fallback allocation paths\n  for pool-miss robustness without changing the pooled hot path\n- pre-sized tokenizer decode buffers\n- tokenizer byte-fallback decode fast paths for pure byte streams and single\n  byte tokens\n- mixed tokenizer byte-fallback decode uses one pass with lazy string builder\n- tokenizer Unicode byte fallback encodes UTF-8 into a stack buffer instead of\n  allocating `[]byte` per unknown rune\n- pooled vision projection scratch buffers and compact vision scratch row storage\n- on-demand pooled vision embedding buffer in the image encoder to avoid a\n  separate large embedding allocation during `EncodeImage`\n- vision embedding rows share the main vision scratch backing block to reduce\n  first image encode allocation count\n- cached vision RoPE tables by image grid and head dimension to avoid repeated\n  per-image table allocations\n- compact vision RoPE table storage with one backing block per axis table\n- fused RGBA resize + patch extraction precompute tables to cut image\n  preprocessing allocations\n- current-goroutine first worker for RGBA resize and patch extraction to reduce\n  scheduler allocation overhead\n- adaptive vision projection worker limits to reduce scheduler overhead on small\n  image grids while keeping full parallelism for large projections\n- adaptive batched row-projection worker limits for small per-patch vision\n  embeddings\n- batch endpoint holds one inference slot across batch execution to reduce scheduler churn\n- CPU parallel matrix-vector/matrix-row kernels with unrolled dot products\n- CPU feature reporting in `/stats` (`arm64` reports NEON-capable target)\n- Vulkan loader probing on Windows and Linux exposed in `/stats`\n- Vulkan operator registry for f32/Q8/Q6/Q4 matvec and fused QKV/SwiGLU kernel\n  plans with 256-thread reductions\n- current-model Vulkan dispatch plans and expanded dispatch/weight-byte summaries\n  exposed in CLI/server JSON stats\n- current-model Vulkan execution graph groups text and vision stages for future\n  command-buffer submission\n- Vulkan kernel ABI metadata exposes descriptor bindings, push constants, and\n  activation input/output byte estimates\n- Vulkan pipeline cache keys and unique pipeline counts are exposed for future\n  layout/pipeline reuse\n- Vulkan pipeline precreation plans expose unique pipeline keys, stage, reference\n  counts, and expanded dispatch counts\n- Vulkan pipeline plans include layout indices built from descriptor signatures\n  and push-constant sizes for future pipeline-layout reuse\n- Vulkan pipeline plans include shader-module indices and shader module plans\n  with entry point, source hash, local size, tile columns, specialization\n  constants, and pipeline refs\n- Vulkan command plans map model ops to pipeline slots with dispatch grids and\n  repeat counts for future command-buffer recording\n- Vulkan command plans include per-command input/weight/scale/output resources,\n  descriptor binding indices, access modes, and byte sizes for future\n  descriptor-set writes\n- Vulkan command plans include per-command descriptor-set layout/index plans and\n  concrete push-constant payloads for rows/columns\n- Vulkan command plans include storage-buffer descriptor write records and\n  resource/write counts for each model-shaped dispatch plan\n- Vulkan command plans expose unique pipeline-layout plans with storage-buffer\n  bindings and pipeline reference counts\n- Vulkan command plans emit future command-buffer recording records for\n  bind-pipeline, bind-descriptor-set, push-constants, and dispatch steps\n- Vulkan command plans include dispatch-batch grouping for consecutive commands\n  that can reuse pipeline binding while updating descriptor sets and push\n  constants\n- Vulkan command plans include buffer memory barrier plans for host-to-compute\n  reads and compute-write-to-compute-read handoff around planned dispatches\n- Vulkan command plans include per-resource buffer usage, memory property, and\n  aligned total buffer-byte allocation plans for future device-memory binding\n- Vulkan command plans include host-to-device upload and device-to-host readback\n  transfer plans with byte totals for future staging-buffer execution\n- Vulkan command plans include descriptor-pool sizing, compute command-pool\n  sizing, and single-submit queue plans for future queue submission\n- Vulkan command plans include timeline semaphore values and fence reset/wait\n  plans for future queue completion tracking\n- Vulkan command plans include compute pipeline cache keys, reuse counts, and\n  pipeline create/destroy lifecycle plans for future cache-backed pipeline setup\n- Vulkan command plans have a pure-Go validator for pipeline/layout/shader,\n  descriptor/resource, dispatch, sync, and lifecycle consistency\n- Vulkan command-plan validation status is exposed as\n  `vulkan_command_plan_valid`/`vulkan_command_plan_error` in server and CLI JSON\n  stats\n- Linux Vulkan ICD manifest discovery exposed in `/stats` and `-stats-only`\n- Linux Vulkan ICD relative `library_path` entries are resolved against the\n  manifest directory for clearer driver reporting\n- GGUF conversion/loading path (`model.safetensors` -\u003e `model.gguf`)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frapidai%2Frapidocrvl","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frapidai%2Frapidocrvl","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frapidai%2Frapidocrvl/lists"}