{"id":31644500,"url":"https://github.com/aatricks/llmedge","last_synced_at":"2026-04-13T04:36:27.273Z","repository":{"id":316106091,"uuid":"1059519757","full_name":"Aatricks/llmedge","owner":"Aatricks","description":"Library for using gguf models on android devices, powered by llama.cpp","archived":false,"fork":false,"pushed_at":"2025-09-22T17:43:23.000Z","size":72318,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-09-22T19:08:10.620Z","etag":null,"topics":["android","gguf","inference","kotlin","library","llamacpp","llm-inference"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Aatricks.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-09-18T14:54:01.000Z","updated_at":"2025-09-22T17:43:27.000Z","dependencies_parsed_at":"2025-09-22T19:23:56.922Z","dependency_job_id":null,"html_url":"https://github.com/Aatricks/llmedge","commit_stats":null,"previous_names":["aatricks/llmedge"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/Aatricks/llmedge","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Aatricks%2Fllmedge","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Aatricks%2Fllmedge/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Aatricks%2Fllmedge/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Aatricks%2Fllmedge/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Aatricks","download_url":"https://codeload.github.com/Aatricks/llmedge/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Aatricks%2Fllmedge/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":278722768,"owners_count":26034461,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-07T02:00:06.786Z","response_time":59,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["android","gguf","inference","kotlin","library","llamacpp","llm-inference"],"created_at":"2025-10-07T04:53:28.172Z","updated_at":"2026-04-13T04:36:27.260Z","avatar_url":"https://github.com/Aatricks.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# llmedge\n\n**llmedge** is a lightweight Android library for running GGUF language models fully on-device, powered by [llama.cpp](https://github.com/ggerganov/llama.cpp).\n\nSee the [examples repository](https://github.com/Aatricks/llmedge-examples) for sample usage.\n\nAcknowledgments to Shubham Panchal and upstream projects are listed in [`CREDITS.md`](./CREDITS.md).\n\n\u003e [!NOTE]\n\u003e This library is in early development and may change significantly.\n\n\u003e [!IMPORTANT]\n\u003e API maturity is uneven by feature area. `LLMEdge`, text inference, speech inference, and model management are the most stable entry points today. OCR via `edge.vision.extractText(...)` is also reliable. Vision/VLM analysis, RAG, and some image/video-generation flows are available and tested, but should still be treated as evolving APIs.\n\n---\n\n## Features\n\n- **LLM Inference**: Run GGUF models directly on Android using llama.cpp (JNI)\n- **Model Downloads**: Download and cache models from Hugging Face Hub\n- **Optimized Inference**: Native KV cache reuse for compact chats, default batched blocking and streaming text generation, separate prompt vs generation thread tuning, and Kotlin-managed `ChatSession` replay for reasoning-heavy models\n- **Speech-to-Text (STT)**: Whisper.cpp integration with timestamp support, language detection, streaming transcription, and SRT generation\n- **Text-to-Speech (TTS)**: Bark.cpp integration with ARM optimizations\n- **Image Generation**: Stable Diffusion with EasyCache and LoRA support\n- **Video Generation**: Wan 2.1 models (4-64 frames) with sequential loading\n- **On-device RAG**: PDF indexing, embeddings, vector search, Q\u0026A\n- **OCR**: Google ML Kit text extraction\n- **Memory Metrics**: Built-in RAM usage monitoring\n- **Vision Models**: Architecture prepared for LLaVA-style models (requires specific model formats)\n- **GPU Acceleration**: Optional Android GPU backends for text, Whisper, and image/video with experimental OpenCL preferred first, Vulkan fallback second, and CPU fallback last\n\n---\n\n## Table of Contents\n\n1. [Installation](#installation)\n2. [Usage](#usage)\n   - [Downloading Models](#downloading-models)\n   - [Reasoning Controls](#reasoning-controls)\n   - [Managed Chat Sessions](#managed-chat-sessions)\n   - [Tool Calling](#tool-calling)\n   - [Image Text Extraction (OCR)](#image-text-extraction-ocr)\n   - [Vision Models](#vision-models)\n   - [Speech-to-Text (Whisper)](#speech-to-text-whisper)\n   - [Text-to-Speech (Bark)](#text-to-speech-bark)\n   - [Speech Performance Status](#speech-performance-status)\n   - [Stable Diffusion (image generation)](#stable-diffusion-image-generation)\n   - [Video Generation](#video-generation)\n   - [On-device RAG](#on-device-rag)\n   - [Expert APIs](#expert-apis)\n3. [Building](#building)\n4. [Architecture](#architecture)\n5. [Technologies](#technologies)\n6. [Memory Metrics](#memory-metrics)\n7. [Notes](#notes)\n8. [Testing](#testing)\n\n---\n\n## Installation\n\n\u003e [!WARNING]\n\u003e For development, Linux is strongly recommended for GPU-enabled builds. The Vulkan shader-generation path used by Stable Diffusion is still unreliable on Windows cross-builds.\n\nClone the repository along with the `llama.cpp` and `stable-diffusion.cpp` submodule:\n\n```bash\ngit clone --depth=1 https://github.com/Aatricks/llmedge\ncd llmedge\ngit submodule update --init --recursive\n```\n\nOpen the project in Android Studio. If it does not build automatically, use ***Build \u003e Rebuild Project.***\n\n### Consume as a dependency\n\nFor Maven Central:\n\n```kotlin\nrepositories {\n    google()\n    mavenCentral()\n}\n\ndependencies {\n    implementation(\"io.github.aatricks:llmedge:0.3.9\")\n}\n```\n\nFor GitHub Packages:\n\n```kotlin\nrepositories {\n    google()\n    mavenCentral()\n    maven {\n        url = uri(\"https://maven.pkg.github.com/Aatricks/llmedge\")\n        credentials {\n            username = providers.gradleProperty(\"gpr.user\").orNull ?: System.getenv(\"GITHUB_ACTOR\")\n            password = providers.gradleProperty(\"gpr.key\").orNull ?: System.getenv(\"GITHUB_TOKEN\")\n        }\n    }\n}\n\ndependencies {\n    implementation(\"io.github.aatricks:llmedge:0.3.9\")\n}\n```\n\n## Usage\n\n### Quick Start\n\nThe recommended entry point is the instance-based `LLMEdge` facade. It exposes domain clients for text, speech, image generation, vision, and RAG while keeping model resolution and resource ownership explicit.\n\n```kotlin\nval edge = LLMEdge.create(\n    context = context,\n    scope = viewModelScope,\n)\n\nviewModelScope.launch {\n    val reply = edge.text.generate(\n        prompt = \"Summarize on-device LLMs in one sentence.\",\n    )\n    outputView.text = reply\n}\n```\n\nLow-level wrappers like `SmolLM`, `StableDiffusion`, `Whisper`, and `BarkTTS` remain available for expert workflows, but new code should prefer `LLMEdge`.\n\nThe intended acquisition path for application code is:\n\n- `edge.models.prefetch(...)` when you want explicit downloads\n- feature clients like `edge.text`, `edge.speech`, `edge.image`, and `edge.vision` when you want inference\n\nDirect `HuggingFaceHub` calls and expert runtime `loadFromHuggingFace(...)` helpers are still supported, but they are advanced APIs for callers that need artifact-level control.\n\nBy default, `edge.text.generate(...)` uses batched native decoding for lower JNI overhead, while\n`edge.text.stream(...)` uses smaller batched chunks so UI updates stay responsive without paying a\nJNI crossing per token.\n\n### Downloading Models\n\nllmedge can resolve and cache model weights independently of inference:\n\n```kotlin\nval edge = LLMEdge.create(context, viewModelScope)\n\nval modelFile = edge.models.prefetch(\n    ModelSpec.huggingFace(\n        repoId = \"unsloth/Qwen3-0.6B-GGUF\",\n        filename = \"Qwen3-0.6B-Q4_K_M.gguf\",\n    ),\n)\n\nLog.d(\"llmedge\", \"Cached ${modelFile.name} at ${modelFile.parent}\")\n```\n\n#### Key points:\n\n- `edge.models.prefetch(...)` and `BoundModelRepository.resolve(...)` keep model acquisition separate from any one inference client.\n\n- Supports progress callbacks and private repositories via token through `ModelSpec.huggingFace(...)`.\n\n- Requests to old mirrors automatically resolve to up-to-date Hugging Face repos.\n\n- Automatically uses the model's declared context window (minimum 1K tokens) and caps it to a heap-aware limit (2K–8K). Override with `InferenceParams(contextSize = …)` if needed.\n\n- Large downloads use Android's DownloadManager when `preferSystemDownloader = true` to keep transfers out of the Dalvik heap.\n\n- Direct `HuggingFaceHub` downloads remain available for expert workflows, but most app code should stay on the facade/model-repository path.\n\n### Reasoning Controls\n\nReasoning-aware models can be controlled from the facade through `TextModelOptions`. The default configuration keeps thinking enabled (`ThinkingMode.DEFAULT`, reasoning budget `-1`). To disable thinking for a request or session, pass the options explicitly:\n\n```kotlin\nval edge = LLMEdge.create(context, viewModelScope)\n\nval reply = edge.text.generate(\n    prompt = \"Solve this step by step, then give only the final answer.\",\n    options = TextModelOptions(\n        thinkingMode = SmolLM.ThinkingMode.DISABLED,\n        reasoningBudget = 0,\n    ),\n)\n```\n\nThe same options work with `edge.text.session(...)` and `edge.text.toolAgent(...)`.\n\nSetting the budget to `0` always disables thinking, while `-1` leaves it unrestricted. If you omit `reasoningBudget`, the library chooses `0` when the mode is `DISABLED` and `-1` otherwise. The API also injects the `/no_think` tag automatically when thinking is disabled, so you do not need to modify prompts manually. If you need to flip reasoning state on a live expert runtime without reloading, see [Expert APIs](#expert-apis).\n\n### Managed Chat Sessions\n\nUse `edge.text.session(...)` when you want bounded multi-turn chat without exposing native `storeChats` state to application code.\n\n```kotlin\nval edge = LLMEdge.create(context, viewModelScope)\n\nval session = edge.text.session(\n    memory = ConversationWindow(\n        maxTurns = 6,\n        maxTokens = 4096,\n        stripThinkTags = true,\n    ),\n    systemPrompt = \"You are a concise assistant.\",\n)\n\nviewModelScope.launch {\n    session.prepare()\n    val reply = session.reply(\"Explain why context windows fill up.\")\n    session.stream(\"Now summarize that in 3 bullets.\").collect { event -\u003e\n        when (event) {\n            is TextStreamEvent.Chunk -\u003e print(event.value)\n            is TextStreamEvent.Completed -\u003e println(event.fullText)\n            else -\u003e Unit\n        }\n    }\n}\n```\n\nThe new session API keeps transcript state in Kotlin, applies sliding-window trimming, and strips replayed `\u003cthink\u003e...\u003c/think\u003e` blocks by default so reasoning-heavy models do not exhaust the context window as quickly.\n\n### Tool Calling\n\nUse `edge.text.toolAgent(...)` when you want the model to call app-defined tools. Read-only tools execute automatically; action tools require an explicit policy decision.\n\n```kotlin\nval edge = LLMEdge.create(context, viewModelScope)\nval factory = DeviceToolFactory(context)\n\nval agent = edge.text.toolAgent(\n    tools = factory.createDefaultTools(),\n    systemPrompt = \"Be concise and only use tools when needed.\",\n    policy = ToolPolicies.ALLOW_ALL, // or keep the default to deny action tools\n)\n\nviewModelScope.launch {\n    val result = agent.reply(\"What time is it and how much battery is left?\")\n    println(result.text)\n\n    agent.stream(\"Open https://example.com\").collect { event -\u003e\n        when (event) {\n            is ToolAgentEvent.ToolCallRequested -\u003e println(\"Tool: ${event.call.tool}\")\n            is ToolAgentEvent.TextChunk -\u003e print(event.value)\n            is ToolAgentEvent.Completed -\u003e println(\"\\nDone: ${event.result.finishReason}\")\n            else -\u003e Unit\n        }\n    }\n}\n```\n\nTool calls use a structured JSON envelope internally: `{\"tool\":\"name\",\"arguments\":{...}}`. The parser also accepts the legacy `tool_name` field for robustness, but new prompts only emit the `tool` shape.\n\n### Speech Request Objects\n\nSpeech APIs now support request-first calls in addition to the existing convenience overloads:\n\n```kotlin\nval result = edge.speech.transcribe(\n    SpeechToTextRequest(\n        audioSamples = samples,\n        model = edge.config.models.speechToText,\n        params = Whisper.TranscribeParams(language = \"en\"),\n        runtime = WhisperRuntimeRequest(gpuEnabled = false, flashAttention = true),\n    ),\n)\n```\n\nThis keeps new speech entrypoints aligned with the request-first style already used by text and image generation, while preserving the older parameter-list overloads for compatibility.\n\n### Text Generation Performance Tuning\n\nThe text stack now separates prompt/batch processing from single-token generation so you can tune\nthe two phases independently:\n\n```kotlin\nval edge = LLMEdge.create(\n    context = context,\n    scope = viewModelScope,\n    config = LLMEdgeConfig(\n        text = TextRuntimeConfig(\n            promptThreads = 6,            // prompt/batch phase\n            generationThreads = 2,       // token-by-token phase\n            batchSize = 8,\n            streamBatchSize = 4,\n            cache = RuntimeCacheConfig(maxEntries = 2, maxMemoryMb = 1536),\n        ),\n    ),\n)\n\nval reply = edge.text.generate(\n    prompt = \"Explain speculative decoding.\",\n    options = TextModelOptions(numThreads = 8, generationThreads = 3),\n    batchSize = 12,\n)\n```\n\nPractical defaults:\n\n- `text.promptThreads`: prompt/batch decode threads\n- `text.generationThreads`: single-token generation threads\n- `text.batchSize`: blocking text batch size (default `8`)\n- `text.streamBatchSize`: streaming batch size (default `4`)\n- `text.cache.maxMemoryMb`: upper bound for text-model cache accounting; the cache now refreshes against\n  native model/state footprint instead of only the GGUF file size\n\nBatch-size guidance:\n\n- `1`: lowest latency per chunk, highest JNI overhead\n- `4`: good default for streaming UI updates\n- `8`: good default for blocking text responses\n- `12+`: better throughput for longer offline generations, but can delay intermediate updates\n\n### Image Text Extraction (OCR)\n\nllmedge uses Google ML Kit Text Recognition for extracting text from images.\n\n#### Quick Start\n\n```kotlin\nval edge = LLMEdge.create(context, viewModelScope)\nval text = edge.vision.extractText(bitmap)\nprintln(\"Extracted text: $text\")\n```\n\n#### OCR Engines\n\n**Google ML Kit Text Recognition**\n- Fast and lightweight\n- No additional data files needed\n- Good for Latin scripts\n- Add dependency: `implementation(\"com.google.mlkit:text-recognition:16.0.0\")`\n\nOCR is exposed directly through `edge.vision.extractText(...)`. The older `VisionMode` convenience\nwrapper is gone; callers now choose explicitly between OCR and VLM analysis instead of routing both\nthrough a second abstraction layer.\n\n### Vision Models\n\nAnalyze images using Vision Language Models (like LLaVA or Phi-3 Vision) via `edge.vision`.\n\n\u003e [!WARNING]\n\u003e The VLM path is experimental. It requires a vision-capable GGUF and a matching mmproj/projector file. When those components are unavailable or incompatible, `edge.vision.analyze(...)` now fails fast with a clear error instead of silently falling back to text-only prompting. OCR remains available through `edge.vision.extractText(...)`.\n\n```kotlin\nval edge = LLMEdge.create(context, viewModelScope)\n\nval description = edge.vision.analyze(\n    image = bitmap,\n    prompt = \"Describe this image in detail.\",\n    numThreads = 4,\n    generationThreads = 2,\n) { status -\u003e\n    Log.d(\"Vision\", \"Status: $status\")\n}\n```\n\nThe current high-level vision path creates a fresh `SmolLM` runtime per request, so it favors\nisolation and predictable cleanup over pooled high-throughput reuse.\n\nThe manager handles the complex pipeline of:\n1. Preprocessing the image\n2. Loading the vision projector and model\n3. Encoding the image to embeddings\n4. Generating the textual response\n\nVision model support is currently experimental and requires specific model architectures (like LLaVA-Phi-3).\n\n### Speech-to-Text (Whisper)\n\nTranscribe audio using the new `edge.speech` client:\n\n```kotlin\nval edge = LLMEdge.create(context, viewModelScope)\n\nval text = edge.speech.transcribeToText(audioSamples)\n\nval segments = edge.speech.transcribe(\n    audioSamples = audioSamples,\n    params = Whisper.TranscribeParams(language = \"en\"),\n)\nsegments.forEach { segment -\u003e\n    println(\"[${segment.startTimeMs}ms] ${segment.text}\")\n}\n\nval lang = edge.speech.detectLanguage(audioSamples)\n```\n\n#### Real-time Streaming Transcription\n\nFor live captioning, use the streaming transcription API with a sliding window approach:\n\n```kotlin\nval edge = LLMEdge.create(context, viewModelScope)\n\nval session = edge.speech.createStreamingSession(\n    params = Whisper.StreamingParams(\n        stepMs = 3000,\n        lengthMs = 10000,\n        keepMs = 200,\n        language = \"en\",\n        useVad = true,\n    ),\n)\n\nviewModelScope.launch {\n    session.events().collect { segment -\u003e\n        updateCaptions(segment.text)\n    }\n}\n\naudioRecorder.onAudioChunk { samples -\u003e\n    viewModelScope.launch { session.feedAudio(samples) }\n}\n\nsession.stop()\n```\n\n**Streaming parameters:**\n- `stepMs`: How often transcription runs (default: 3000ms). Lower = faster updates, higher CPU usage.\n- `lengthMs`: Audio window size (default: 10000ms). Longer windows improve accuracy.\n- `keepMs`: Overlap with previous window (default: 200ms). Helps maintain context.\n- `useVad`: Voice Activity Detection - skips silent audio (default: true).\n\nDirect `Whisper` access remains available for expert workflows, but the namespaced speech client is the standard integration path.\n\n**Recommended models:**\n- `ggml-tiny.bin` (~75MB) - Fast, lower accuracy\n- `ggml-base.bin` (~142MB) - Good balance\n- `ggml-small.bin` (~466MB) - Higher accuracy\n\n### Text-to-Speech (Bark)\n\nGenerate speech using `edge.speech`:\n\n```kotlin\nval edge = LLMEdge.create(context, viewModelScope)\n\nval audio = edge.speech.synthesize(\"Hello, world!\")\n\nviewModelScope.launch {\n    edge.speech.synthesizeStream(\"Hello, world!\").collect { event -\u003e\n        when (event) {\n            is AudioStreamEvent.Progress -\u003e Log.d(\"Bark\", \"${event.step.name}: ${event.percent}%\")\n            is AudioStreamEvent.Result -\u003e saveAudio(event.audio)\n            else -\u003e Unit\n        }\n    }\n}\n```\n\nDirect `BarkTTS` access remains available for expert workflows, but the namespaced speech client is the standard integration path.\n\n### Stable Diffusion (image generation)\n\nGenerate images on-device using the namespaced `edge.image` client:\n\n```kotlin\nval edge = LLMEdge.create(context, viewModelScope)\n\nval bitmap = edge.image.generate(\n    ImageGenerationRequest(\n        prompt = \"a cute pastel anime cat, soft colors, high quality \u003clora:detail_tweaker:1.0\u003e\",\n        width = 512,\n        height = 512,\n        steps = 20,\n        loraModelDir = \"/path/to/loras\",\n        loraApplyMode = StableDiffusion.LoraApplyMode.AUTO,\n    ),\n)\nimageView.setImageBitmap(bitmap)\n```\n\n**Key Optimizations:**\n- **EasyCache**: `edge.image` automatically enables EasyCache for supported Diffusion Transformer (DiT) models such as Flux, SD3, Wan, Qwen Image, and Z-Image; it stays disabled for classic UNet pipelines.\n- **Flash Attention**: Automatically enabled for compatible image dimensions.\n- **LoRA**: Apply fine-tuned weights on the fly without merging models.\n\nFor explicit runtime ownership or custom native-load experiments, the `StableDiffusion` class remains available in the expert API layer.\n\n### Video Generation\n\nGenerate short video clips using `edge.image.generateVideo(...)`. The namespaced client surfaces progress as a `Flow` while reusing the existing Wan loading logic internally.\n\n**Hardware Requirements**:\n- **12GB+ RAM** recommended for standard loading.\n- **8GB+ RAM** supported via `forceSequentialLoad = true` (slower but memory-safe).\n\n```kotlin\nval edge = LLMEdge.create(context, viewModelScope)\n\nval params = VideoGenerationRequest(\n    prompt = \"a cat walking in a garden, high quality\",\n    videoFrames = 8,\n    width = 512,\n    height = 512,\n    steps = 20,\n    cfgScale = 7.0f,\n    flowShift = 3.0f,\n    forceSequentialLoad = true,\n)\n\nviewModelScope.launch {\n    edge.image.generateVideo(params).collect { event -\u003e\n        when (event) {\n            is GenerationStreamEvent.Progress -\u003e Log.d(\"VideoGen\", event.update.message)\n            is GenerationStreamEvent.Completed -\u003e previewImageView.setImageBitmap(event.frames.first())\n        }\n    }\n}\n```\n\n`edge.image` automatically:\n1. Downloads the necessary Wan 2.1 model files (Diffusion, VAE, T5).\n2. Sequentially loads components to minimize peak memory usage (if requested).\n3. Manages the generation loop and frame conversion.\n\nSee `llmedge-examples` for a complete UI implementation.\n\n\nRunning the example app:\n1. Build the library (from the repo root):\n\n```bash\n./gradlew :llmedge:assembleRelease\n```\n\n2. Build and install the example app:\n\n```bash\ncd llmedge-examples\n../gradlew :app:assembleDebug\n../gradlew :app:installDebug\n```\n\n3. Open the app on device and pick the \"Stable Diffusion\" demo from the launcher. The demo downloads any missing files from Hugging Face and runs a quick txt2img generation.\n\nNotes:\n- The example explicitly downloads a VAE safetensors file for the `Meina/MeinaMix` demo; many repos include VAE files, but some GGUF model repos bundle everything you need. If the repo lacks a GGUF model file you'll get an obvious IllegalArgumentException — provide a `filename` or choose a different repo in that case.\n- Use the system downloader for large safetensors/gguf files to avoid heap pressure on Android.\n### On-device RAG\n\nThe library includes a minimal on-device RAG pipeline, similar to Android-Doc-QA, built with:\n- Sentence embeddings (ONNX)\n- Whitespace `TextSplitter`\n- In-memory cosine `VectorStore` with JSON persistence\n- `SmolLM` for context-aware responses through the facade-managed RAG session\n\n### Setup\n\n1. Download embeddings\n\n   From the Hugging Face repository `sentence-transformers/all-MiniLM-L6-v2`, place:\n\n```\nllmedge/src/main/assets/embeddings/all-minilm-l6-v2/model.onnx\nllmedge/src/main/assets/embeddings/all-minilm-l6-v2/tokenizer.json\n```\n\n2. Build the library\n\n```\n./gradlew :llmedge:assembleRelease\n```\n\n3. Use in your application\n\n```kotlin\nval edge = LLMEdge.create(this, lifecycleScope)\nval rag = edge.rag.createSession()\n\nlifecycleScope.launch {\n    rag.init()\n    val count = rag.indexPdf(pdfUri)\n    val answer = rag.ask(\"What are the key points?\")\n    // render answer\n}\n```\n\nDirect `RAGEngine` construction remains available for expert workflows, but new app code should prefer `edge.rag.createSession()` so runtime ownership and teardown stay aligned with the rest of the library.\n\n### Expert APIs\n\n`SmolLM`, `StableDiffusion`, `Whisper`, `BarkTTS`, `RAGEngine`, and direct `HuggingFaceHub` access are still available when you need to hold a native runtime directly or override low-level loading behavior. They are intentionally secondary to the facade APIs.\n\nExamples:\n\n```kotlin\n// Direct model download when you need full control over artifact selection.\nval download = HuggingFaceHub.ensureModelOnDisk(\n    context = context,\n    modelId = \"unsloth/Qwen3-0.6B-GGUF\",\n    filename = \"Qwen3-0.6B-Q4_K_M.gguf\",\n)\n\n// Expert text runtime with live reasoning-state control.\nval smol = SmolLM()\nsmol.load(download.file.absolutePath)\nsmol.setThinkingEnabled(false)\n\n// Expert RAG wiring when you want to own both the runtime and the pipeline yourself.\nval ragEngine = RAGEngine(context = context, smolLM = smol)\n```\n\n## Building\n\nBuilding GPU backends on Android\n--------------------------------\n\nIf you want GPU acceleration for the native inference backends, follow these notes and requirements. On Android, llmedge now prefers `OPENCL -\u003e VULKAN -\u003e CPU` when GPU use is allowed for text, Whisper, and image/video requests. OpenCL support is experimental, Android-only, and currently limited to `arm64-v8a`. Bark remains CPU-only.\n\nPrerequisites\n- Android NDK r27 or newer (NDK r27 used in development; the NDK provides the Vulkan C headers). Ensure your NDK matches the version used by your build environment.\n- CMake 3.22+ and Ninja (the Android Gradle plugin will pick up CMake when configured).\n- Gradle (use the wrapper: `./gradlew`).\n- Android API (minSdk) 30 or higher. `llmedge` targets Android 11+ today, and Vulkan support still requires Vulkan 1.2.\n- (Optional) `VULKAN_SDK` set in the environment if you build shaders or use Vulkan SDK tools on the host. The build fetches a matching `vulkan.hpp` header when needed.\n\n### Host Setup for Vulkan Builds (Ubuntu/WSL)\n\nTo build the library with Vulkan support on a Linux host or WSL2, you must install the Vulkan shader compiler and development headers:\n\n1. **Install Dependencies**:\n   ```bash\n   sudo apt-get update\n   sudo apt-get install -y glslc libvulkan-dev\n   ```\n\n2. **Verify glslc**:\n   Ensure `glslc` is in your PATH:\n   ```bash\n   glslc --version\n   ```\n\n3. **Android NDK**:\n   Ensure you have Android NDK **r27** (specifically `27.2.12479018`) installed via Android Studio or the SDK manager.\n\nBuild flags\n- On Linux/macOS hosts, the Gradle build enables Vulkan by default. On Windows hosts, it defaults to `OFF` because the upstream shader-generator step is still fragile under the Android cross-build toolchain. Re-enable it explicitly only when your environment supports that path.\n- Experimental Android OpenCL is disabled by default. Enable it with `-PllmedgeAndroidOpencl=ON` or the environment variable `LLMEDGE_ANDROID_OPENCL=ON`.\n- If you want both OpenCL and Vulkan compiled in explicitly, use:\n\n```bash\n./gradlew :llmedge:assembleRelease \\\n  -PllmedgeAndroidOpencl=ON \\\n  -Pandroid.injected.build.api=30 \\\n  -Pandroid.jniCmakeArgs=\"-DSD_VULKAN=ON -DGGML_VULKAN=ON\"\n```\n\nAlternatively, set the same flags in your Android Studio CMake configuration. `LLMEDGE_ANDROID_OPENCL` is the library's experimental OpenCL toggle, while `-DSD_VULKAN=ON` and `-DGGML_VULKAN=ON` force Vulkan support for Stable Diffusion and ggml.\n\nNotes about headers and toolchain\n- The build fetches `Vulkan-Hpp` (`vulkan.hpp`) and pins it to the NDK's Vulkan headers to avoid API mismatch. If you have a local `VULKAN_SDK` you can point to it, otherwise the project will use the fetched headers.\n- When OpenCL is enabled, the build uses repo-managed OpenCL headers and a link-time loader shim. The packaged app still resolves the device's OpenCL implementation at runtime rather than shipping its own platform ICD.\n- The repository also builds a small host toolchain to generate SPIR-V shaders at build time; ensure your build host has a working C++ toolchain (clang/gcc) and CMake configured.\n\nRuntime verification\n- To verify GPU capability at runtime:\n    - Run the app on an Android 11+ device.\n    - Use the per-subsystem capability APIs to inspect the engines you care about, for example `LLMEdge.getTextBackendAvailability()`, `LLMEdge.getSpeechBackendAvailability()`, `LLMEdge.getImageBackendAvailability()`, and `LLMEdge.getVisionBackendAvailability()`.\n    - Inspect runtime logs for the selected backend and any fallback reason. Example:\n\n```bash\nadb logcat -s SmolSD:* | sed -n '1,200p'\n```\n\n    Look for messages indicating OpenCL or Vulkan initialization. `LLMEdgeConfig(text = TextRuntimeConfig(useVulkan = true))` means \"allow a supported GPU backend\", not \"force Vulkan\".\n\nTroubleshooting\n- If you see \"Vulkan 1.2 required\" or linker errors for Vulkan symbols, confirm `minSdk` is set to 30 or higher in `llmedge/build.gradle.kts` and that your NDK provides the expected Vulkan headers.\n- If experimental OpenCL is not available, or if a GPU backend fails to initialize or execute, llmedge falls back to Vulkan or CPU automatically. For text, Whisper, and image/video, a failing backend is blacklisted per subsystem for the rest of the process and the next backend is retried once.\n- If your device lacks both usable OpenCL and Vulkan support, the native code falls back to the CPU backend.\n\n#### Notes:\n\n- Uses `com.tom-roush:pdfbox-android` for PDF parsing.\n- Embeddings library: `io.gitlab.shubham0204:sentence-embeddings:v6`.\n- Scanned PDFs require OCR (e.g., ML Kit or Tesseract) before indexing.\n- ONNX `token_type_ids` errors are automatically handled; override via `EmbeddingConfig` if required.\n\n## Architecture\n\nThe Kotlin side is now organized around a few explicit layers instead of one eager facade:\n\n1. `LLMEdge` is a thin convenience shell that lazy-creates domain clients (`text`, `speech`, `image`, `vision`, `rag`) on first access.\n2. `ModelRepository` owns model acquisition and validation for local files and Hugging Face downloads.\n3. `RuntimePool` and `RuntimeCoordinator` provide shared runtime caching, backend selection, and failure blacklisting.\n4. `RuntimePoolProfile` lets each domain describe cache sizing, keying, loading, and backend policy without duplicating pool boilerplate.\n5. `TextClient`, `SpeechClient`, `ImageClient`, `VisionClient`, and `RAGClient` remain independently constructible for advanced use, but `LLMEdge` is the canonical public entrypoint.\n6. `ConversationSessionSupport` centralizes transcript state and runtime access for chat sessions and tool agents.\n7. `VisionInputPreparer` and `VisionRuntimeExecutor` split image preprocessing/embedding from generation execution.\n8. `RAGIndexer`, `RAGRetriever`, and `RAGAnswerer` separate document ingestion, retrieval, and answer generation.\n9. Native libraries remain in the same Android module, but native loading is now explicit and overridable for JVM tests instead of relying on static side effects.\n\nOn the native side, the project still builds llama.cpp, stable-diffusion.cpp, whisper.cpp, bark.cpp, and the JNI bridge sources through the Android NDK.\n\n## Technologies\n\n- [llama.cpp](https://github.com/ggml-org/llama.cpp) — Core LLM backend\n- [stable-diffusion.cpp](https://github.com/leejet/stable-diffusion.cpp) — Image/video generation backend\n- [whisper.cpp](https://github.com/ggerganov/whisper.cpp) — Speech-to-text backend\n- [bark.cpp](https://github.com/PABannier/bark.cpp) — Text-to-speech backend\n- GGUF / GGML — Model formats\n- Android NDK / JNI — Native bindings\n- ONNX Runtime — Sentence embeddings\n- Android DownloadManager — Large file downloads\n\n## Memory Metrics\n\nYou can measure RAM usage at runtime:\n\n```kotlin\nval snapshot = MemoryMetrics.snapshot(context)\nLog.d(\"Memory\", snapshot.toPretty(context))\n```\n\nTypical measurement points:\n\n- Before model load\n- After model load\n- After blocking prompt\n- After streaming prompt\n\n#### Key fields:\n\n- `totalPssKb`: Total proportional RAM usage. Best for overall tracking.\n- `dalvikPssKb`: JVM-managed heap and runtime.\n- `nativePssKb`: Native heap (llama.cpp, ONNX, tensors, KV cache).\n- `otherPssKb`: Miscellaneous memory.\n\nMonitor `nativePssKb` closely during model loading and inference to understand LLM memory footprint.\nExpert runtimes such as `SmolLM` also expose native/state-specific memory estimates when you need lower-level instrumentation.\n\n## Notes\n\n- `VULKAN_SDK` may still be required when you are building the Vulkan path on the host.\n- Check Android GPU capability with the explicit per-subsystem helpers such as `LLMEdge.getTextBackendAvailability()` and `LLMEdge.getImageBackendAvailability()`.\n\n### ProGuard/R8 Configuration\n\nThe library includes consumer ProGuard rules. If you need to add custom rules:\n\n```proguard\n# Keep OCR engines\n-keep class io.aatricks.llmedge.vision.** { *; }\n-keep class org.bytedeco.** { *; }\n-keep class com.google.mlkit.** { *; }\n\n# Suppress warnings for optional dependencies\n-dontwarn org.bytedeco.**\n-dontwarn com.google.mlkit.**\n```\n\n### Licenses\n\n- **llmedge**: Apache 2.0\n- **llama.cpp**: MIT\n- **stable-diffusion.cpp**: MIT\n- **whisper.cpp**: MIT\n- **bark.cpp**: MIT\n- **Leptonica**: Custom (BSD-like)\n- **Google ML Kit**: Proprietary (see ML Kit terms)\n- **JavaCPP**: Apache 2.0\n\n## License and Credits\n\nThis project builds upon work by [Shubham Panchal](https://github.com/shubham0204), [ggerganov](https://github.com/ggerganov), and [PABannier](https://github.com/PABannier).\nSee [CREDITS.md](CREDITS.md) for full details.\n\n## Testing\n\nLooking to run unit and instrumentation tests locally, including optional native txt2img E2E checks? See the step-by-step guide in [docs/testing.md](docs/testing.md).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faatricks%2Fllmedge","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faatricks%2Fllmedge","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faatricks%2Fllmedge/lists"}