https://github.com/aatricks/llmedge-examples

Examples using the llmedge library
https://github.com/aatricks/llmedge-examples

android app-example gguf kotlin llamacpp llm-inference

Last synced: 3 months ago
JSON representation

Examples using the llmedge library

Host: GitHub
URL: https://github.com/aatricks/llmedge-examples
Owner: Aatricks
License: gpl-3.0
Created: 2025-09-18T15:04:12.000Z (10 months ago)
Default Branch: main
Last Pushed: 2025-09-24T17:13:51.000Z (10 months ago)
Last Synced: 2025-09-24T19:18:32.236Z (10 months ago)
Topics: android, app-example, gguf, kotlin, llamacpp, llm-inference
Language: Kotlin
Homepage:
Size: 113 MB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# llmedge Examples

Comprehensive demonstration applications for the llmedge Android library, showcasing on-device language model inference, RAG pipelines, image generation, and video synthesis capabilities.

**Main Library Repository**: https://github.com/Aatricks/llmedge

## Overview

This example application provides production-ready demonstrations of llmedge's core features. Each activity is designed to illustrate best practices for model loading, memory management, and efficient on-device inference.

## Included Demonstrations

### Language Model Inference

**Local Asset Demo** (`LocalAssetDemoActivity.kt`)
- Demonstrates loading GGUF models bundled within the APK
- Illustrates asset extraction to app-private storage
- Shows both blocking and streaming inference patterns
- Suitable for offline-first applications

**Jinja Chat Template Demo** (`JinjaTemplateDemoActivity.kt`)
- Demonstrates passing an explicit loop-based Jinja chat template through `SmolLM.InferenceParams.chatTemplate`
- Downloads a GGUF model from Hugging Face through `SmolLM.loadFromHuggingFace(...)`
- Shows the exact template string used for the request so the override path is visible in-app

**Hugging Face Demo** (`HuggingFaceDemoActivity.kt`)
- Automated model download from Hugging Face Hub
- Progress monitoring and cache management
- Demonstrates proper error handling for network operations
- Shows model reuse across application sessions

### Retrieval-Augmented Generation

**RAG Demo** (`RagActivity.kt`)
- Complete on-device RAG pipeline implementation
- Document indexing with ONNX embeddings
- Vector similarity search and context retrieval
- Integration with SmolLM for answer generation
- Demonstrates PDF parsing and text chunking strategies

### Vision and Multimodal Processing

**Image Text Extraction** (`ImageToTextActivity.kt`)
- Google ML Kit OCR integration
- Batch image processing capabilities
- Error handling for unsupported image formats
- Demonstrates preprocessing for vision models

**Vision Model Demo** (`LlavaVisionActivity.kt`)
- Vision-capable language model integration
- Image-to-text description generation
- Multimodal input preparation
- Demonstrates vision model inference patterns

### Generative Media

**Image Generation** (`StableDiffusionActivity.kt`)
- Text-to-image synthesis using Stable Diffusion
- LoRA Support: Toggle switch to apply Detail Tweaker LoRA, automatically downloaded from Hugging Face
- EasyCache: Auto-enabled acceleration for supported DiT models (Flux, SD3, Wan, Qwen Image, Z-Image)
- Memory-aware configuration options
- Progressive generation with cancellation support
- Demonstrates VAE loading and tensor offloading strategies

**Video Generation** (`VideoGenerationActivity.kt`)
- Text-to-video synthesis using Wan models
- Multi-file model loading (main + VAE + T5XXL)
- Device capability detection (12GB+ RAM required)
- Frame-by-frame progress monitoring
- Demonstrates proper resource cleanup

### Speech Processing

**Speech-to-Text (STT)** (`STTActivity.kt`)
- Whisper model download from Hugging Face
- Audio recording and transcription
- Real-time streaming transcription support
- Timestamp and SRT generation

**Text-to-Speech (TTS)** (`TTSActivity.kt`)
- Bark model download from Hugging Face via `LLMEdge`
- Text input for speech synthesis
- Progress tracking during generation
- Audio playback and WAV file saving
- ARM-optimized native inference with OpenMP

## System Requirements

### Minimum Requirements
- Android 11+ (API 30)
- 3GB RAM for basic LLM inference
- 500MB free storage for model caching
- 1GB+ free storage for speech models

### Recommended Configuration
- Android 11+ (API 30) with GPU backends enabled
- 8GB RAM for Stable Diffusion
- 12GB+ RAM for video generation (Wan models)
- 5GB free storage for video model pipeline

### Speech Model Requirements
- **Whisper STT**: 75MB-500MB depending on model size (tiny to small)
- **Bark TTS**: 843MB for f16 models

### Development Environment
- Android SDK with NDK r27+
- CMake 3.22+
- Java 17+
- Gradle 8.0+ (wrapper included)

## Building the Application

### Standard Build Process

From the repository root directory:

1. Build the llmedge library:
```bash
./gradlew :llmedge:assembleRelease
```

2. Build the example application:
```bash
cd llmedge-examples
./gradlew :app:assembleDebug
```

3. Install to device:
```bash
./gradlew :app:installDebug
```

### GPU-Enabled Build

For Android GPU builds with OpenCL-first, Vulkan-fallback runtime selection:

```bash
./gradlew :llmedge:assembleRelease \
-PllmedgeAndroidOpencl=ON \
-Pandroid.jniCmakeArgs="-DGGML_VULKAN=ON -DSD_VULKAN=ON"

cd llmedge-examples
./gradlew :app:assembleDebug :app:installDebug
```

**Notes**:
- Experimental OpenCL support is Android-only and currently limited to `arm64-v8a`.
- At runtime, `llmedge` prefers OpenCL first, then Vulkan, then CPU for text, Whisper, and image/video.
- Bark remains CPU-only.

## Asset Configuration

### Bundled GGUF Models

Place small GGUF models in `app/src/main/assets/` for offline-first demos:

```
app/src/main/assets/
└── models/
└── smolm2-360M-instruct.gguf
```

Recommended models for bundling:
- SmolLM2-360M-Instruct (~200MB)
- Qwen2-0.5B-Instruct (~300MB)
- TinyLlama-1.1B (~600MB)

### RAG Embeddings

The RAG demo requires ONNX embedding models:

```
app/src/main/assets/
└── embeddings/
└── all-minilm-l6-v2/
├── model.onnx
└── tokenizer.json
```

Download from: `sentence-transformers/all-MiniLM-L6-v2` on Hugging Face

### Runtime Model Cache

Models downloaded via Hugging Face are cached at:
```
/files/hf-models///
```

Cache persists across app restarts and is reused automatically.

## Usage Examples

### Basic LLM Inference

```kotlin
val edge = LLMEdge.create(context, lifecycleScope)

CoroutineScope(Dispatchers.IO).launch {
val response = edge.text.generate(
prompt = "Explain quantum computing concisely.",
model = ModelSpec.huggingFace(
repoId = "unsloth/Qwen3-0.6B-GGUF",
filename = "Qwen3-0.6B-Q4_K_M.gguf",
),
)

withContext(Dispatchers.Main) {
textView.text = response
}
}
```

### RAG Pipeline

```kotlin
val edge = LLMEdge.create(context, lifecycleScope)
val rag = edge.rag.createSession()
rag.init()

CoroutineScope(Dispatchers.IO).launch {
val chunks = rag.indexPdf(pdfUri)
val answer = rag.ask("What are the main conclusions?")

withContext(Dispatchers.Main) {
resultView.text = answer
}
}
```

### Speech-to-Text (Whisper)

```kotlin
val edge = LLMEdge.create(context, lifecycleScope)

CoroutineScope(Dispatchers.IO).launch {
// Simple transcription
val text = edge.speech.transcribeToText(audioSamples)

// Full transcription with timing
val segments = edge.speech.transcribe(
audioSamples = audioSamples,
params = Whisper.TranscribeParams(language = "en"),
)

withContext(Dispatchers.Main) {
segments.forEach { segment ->
textView.append("[${segment.startTimeMs}ms] ${segment.text}\n")
}
}
}
```

### Real-time Streaming Transcription

For live captioning from a microphone:

```kotlin
class LiveCaptionActivity : AppCompatActivity() {
private var transcriber: StreamingTranscriptionSession? = null

fun startLiveCaptions() {
lifecycleScope.launch(Dispatchers.IO) {
// Create streaming transcriber with sliding window
transcriber = LLMEdge.create(this@LiveCaptionActivity, lifecycleScope).speech.createStreamingSession(
params = Whisper.StreamingParams(
stepMs = 3000, // Process every 3 seconds
lengthMs = 10000, // 10-second windows
language = "en",
useVad = true // Skip silent audio
)
)

// Collect transcription results
transcriber?.events()?.collect { segment ->
withContext(Dispatchers.Main) {
captionTextView.text = segment.text
}
}
}
}

// Feed audio from microphone (called by AudioRecord callback)
fun onAudioData(samples: FloatArray) {
lifecycleScope.launch(Dispatchers.IO) {
transcriber?.feedAudio(samples)
}
}

fun stopLiveCaptions() {
transcriber?.stop()
}
}
```

### Text-to-Speech (Bark)

```kotlin
val edge = LLMEdge.create(context, lifecycleScope)

CoroutineScope(Dispatchers.IO).launch {
// Generate speech (model auto-downloads on first use)
val audio = edge.speech.synthesize("Hello, world!")
audioPlayer.play(audio.samples, audio.sampleRate)
}
```

### Image Generation

```kotlin
val edge = LLMEdge.create(this, lifecycleScope)

val bitmap = edge.image.generate(
ImageGenerationRequest(
prompt = "serene mountain landscape, sunset",
width = 512,
height = 512,
steps = 20
),
)

imageView.setImageBitmap(bitmap)
```

### Video Generation

```kotlin
val edge = LLMEdge.create(this, lifecycleScope)

// Automatic memory management and sequential loading
edge.image.generateVideo(
VideoGenerationRequest(
prompt = "cat walking through garden",
videoFrames = 8,
width = 512,
height = 512,
steps = 20,
cfgScale = 7.0f,
flowShift = 3.0f,
forceSequentialLoad = true // Safe for most devices
)
).collect { event ->
Log.d("VideoGen", event.toString())
}
```

## Performance Optimization

### Memory Management

**Monitor Memory Usage**:
```kotlin
val snapshot = MemoryMetrics.snapshot(context)
Log.d("Memory", "Native heap: ${snapshot.nativePssKb / 1024}MB")
```

**Optimization Strategies**:
- Use quantized models (Q4_K_M) for lower memory footprint
- Enable CPU offloading for large models
- Close model instances when not in use
- Process images/video in batches with intermediate cleanup

### Thread Configuration

```kotlin
val edge = LLMEdge.create(
context = context,
scope = lifecycleScope,
config = LLMEdgeConfig(
text = TextRuntimeConfig(
promptThreads = Runtime.getRuntime().availableProcessors(),
contextSize = 2048,
),
),
)
```

### GPU Backends

Verify Android GPU capability:
```kotlin
val textBackends = LLMEdge.getTextBackendAvailability()
val imageBackends = LLMEdge.getImageBackendAvailability()

Log.i("Performance", "Text backends: $textBackends")
Log.i("Performance", "Image backends: $imageBackends")
```

Check logcat for initialization:
```bash
adb logcat -s SmolLM:* SmolSD:* | grep -Ei "opencl|vulkan|backend"
```

## Troubleshooting

### Model Loading Failures

**Symptoms**: `FileNotFoundException`, `IllegalStateException` during load

**Solutions**:
- Verify model file exists in expected location
- Check available storage space
- Ensure network connectivity for Hugging Face downloads
- Validate model file integrity (not corrupted)

### Out of Memory Errors

**Symptoms**: App crashes with OOM during inference or generation

**Solutions**:
- Use smaller models or quantized variants
- Reduce image/video resolution
- Enable CPU offloading: `offloadToCpu = true`
- Lower context window size
- Close unused model instances

### Slow Inference Performance

**Symptoms**: Generation takes excessive time per token/frame

**Solutions**:
- Use quantized models (Q4_K_M, Q3_K_S)
- Reduce inference steps (15-20 is usually sufficient)
- Enable Android GPU backends on compatible devices
- Adjust thread count to match device cores
- Use smaller resolutions for media generation

### Video Generation Failures

**Symptoms**: Crashes or errors when loading Wan models

**Solutions**:
- Verify device has 12GB+ RAM
- Ensure all three files downloaded (main + VAE + T5XXL)
- Use explicit file paths (not modelId shorthand)
- Check stable-diffusion.cpp logs in logcat
- Verify sufficient storage for 6GB+ model files

### Native Library Issues

**Symptoms**: `UnsatisfiedLinkError`, native crashes

**Solutions**:
- Rebuild AAR and reinstall app
- Verify NDK version matches (r27+)
- Check device ABI compatibility
- Inspect logcat for native stack traces
- Clean build: `./gradlew clean`

### Speech Processing Issues

**Symptoms**: Whisper transcription crashing or producing garbled output

**Solutions**:
- Ensure audio is 16kHz mono PCM float32 format
- Use smaller models (tiny/base) for faster processing
- Check that model file downloaded completely

## Testing Infrastructure

### Speech E2E Testing

Run speech tests via adb:
```bash
adb shell am instrument -w -e class com.example.llmedgeexample.SpeechE2ETest \
com.example.llmedgeexample.test/androidx.test.runner.AndroidJUnitRunner
```

### Headless E2E Testing

Run automated video generation tests:

```bash
adb shell am start -n com.example.llmedgeexample/.HeadlessVideoTestActivity
```

Monitor test execution:
```bash
adb logcat -s VideoE2E:*
```

Test results are logged to logcat with detailed timing and validation metrics.

## Architecture Notes

### Memory Architecture
- Native models allocated via JNI in native heap
- Dalvik heap used only for Java objects and bitmaps
- Large file downloads use system DownloadManager
- Tensor operations execute in native memory space

### Threading Model
- All model operations run on background threads (Dispatchers.IO)
- UI updates dispatched to Main thread
- Blocking calls avoided on UI thread
- Coroutines used for structured concurrency

### Resource Lifecycle
- Models implement `AutoCloseable` for automatic cleanup
- Native resources freed via `close()` method
- File handles managed with try-with-resources pattern
- Memory mapped files used for large model loading

## License

Apache 2.0 - See LICENSE file for details

## Contributing

Contributions are welcome. Please review the main repository's contributing guidelines before submitting pull requests.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/aatricks/llmedge-examples

Awesome Lists containing this project

README