https://github.com/aget-framework/template-document-processor-aget
Production-ready template for creating document processing agents with LLM pipelines, security protocols, and multi-provider support
https://github.com/aget-framework/template-document-processor-aget
aget-framework document-processing llm python template
Last synced: 27 days ago
JSON representation
Production-ready template for creating document processing agents with LLM pipelines, security protocols, and multi-provider support
- Host: GitHub
- URL: https://github.com/aget-framework/template-document-processor-aget
- Owner: aget-framework
- License: apache-2.0
- Created: 2025-10-26T21:28:58.000Z (7 months ago)
- Default Branch: main
- Last Pushed: 2026-04-26T05:20:13.000Z (about 1 month ago)
- Last Synced: 2026-04-26T07:18:18.656Z (about 1 month ago)
- Topics: aget-framework, document-processing, llm, python, template
- Language: Python
- Size: 407 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
- Agents: AGENTS.md
Awesome Lists containing this project
README
# template-document-processor-AGET
> **Status**: 🟡 DORMANT (stable template - v2.8.0 released 2025-11-03)
>
> Production-ready template. Activated when new document processing agents are needed.
**Template Version**: 2.8.0
**AGET Framework**: v2.7.0
**Type**: AGET Template (specialized from worker template)
**Domain**: Document Processing
A production-ready template for creating document processing agents with LLM pipelines, security protocols, format preservation, and multi-provider support.
**Note**: This template is based on `template-worker-aget` v2.7.0 with specialized document processing capabilities. Template version (v2.8.0) tracks template-specific features independently from AGET framework version (v2.7.0).
## Overview
This template provides a complete foundation for agents that:
- ✅ Process documents using LLM assistance (OpenAI, Anthropic, Google)
- ✅ Preserve DOCX formatting (Track Changes, comments, annotations)
- ✅ Prevent catastrophic format loss (L245-type failures)
- ✅ Support batch operations with validation pipelines
- ✅ Implement security protocols (injection prevention, content filtering)
- ✅ Provide caching, metrics, and observability
- ✅ Enable task decomposition for large documents
- ✅ Support rollback and version management
## Based on L208
This template is based on **L208: Document Processing Agent Template Pattern**, which analyzed production document processing agents to extract common patterns and best practices.
**Time Savings**: 60-70% reduction in new agent setup (3-5 hours → 1-2 hours)
## Quick Start
### 1. Clone Template
```bash
gh repo clone aget-framework/template-document-processor-AGET my-document-agent
cd my-document-agent
```
### 2. Customize Configuration
Update these files:
**`.aget/version.json`**:
```json
{
"agent_name": "my-document-agent",
"domain": "your-domain"
}
```
**`configs/validation_rules.yaml`**:
```yaml
max_file_size_mb: 10
allowed_extensions: [".pdf", ".docx", ".txt", ".md"]
required_validations:
- file_size
- file_format
- content_safety
```
**`configs/llm_providers.yaml`**:
```yaml
providers:
openai:
api_key_env: OPENAI_API_KEY
enabled: true
anthropic:
api_key_env: ANTHROPIC_API_KEY
enabled: false
budget:
monthly_limit_usd: 300.0
```
### 3. Update Mission
Edit `AGENTS.md` to describe your agent's specific purpose and domain.
### 4. Run Tests
```bash
python3 -m pytest tests/ -v
```
### 5. Deploy
```bash
git remote set-url origin
git add .
git commit -m "feat: Initialize from template-document-processor-AGET v2.7.0"
git push -u origin main
```
## Architecture
### Core Components
- **`src/ingestion/`** - Queue management, validation, batch processing
- **`src/processing/`** - LLM providers, model routing, caching, schema validation
- **`src/output/`** - Publishing, version management, rollback
- **`src/security/`** - Input sanitization, content filtering, resource limiting
- **`src/pipeline/`** - Task decomposition, orchestration, metrics
- **`src/wikitext/`** - Document format support (docx, wikitext, extensible for additional formats)
### Configuration Files (9 YAMLs)
- **`configs/validation_rules.yaml`** - Document validation criteria
- **`configs/llm_providers.yaml`** - LLM provider configuration
- **`configs/model_routing.yaml`** - Model selection strategy
- **`configs/models.yaml`** - Model definitions and capabilities
- **`configs/security_policy.yaml`** - Security and content filtering
- **`configs/processing_limits.yaml`** - Resource limits (tokens, time, cost)
- **`configs/caching.yaml`** - Cache settings and TTL
- **`configs/metrics.yaml`** - Metrics collection and alerts
- **`configs/orchestration.yaml`** - Task decomposition and pipeline
## Customization Points
### 1. Document Validation
Customize `configs/validation_rules.yaml` for your document format:
- File extensions
- Size limits
- Format-specific validation rules
### 2. LLM Providers
Set up providers in `configs/llm_providers.yaml`:
- API keys (use environment variables)
- Model selection (cost vs quality)
- Fallback chain
- Budget limits
### 3. Security
Configure security in `configs/security_policy.yaml`:
- Input sanitization rules
- Content filtering (PII detection)
- Resource limits (tokens, time, cost)
### 4. Metrics
Define metrics in `configs/metrics.yaml`:
- Accuracy tracking
- Latency monitoring (p50/p95/p99)
- Cost tracking
- Alert thresholds
## Protocols
The template includes 10 operational protocols in `.aget/docs/protocols/`:
1. **Format Preservation** - Prevent L245-type catastrophic failures (Track Changes loss)
2. **Queue Management** - Managing document queues
3. **Processing Authorization** - Approval gates and STOP protocol
4. **Validation Pipeline** - Pre/post validation
5. **Rollback** - Version management and recovery
6. **Security Validation** - Input/output sanitization
7. **Task Decomposition** - Breaking large documents into subtasks
8. **Model Routing** - Selecting optimal LLM for each task
9. **Caching** - LLM response caching for cost/speed
10. **Metrics Collection** - Tracking accuracy/latency/cost
Each protocol includes bash commands and code examples.
**Format Preservation Guide**: See `.aget/docs/FORMAT_PRESERVATION_GUIDE.md` for complete usage guide.
## Scripts
The template provides 17 operational tools (15 scripts + 2 helper tools):
**Session Management**:
- `session_protocol.py` - Wake up/wind down/sign off
- `queue_status.py` - Queue management CLI
- `health_check.py` - System diagnostics
**Core Operations** (`scripts/`):
- `validate.py` - Document validation CLI
- `process.py` - End-to-end processing pipeline
- `queue_status.py` - Queue status and management
- `rollback.py` - Version rollback operations
- `cache_setup.py` - Cache initialization
- `metrics.py` - Metrics display and export
**Supporting Operations** (`scripts/`):
- `health_check.py` - System health diagnostics
- `security_check.py` - Security validation
- `audit.py` - Audit trail viewer
- `model_router.py` - Model routing recommendations
- `cache_stats.py` - Cache statistics
- `cache_clear.py` - Cache clearing
**Specialized Tools** (`scripts/` and `.aget/tools/`):
- `session_protocol.py` - Session lifecycle (wake/wind-down/sign-off)
- `checkpoint.py` - Checkpoint save/load/list
- `task_planner.py` - Task decomposition planning
- `.aget/tools/analyze_agent_fit.py` - Use case fit analysis
- `.aget/tools/instantiate_template.py` - Template instantiation helper
## Testing
Template includes 30 contract tests (100% passing):
```bash
# Run all tests
python3 -m pytest tests/ -v
# Run specific test category
python3 -m pytest tests/test_processing.py -v
```
Test coverage:
- **Smoke tests** (20 tests): All 20 modules tested
- **Integration tests** (10 tests): End-to-end workflows, script integration, contract validation
```bash
# Run smoke tests only
python3 tests/smoke_test.py
# Run integration tests only
python3 tests/test_integration.py
```
## Helper Tools
Two helper tools are provided for template users:
**Analyze Agent Fit**:
```bash
# Check if use case fits this template
python3 .aget/tools/analyze_agent_fit.py "Process legal contracts and extract structured data"
# Interactive mode
python3 .aget/tools/analyze_agent_fit.py --interactive
```
**Instantiate Template**:
```bash
# Create new agent from template
python3 .aget/tools/instantiate_template.py invoice-processor ~/github/invoice-processor-AGET
# Verify instantiation
python3 .aget/tools/instantiate_template.py --check ~/github/invoice-processor-AGET
```
## Version History
### Template Versioning
This template uses dual versioning:
- **Template Version** (v2.8.0): Template-specific features and enhancements
- **AGET Framework** (v2.7.0): Framework compliance version
**Template v2.8.0** tracks format preservation capabilities added to this specialized template.
**AGET v2.7.0** indicates compliance with AGET framework standards and base worker template.
---
**Template v2.8.0** (2025-11-02) - AGET Framework v2.7.0
- **Format Preservation Framework**: Prevent L245-type catastrophic failures
- **New capabilities**:
- OOXML verification for Track Changes, comments, annotations
- Round-trip validation (before → process → after)
- Multi-stage checkpoint system for pipeline verification
- L245 failure detection (100% format loss prevention)
- **Documentation**:
- FORMAT_PRESERVING_DECISION_PROTOCOL.md (5-question architecture checklist)
- FORMAT_PRESERVATION_GUIDE.md (implementation guide)
- **Test coverage**: 17 tests, 38% coverage (critical paths validated)
- **API**: 5-module framework with simple and advanced usage patterns
**Template v2.7.0** (2025-10-26) - AGET Framework v2.7.0
- Initial template release
- Based on L208 document processing pattern analysis
- **20 source modules** (Gate 2A: 8, Gate 2B: 7, Gate 2C: 5)
- **9 configuration files** (YAML-based)
- **9 operational protocols** (tested and validated)
- **3 formal specifications** (168 EARS requirements)
- **17 operational tools** (15 scripts + 2 helper tools)
- **30 contract tests** (20 smoke + 10 integration, 100% passing)
- Multi-provider LLM support (OpenAI, Anthropic, Google)
- Security protocols (input sanitization, content filtering, resource limits)
- Document format support (docx, wikitext, extensible to additional formats)
## Support
For template issues or questions:
- Issue tracker: https://github.com/aget-framework/aget/issues
- Template repository: https://github.com/aget-framework/template-document-processor-AGET
- AGET framework: https://github.com/aget-framework
## License
Licensed under the Apache License, Version 2.0. See [LICENSE](LICENSE) file for details.
Copyright 2025 AGET Framework Contributors
---
*Generated by AGET v2.7.0*