https://github.com/legionio/lex-knowledge
https://github.com/legionio/lex-knowledge
Last synced: about 1 month ago
JSON representation
- Host: GitHub
- URL: https://github.com/legionio/lex-knowledge
- Owner: LegionIO
- Created: 2026-03-24T19:13:13.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2026-04-15T15:04:56.000Z (2 months ago)
- Last Synced: 2026-04-15T17:09:01.387Z (2 months ago)
- Language: Ruby
- Size: 135 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
Awesome Lists containing this project
README
# lex-knowledge
Document corpus ingestion and knowledge query pipeline for LegionIO.
`lex-knowledge` walks a directory of documents, parses them into sections, splits sections into token-aware chunks, and writes each chunk to Apollo as a searchable knowledge entry. A query runner retrieves relevant chunks via semantic search and optionally synthesizes an answer through the LLM pipeline.
## Phase A: Corpus Ingestion
This gem implements Phase A of the knowledge pipeline:
- **Manifest**: file walker with SHA256 fingerprinting and incremental diff support
- **Parser**: section-aware extraction for Markdown and plain text
- **Chunker**: paragraph-respecting splits with configurable token budget and overlap
- **Ingest runners**: full corpus or single-file ingestion, writing chunks to Apollo
- **Query runners**: retrieval-only or retrieval + LLM synthesis
`.docx` and `.pdf` parsing are deferred to a later phase.
## Usage
```ruby
require 'legion/extensions/knowledge'
# Ingest an entire directory
Legion::Extensions::Knowledge::Runners::Ingest.ingest_corpus(
path: '/path/to/docs',
dry_run: false,
force: false
)
# => { success: true, files_scanned: 12, chunks_created: 84, chunks_skipped: 0, chunks_updated: 0 }
# Ingest a single file
Legion::Extensions::Knowledge::Runners::Ingest.ingest_file(
file_path: '/path/to/docs/guide.md'
)
# => { success: true, file: '...', chunks_created: 7, chunks_skipped: 0, chunks_updated: 0 }
# Query with LLM synthesis
Legion::Extensions::Knowledge::Runners::Query.query(
question: 'How does Legion route tasks?',
top_k: 5,
synthesize: true
)
# => { success: true, answer: '...', sources: [...], metadata: { retrieval_score: 0.87, chunk_count: 5, latency_ms: 312 } }
# Retrieval only (no LLM)
Legion::Extensions::Knowledge::Runners::Query.retrieve(
question: 'What is a LEX extension?',
top_k: 3
)
# => { success: true, sources: [...], metadata: { chunk_count: 3 } }
```
## Configuration
Settings are read from `Legion::Settings` under the `:knowledge` key:
```yaml
knowledge:
chunker:
max_tokens: 512 # default 512
overlap_tokens: 128 # default 128
query:
top_k: 5 # default 5
```
## Dependencies
- `legion-cache`, `legion-crypt`, `legion-data`, `legion-json`, `legion-logging`, `legion-settings`, `legion-transport`
- `lex-apollo` (optional): chunk storage and vector retrieval
- `legion-llm` (optional): answer synthesis
Both optional dependencies are guarded with `defined?()` — the gem degrades gracefully when they are absent.
## License
MIT