https://github.com/stevenbtw/deriva
Deriva is a research project aimed at enabling digital architects by automating the derivation of architecture models. Deriva uses a combination of knowledge graphs, heuristics and LLM's.
https://github.com/stevenbtw/deriva
archimate archimate-models architecture ast cli deriva derivation enterprise-architecture graph-algorigthms graph-algorithms heuristics knowledge-graph llm marimo neo4j ocel2 pydantic python
Last synced: 4 months ago
JSON representation
Deriva is a research project aimed at enabling digital architects by automating the derivation of architecture models. Deriva uses a combination of knowledge graphs, heuristics and LLM's.
- Host: GitHub
- URL: https://github.com/stevenbtw/deriva
- Owner: StevenBtw
- License: other
- Created: 2025-12-31T13:54:17.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2026-01-24T20:41:14.000Z (5 months ago)
- Last Synced: 2026-01-24T21:19:21.011Z (5 months ago)
- Topics: archimate, archimate-models, architecture, ast, cli, deriva, derivation, enterprise-architecture, graph-algorigthms, graph-algorithms, heuristics, knowledge-graph, llm, marimo, neo4j, ocel2, pydantic, python
- Language: Python
- Homepage: https://deriva.dev
- Size: 2.5 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- Funding: .github/FUNDING.yml
- License: LICENSE
- Codeowners: .github/CODEOWNERS
- Security: .github/SECURITY.md
- Support: .github/SUPPORT.md
Awesome Lists containing this project
README
# Deriva
[](#)
[](https://github.com/StevenBtw/Deriva/actions/workflows/ci.yml)
[](LICENSE)
[](https://www.python.org/downloads/)
[](https://neo4j.com/)
[](https://www.docker.com/)
[](https://marimo.io/)
**Automatically generate ArchiMate enterprise architecture models from software repositories.**
Deriva analyzes code repositories and transforms them into [ArchiMate](https://www.opengroup.org/archimate-forum) models that can be opened in the [Archi modeling tool](https://www.archimatetool.com/).
## How It Works
1. **Clone** a Git repository
2. **Extraction** - Build a graph representation in Neo4j:
- **Classify phase**: Categorize files by type and subtype using registry
- **Parse phase**: Extract semantic nodes (TypeDefinitions, Methods, BusinessConcepts, etc.)
- Python files use fast AST parsing; other languages use LLM
3. **Derivation** - Generate ArchiMate elements using a hybrid approach:
- **Prep phase**: Graph enrichment (PageRank, Louvain communities, k-core)
- **Generate phase**: LLM-based element derivation with graph metrics
- **Refine phase**: Relationship derivation and quality assurance
4. **Export** to `.xml` file (ArchiMate format)
## Quick Setup
### Prerequisites
- **Python 3.14+**
- **Docker** (for Neo4j)
- **uv** (Python package manager)
### 1. Install uv
```bash
# Windows (PowerShell)
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"
# macOS/Linux
curl -LsSf https://astral.sh/uv/install.sh | sh
```
### 2. Clone and Configure
```bash
git clone https://github.com/StevenBtw/Deriva.git
cd Deriva
# Create environment configuration
cp .env.example .env
# Edit .env with your settings (Neo4j, LLM API keys, etc.)
```
### 3. Create Python Environment
```bash
uv venv --python 3.14
```
Activate the virtual environment:
```bash
# Windows PowerShell
.venv\Scripts\Activate.ps1
# Windows Command Prompt
.venv\Scripts\activate.bat
# macOS/Linux
source .venv/bin/activate
```
### 4. Install Dependencies
```bash
uv sync
```
### 5. Start Neo4j
```bash
cd deriva/adapters/neo4j
docker-compose up -d
```
Neo4j will be available at:
- **Browser UI**: http://localhost:7474 (no authentication)
- **Bolt Protocol**: `bolt://localhost:7687`
Verify Neo4j is running:
```bash
docker ps # Should show deriva_neo4j container
```
### 6. Launch Deriva
```bash
cd ../../.. # Back to Deriva root
uv run marimo edit deriva/app/app.py
```
The marimo notebook opens in your browser at: http://127.0.0.1:2718
---
## First Time Setup
When you first open Deriva, you need to seed the configuration database.
### 1. Seed File Type Registry
Navigate to **Column 2: Manage Extraction** → **File Type Registry**
1. Click **"Seed from JSON"**
2. This loads default file type mappings from `extraction_config.json`
3. Categories include: Source, Config, Docs, Test, Build, Asset, Data, Exclude
### 2. Enable Extraction Steps
Navigate to **Column 2: Manage Extraction** → **Extraction Step Configuration**
Enable the extraction steps you need:
| Step | Purpose | Recommended |
|------|---------|-------------|
| Repository | Creates root node for the repo | Always |
| Directory | Creates directory structure nodes | Always |
| File | Creates file nodes with classification | Always |
| TypeDefinition | Extracts classes, functions (AST for Python) | Yes |
| Method | Extracts methods from type definitions | Optional |
| Edge | Extracts relationships (IMPORTS, USES, CALLS, DECORATED_BY, REFERENCES) | Yes |
| Technology | Detects frameworks and libraries | Optional |
| ExternalDependency | Maps external dependencies | Optional |
| Test | Extracts test definitions | Optional |
### 3. Configure LLM (Optional)
If using LLM-assisted extraction, configure your provider in `.env`:
```bash
# Set default model to use
LLM_DEFAULT_MODEL=mistral-devstral
# Configure the model (naming: LLM_{NAME}_*)
LLM_MISTRAL_DEVSTRAL_PROVIDER=mistral
LLM_MISTRAL_DEVSTRAL_MODEL=devstral-2512
LLM_MISTRAL_DEVSTRAL_URL=https://api.mistral.ai/v1/chat/completions
LLM_MISTRAL_DEVSTRAL_KEY=your-key-here
LLM_MISTRAL_DEVSTRAL_STRUCTURED_OUTPUT=true
```
---
## Using Deriva
### Basic Workflow
#### 1. Clone a Repository
**Column 1: Configuration → Repositories**
1. Enter repository URL (e.g., `https://github.com/user/repo.git`)
2. Optionally specify a target name
3. Click **"Clone"**
#### 2. Run the Pipeline
**Column 0: Run Deriva**
- Click **"Run Deriva"** to run the full pipeline (extraction → derivation)
- Or use individual step buttons: **Extraction**, **Derivation**
Results display in a status callout showing nodes/elements created and any errors.
#### 3. View Results
**Column 1: Configuration**
- **Graph Statistics**: Node counts by type (Repository, Directory, File, etc.)
- **ArchiMate Model**: Element and relationship counts by type
#### 4. Export to Archi
**Column 1: Configuration → ArchiMate Model**
1. Set export path (default: `workspace/output/model.xml`)
2. Click **"Export Model"**
3. Open the file with [Archi](https://www.archimatetool.com/)
**Via CLI:**
```bash
deriva export -o workspace/output/model.xml
```
---
## Configuration
### Environment Variables (.env)
All configuration lives in `.env`. Key settings:
```bash
# Neo4j (default docker-compose has auth disabled)
NEO4J_URI=bolt://localhost:7687
NEO4J_USERNAME=
NEO4J_PASSWORD=
# LLM Provider (mistral, openai, azure, anthropic, ollama, lmstudio)
LLM_MISTRAL_DEVSTRAL_PROVIDER=mistral
LLM_MISTRAL_DEVSTRAL_MODEL=devstral-2512
LLM_MISTRAL_DEVSTRAL_URL=https://api.mistral.ai/v1/chat/completions
LLM_MISTRAL_DEVSTRAL_KEY=your-mistral-api-key
LLM_MISTRAL_DEVSTRAL_STRUCTURED_OUTPUT=true
# Namespaces
NEO4J_GRAPH_NAMESPACE=Graph
ARCHIMATE_NAMESPACE=Model
```
See `.env.example` for all available options.
### Rate Limiting
The LLM adapter includes built-in rate limiting to prevent API throttling:
```bash
# Requests per minute (0 = use provider default: 60 RPM for cloud, unlimited for local)
LLM_RATE_LIMIT_RPM=0
# Minimum delay between requests in seconds
LLM_RATE_LIMIT_DELAY=0.0
# Max retries on rate limit (429) errors
LLM_RATE_LIMIT_RETRIES=3
# Adaptive throttling (reduces RPM when hitting rate limits)
LLM_THROTTLE_ENABLED=true
LLM_THROTTLE_MIN_FACTOR=0.25 # Minimum 25% of configured RPM
LLM_THROTTLE_RECOVERY_TIME=60 # Seconds before trying to increase RPM
# Circuit breaker (stops requests when provider is failing)
LLM_CIRCUIT_BREAKER_ENABLED=true
LLM_CIRCUIT_FAILURE_THRESHOLD=5 # Consecutive failures to open circuit
LLM_CIRCUIT_RECOVERY_TIME=30 # Seconds before testing recovery
```
Default rate limits by provider:
| Provider | Default RPM |
|----------|-------------|
| OpenAI | 30 |
| Anthropic | 30 |
| Mistral | 24 |
| Ollama | Unlimited |
| LM Studio | Unlimited |
The rate limiter automatically:
- Throttles requests to stay within limits
- Applies exponential backoff on rate limit errors (HTTP 429)
- Respects Retry-After headers from providers
- Adaptively reduces RPM when hitting rate limits (recovers over time)
- Opens circuit breaker after consecutive failures to prevent cascading errors
### Managing File Types
If you encounter **undefined extensions** during extraction:
**Via UI (Marimo):**
1. Navigate to **Column 2** → **Undefined Extensions**
2. Add them to the registry:
- Extension (e.g., `.tsx`, `Dockerfile`)
- Type (source, config, docs, test, build, asset, data, exclude)
- Subtype (e.g., `typescript`, `docker`)
**Via CLI:**
```bash
# List all registered file types
deriva config filetype list
# Add a new file type
deriva config filetype add ".tsx" source typescript
# Delete a file type
deriva config filetype delete ".tsx"
# Show file type statistics by category
deriva config filetype stats
```
> **Note:** Files with unrecognized extensions are automatically classified as `file_type="unknown"` with their extension as the subtype. This ensures all files get proper classification even without explicit registry entries.
### Updating Configurations (Versioning)
Deriva uses a **versioning system** for configurations. When you update a config, a new version is created while preserving previous versions for rollback.
**Correct ways to update configs:**
1. **Via UI (Marimo)**: Navigate to the config section, edit, and click **"Save Config"**
2. **Via CLI**: Use the `config update` command
```bash
# Update extraction config instruction
deriva config update extraction BusinessConcept \
-i "New instruction text..."
# Update extraction config with batch size for multi-file LLM calls
deriva config update extraction BusinessConcept \
--batch-size 5
# Update derivation config from file
deriva config update derivation ApplicationComponent \
--instruction-file prompts/app_component.txt
# View all versions
deriva config versions
```
**Do NOT use JSON import/export for config updates.** The `db_tool import` command is only for backup restoration or migration - it overwrites version history. See [BENCHMARKS.md](BENCHMARKS.md) for the optimization workflow.
### Customizing Extraction Prompts
For LLM-assisted extraction steps:
1. Navigate to **Column 2** → **Extraction Step Configuration**
2. Expand a node type (e.g., TypeDefinition)
3. Edit: Input File Types, Input Graph Elements, Instruction, Example
4. Click **"Save Config"** (this creates a new version)
All prompts follow the **Input + Instruction + Example** pattern.
---
## UI Layout
Deriva uses a multi-column marimo notebook layout:
| Column | Purpose |
|--------|---------|
| **0** | **Run Deriva**: Pipeline execution buttons, status display |
| **1** | **Configuration**: Runs, repositories, Neo4j, graph stats, ArchiMate, LLM |
| **2** | **Extraction Settings**: File type registry, extraction step configuration |
| **3** | **Derivation Settings**: Element type configuration (13 types across Business/Application/Technology layers), relationship derivation |
The UI is powered by `PipelineSession` from the services layer, providing a clean separation between presentation and business logic.
---
## Data Storage
- **Neo4j Graph Database**:
- **Graph namespace**: Intermediate representation (Modules, Files, Dependencies)
- **Model namespace**: ArchiMate elements and relationships
- **DuckDB** (`deriva/adapters/database/sql.db`): File type registry, extraction configs, settings
### Clearing Data
**Column 0: Run Overview**
- **Clear Graph**: Removes all nodes/edges from Graph namespace
- **Clear Model**: Removes all ArchiMate elements and relationships
---
## Querying Neo4j Directly
Access the Neo4j browser at http://localhost:7474 and run Cypher queries:
```cypher
// See all repositories
MATCH (r:Graph:Repository) RETURN r
// See files in a repo
MATCH (repo:Graph:Repository)-[:Graph:CONTAINS*]->(f:Graph:File)
WHERE repo.name = 'my-repo'
RETURN f.name, f.file_type
// See type definitions
MATCH (td:Graph:TypeDefinition) RETURN td.name, td.type_category
```
---
## CLI (Headless Mode)
Deriva includes a full CLI for headless operation and automation:
```bash
# Help
deriva --help
# View configuration
deriva config list extraction
deriva config show extraction BusinessConcept
deriva status
# Manage file types
deriva config filetype list
deriva config filetype add ".lock" dependency lock
deriva config filetype stats
# Run pipeline stages
deriva run extraction --repo flask_invoice_generator -v
deriva run derivation -v
deriva run derivation --phase generate -v # Run specific phase (prep, generate, refine)
deriva run all --repo myrepo
# Export ArchiMate model
deriva export -o workspace/output/model.xml
```
**CLI Options:**
| Option | Description |
|--------|-------------|
| `--repo NAME` | Process specific repository (default: all) |
| `--phase PHASE` | Run specific derivation phase: prep, generate, or refine |
| `-v, --verbose` | Print detailed progress |
| `--no-llm` | Skip LLM-based steps (structural extraction only) |
| `-o, --output PATH` | Output file path for export |
---
## Benchmarking
Deriva includes a multi-model benchmarking system for comparing LLM performance across different providers and models. See [BENCHMARKS.md](BENCHMARKS.md) for the full guide and [OPTIMIZATION.md](OPTIMIZATION.md) for detailed case studies.
### Running Benchmarks
```bash
# List available benchmark models
deriva benchmark models
# Run a benchmark with specific models
deriva benchmark run \
--repos flask_invoice_generator \
--models openai-gptx,ollama-devstral \
-n 3 \
-d "Comparing gptx with devstral" \
-v
# List benchmark sessions
deriva benchmark list
# Analyze a benchmark session
deriva benchmark analyze bench_20260101_150724
```
### Configuring Benchmark Models
Add models to `.env` using the pattern:
```bash
# Azure GPT-4o-mini
LLM_AZURE_GPT4MINI_PROVIDER=azure
LLM_AZURE_GPT4MINI_MODEL=gpt-4
LLM_AZURE_GPT4MINI_URL=https://your-resource.openai.azure.com/...
LLM_AZURE_GPT4MINI_KEY=your-api-key
# Ollama local model
LLM_OLLAMA_LLAMA_PROVIDER=ollama
LLM_OLLAMA_LLAMA_MODEL=devstral
LLM_OLLAMA_LLAMA_URL=http://localhost:11434/api/chat
```
### OCEL Event Logging
Benchmark runs are logged in **OCEL 2.0** (Object-Centric Event Log) format for process mining analysis:
- Events capture pipeline stages, LLM calls, and results
- Object types: `BenchmarkSession`, `BenchmarkRun`, `Repository`, `Model`
- Logs are saved to `workspace/benchmarks/{session_id}/events.ocel.json`
OCEL files can be analyzed with process mining tools like PM4Py, Celonis, or custom analysis scripts.
---
## Troubleshooting
### Neo4j Connection Issues
```bash
# Check if running
docker ps
# View logs
cd deriva/adapters/neo4j && docker-compose logs
# Restart
docker-compose restart
# Clear all data (destructive!)
docker-compose down -v
```
### Port Conflicts
If ports 7687/7474 are in use, edit `deriva/adapters/neo4j/docker-compose.yml`:
```yaml
ports:
- "7688:7687"
- "7475:7474"
```
Update `.env` accordingly:
```bash
NEO4J_URI=bolt://localhost:7688
```
### Marimo Issues
```bash
# Check Python version
python --version # Should be 3.14+
# Reinstall dependencies
uv sync --reinstall
# Run without watch mode
uv run marimo edit deriva/app/app.py
```
---
## Contributing
For development setup, architecture details, and contribution guidelines, see [CONTRIBUTING.md](CONTRIBUTING.md).
---
## License
This project is licensed under the **GNU Affero General Public License v3.0 (AGPL-3.0)**.
This means you can freely use, modify, and distribute this software, but if you run a modified version as a network service, you must make the source code available to users of that service.
See [LICENSE](LICENSE) for the full license text.
## Acknowledgments
- [Marimo](https://marimo.io) - Reactive Python notebooks
- [Neo4j](https://neo4j.com) - Graph database
- [ArchiMate](https://www.opengroup.org/archimate-forum) - Enterprise architecture standard
- [Archi](https://www.archimatetool.com) - Open source ArchiMate modeling tool
- [Tree-sitter](https://tree-sitter.github.io/tree-sitter/) - Multi-language AST parsing
---
**Status**: Active Development