https://github.com/tachyon-beep/esper-lite

deep-reinforcement-learning morphogenesis neural-architecture-optimization neural-architecture-search neural-networks pytorch
Last synced: 4 months ago
JSON representation
Host: GitHub
URL: https://github.com/tachyon-beep/esper-lite
Owner: tachyon-beep
License: mit
Created: 2025-09-20T09:39:55.000Z (9 months ago)
Default Branch: main
Last Pushed: 2026-01-12T21:40:11.000Z (5 months ago)
Last Synced: 2026-01-13T01:31:43.609Z (5 months ago)
Topics: deep-reinforcement-learning, morphogenesis, neural-architecture-optimization, neural-architecture-search, neural-networks, pytorch
Language: Python
Homepage:
Size: 18.6 MB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Roadmap: ROADMAP.md
- Agents: AGENTS.md
Awesome Lists containing this project

README

          # Esper: Morphogenetic Neural Networks

**Grow capabilities, don’t just train weights.**

Esper is a framework for **morphogenetic AI**: neural networks that can **grow, prune, and adapt their own topology during training**. Instead of committing to a static architecture up front, Esper uses a lifecycle-driven approach where “seed” modules are germinated, trained safely, and only blended into the host when they earn their keep.

---

## What exists today

Esper is built as a set of decoupled subsystems. The ones you will see in the codebase and tooling right now are:

* **Kasmina** (host + slots): the morphogenetic model and the mechanics for inserting, training, blending, and fossilising seeds.

* **Tamiyo** (decision-maker): chooses lifecycle actions (germinate, blend, prune, fossilise, etc.) using either heuristics or a learned policy.

* **Simic** (selection pressure): reward, accounting, PPO training loop, and the “economy” (rent, churn, contribution signals).

* **Tolaria** (execution engine): high-throughput, deterministic training/evaluation substrate; governs safety rollback.

* **Nissa** (telemetry backends): emits structured diagnostics and run artefacts for analysis.

* **Karn** (operator UI + analytics): Sanctum TUI, Overwatch dashboard, logs, aggregation, and “flight recorder” style visibility.

Planned subsystems (designed but not yet fully shipped as first-class controllers):

* **Emrakul** (maintenance/decay policy): evidence-led probe/sedate/lyse of committed structure; long-horizon efficiency.

* **Narset** (allocator): slow-timescale budget allocator coordinating multiple Tamiyo/Emrakul pairs.

* **Esika** (host superstructure): container/ruleset that deconflicts Kasmina “cells” and hosts Narset at scale.

---

## Current architecture baseline

As of the current “Tamiyo Next” baseline, the RL-controlled stack supports long-horizon, multi-seed behaviour:

* **Obs V3**: reduced redundancy; blueprint identity moved to learned embeddings

  * Non-blueprint obs: **116 dims** (23 base + 31 per-slot × 3 slots)

  * Blueprint embedding: **4 × slots** (e.g. 12 dims for 3 slots)

  * Total policy input: **128 dims**

* **Policy V2**: **512-dim feature net + 512 hidden LSTM**, designed for ~150-step horizons

* **Critic**: action-conditioned baseline (**Q(s, op)** style), reducing value aliasing

* **Default episode length**: **150** steps (epochs) per rollout horizon

---

## 📚 Terminology: Nested Training Loops

Esper uses PPO to control *another neural network's training process*—an unusual meta-learning setup that creates terminology overlap. Here's how the nested loops work:

```

Outer loop: 200 batches (PPO updates)

├── Each batch: n_envs environments run in parallel

│   └── Each environment: 1 episode = 150 steps

│       └── Each step: host trains 1 epoch, Tamiyo takes 1 action

└── After batch completes: PPO update on collected trajectories

```

| Term | Meaning |

|------|---------|

| **Episode** | One complete host training run (e.g., 150 steps). The RL trajectory. |

| **Step** | One policy decision point. Tamiyo observes host state and chooses an action (WAIT, GERMINATE, PRUNE, etc.). |

| **Batch** | Collection of parallel episodes used for one PPO update. With `n_envs=10`, one batch = 10 episodes. |

| **Host epoch** | One training iteration of the host neural network (domain term, not RL term). |

> **Why the confusion?** In standard RL, "step" is unambiguous (one game frame, one physics tick). Here, each RL step is synchronized with one host training epoch—the "environment" is literally a neural network mid-training. So: *"At step 75 of episode 3, the host finished its 75th training epoch and Tamiyo decided to GERMINATE."*

---

## ⚡ Quick Start

## Key ideas

### Seed lifecycle (Kasmina)

Seeds are introduced and evaluated under a controlled state machine:

* **Germinated**: module exists, influence is isolated

* **Training**: learns “behind the host” (safe, no destabilising contribution)

* **Blending**: alpha ramps in under controlled schedules

* **Holding**: stabilisation window before committing

* **Fossilised**: accepted as part of the model’s committed structure

### Gradient isolation

Seeds can learn from task gradients without immediately altering the host’s forward behaviour, reducing destabilisation during early growth.

### Vectorised training (Tolaria + Simic)

Esper’s training loop is designed for high throughput:

* parallel environments

* GPU-first execution

* deterministic replay principles

* “inverted control flow” (batches drive environments, not the other way around)

### Telemetry as an API (Nissa + Karn)

Telemetry is treated like a contract:

* typed payloads

* schema validation

* explicit provenance

* UI panels reflect real signals (no silent fallback)

---

## 🛠️ Development

**Project Structure:**

## Quick start

### Installation

Requires Python 3.11+ and PyTorch.

```bash

git clone https://github.com/yourusername/esper.git

cd esper

uv sync

```

### Heuristic baseline (Tamiyo heuristic)

```bash

PYTHONPATH=src uv run python -m esper.scripts.train heuristic \

  --task cifar_baseline --episodes 1

```

### PPO training (Simic RL)

```bash

PYTHONPATH=src uv run python -m esper.scripts.train ppo \

  --task cifar_baseline \

  --rounds 100 \

  --envs 4 \

  --episode-length 150 \

  --device cuda:0

```

---

## System map

| Domain      | Role                                      | Exists today | Notes                                                     |

| ----------- | ----------------------------------------- | ------------ | --------------------------------------------------------- |

| **Kasmina** | Host + SeedSlots + lifecycle mechanics    | ✅            | “Where topology changes happen” under strict contracts    |

| **Leyline** | Shared enums/contracts/schemas            | ✅            | Source of truth for types and ordering invariants         |

| **Tamiyo**  | Growth policy (heuristic or learned)      | ✅            | Manages seeds pre-commit                                  |

| **Simic**   | Reward + PPO + accounting                 | ✅            | Selection pressure, credit signals, and training loop     |

| **Tolaria** | Execution engine + determinism + rollback | ✅            | High-throughput substrate and safety governor             |

| **Nissa**   | Telemetry backends                        | ✅            | Structured emission and artefact routing                  |

| **Karn**    | Operator UI + analytics                   | ✅            | Sanctum/Overwatch and aggregation                         |

| **Emrakul** | Decay/maintenance policy                  | 🧭 Planned   | Probe/sedate/lyse for efficiency and consolidation        |

| **Narset**  | Budget allocator over regions             | 🧭 Planned   | Slow coordinator based on coarse health signals           |

| **Esika**   | Host superstructure                       | 🧭 Planned   | Deconfliction + safe-boundary scheduling + Narset hosting |

---

## CLI overview

### PPO training (`esper.scripts.train ppo`)

Core scaling knobs:

| Flag                 | Default | Meaning                                     |

| -------------------- | ------- | ------------------------------------------- |

| `--rounds N`         | 100     | PPO update rounds                           |

| `--envs K`           | 4       | Parallel environments per round             |

| `--episode-length L` | 150     | Steps per env per round (also LSTM horizon) |

| `--ppo-epochs E`     | 1       | PPO update passes over rollout data         |

| `--memory-size H`    | 512     | LSTM hidden size                            |

Config/presets:

| Flag                 | Default          | Meaning                                |

| -------------------- | ---------------- | -------------------------------------- |

| `--task`             | `cifar_baseline` | Host + dataloaders + topology preset   |

| `--preset`           | `cifar_baseline` | Hyperparameter preset                  |

| `--config-json PATH` | (none)           | Strict config file (unknown keys fail) |

| `--seed N`           | (config)         | Run seed override                      |

Hardware/perf:

| Flag              | Default   | Meaning                                     |

| ----------------- | --------- | ------------------------------------------- |

| `--device`        | `cuda:0`  | Policy device                               |

| `--devices`       | (none)    | Multi-GPU env devices (`cuda:0 cuda:1 ...`) |

| `--num-workers`   | (task)    | DataLoader workers                          |

| `--gpu-preload`   | off       | CIFAR GPU preload (VRAM trade)              |

| `--compile-mode`  | `default` | torch.compile mode                          |

| `--force-compile` | off       | Compile even in TUI mode                    |

Telemetry/monitoring:

| Flag                   | Meaning                                  |

| ---------------------- | ---------------------------------------- |

| `--sanctum`            | Textual TUI for debugging                |

| `--overwatch`          | Web dashboard                            |

| `--telemetry-dir PATH` | Write telemetry artefacts                |

| `--wandb`              | Enable Weights & Biases (optional extra) |

---

## Seed lifecycle diagram

```mermaid

stateDiagram-v2

    [*] --> DORMANT

    DORMANT --> GERMINATED: Germinate

    GERMINATED --> TRAINING: Advance (G1)

    TRAINING --> BLENDING: Advance (G2)

    BLENDING --> HOLDING: Advance (G3)

    HOLDING --> FOSSILIZED: Fossilise

    TRAINING --> PRUNED: Prune

    BLENDING --> PRUNED: Prune

    PRUNED --> EMBARGOED: Cleanup

    EMBARGOED --> RESETTING: Cooldown

    RESETTING --> DORMANT: Recycle

```

---

## Results notes

Esper’s performance is best evaluated as a **frontier**: quality vs cost vs stability. Peak accuracy matters, but reliability and growth ratio matter more. The system is designed to support:

* “capable host augmentation” (baseline CIFAR)

* “rescue a broken host” (impaired/minimal CIFAR)

* scaling pressure tests (deep/multi-slot hosts)

---

## Development

Run tests:

```bash

uv run pytest -q

```

#### Reward Configuration

| Parameter              | Default          | Description                                                                   |

| ---------------------- | ---------------- | ----------------------------------------------------------------------------- |

| `reward_mode`          | `"shaped"`       | Reward signal strategy (see modes below)                                      |

| `reward_family`        | `"contribution"` | `"contribution"` (counterfactual) or `"loss"` (direct loss delta)             |

| `param_budget`         | `500000`         | Parameter budget for seeds (penalty if exceeded)                              |

| `param_penalty_weight` | `0.1`            | Weight of parameter budget penalty in reward                                  |

| `rent_host_params_floor` | `200`          | Host-size normalization floor for rent/alpha-shock (prevents tiny hosts being crushed) |

**Reward Modes:**

| Mode | Description | Use Case |

|------|-------------|----------|

| `shaped` | 7-component dense signals (PBRS + attribution + rent + warnings) | Default; rich feedback but potential Goodhart risk |

| `simplified` | 3-component (PBRS + intervention cost + terminal bonus) | Cleaner gradients for temporal credit assignment |

| `basic` | Minimal viable reward: contribution + rent | Baseline for ablation |

| `basic_plus` | BASIC + post-fossilization drip accountability | Prevents early-fossilization gaming |

| `sparse` | Terminal-only: final accuracy minus rent | Hard mode; theoretically perfect alignment |

| `minimal` | Sparse + early-cull penalty | Penalises wasted compute |

| `escrow` | Delayed attribution via escrow accounts | Experimental |

> **Drip Mechanism (BASIC_PLUS):** After a seed fossilizes, rewards continue to "drip" based on ongoing contribution. Controlled by `drip_fraction` (default 0.7 = 70% drip, 30% immediate). Prevents gaming where seeds are fossilized to lock in short-term gains before regression.

#### A/B Testing (True)

Use `--dual-ab` to train separate policies on separate GPUs (e.g. `shaped-vs-simplified`).

Project structure:

```text

src/esper/

├── kasmina/      # Host + slots + seed mechanics

├── leyline/      # Shared contracts and schemas

├── tamiyo/       # Policies and action masks

├── tolaria/      # Execution engine + safety governor

├── simic/        # PPO + reward/accounting

├── nissa/        # Telemetry backends and outputs

├── karn/         # UI, dashboards, analytics

└── scripts/      # CLI entry points

```
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/tachyon-beep/esper-lite

Awesome Lists containing this project

README