https://github.com/frankbria/semantic-gui-control
A cross-platform, text-first control layer that exposes GUIs to agents as structured affordances instead of pixels. Discovers the interface, normalizes it, executes through a small command vocabulary, and verifies state changes. Vision is the spare tire, not the steering wheel.
https://github.com/frankbria/semantic-gui-control
accessibility agent-tools agentic-ai ai ai-agents automation computer-use cross-platform desktop-automation gui-automation llm llm-agents llm-tools mcp python rpa ui-automation uiautomation windows
Last synced: 16 days ago
JSON representation
A cross-platform, text-first control layer that exposes GUIs to agents as structured affordances instead of pixels. Discovers the interface, normalizes it, executes through a small command vocabulary, and verifies state changes. Vision is the spare tire, not the steering wheel.
- Host: GitHub
- URL: https://github.com/frankbria/semantic-gui-control
- Owner: frankbria
- Created: 2026-05-19T17:20:17.000Z (25 days ago)
- Default Branch: main
- Last Pushed: 2026-05-19T21:02:23.000Z (25 days ago)
- Last Synced: 2026-05-19T21:06:23.353Z (25 days ago)
- Topics: accessibility, agent-tools, agentic-ai, ai, ai-agents, automation, computer-use, cross-platform, desktop-automation, gui-automation, llm, llm-agents, llm-tools, mcp, python, rpa, ui-automation, uiautomation, windows
- Language: Python
- Homepage:
- Size: 158 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
- Roadmap: docs/roadmap-blunt-wins.md
Awesome Lists containing this project
README
# Semantic GUI Control Layer (SGCL)
A text-first, cross-platform control layer for agentic interaction with graphical user interfaces.
## Thesis
Agents should not primarily operate GUIs through screenshots and coordinate clicks. SGCL should:
1. **Discover** the usable interface layer from the environment (accessibility trees, DOM, OS automation APIs, keyboard traversal, app APIs).
2. **Normalize** it into structured affordances.
3. **Expose** a small standard command vocabulary.
4. **Execute** actions through platform adapters.
5. **Verify** state changes.
6. **Fall back** to vision/OCR only when semantic paths are broken or incomplete.
> Vision is the spare tire, not the steering wheel.
## Current status
**Discovery / spike phase.** No production code yet. Planning and architecture only.
The first executable milestone (Phase 0) targets a Windows UIA observer that can list windows and dump an active window's control tree as JSON. Windows is a convenient first spike; the core model is intentionally cross-platform.
## Blunt-win roadmap
Coarse learning milestones. Each one must produce a working capability, a documented constraint, or a killed assumption. See [`docs/roadmap-blunt-wins.md`](docs/roadmap-blunt-wins.md) for detail.
| # | Win | Question it answers |
|---|-----|---------------------|
| 1 | Observe | Can we expose a real desktop GUI as structured text without screenshots? |
| 2 | Normalize | Can we hide UIA/AX/AT-SPI/DOM differences behind a common schema? |
| 3 | Find | Can an agent find the thing it means without knowing screen coordinates? |
| 4 | Read | Can the system read enough state to support agent reasoning and verification? |
| 5 | Act | Can we perform basic actions through the affordance layer rather than pixels? |
| 6 | Verify | Can every action return evidence, not just "I clicked it"? |
| 7 | Risk | Can the system avoid becoming a blind automation monkey on committing actions? |
| 8 | Repair & Fallback | Can the system recover from broken accessibility trees? |
| 9 | Cross-Platform Adapter Contract | Did we build a real abstraction, or just rename Windows UIA? |
| 10 | Agent Loop | Can an LLM use SGCL to complete a tiny task through structured state only? |
## Documentation
| Doc | Purpose |
|-----|---------|
| [`docs/project-thesis.md`](docs/project-thesis.md) | Problem, thesis, non-goals, guiding principles |
| [`docs/roadmap-blunt-wins.md`](docs/roadmap-blunt-wins.md) | The 10 blunt wins, with exit criteria |
| [`docs/architecture-overview.md`](docs/architecture-overview.md) | Conceptual architecture and adapter model |
| [`docs/command-vocabulary.md`](docs/command-vocabulary.md) | Standard agent-facing commands |
| [`docs/affordance-model.md`](docs/affordance-model.md) | Normalized affordance schema |
| [`docs/risk-model.md`](docs/risk-model.md) | Risk classes and default policy |
| [`docs/use-cases.md`](docs/use-cases.md) | Initial target use cases |
| [`docs/phase-0-observe-spike.md`](docs/phase-0-observe-spike.md) | Detailed plan for the first spike |
| [`docs/phase-1-normalize-spike.md`](docs/phase-1-normalize-spike.md) | Normalize planning |
| [`docs/phase-2-find-read-spike.md`](docs/phase-2-find-read-spike.md) | Find + Read planning |
| [`docs/phase-3-act-verify-risk-spike.md`](docs/phase-3-act-verify-risk-spike.md) | Act + Verify + Risk planning |
| [`docs/open-questions.md`](docs/open-questions.md) | Unresolved questions |
| [`docs/decisions/`](docs/decisions/) | Architecture Decision Records |
| [`docs/github-issues-seed.md`](docs/github-issues-seed.md) | Copy-paste GitHub issue bodies for the first 7 wins |
| [`spikes/`](spikes/) | Results of each exploratory spike |
Legacy reference docs (kept for context, superseded by the above):
- [`docs/level-1-spec.md`](docs/level-1-spec.md) — early system spec
- [`docs/cross-platform-strategy.md`](docs/cross-platform-strategy.md) — adapter strategy notes
- [`docs/development-sequence.md`](docs/development-sequence.md) — earlier phase sequence
## Local development
Nothing to run yet. The proposed package shape is:
```
sgcl/
core/ # platform-neutral schemas, vocabulary, verifier, risk
adapters/ # windows_uia, macos_ax, linux_atspi, browser_dom, vision_ocr
cli.py # `sgcl` entry point
```
The first spike (Phase 0) will likely use Python with `pywinauto` or `uiautomation` on Windows. Setup steps will be documented once they exist.
## Recommended invocation on Windows
Always use `sgcl --output PATH ...` (or pipe to `Out-File -Encoding utf8`) instead of `> file.json` or `| Tee-Object file.json`. Phase 1 confirmed that PowerShell's default `[Console]::OutputEncoding` mangles non-ASCII bytes when sgcl's UTF-8 stdout flows through the pipe; `--output` writes the file directly from Python in UTF-8 and avoids the round-trip. See `docs/windows-claude-setup.md` for the optional one-time PowerShell profile additions that also fix interactive command output.
## Working metaphor
A terminal for the visual operating environment. Not because everything becomes text, but because the GUI becomes inspectable, commandable, and verifiable.