https://github.com/benmarte/autoimprove

Autonomous codebase improvement loop for Claude Code
https://github.com/benmarte/autoimprove

ai ai-skill claude claude-code claude-code-plugin claude-skills

Last synced: 2 months ago
JSON representation

Autonomous codebase improvement loop for Claude Code

Host: GitHub
URL: https://github.com/benmarte/autoimprove
Owner: benmarte
License: mit
Created: 2026-03-11T12:40:57.000Z (3 months ago)
Default Branch: main
Last Pushed: 2026-03-13T20:17:59.000Z (3 months ago)
Last Synced: 2026-03-15T09:55:57.164Z (3 months ago)
Topics: ai, ai-skill, claude, claude-code, claude-code-plugin, claude-skills
Language: Shell
Homepage:
Size: 472 KB
Stars: 5
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE

Awesome Lists containing this project

README

# 🔁 autoimprove

### Autonomous codebase improvement loop for Claude Code

[![Claude Code](https://img.shields.io/badge/Claude%20Code-Plugin-blueviolet?logo=anthropic)](https://code.claude.com)
[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
[![Languages](https://img.shields.io/badge/languages-10%2B-blue)](#supported-languages)

*Inspired by [karpathy/autoresearch](https://github.com/karpathy/autoresearch) — but for any codebase, not just ML training loops.*

---

## What is this?

Karpathy's `autoresearch` lets an AI agent run ML experiments overnight: modify `train.py` → measure `val_bpb` → keep if better, discard if worse → repeat. You wake up to a log of experiments and a better model.

**autoimprove does the same thing for your codebase.**

Give Claude Code your project, run `/autoimprove:improve`, and let it iterate autonomously. It proposes a targeted change, scores your codebase before and after using your own tooling (TypeScript, `cargo clippy`, `pytest`, `golangci-lint` — whatever you already have), keeps the changes that improve the score, reverts the ones that don't, and logs everything. You wake up to a readable log of what worked, what didn't, and a cleaner codebase.

```
propose → measure BEFORE → implement → measure AFTER → keep ✅ or discard ❌ → log → repeat
```

autoimprove session showing 5 iterations with 3 wins, score going from 50 to 59

A real autoimprove session: 5 iterations, 3 wins, score 50 → 59 (+9 pts) in under 6 minutes

---

## Quick start

```bash
# 1. Add the marketplace and install the plugin
/plugin marketplace add benmarte/autoimprove
/plugin install autoimprove@autoimprove

# 2. Auto-detect your stack and see your codebase report
/autoimprove:setup

# 3. The audit shows what's wrong and offers to start fixing
# Or run the audit anytime for a fresh check
/autoimprove:audit

# 4. For unattended runs (e.g. overnight), use improve directly
/autoimprove:improve 20

# Or focus on a specific task
/autoimprove:improve 10 "Replace all any types with proper interfaces"

# 5. Review in the morning
cat .claude/autoimprove/log.md
git log --oneline # one commit per winning experiment
git show HEAD # inspect the latest win
```

That's it. No config required upfront — `/autoimprove:setup` fingerprints your project, writes `.claude/autoimprove/config.md`, and immediately runs an audit showing your codebase's deficiencies ranked by efficiency.

### Upgrading

#### If you already have the upgrade command

```bash
/autoimprove:upgrade
```

#### If you don't have the upgrade command (older installs)

The plugin system caches marketplace clones locally. If your install predates the upgrade command, you need to update the marketplace clone first:

```bash
# 1. Update the marketplace clone
cd ~/.claude/plugins/marketplaces/autoimprove && git pull origin main

# 2. Reinstall the plugin
/plugin update autoimprove@autoimprove
```

If `/plugin update` still shows "already at the latest version", uninstall and reinstall:

```bash
/plugin uninstall autoimprove@autoimprove
/plugin install autoimprove@autoimprove
```

After this, `/autoimprove:upgrade` will be available for all future updates.

### Auto-update check

autoimprove checks for new releases once per day on session start. If an update is available, you'll see:

```
Update available: v1.2.0 → v1.3.0
Run /autoimprove:upgrade to update.
```

The check is lightweight (single GitHub API call, 3s timeout, cached for 24 hours) and never blocks startup.

---

## How it works

### 1. Setup (once per project)

`/autoimprove:setup` scans your project root to detect:
- Language and framework
- Package manager (`npm`, `cargo`, `poetry`, `uv`, etc.)
- Test runner (`pytest`, `jest`, `go test`, `rspec`, etc.)
- Type checker (`tsc`, `mypy`, `pyright`, etc.)
- Linter (`eslint`, `ruff`, `golangci-lint`, `rubocop`, etc.)

It writes an `.claude/autoimprove/config.md` file in your project root — a plain Markdown config that maps your specific tools to a **0–100 composite quality score**. You can edit this file to customise the loop for your project.

### 2. Isolated experiments via git worktrees

Every experiment runs in a **separate git worktree** — its own directory, its own branch, completely isolated from your main codebase:

```
your-project/ ← main branch (never touched during experiments)
.claude/autoimprove/worktrees/ ← gitignored, auto-created
experiment-001/ ← branch: autoimprove/experiment-001
experiment-002/ ← branch: autoimprove/experiment-002
experiment-003/ ← branch: autoimprove/experiment-003
```

- ✅ Winning experiments get **squash-merged** back to main as a clean commit
- ❌ Losing experiments have the **worktree and branch deleted** — nothing touches main
- 🔒 Your working directory is **read-only** for the entire session
- 🧹 All worktrees are cleaned up automatically at session end

No more `git checkout -- .` rollbacks. No risk of a broken experiment corrupting your codebase.

### 3. The audit

Before diving into fixes, `/autoimprove:audit` scans your codebase and shows exactly what needs work:

```
━━━ Codebase Audit ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📊 Current Score: 61/100

Type safety: 24/40 ██████░░░░ (16 pts to max)
Build: 20/20 ██████████ ✓ maxed
Tests: 10/30 ███░░░░░░░ (20 pts to max)
Lint: 7/10 ███████░░░ (3 pts to max)

━━━ Fastest Path to 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

# Area Gap Issues Est. iterations Efficiency
1 Type safety 16pts 8 errors 3 iterations 5.3 pts/iter ← best
2 Lint 3pts 2 warnings 1 iteration 3.0 pts/iter
3 Tests 20pts 0/4 covered 7 iterations 2.9 pts/iter

Total: ~11 iterations to reach 100/100
⚡ Estimated token usage: ~250K tokens (rough estimate, actual usage varies)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
```

The audit ranks areas by **efficiency** — points gained per iteration — so you fix the highest-impact issues first. It then offers to start fixing interactively, area by area, or you can run `/autoimprove:improve` directly.

Setup auto-runs the audit after generating your config, so first-time users see this report immediately.

### 4. The score

Every iteration, the loop measures your codebase on four axes:

| Metric | Weight | What it checks |
|---|---|---|
| **Type / compile errors** | 40 pts | `tsc --noEmit`, `cargo check`, `go build`, `mypy`, etc. |
| **Build success** | 20 pts | Does the project build without errors? |
| **Test pass rate** | 30 pts | `(passing / total) × 30` |
| **Lint errors** | 10 pts | `eslint`, `ruff`, `clippy`, `golangci-lint`, etc. |

If a metric doesn't apply (no tests yet, no linter configured), its weight is redistributed across the others.

### 5. The loop

Each iteration prints visible progress so you always know what's happening:

```
━━━ Iteration 1/5 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
🔬 PROPOSE: Targeting error handling in src/api/client.ts
🔬 SNAPSHOT: Measuring BEFORE score...
🔬 IMPLEMENT: Adding try/catch to unhandled async calls
🔬 MEASURE: Measuring AFTER score...
🔬 DECIDE: 85 → 89 (+4 pts) — KEPT ✅
🔬 LOG: Recorded to .claude/autoimprove/log.md
```

Steps per iteration:

1. **Creates** a fresh git worktree + branch (`autoimprove/experiment-NNN`)
2. **Proposes** one bounded improvement with an explicit hypothesis — *"I will fix the three unhandled promise rejections in `api/invoices.ts` because I expect it to reduce TypeScript errors and improve the type score by ~8 points"*
3. **Measures** the score inside the worktree (BEFORE)
4. **Implements** the change inside the worktree (surgical — 1–3 files at most)
5. **Measures** again (AFTER)
6. **Keeps** — squash-merges to main and deletes the worktree — if AFTER ≥ BEFORE
7. **Discards** — deletes the worktree and branch, main untouched — if AFTER < BEFORE
8. **Logs** the result to `.claude/autoimprove/log.md`

### 6. The log

After each iteration, `.claude/autoimprove/log.md` gets an entry like:

```
## Iteration 4 — 2026-03-11 02:14
**Hypothesis:** Replace 3 `any` types in convex/invoices.ts with proper TypeScript interfaces
**Branch:** autoimprove/experiment-004
**Files changed:** convex/invoices.ts
**Before:** 74/100 — type: 28, build: 20, tests: 18, lint: 8
**After:** 82/100 — type: 36, build: 20, tests: 18, lint: 8
**Decision:** KEPT ✅ (squash-merged to main, worktree deleted)
**Reason:** Eliminated 2 TS errors by typing the invoice mutation arguments properly
```

---

## Commands

| Command | Description |
|---|---|
| `/autoimprove:setup` | Detect stack, generate config, and run initial audit |
| `/autoimprove:audit` | Scan codebase for deficiencies and get a prioritized fix plan |
| `/autoimprove:improve [N] ["focus"]` | Run N iterations of the loop (default: 5), optionally focused on a specific task |
| `/autoimprove:continue [N] ["focus"]` | Resume an interrupted session — inherits remaining iterations and focus from the log |
| `/autoimprove:status` | Show a summary of all runs from `.claude/autoimprove/log.md` |
| `/autoimprove:upgrade` | Check for and install the latest version |

---

## Supported languages

| Language | Type check | Build | Tests | Lint |
|---|---|---|---|---|
| **TypeScript / JavaScript** | `tsc --noEmit` | `npm/pnpm/yarn/bun build` | jest / vitest / mocha | eslint |
| **Next.js / Nuxt / Remix / Astro** | `tsc --noEmit` | framework build cmd | jest / vitest | eslint |
| **Python** | mypy / pyright | — | pytest | ruff / flake8 / pylint |
| **Go** | `go build ./...` | `go build` | `go test ./...` | golangci-lint / `go vet` |
| **Rust** | `cargo check` | `cargo build` | `cargo test` | `cargo clippy` |
| **Ruby** | sorbet (if configured) | — | rspec / minitest | rubocop |
| **Java / Kotlin** | `mvn compile` / `./gradlew build` | same | `mvn test` / `./gradlew test` | checkstyle / ktlint |
| **C# / .NET** | `dotnet build` | `dotnet build` | `dotnet test` | `dotnet format --verify-no-changes` |
| **PHP** | phpstan | — | phpunit | phpcs |
| **Swift** | `swift build` | `swift build` | `swift test` | swiftlint |
| **Any Makefile project** | `make check` / `make typecheck` | `make build` | `make test` | `make lint` |

Don't see your stack? Edit `.claude/autoimprove/config.md` after setup to add your own commands.

---

## Customising .claude/autoimprove/config.md

After running `/autoimprove:setup`, edit the generated `.claude/autoimprove/config.md` to tailor the loop to your project:

```markdown
## Improvement Areas
- Check all Convex mutations have auth guards
- Replace fetch() calls with our internal apiClient wrapper
- Ensure every page component has a loading.tsx sibling

## Files to Never Modify
- convex/schema.ts
- src/generated/
- migrations/
- .env.local
```

You can also override any auto-detected command, change scoring weights, or add custom shell commands as additional metrics.

---

## Focused improvements

You can focus the loop on a specific task **directly from the command** — no config editing needed. Just pass a quoted string:

```bash
# Focus on type safety
/autoimprove:improve 10 "Replace all any types with proper TypeScript interfaces"

# Focus on a specific directory
/autoimprove:improve 5 "Fix all lint warnings in src/components/dashboard/"

# Focus on tests
/autoimprove:improve 10 "Add unit tests for every exported function in lib/billing/"

# Focus on a migration
/autoimprove:improve 20 "Replace all raw fetch() calls with the apiClient wrapper from lib/api-client.ts"
```

When a focus string is provided, **every iteration targets that task**. The loop breaks it into file-by-file sub-tasks and chips away one per iteration until the focus is fully addressed or iterations run out.

Without a focus string, the loop rotates through all areas listed in your `.claude/autoimprove/config.md` as usual.

### Alternative: edit the config

For recurring focus areas, you can also edit the `Improvement Areas` section in `.claude/autoimprove/config.md` directly:

```markdown
## Improvement Areas
- Replace every `any` type with a proper TypeScript interface or type alias
```

This is useful when you want the focus to persist across multiple sessions without re-typing it.

### Tips for focused runs

- **Be specific.** `"Fix type errors"` is vague. `"Replace any with proper types in convex/ mutations"` gives the loop a clear target.
- **One concern at a time** works best. The loop makes surgical 1–3 file changes per iteration — a narrow focus means every iteration chips away at the same problem.
- **Match iteration count to scope.** If you have ~20 files to fix, run `/autoimprove:improve 20 "..."` so each iteration can tackle one file.
- **Use "Files to Never Modify"** in the config to protect areas you don't want touched during a focused run.

---

## Resuming interrupted sessions

If your session gets interrupted (Ctrl+C, context limit, crash), you can pick up where you left off:

```bash
# Resume with remaining iterations and same focus
/autoimprove:continue

# Resume but only run 3 more iterations
/autoimprove:continue 3

# Resume with a different focus
/autoimprove:continue "New focus area"

# Override both
/autoimprove:continue 5 "Fix error handling in api/"
```

The continue command reads `.claude/autoimprove/log.md` to find the interrupted session, inherits its settings, and picks up from the next iteration. Iteration numbering continues seamlessly (e.g., if you completed 4/10, it resumes at 5/10).

If the codebase has changed since the interrupted session (you made manual commits), autoimprove will warn you and re-measure the baseline.

Check `/autoimprove:status` to see if you have an interrupted session to resume.

---

## What the loop improves

The loop rotates through these universal improvement areas (and adds language-specific ones based on your stack):

- **Type safety** — fix type errors, replace `any`/`interface{}`/untyped constructs
- **Error handling** — unhandled promises, bare `catch {}`, swallowed errors
- **Dead code** — unused imports, variables, unreachable branches
- **Code duplication** — extract repeated logic (3+ occurrences) into shared utilities
- **Naming & readability** — cryptic names, functions over ~50 lines
- **Performance** — N+1 query patterns, missing memoization, unnecessary allocations
- **Security** — hardcoded secrets, missing input validation, unguarded auth routes
- **Tests** — add a test for the most critical untested function, fix flaky tests

---

## Safety

The loop is designed to be safe to run unattended:

| Rule | Detail |
|---|---|
| 🔒 Never touches lock files | `package-lock.json`, `Cargo.lock`, `go.sum`, `Gemfile.lock`, etc. |
| 🔒 Never touches generated files | Migrations, protobuf output, OpenAPI generated code |
| 🔒 Never touches secrets | `.env`, `.env.local`, any secrets file |
| 🔒 Never deploys or publishes | No `git push`, `npm publish`, `cargo publish`, etc. |
| 🔒 Requires clean git state | Won't start if `git status` shows uncommitted changes |
| 🔒 Experiments in isolated worktrees | Each experiment is on its own branch — main is never modified mid-session |
| 🔒 Losers deleted, not rolled back | Failed experiments: worktree deleted, branch deleted, main untouched |
| 🔒 Winners squash-merged | One clean commit per winning experiment — easy to review with `git log` |
| 🔒 Pauses every 10 iterations | Cleans up worktrees, writes summary, waits for human review |

You always review and push — the loop never commits or pushes on your behalf.

---

## Plugin structure

```
autoimprove/
├── .claude-plugin/
│ ├── plugin.json # Plugin manifest
│ └── hooks/
│ └── hooks.json # SessionStart hook registration
├── hooks/
│ └── sessionstart.sh # update check on startup (once per day)
├── skills/
│ ├── audit/
│ │ └── SKILL.md # Codebase deficiency scan, prioritized report, interactive fix loop
│ ├── detect-stack/
│ │ └── SKILL.md # Fingerprints project, writes .claude/autoimprove/config.md
│ ├── worktree/
│ │ └── SKILL.md # Creates/manages/cleans up git worktrees per experiment
│ ├── improve-loop/
│ │ └── SKILL.md # Core loop: worktree → propose → implement → measure → merge/delete
│ ├── measure/
│ │ └── SKILL.md # Internal scoring utility (used by audit and improve-loop)
│ └── rollback/
│ └── SKILL.md # Emergency cleanup of all experiment worktrees
└── commands/
├── audit.md # /autoimprove:audit
├── continue.md # /autoimprove:continue [N] ["focus"]
├── setup.md # /autoimprove:setup
├── improve.md # /autoimprove:improve [N] ["focus"]
├── status.md # /autoimprove:status
└── upgrade.md # /autoimprove:upgrade (check for updates)
```

---

## Example run

Here's what a real overnight session looks like. This is from a Next.js + Convex project starting at a score of 61/100:

```
## Iteration 1 — 23:04
**Hypothesis:** Replace 4 implicit `any` types in `convex/invoices.ts` with proper interfaces
**Files changed:** convex/invoices.ts
**Before:** 61/100 — type: 24, build: 20, tests: 10, lint: 7
**After:** 69/100 — type: 32, build: 20, tests: 10, lint: 7
**Decision:** KEPT ✅
**Reason:** Removed 4 TS7006 implicit-any errors by typing mutation arguments

## Iteration 5 — 23:37
**Hypothesis:** Move ExpenseList to a server component — it only reads data, no interactivity
**Branch:** autoimprove/experiment-005
**Files changed:** components/ExpenseList.tsx
**Before:** 71/100 — type: 32, build: 20, tests: 10, lint: 9
**After:** 68/100 — type: 26, build: 20, tests: 10, lint: 12
**Decision:** DISCARDED ❌ (worktree deleted, main untouched)
**Reason:** Removing "use client" broke useQuery hook — must stay client component.

## Iteration 8 — 00:02
**Hypothesis:** Add unit tests for calculateTaxEstimate() — most complex function, zero coverage
**Files changed:** lib/tax.test.ts (new)
**Before:** 78/100 — type: 36, build: 20, tests: 10, lint: 10
**After:** 84/100 — type: 36, build: 20, tests: 16, lint: 10
**Decision:** KEPT ✅
**Reason:** 2 new tests passing, covers basic and edge-case tax bracket logic

━━━ Session Complete ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📊 Score: 61 → 84 (+23 pts)
🔁 Iterations: 10 total — 9 kept ✅, 1 discarded ❌
📝 Merged commits:
• abc1234 autoimprove(001): Replace 4 implicit any types
• def5678 autoimprove(002): Add error boundaries
• ...
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
```

See [`autoimprove-log.example.md`](autoimprove-log.example.md) for the full 10-iteration session with summary table.

---

## Contributing

PRs welcome! Especially:
- New language profiles in `detect-stack/SKILL.md`
- Better improvement area prompts for specific frameworks
- Example `.claude/autoimprove/config.md` files for common stacks

---

## License

MIT

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/benmarte/autoimprove

Awesome Lists containing this project

README