An open API service indexing awesome lists of open source software.

https://github.com/benmarte/autoimprove

Autonomous codebase improvement loop for Claude Code
https://github.com/benmarte/autoimprove

ai ai-skill claude claude-code claude-code-plugin claude-skills

Last synced: 22 days ago
JSON representation

Autonomous codebase improvement loop for Claude Code

Awesome Lists containing this project

README

          

# ๐Ÿ” autoimprove

### Autonomous codebase improvement loop for Claude Code

[![Claude Code](https://img.shields.io/badge/Claude%20Code-Plugin-blueviolet?logo=anthropic)](https://code.claude.com)
[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
[![Languages](https://img.shields.io/badge/languages-10%2B-blue)](#supported-languages)

*Inspired by [karpathy/autoresearch](https://github.com/karpathy/autoresearch) โ€” but for any codebase, not just ML training loops.*

---

## What is this?

Karpathy's `autoresearch` lets an AI agent run ML experiments overnight: modify `train.py` โ†’ measure `val_bpb` โ†’ keep if better, discard if worse โ†’ repeat. You wake up to a log of experiments and a better model.

**autoimprove does the same thing for your codebase.**

Give Claude Code your project, run `/autoimprove:improve`, and let it iterate autonomously. It proposes a targeted change, scores your codebase before and after using your own tooling (TypeScript, `cargo clippy`, `pytest`, `golangci-lint` โ€” whatever you already have), keeps the changes that improve the score, reverts the ones that don't, and logs everything. You wake up to a readable log of what worked, what didn't, and a cleaner codebase.

```
propose โ†’ measure BEFORE โ†’ implement โ†’ measure AFTER โ†’ keep โœ… or discard โŒ โ†’ log โ†’ repeat
```


autoimprove session showing 5 iterations with 3 wins, score going from 50 to 59

A real autoimprove session: 5 iterations, 3 wins, score 50 โ†’ 59 (+9 pts) in under 6 minutes

---

## Quick start

```bash
# 1. Add the marketplace and install the plugin
/plugin marketplace add benmarte/autoimprove
/plugin install autoimprove@autoimprove

# 2. Auto-detect your stack and see your codebase report
/autoimprove:setup

# 3. The audit shows what's wrong and offers to start fixing
# Or run the audit anytime for a fresh check
/autoimprove:audit

# 4. For unattended runs (e.g. overnight), use improve directly
/autoimprove:improve 20

# Or focus on a specific task
/autoimprove:improve 10 "Replace all any types with proper interfaces"

# 5. Review in the morning
cat .claude/autoimprove/log.md
git log --oneline # one commit per winning experiment
git show HEAD # inspect the latest win
```

That's it. No config required upfront โ€” `/autoimprove:setup` fingerprints your project, writes `.claude/autoimprove/config.md`, and immediately runs an audit showing your codebase's deficiencies ranked by efficiency.

### Upgrading

#### If you already have the upgrade command

```bash
/autoimprove:upgrade
```

#### If you don't have the upgrade command (older installs)

The plugin system caches marketplace clones locally. If your install predates the upgrade command, you need to update the marketplace clone first:

```bash
# 1. Update the marketplace clone
cd ~/.claude/plugins/marketplaces/autoimprove && git pull origin main

# 2. Reinstall the plugin
/plugin update autoimprove@autoimprove
```

If `/plugin update` still shows "already at the latest version", uninstall and reinstall:

```bash
/plugin uninstall autoimprove@autoimprove
/plugin install autoimprove@autoimprove
```

After this, `/autoimprove:upgrade` will be available for all future updates.

### Auto-update check

autoimprove checks for new releases once per day on session start. If an update is available, you'll see:

```
Update available: v1.2.0 โ†’ v1.3.0
Run /autoimprove:upgrade to update.
```

The check is lightweight (single GitHub API call, 3s timeout, cached for 24 hours) and never blocks startup.

---

## How it works

### 1. Setup (once per project)

`/autoimprove:setup` scans your project root to detect:
- Language and framework
- Package manager (`npm`, `cargo`, `poetry`, `uv`, etc.)
- Test runner (`pytest`, `jest`, `go test`, `rspec`, etc.)
- Type checker (`tsc`, `mypy`, `pyright`, etc.)
- Linter (`eslint`, `ruff`, `golangci-lint`, `rubocop`, etc.)

It writes an `.claude/autoimprove/config.md` file in your project root โ€” a plain Markdown config that maps your specific tools to a **0โ€“100 composite quality score**. You can edit this file to customise the loop for your project.

### 2. Isolated experiments via git worktrees

Every experiment runs in a **separate git worktree** โ€” its own directory, its own branch, completely isolated from your main codebase:

```
your-project/ โ† main branch (never touched during experiments)
.claude/autoimprove/worktrees/ โ† gitignored, auto-created
experiment-001/ โ† branch: autoimprove/experiment-001
experiment-002/ โ† branch: autoimprove/experiment-002
experiment-003/ โ† branch: autoimprove/experiment-003
```

- โœ… Winning experiments get **squash-merged** back to main as a clean commit
- โŒ Losing experiments have the **worktree and branch deleted** โ€” nothing touches main
- ๐Ÿ”’ Your working directory is **read-only** for the entire session
- ๐Ÿงน All worktrees are cleaned up automatically at session end

No more `git checkout -- .` rollbacks. No risk of a broken experiment corrupting your codebase.

### 3. The audit

Before diving into fixes, `/autoimprove:audit` scans your codebase and shows exactly what needs work:

```
โ”โ”โ” Codebase Audit โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
๐Ÿ“Š Current Score: 61/100

Type safety: 24/40 โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘ (16 pts to max)
Build: 20/20 โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โœ“ maxed
Tests: 10/30 โ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ (20 pts to max)
Lint: 7/10 โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘ (3 pts to max)

โ”โ”โ” Fastest Path to 100% โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”

# Area Gap Issues Est. iterations Efficiency
1 Type safety 16pts 8 errors 3 iterations 5.3 pts/iter โ† best
2 Lint 3pts 2 warnings 1 iteration 3.0 pts/iter
3 Tests 20pts 0/4 covered 7 iterations 2.9 pts/iter

Total: ~11 iterations to reach 100/100
โšก Estimated token usage: ~250K tokens (rough estimate, actual usage varies)
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
```

The audit ranks areas by **efficiency** โ€” points gained per iteration โ€” so you fix the highest-impact issues first. It then offers to start fixing interactively, area by area, or you can run `/autoimprove:improve` directly.

Setup auto-runs the audit after generating your config, so first-time users see this report immediately.

### 4. The score

Every iteration, the loop measures your codebase on four axes:

| Metric | Weight | What it checks |
|---|---|---|
| **Type / compile errors** | 40 pts | `tsc --noEmit`, `cargo check`, `go build`, `mypy`, etc. |
| **Build success** | 20 pts | Does the project build without errors? |
| **Test pass rate** | 30 pts | `(passing / total) ร— 30` |
| **Lint errors** | 10 pts | `eslint`, `ruff`, `clippy`, `golangci-lint`, etc. |

If a metric doesn't apply (no tests yet, no linter configured), its weight is redistributed across the others.

### 5. The loop

Each iteration prints visible progress so you always know what's happening:

```
โ”โ”โ” Iteration 1/5 โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
๐Ÿ”ฌ PROPOSE: Targeting error handling in src/api/client.ts
๐Ÿ”ฌ SNAPSHOT: Measuring BEFORE score...
๐Ÿ”ฌ IMPLEMENT: Adding try/catch to unhandled async calls
๐Ÿ”ฌ MEASURE: Measuring AFTER score...
๐Ÿ”ฌ DECIDE: 85 โ†’ 89 (+4 pts) โ€” KEPT โœ…
๐Ÿ”ฌ LOG: Recorded to .claude/autoimprove/log.md
```

Steps per iteration:

1. **Creates** a fresh git worktree + branch (`autoimprove/experiment-NNN`)
2. **Proposes** one bounded improvement with an explicit hypothesis โ€” *"I will fix the three unhandled promise rejections in `api/invoices.ts` because I expect it to reduce TypeScript errors and improve the type score by ~8 points"*
3. **Measures** the score inside the worktree (BEFORE)
4. **Implements** the change inside the worktree (surgical โ€” 1โ€“3 files at most)
5. **Measures** again (AFTER)
6. **Keeps** โ€” squash-merges to main and deletes the worktree โ€” if AFTER โ‰ฅ BEFORE
7. **Discards** โ€” deletes the worktree and branch, main untouched โ€” if AFTER < BEFORE
8. **Logs** the result to `.claude/autoimprove/log.md`

### 6. The log

After each iteration, `.claude/autoimprove/log.md` gets an entry like:

```
## Iteration 4 โ€” 2026-03-11 02:14
**Hypothesis:** Replace 3 `any` types in convex/invoices.ts with proper TypeScript interfaces
**Branch:** autoimprove/experiment-004
**Files changed:** convex/invoices.ts
**Before:** 74/100 โ€” type: 28, build: 20, tests: 18, lint: 8
**After:** 82/100 โ€” type: 36, build: 20, tests: 18, lint: 8
**Decision:** KEPT โœ… (squash-merged to main, worktree deleted)
**Reason:** Eliminated 2 TS errors by typing the invoice mutation arguments properly
```

---

## Commands

| Command | Description |
|---|---|
| `/autoimprove:setup` | Detect stack, generate config, and run initial audit |
| `/autoimprove:audit` | Scan codebase for deficiencies and get a prioritized fix plan |
| `/autoimprove:improve [N] ["focus"]` | Run N iterations of the loop (default: 5), optionally focused on a specific task |
| `/autoimprove:continue [N] ["focus"]` | Resume an interrupted session โ€” inherits remaining iterations and focus from the log |
| `/autoimprove:status` | Show a summary of all runs from `.claude/autoimprove/log.md` |
| `/autoimprove:upgrade` | Check for and install the latest version |

---

## Supported languages

| Language | Type check | Build | Tests | Lint |
|---|---|---|---|---|
| **TypeScript / JavaScript** | `tsc --noEmit` | `npm/pnpm/yarn/bun build` | jest / vitest / mocha | eslint |
| **Next.js / Nuxt / Remix / Astro** | `tsc --noEmit` | framework build cmd | jest / vitest | eslint |
| **Python** | mypy / pyright | โ€” | pytest | ruff / flake8 / pylint |
| **Go** | `go build ./...` | `go build` | `go test ./...` | golangci-lint / `go vet` |
| **Rust** | `cargo check` | `cargo build` | `cargo test` | `cargo clippy` |
| **Ruby** | sorbet (if configured) | โ€” | rspec / minitest | rubocop |
| **Java / Kotlin** | `mvn compile` / `./gradlew build` | same | `mvn test` / `./gradlew test` | checkstyle / ktlint |
| **C# / .NET** | `dotnet build` | `dotnet build` | `dotnet test` | `dotnet format --verify-no-changes` |
| **PHP** | phpstan | โ€” | phpunit | phpcs |
| **Swift** | `swift build` | `swift build` | `swift test` | swiftlint |
| **Any Makefile project** | `make check` / `make typecheck` | `make build` | `make test` | `make lint` |

Don't see your stack? Edit `.claude/autoimprove/config.md` after setup to add your own commands.

---

## Customising .claude/autoimprove/config.md

After running `/autoimprove:setup`, edit the generated `.claude/autoimprove/config.md` to tailor the loop to your project:

```markdown
## Improvement Areas
- Check all Convex mutations have auth guards
- Replace fetch() calls with our internal apiClient wrapper
- Ensure every page component has a loading.tsx sibling

## Files to Never Modify
- convex/schema.ts
- src/generated/
- migrations/
- .env.local
```

You can also override any auto-detected command, change scoring weights, or add custom shell commands as additional metrics.

---

## Focused improvements

You can focus the loop on a specific task **directly from the command** โ€” no config editing needed. Just pass a quoted string:

```bash
# Focus on type safety
/autoimprove:improve 10 "Replace all any types with proper TypeScript interfaces"

# Focus on a specific directory
/autoimprove:improve 5 "Fix all lint warnings in src/components/dashboard/"

# Focus on tests
/autoimprove:improve 10 "Add unit tests for every exported function in lib/billing/"

# Focus on a migration
/autoimprove:improve 20 "Replace all raw fetch() calls with the apiClient wrapper from lib/api-client.ts"
```

When a focus string is provided, **every iteration targets that task**. The loop breaks it into file-by-file sub-tasks and chips away one per iteration until the focus is fully addressed or iterations run out.

Without a focus string, the loop rotates through all areas listed in your `.claude/autoimprove/config.md` as usual.

### Alternative: edit the config

For recurring focus areas, you can also edit the `Improvement Areas` section in `.claude/autoimprove/config.md` directly:

```markdown
## Improvement Areas
- Replace every `any` type with a proper TypeScript interface or type alias
```

This is useful when you want the focus to persist across multiple sessions without re-typing it.

### Tips for focused runs

- **Be specific.** `"Fix type errors"` is vague. `"Replace any with proper types in convex/ mutations"` gives the loop a clear target.
- **One concern at a time** works best. The loop makes surgical 1โ€“3 file changes per iteration โ€” a narrow focus means every iteration chips away at the same problem.
- **Match iteration count to scope.** If you have ~20 files to fix, run `/autoimprove:improve 20 "..."` so each iteration can tackle one file.
- **Use "Files to Never Modify"** in the config to protect areas you don't want touched during a focused run.

---

## Resuming interrupted sessions

If your session gets interrupted (Ctrl+C, context limit, crash), you can pick up where you left off:

```bash
# Resume with remaining iterations and same focus
/autoimprove:continue

# Resume but only run 3 more iterations
/autoimprove:continue 3

# Resume with a different focus
/autoimprove:continue "New focus area"

# Override both
/autoimprove:continue 5 "Fix error handling in api/"
```

The continue command reads `.claude/autoimprove/log.md` to find the interrupted session, inherits its settings, and picks up from the next iteration. Iteration numbering continues seamlessly (e.g., if you completed 4/10, it resumes at 5/10).

If the codebase has changed since the interrupted session (you made manual commits), autoimprove will warn you and re-measure the baseline.

Check `/autoimprove:status` to see if you have an interrupted session to resume.

---

## What the loop improves

The loop rotates through these universal improvement areas (and adds language-specific ones based on your stack):

- **Type safety** โ€” fix type errors, replace `any`/`interface{}`/untyped constructs
- **Error handling** โ€” unhandled promises, bare `catch {}`, swallowed errors
- **Dead code** โ€” unused imports, variables, unreachable branches
- **Code duplication** โ€” extract repeated logic (3+ occurrences) into shared utilities
- **Naming & readability** โ€” cryptic names, functions over ~50 lines
- **Performance** โ€” N+1 query patterns, missing memoization, unnecessary allocations
- **Security** โ€” hardcoded secrets, missing input validation, unguarded auth routes
- **Tests** โ€” add a test for the most critical untested function, fix flaky tests

---

## Safety

The loop is designed to be safe to run unattended:

| Rule | Detail |
|---|---|
| ๐Ÿ”’ Never touches lock files | `package-lock.json`, `Cargo.lock`, `go.sum`, `Gemfile.lock`, etc. |
| ๐Ÿ”’ Never touches generated files | Migrations, protobuf output, OpenAPI generated code |
| ๐Ÿ”’ Never touches secrets | `.env`, `.env.local`, any secrets file |
| ๐Ÿ”’ Never deploys or publishes | No `git push`, `npm publish`, `cargo publish`, etc. |
| ๐Ÿ”’ Requires clean git state | Won't start if `git status` shows uncommitted changes |
| ๐Ÿ”’ Experiments in isolated worktrees | Each experiment is on its own branch โ€” main is never modified mid-session |
| ๐Ÿ”’ Losers deleted, not rolled back | Failed experiments: worktree deleted, branch deleted, main untouched |
| ๐Ÿ”’ Winners squash-merged | One clean commit per winning experiment โ€” easy to review with `git log` |
| ๐Ÿ”’ Pauses every 10 iterations | Cleans up worktrees, writes summary, waits for human review |

You always review and push โ€” the loop never commits or pushes on your behalf.

---

## Plugin structure

```
autoimprove/
โ”œโ”€โ”€ .claude-plugin/
โ”‚ โ”œโ”€โ”€ plugin.json # Plugin manifest
โ”‚ โ””โ”€โ”€ hooks/
โ”‚ โ””โ”€โ”€ hooks.json # SessionStart hook registration
โ”œโ”€โ”€ hooks/
โ”‚ โ””โ”€โ”€ sessionstart.sh # update check on startup (once per day)
โ”œโ”€โ”€ skills/
โ”‚ โ”œโ”€โ”€ audit/
โ”‚ โ”‚ โ””โ”€โ”€ SKILL.md # Codebase deficiency scan, prioritized report, interactive fix loop
โ”‚ โ”œโ”€โ”€ detect-stack/
โ”‚ โ”‚ โ””โ”€โ”€ SKILL.md # Fingerprints project, writes .claude/autoimprove/config.md
โ”‚ โ”œโ”€โ”€ worktree/
โ”‚ โ”‚ โ””โ”€โ”€ SKILL.md # Creates/manages/cleans up git worktrees per experiment
โ”‚ โ”œโ”€โ”€ improve-loop/
โ”‚ โ”‚ โ””โ”€โ”€ SKILL.md # Core loop: worktree โ†’ propose โ†’ implement โ†’ measure โ†’ merge/delete
โ”‚ โ”œโ”€โ”€ measure/
โ”‚ โ”‚ โ””โ”€โ”€ SKILL.md # Internal scoring utility (used by audit and improve-loop)
โ”‚ โ””โ”€โ”€ rollback/
โ”‚ โ””โ”€โ”€ SKILL.md # Emergency cleanup of all experiment worktrees
โ””โ”€โ”€ commands/
โ”œโ”€โ”€ audit.md # /autoimprove:audit
โ”œโ”€โ”€ continue.md # /autoimprove:continue [N] ["focus"]
โ”œโ”€โ”€ setup.md # /autoimprove:setup
โ”œโ”€โ”€ improve.md # /autoimprove:improve [N] ["focus"]
โ”œโ”€โ”€ status.md # /autoimprove:status
โ””โ”€โ”€ upgrade.md # /autoimprove:upgrade (check for updates)
```

---

## Example run

Here's what a real overnight session looks like. This is from a Next.js + Convex project starting at a score of 61/100:

```
## Iteration 1 โ€” 23:04
**Hypothesis:** Replace 4 implicit `any` types in `convex/invoices.ts` with proper interfaces
**Files changed:** convex/invoices.ts
**Before:** 61/100 โ€” type: 24, build: 20, tests: 10, lint: 7
**After:** 69/100 โ€” type: 32, build: 20, tests: 10, lint: 7
**Decision:** KEPT โœ…
**Reason:** Removed 4 TS7006 implicit-any errors by typing mutation arguments

## Iteration 5 โ€” 23:37
**Hypothesis:** Move ExpenseList to a server component โ€” it only reads data, no interactivity
**Branch:** autoimprove/experiment-005
**Files changed:** components/ExpenseList.tsx
**Before:** 71/100 โ€” type: 32, build: 20, tests: 10, lint: 9
**After:** 68/100 โ€” type: 26, build: 20, tests: 10, lint: 12
**Decision:** DISCARDED โŒ (worktree deleted, main untouched)
**Reason:** Removing "use client" broke useQuery hook โ€” must stay client component.

## Iteration 8 โ€” 00:02
**Hypothesis:** Add unit tests for calculateTaxEstimate() โ€” most complex function, zero coverage
**Files changed:** lib/tax.test.ts (new)
**Before:** 78/100 โ€” type: 36, build: 20, tests: 10, lint: 10
**After:** 84/100 โ€” type: 36, build: 20, tests: 16, lint: 10
**Decision:** KEPT โœ…
**Reason:** 2 new tests passing, covers basic and edge-case tax bracket logic

โ”โ”โ” Session Complete โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
๐Ÿ“Š Score: 61 โ†’ 84 (+23 pts)
๐Ÿ” Iterations: 10 total โ€” 9 kept โœ…, 1 discarded โŒ
๐Ÿ“ Merged commits:
โ€ข abc1234 autoimprove(001): Replace 4 implicit any types
โ€ข def5678 autoimprove(002): Add error boundaries
โ€ข ...
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
```

See [`autoimprove-log.example.md`](autoimprove-log.example.md) for the full 10-iteration session with summary table.

---

## Contributing

PRs welcome! Especially:
- New language profiles in `detect-stack/SKILL.md`
- Better improvement area prompts for specific frameworks
- Example `.claude/autoimprove/config.md` files for common stacks

---

## License

MIT