https://github.com/benmarte/autoimprove
Autonomous codebase improvement loop for Claude Code
https://github.com/benmarte/autoimprove
ai ai-skill claude claude-code claude-code-plugin claude-skills
Last synced: 22 days ago
JSON representation
Autonomous codebase improvement loop for Claude Code
- Host: GitHub
- URL: https://github.com/benmarte/autoimprove
- Owner: benmarte
- License: mit
- Created: 2026-03-11T12:40:57.000Z (about 1 month ago)
- Default Branch: main
- Last Pushed: 2026-03-13T20:17:59.000Z (about 1 month ago)
- Last Synced: 2026-03-15T09:55:57.164Z (about 1 month ago)
- Topics: ai, ai-skill, claude, claude-code, claude-code-plugin, claude-skills
- Language: Shell
- Homepage:
- Size: 472 KB
- Stars: 5
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
# ๐ autoimprove
### Autonomous codebase improvement loop for Claude Code
[](https://code.claude.com)
[](LICENSE)
[](#supported-languages)
*Inspired by [karpathy/autoresearch](https://github.com/karpathy/autoresearch) โ but for any codebase, not just ML training loops.*
---
## What is this?
Karpathy's `autoresearch` lets an AI agent run ML experiments overnight: modify `train.py` โ measure `val_bpb` โ keep if better, discard if worse โ repeat. You wake up to a log of experiments and a better model.
**autoimprove does the same thing for your codebase.**
Give Claude Code your project, run `/autoimprove:improve`, and let it iterate autonomously. It proposes a targeted change, scores your codebase before and after using your own tooling (TypeScript, `cargo clippy`, `pytest`, `golangci-lint` โ whatever you already have), keeps the changes that improve the score, reverts the ones that don't, and logs everything. You wake up to a readable log of what worked, what didn't, and a cleaner codebase.
```
propose โ measure BEFORE โ implement โ measure AFTER โ keep โ
or discard โ โ log โ repeat
```
A real autoimprove session: 5 iterations, 3 wins, score 50 โ 59 (+9 pts) in under 6 minutes
---
## Quick start
```bash
# 1. Add the marketplace and install the plugin
/plugin marketplace add benmarte/autoimprove
/plugin install autoimprove@autoimprove
# 2. Auto-detect your stack and see your codebase report
/autoimprove:setup
# 3. The audit shows what's wrong and offers to start fixing
# Or run the audit anytime for a fresh check
/autoimprove:audit
# 4. For unattended runs (e.g. overnight), use improve directly
/autoimprove:improve 20
# Or focus on a specific task
/autoimprove:improve 10 "Replace all any types with proper interfaces"
# 5. Review in the morning
cat .claude/autoimprove/log.md
git log --oneline # one commit per winning experiment
git show HEAD # inspect the latest win
```
That's it. No config required upfront โ `/autoimprove:setup` fingerprints your project, writes `.claude/autoimprove/config.md`, and immediately runs an audit showing your codebase's deficiencies ranked by efficiency.
### Upgrading
#### If you already have the upgrade command
```bash
/autoimprove:upgrade
```
#### If you don't have the upgrade command (older installs)
The plugin system caches marketplace clones locally. If your install predates the upgrade command, you need to update the marketplace clone first:
```bash
# 1. Update the marketplace clone
cd ~/.claude/plugins/marketplaces/autoimprove && git pull origin main
# 2. Reinstall the plugin
/plugin update autoimprove@autoimprove
```
If `/plugin update` still shows "already at the latest version", uninstall and reinstall:
```bash
/plugin uninstall autoimprove@autoimprove
/plugin install autoimprove@autoimprove
```
After this, `/autoimprove:upgrade` will be available for all future updates.
### Auto-update check
autoimprove checks for new releases once per day on session start. If an update is available, you'll see:
```
Update available: v1.2.0 โ v1.3.0
Run /autoimprove:upgrade to update.
```
The check is lightweight (single GitHub API call, 3s timeout, cached for 24 hours) and never blocks startup.
---
## How it works
### 1. Setup (once per project)
`/autoimprove:setup` scans your project root to detect:
- Language and framework
- Package manager (`npm`, `cargo`, `poetry`, `uv`, etc.)
- Test runner (`pytest`, `jest`, `go test`, `rspec`, etc.)
- Type checker (`tsc`, `mypy`, `pyright`, etc.)
- Linter (`eslint`, `ruff`, `golangci-lint`, `rubocop`, etc.)
It writes an `.claude/autoimprove/config.md` file in your project root โ a plain Markdown config that maps your specific tools to a **0โ100 composite quality score**. You can edit this file to customise the loop for your project.
### 2. Isolated experiments via git worktrees
Every experiment runs in a **separate git worktree** โ its own directory, its own branch, completely isolated from your main codebase:
```
your-project/ โ main branch (never touched during experiments)
.claude/autoimprove/worktrees/ โ gitignored, auto-created
experiment-001/ โ branch: autoimprove/experiment-001
experiment-002/ โ branch: autoimprove/experiment-002
experiment-003/ โ branch: autoimprove/experiment-003
```
- โ
Winning experiments get **squash-merged** back to main as a clean commit
- โ Losing experiments have the **worktree and branch deleted** โ nothing touches main
- ๐ Your working directory is **read-only** for the entire session
- ๐งน All worktrees are cleaned up automatically at session end
No more `git checkout -- .` rollbacks. No risk of a broken experiment corrupting your codebase.
### 3. The audit
Before diving into fixes, `/autoimprove:audit` scans your codebase and shows exactly what needs work:
```
โโโ Codebase Audit โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ Current Score: 61/100
Type safety: 24/40 โโโโโโโโโโ (16 pts to max)
Build: 20/20 โโโโโโโโโโ โ maxed
Tests: 10/30 โโโโโโโโโโ (20 pts to max)
Lint: 7/10 โโโโโโโโโโ (3 pts to max)
โโโ Fastest Path to 100% โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# Area Gap Issues Est. iterations Efficiency
1 Type safety 16pts 8 errors 3 iterations 5.3 pts/iter โ best
2 Lint 3pts 2 warnings 1 iteration 3.0 pts/iter
3 Tests 20pts 0/4 covered 7 iterations 2.9 pts/iter
Total: ~11 iterations to reach 100/100
โก Estimated token usage: ~250K tokens (rough estimate, actual usage varies)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
```
The audit ranks areas by **efficiency** โ points gained per iteration โ so you fix the highest-impact issues first. It then offers to start fixing interactively, area by area, or you can run `/autoimprove:improve` directly.
Setup auto-runs the audit after generating your config, so first-time users see this report immediately.
### 4. The score
Every iteration, the loop measures your codebase on four axes:
| Metric | Weight | What it checks |
|---|---|---|
| **Type / compile errors** | 40 pts | `tsc --noEmit`, `cargo check`, `go build`, `mypy`, etc. |
| **Build success** | 20 pts | Does the project build without errors? |
| **Test pass rate** | 30 pts | `(passing / total) ร 30` |
| **Lint errors** | 10 pts | `eslint`, `ruff`, `clippy`, `golangci-lint`, etc. |
If a metric doesn't apply (no tests yet, no linter configured), its weight is redistributed across the others.
### 5. The loop
Each iteration prints visible progress so you always know what's happening:
```
โโโ Iteration 1/5 โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ฌ PROPOSE: Targeting error handling in src/api/client.ts
๐ฌ SNAPSHOT: Measuring BEFORE score...
๐ฌ IMPLEMENT: Adding try/catch to unhandled async calls
๐ฌ MEASURE: Measuring AFTER score...
๐ฌ DECIDE: 85 โ 89 (+4 pts) โ KEPT โ
๐ฌ LOG: Recorded to .claude/autoimprove/log.md
```
Steps per iteration:
1. **Creates** a fresh git worktree + branch (`autoimprove/experiment-NNN`)
2. **Proposes** one bounded improvement with an explicit hypothesis โ *"I will fix the three unhandled promise rejections in `api/invoices.ts` because I expect it to reduce TypeScript errors and improve the type score by ~8 points"*
3. **Measures** the score inside the worktree (BEFORE)
4. **Implements** the change inside the worktree (surgical โ 1โ3 files at most)
5. **Measures** again (AFTER)
6. **Keeps** โ squash-merges to main and deletes the worktree โ if AFTER โฅ BEFORE
7. **Discards** โ deletes the worktree and branch, main untouched โ if AFTER < BEFORE
8. **Logs** the result to `.claude/autoimprove/log.md`
### 6. The log
After each iteration, `.claude/autoimprove/log.md` gets an entry like:
```
## Iteration 4 โ 2026-03-11 02:14
**Hypothesis:** Replace 3 `any` types in convex/invoices.ts with proper TypeScript interfaces
**Branch:** autoimprove/experiment-004
**Files changed:** convex/invoices.ts
**Before:** 74/100 โ type: 28, build: 20, tests: 18, lint: 8
**After:** 82/100 โ type: 36, build: 20, tests: 18, lint: 8
**Decision:** KEPT โ
(squash-merged to main, worktree deleted)
**Reason:** Eliminated 2 TS errors by typing the invoice mutation arguments properly
```
---
## Commands
| Command | Description |
|---|---|
| `/autoimprove:setup` | Detect stack, generate config, and run initial audit |
| `/autoimprove:audit` | Scan codebase for deficiencies and get a prioritized fix plan |
| `/autoimprove:improve [N] ["focus"]` | Run N iterations of the loop (default: 5), optionally focused on a specific task |
| `/autoimprove:continue [N] ["focus"]` | Resume an interrupted session โ inherits remaining iterations and focus from the log |
| `/autoimprove:status` | Show a summary of all runs from `.claude/autoimprove/log.md` |
| `/autoimprove:upgrade` | Check for and install the latest version |
---
## Supported languages
| Language | Type check | Build | Tests | Lint |
|---|---|---|---|---|
| **TypeScript / JavaScript** | `tsc --noEmit` | `npm/pnpm/yarn/bun build` | jest / vitest / mocha | eslint |
| **Next.js / Nuxt / Remix / Astro** | `tsc --noEmit` | framework build cmd | jest / vitest | eslint |
| **Python** | mypy / pyright | โ | pytest | ruff / flake8 / pylint |
| **Go** | `go build ./...` | `go build` | `go test ./...` | golangci-lint / `go vet` |
| **Rust** | `cargo check` | `cargo build` | `cargo test` | `cargo clippy` |
| **Ruby** | sorbet (if configured) | โ | rspec / minitest | rubocop |
| **Java / Kotlin** | `mvn compile` / `./gradlew build` | same | `mvn test` / `./gradlew test` | checkstyle / ktlint |
| **C# / .NET** | `dotnet build` | `dotnet build` | `dotnet test` | `dotnet format --verify-no-changes` |
| **PHP** | phpstan | โ | phpunit | phpcs |
| **Swift** | `swift build` | `swift build` | `swift test` | swiftlint |
| **Any Makefile project** | `make check` / `make typecheck` | `make build` | `make test` | `make lint` |
Don't see your stack? Edit `.claude/autoimprove/config.md` after setup to add your own commands.
---
## Customising .claude/autoimprove/config.md
After running `/autoimprove:setup`, edit the generated `.claude/autoimprove/config.md` to tailor the loop to your project:
```markdown
## Improvement Areas
- Check all Convex mutations have auth guards
- Replace fetch() calls with our internal apiClient wrapper
- Ensure every page component has a loading.tsx sibling
## Files to Never Modify
- convex/schema.ts
- src/generated/
- migrations/
- .env.local
```
You can also override any auto-detected command, change scoring weights, or add custom shell commands as additional metrics.
---
## Focused improvements
You can focus the loop on a specific task **directly from the command** โ no config editing needed. Just pass a quoted string:
```bash
# Focus on type safety
/autoimprove:improve 10 "Replace all any types with proper TypeScript interfaces"
# Focus on a specific directory
/autoimprove:improve 5 "Fix all lint warnings in src/components/dashboard/"
# Focus on tests
/autoimprove:improve 10 "Add unit tests for every exported function in lib/billing/"
# Focus on a migration
/autoimprove:improve 20 "Replace all raw fetch() calls with the apiClient wrapper from lib/api-client.ts"
```
When a focus string is provided, **every iteration targets that task**. The loop breaks it into file-by-file sub-tasks and chips away one per iteration until the focus is fully addressed or iterations run out.
Without a focus string, the loop rotates through all areas listed in your `.claude/autoimprove/config.md` as usual.
### Alternative: edit the config
For recurring focus areas, you can also edit the `Improvement Areas` section in `.claude/autoimprove/config.md` directly:
```markdown
## Improvement Areas
- Replace every `any` type with a proper TypeScript interface or type alias
```
This is useful when you want the focus to persist across multiple sessions without re-typing it.
### Tips for focused runs
- **Be specific.** `"Fix type errors"` is vague. `"Replace any with proper types in convex/ mutations"` gives the loop a clear target.
- **One concern at a time** works best. The loop makes surgical 1โ3 file changes per iteration โ a narrow focus means every iteration chips away at the same problem.
- **Match iteration count to scope.** If you have ~20 files to fix, run `/autoimprove:improve 20 "..."` so each iteration can tackle one file.
- **Use "Files to Never Modify"** in the config to protect areas you don't want touched during a focused run.
---
## Resuming interrupted sessions
If your session gets interrupted (Ctrl+C, context limit, crash), you can pick up where you left off:
```bash
# Resume with remaining iterations and same focus
/autoimprove:continue
# Resume but only run 3 more iterations
/autoimprove:continue 3
# Resume with a different focus
/autoimprove:continue "New focus area"
# Override both
/autoimprove:continue 5 "Fix error handling in api/"
```
The continue command reads `.claude/autoimprove/log.md` to find the interrupted session, inherits its settings, and picks up from the next iteration. Iteration numbering continues seamlessly (e.g., if you completed 4/10, it resumes at 5/10).
If the codebase has changed since the interrupted session (you made manual commits), autoimprove will warn you and re-measure the baseline.
Check `/autoimprove:status` to see if you have an interrupted session to resume.
---
## What the loop improves
The loop rotates through these universal improvement areas (and adds language-specific ones based on your stack):
- **Type safety** โ fix type errors, replace `any`/`interface{}`/untyped constructs
- **Error handling** โ unhandled promises, bare `catch {}`, swallowed errors
- **Dead code** โ unused imports, variables, unreachable branches
- **Code duplication** โ extract repeated logic (3+ occurrences) into shared utilities
- **Naming & readability** โ cryptic names, functions over ~50 lines
- **Performance** โ N+1 query patterns, missing memoization, unnecessary allocations
- **Security** โ hardcoded secrets, missing input validation, unguarded auth routes
- **Tests** โ add a test for the most critical untested function, fix flaky tests
---
## Safety
The loop is designed to be safe to run unattended:
| Rule | Detail |
|---|---|
| ๐ Never touches lock files | `package-lock.json`, `Cargo.lock`, `go.sum`, `Gemfile.lock`, etc. |
| ๐ Never touches generated files | Migrations, protobuf output, OpenAPI generated code |
| ๐ Never touches secrets | `.env`, `.env.local`, any secrets file |
| ๐ Never deploys or publishes | No `git push`, `npm publish`, `cargo publish`, etc. |
| ๐ Requires clean git state | Won't start if `git status` shows uncommitted changes |
| ๐ Experiments in isolated worktrees | Each experiment is on its own branch โ main is never modified mid-session |
| ๐ Losers deleted, not rolled back | Failed experiments: worktree deleted, branch deleted, main untouched |
| ๐ Winners squash-merged | One clean commit per winning experiment โ easy to review with `git log` |
| ๐ Pauses every 10 iterations | Cleans up worktrees, writes summary, waits for human review |
You always review and push โ the loop never commits or pushes on your behalf.
---
## Plugin structure
```
autoimprove/
โโโ .claude-plugin/
โ โโโ plugin.json # Plugin manifest
โ โโโ hooks/
โ โโโ hooks.json # SessionStart hook registration
โโโ hooks/
โ โโโ sessionstart.sh # update check on startup (once per day)
โโโ skills/
โ โโโ audit/
โ โ โโโ SKILL.md # Codebase deficiency scan, prioritized report, interactive fix loop
โ โโโ detect-stack/
โ โ โโโ SKILL.md # Fingerprints project, writes .claude/autoimprove/config.md
โ โโโ worktree/
โ โ โโโ SKILL.md # Creates/manages/cleans up git worktrees per experiment
โ โโโ improve-loop/
โ โ โโโ SKILL.md # Core loop: worktree โ propose โ implement โ measure โ merge/delete
โ โโโ measure/
โ โ โโโ SKILL.md # Internal scoring utility (used by audit and improve-loop)
โ โโโ rollback/
โ โโโ SKILL.md # Emergency cleanup of all experiment worktrees
โโโ commands/
โโโ audit.md # /autoimprove:audit
โโโ continue.md # /autoimprove:continue [N] ["focus"]
โโโ setup.md # /autoimprove:setup
โโโ improve.md # /autoimprove:improve [N] ["focus"]
โโโ status.md # /autoimprove:status
โโโ upgrade.md # /autoimprove:upgrade (check for updates)
```
---
## Example run
Here's what a real overnight session looks like. This is from a Next.js + Convex project starting at a score of 61/100:
```
## Iteration 1 โ 23:04
**Hypothesis:** Replace 4 implicit `any` types in `convex/invoices.ts` with proper interfaces
**Files changed:** convex/invoices.ts
**Before:** 61/100 โ type: 24, build: 20, tests: 10, lint: 7
**After:** 69/100 โ type: 32, build: 20, tests: 10, lint: 7
**Decision:** KEPT โ
**Reason:** Removed 4 TS7006 implicit-any errors by typing mutation arguments
## Iteration 5 โ 23:37
**Hypothesis:** Move ExpenseList to a server component โ it only reads data, no interactivity
**Branch:** autoimprove/experiment-005
**Files changed:** components/ExpenseList.tsx
**Before:** 71/100 โ type: 32, build: 20, tests: 10, lint: 9
**After:** 68/100 โ type: 26, build: 20, tests: 10, lint: 12
**Decision:** DISCARDED โ (worktree deleted, main untouched)
**Reason:** Removing "use client" broke useQuery hook โ must stay client component.
## Iteration 8 โ 00:02
**Hypothesis:** Add unit tests for calculateTaxEstimate() โ most complex function, zero coverage
**Files changed:** lib/tax.test.ts (new)
**Before:** 78/100 โ type: 36, build: 20, tests: 10, lint: 10
**After:** 84/100 โ type: 36, build: 20, tests: 16, lint: 10
**Decision:** KEPT โ
**Reason:** 2 new tests passing, covers basic and edge-case tax bracket logic
โโโ Session Complete โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ Score: 61 โ 84 (+23 pts)
๐ Iterations: 10 total โ 9 kept โ
, 1 discarded โ
๐ Merged commits:
โข abc1234 autoimprove(001): Replace 4 implicit any types
โข def5678 autoimprove(002): Add error boundaries
โข ...
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
```
See [`autoimprove-log.example.md`](autoimprove-log.example.md) for the full 10-iteration session with summary table.
---
## Contributing
PRs welcome! Especially:
- New language profiles in `detect-stack/SKILL.md`
- Better improvement area prompts for specific frameworks
- Example `.claude/autoimprove/config.md` files for common stacks
---
## License
MIT