{"id":49009654,"url":"https://github.com/lex0c/gitcortex","last_synced_at":"2026-06-27T22:00:35.337Z","repository":{"id":351883652,"uuid":"1212854034","full_name":"lex0c/gitcortex","owner":"lex0c","description":"CLI for git repository metrics — extract commit data, generate stats and HTML reports. Single binary, 100% local.","archived":false,"fork":false,"pushed_at":"2026-06-27T18:44:48.000Z","size":1107,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-27T19:19:19.605Z","etag":null,"topics":["engineering-intelligence","git","git-analytics","git-metrics","git-stats"],"latest_commit_sha":null,"homepage":"https://lex0c.github.io/gitcortex/","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lex0c.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-16T19:53:11.000Z","updated_at":"2026-06-27T18:44:51.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/lex0c/gitcortex","commit_stats":null,"previous_names":["lex0c/gitcortexv2","lex0c/gitcortex"],"tags_count":21,"template":false,"template_full_name":null,"purl":"pkg:github/lex0c/gitcortex","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lex0c%2Fgitcortex","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lex0c%2Fgitcortex/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lex0c%2Fgitcortex/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lex0c%2Fgitcortex/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lex0c","download_url":"https://codeload.github.com/lex0c/gitcortex/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lex0c%2Fgitcortex/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34869004,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-27T02:00:06.362Z","response_time":126,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["engineering-intelligence","git","git-analytics","git-metrics","git-stats"],"created_at":"2026-04-18T22:13:59.374Z","updated_at":"2026-06-27T22:00:35.324Z","avatar_url":"https://github.com/lex0c.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# gitcortex\n\n**Repository behavior analyzer.** Reads git history — commits, authors, dates, file paths, line counts — and surfaces signals about how people and processes interact with a codebase: hotspots, bus factor, coupling, churn risk, working patterns, collaboration networks. It analyzes the behavior recorded in git, not the source code itself — every metric is derived from who touched what, when, and with whom.\n\n## Performance\n\nSee [`docs/PERF.md`](docs/PERF.md) for extended benchmarks.\n\nBenchmarked on open-source repositories. `extract` reads bare clones; `stats` and `report` read the resulting JSONL. Measurements taken on a single NVMe-SSD machine with a development build after v2.11.0 — blob-size resolution is now opt-in (`--blob-sizes`), and `stats`/`report` fan their independent metric passes across CPU cores (12 here). Not a controlled lab benchmark — directional, not absolute; `extract` is I/O-bound and carries ~10% run-to-run variance.\n\n| Repository | Commits | Devs | Extract | Stats (JSON) | Report (HTML) | JSONL size |\n|------------|---------|------|---------|-------------|--------------|------------|\n| [Pi-hole](https://github.com/pi-hole/pi-hole) | 7,077 | 281 | 0.9s | 0.11s | 0.14s | 23K lines / 6.2 MB |\n| [Praat](https://github.com/praat/praat) | 10,221 | 19 | 20s | 0.6s | 0.6s | 95K lines / 27 MB |\n| [WordPress](https://github.com/WordPress/WordPress) | 52,466 | 131 | 40s | 1.6s | 1.8s | 298K lines / 90 MB |\n| [Kubernetes](https://github.com/kubernetes/kubernetes) | 137,016 | 5,295 | 1m 51s | 8.1s | 8.5s | 943K lines / 295 MB |\n| [Linux kernel](https://github.com/torvalds/linux) | 1,438,634 | 38,832 | 12m 46s | 47.6s | 48.8s | 6.1M lines / 1.74 GB |\n\n`extract`, `stats`, and `report` scale roughly linearly with dataset size. Since the post-v2.11.0 work, `stats` and `report` run their independent metric passes concurrently across cores — on this 12-core machine that cut their wall time ~30–49% versus v2.11.0 (e.g. WordPress report 3.2s → 1.8s, Linux 1m29s → 49s) and brings the two within a few percent of each other (`report` does a little more — repo tree, dev network — but it overlaps the other passes). `extract` is unchanged within run-to-run variance: it's bound by the `git log` stream, not the now-optional blob-size lookup. `stats --format json` is the leanest path when you only need aggregate data; reach for `report` when you want the HTML dashboard.\n\n## Privacy and reliability\n\nAll processing is **100% local**. No external services, no network calls, **no AI**, no telemetry. gitcortex reads only git metadata (commits, authors, dates, file paths, line counts) — it never reads source code content. Commit messages are excluded by default and only included with `--include-commit-messages`. Data stays on your machine as a JSONL file that you control.\n\n## Install\n\n### Download binary (no Go required)\n\nPre-built binaries for Linux, macOS, and Windows are available on [GitHub Releases](https://github.com/lex0c/gitcortex/releases/latest):\n\n```bash\n# Linux (x64)\ncurl -L https://github.com/lex0c/gitcortex/releases/latest/download/gitcortex-linux-amd64 -o gitcortex\nchmod +x gitcortex\nsudo mv gitcortex /usr/local/bin/\n\n# macOS (Apple Silicon)\ncurl -L https://github.com/lex0c/gitcortex/releases/latest/download/gitcortex-darwin-arm64 -o gitcortex\nchmod +x gitcortex\nsudo mv gitcortex /usr/local/bin/\n\n# macOS (Intel)\ncurl -L https://github.com/lex0c/gitcortex/releases/latest/download/gitcortex-darwin-amd64 -o gitcortex\nchmod +x gitcortex\nsudo mv gitcortex /usr/local/bin/\n```\n\n### Go install\n\n```bash\ngo install github.com/lex0c/gitcortex/cmd/gitcortex@latest\n```\n\n### Build from source\n\n```bash\ngit clone https://github.com/lex0c/gitcortex.git\ncd gitcortex\nmake build\n```\n\nOther targets: `make test`, `make vet`, `make check` (vet + test), `make install`, `make clean`.\n\nCheck version: `gitcortex --version`\n\nRequires Git 2.31+ and Go 1.21+. CI runs automatically on push/PR via GitHub Actions.\n\n### Release\n\n```bash\ngit tag v0.1.0\ngit push origin main --tags\n```\n\nThe version is injected at build time from `git describe --tags`. After tagging, `make build \u0026\u0026 gitcortex --version` shows `v0.1.0`.\n\n## Usage\n\n### Extract\n\n```bash\n# Extract from current directory\ngitcortex extract\n\n# Extract from a specific repo and branch\ngitcortex extract --repo /path/to/repo --branch main\n\n# Include commit messages in output\ngitcortex extract --repo /path/to/repo --include-commit-messages\n\n# Custom output path\ngitcortex extract --repo /path/to/repo --output data.jsonl\n\n# Normalize author identities via .mailmap\ngitcortex extract --repo /path/to/repo --mailmap\n\n# Exclude files from extraction\ngitcortex extract --repo /path/to/repo --ignore package-lock.json --ignore \"*.min.js\"\n\n# Exclude entire directories\ngitcortex extract --repo /path/to/repo --ignore \"dist/*\" --ignore \"vendor/*\"\n```\n\nThe default branch is auto-detected from `origin/HEAD`, falling back to `main`, `master`, or `HEAD`.\n\nThe `--mailmap` flag uses git's built-in `.mailmap` support to unify developer identities. Without it, the same person with different emails (e.g., `alice@work.com` and `alice@personal.com`) appears as separate contributors.\n\n### What gitcortex collects from git\n\nExtraction runs one git command against the local repository by default (a\nsecond only when `--blob-sizes` is set) and streams the output. No source-code\nbytes are read.\n\n```\ngit log -M --raw --numstat --format=\u003cmetadata\u003e \u003cbranch\u003e    → commits, parents, per-file diffs (counts only)\ngit cat-file --batch-check   (only with --blob-sizes)      → blob sizes (old/new) for each file change\n```\n\nPer-commit metadata (populates the `commit` record):\n\n| Field | Source | Used by |\n|---|---|---|\n| `sha`, `tree`, `parents` | `git log --format` | commit graph, merge detection |\n| `author_name`, `author_email`, `author_date` | `git log --format` | contributors, activity, working patterns, bus factor |\n| `committer_name`, `committer_email`, `committer_date` | `git log --format` | committer identity feeds the `dev` registry (so a committer who is never an author still appears as a known developer); no other stat consumes these fields |\n| `additions`, `deletions`, `files_changed` | summed from `--numstat` | summary totals, hotspots, churn-risk |\n| `message` | `git log --format` | opt-in only (`--include-commit-messages`); truncated to 80 chars in `top-commits` when present |\n\nPer-file-change metadata (populates the `commit_file` record):\n\n| Field | Source | Used by |\n|---|---|---|\n| `path_current`, `path_previous`, `status` | `git log --raw` | hotspots, directories, extensions, rename tracking (`R100` / `C075` trigger merges) |\n| `additions`, `deletions` | `git log --numstat` | per-file churn, recent churn, coupling |\n| `old_hash`, `new_hash` | `git log --raw` | emitted but not consumed by any stat |\n| `old_size`, `new_size` | `git cat-file --batch-check` (opt-in: `--blob-sizes`) | blob byte sizes; **off by default** (the lookup dominated extract time and no stat reads them), so absent from the JSONL unless `--blob-sizes` is passed |\n\n**Not collected:**\n- File contents / diff hunks — only line counts from `--numstat`.\n- Commit messages (unless `--include-commit-messages` is passed).\n- Tags, refs other than the traversed branch, reflog, notes.\n- Any network traffic — extraction is 100% local to the git directory.\n\n**Opt-ins that change what ships in the JSONL:**\n- `--include-commit-messages` — adds the commit subject to each `commit` record (off by default).\n- `--mailmap` — normalizes author/committer names+emails via git's `.mailmap` before recording (off by default; warned when a `.mailmap` exists but the flag is omitted).\n- `--ignore \u003cglob\u003e` — drops matching `commit_file` records entirely at extract time (counts in the `commit` record are recomputed so totals remain consistent).\n- `--first-parent` — traverses only the first-parent chain, skipping merged branch history.\n- `--blob-sizes` — resolves per-blob byte sizes via `git cat-file --batch-check` and emits `old_size`/`new_size` (off by default). The lookup is the bulk of any cat-file cost and no gitcortex stat consumes the sizes, so enable this only when an external consumer of the JSONL needs them.\n\nFull per-record schema (every field, types, enums): see [`docs/RUNBOOK.md`](docs/RUNBOOK.md#jsonl-format).\n\nOutput is a JSONL file with one record per line. Four record types:\n\n```jsonl\n{\"type\":\"commit\",\"sha\":\"abc...\",\"tree\":\"def...\",\"parents\":[\"ghi...\"],\"author_name\":\"Alice\",\"author_email\":\"alice@example.com\",\"author_date\":\"2024-01-15T10:30:00Z\",\"committer_name\":\"Alice\",\"committer_email\":\"alice@example.com\",\"committer_date\":\"2024-01-15T10:30:00Z\",\"message\":\"\",\"additions\":42,\"deletions\":7,\"files_changed\":3}\n{\"type\":\"commit_parent\",\"sha\":\"abc...\",\"parent_sha\":\"ghi...\"}\n{\"type\":\"commit_file\",\"commit\":\"abc...\",\"path_current\":\"src/main.go\",\"path_previous\":\"src/main.go\",\"status\":\"M\",\"old_hash\":\"111...\",\"new_hash\":\"222...\",\"additions\":10,\"deletions\":3}\n{\"type\":\"dev\",\"dev_id\":\"sha256hash...\",\"name\":\"Alice\",\"email\":\"alice@example.com\"}\n```\n\n### Resume\n\nExtraction is resumable. State is saved to a file (default `git_state`) at every checkpoint:\n\n```bash\n# First run (interrupted or completed)\ngitcortex extract --repo /path/to/repo --output data.jsonl\n\n# Resume from where it left off\ngitcortex extract --repo /path/to/repo --output data.jsonl\n```\n\nThe checkpoint interval is controlled by `--batch-size` (default 1000 commits).\n\n### Stats\n\n```bash\n# All stats at once (table format)\ngitcortex stats --input data.jsonl\n\n# Individual stat\ngitcortex stats --input data.jsonl --stat contributors --top 20\n\n# Multi-repo: aggregate stats across repositories\ngitcortex stats --input svc-auth.jsonl --input svc-payments.jsonl --input svc-gateway.jsonl\n\n# Export as CSV or JSON\ngitcortex stats --input data.jsonl --stat hotspots --format csv \u003e hotspots.csv\ngitcortex stats --input data.jsonl --format json \u003e report.json\n\n# Full dataset (no truncation) — useful for scripting\ngitcortex stats --input data.jsonl --stat churn-risk --top 0 --format csv\n\n# Activity by week\ngitcortex stats --input data.jsonl --stat activity --granularity week\n\n# Test-to-source ratio (history-based proxy, not coverage)\ngitcortex stats --input data.jsonl --stat tests\n# Mark a non-standard test layout so it counts (repeatable)\ngitcortex stats --input data.jsonl --stat tests --test-glob 'tools/testing/selftests/*' --test-glob '*_kunit.c'\n\n# Filter to recent period\ngitcortex stats --since 7d                    # last 7 days\ngitcortex stats --since 3m --stat contributors # last 3 months\ngitcortex report --since 30d --output monthly.html\n\n# Closed window (arbitrary start/end — e.g. past quarter)\ngitcortex stats  --from 2026-01-01 --to 2026-03-31 --stat contributors\ngitcortex report --from 2026-01-01 --to 2026-03-31 --output q1.html\ngitcortex report --from 2025-06-01 --output post-release.html  # open-ended forward\ngitcortex report --to 2024-12-31 --output pre-2025.html        # open-ended backward\n```\n\nAvailable stats:\n\n| Stat | Description |\n|------|-------------|\n| `summary` | Total commits, devs, files, additions/deletions, merge count, averages, date range |\n| `contributors` | Ranked by commit count with additions/deletions per developer |\n| `hotspots` | Most frequently changed files with churn and unique developer count |\n| `activity` | Commits and line changes bucketed by day, week, month, or year |\n| `busfactor` | Files with lowest bus factor (fewest developers owning 80%+ of changes) |\n| `coupling` | Files that frequently change together, revealing hidden architectural dependencies |\n| `churn-risk` | Files ranked by recent churn, classified into `cold` / `active` / `active-core` / `silo` / `fading-silo` |\n| `working-patterns` | Commit heatmap by hour and day of week |\n| `dev-network` | Developer collaboration graph based on shared file ownership |\n| `profile` | Per-developer report: scope, specialization index, contribution type, pace, collaboration, top files |\n| `top-commits` | Largest commits ranked by lines changed (includes message if extracted with `--include-commit-messages`) |\n| `pareto` | Concentration (80% threshold) across files, devs (two lenses: commits and churn), and directories |\n| `structure` | Repo layout as a `tree(1)`-style view, dirs sorted by aggregate churn, capped by `--tree-depth` (default 3) |\n| `extensions` | File extensions ranked by recent churn, with file count, unique devs, and first/last-seen — the historical lens on language distribution |\n| `tests` | History-based test-investment proxy: test-to-source ratio (files and churn) overall and per language, plus a ratio-over-time trend. Files are classified by path convention (not coverage); customize with `--test-glob` |\n\nOutput formats: `table` (default, human-readable), `csv` (single clean table per `--stat`, header row on line 1), `json` (unified object with all sections).\n\n`--top 0` disables the truncation and returns every row — useful for driving downstream scripts. Prefer `--format json` piped into `jq` for reliable filtering:\n\n```bash\ngitcortex stats --input data.jsonl --stat churn-risk --top 0 --format json \\\n  | jq '.churn_risk[] | select(.Label == \"fading-silo\")'\n```\n\nCSV output also carries a stable header on line 1, but paths containing commas (font filenames, generated assets) are standard-quoted — a naive `awk -F','` will mis-split on those rows. For CSV pipelines use a proper parser (`csvkit`, `mlr`) or stick with the JSON path above.\n\nSee [`docs/METRICS.md`](docs/METRICS.md) for how each metric is calculated, including timezone handling (UTC for aggregation buckets, author-local for working patterns) and rename tracking (history merged across git-detected renames).\n\n### Developer profile\n\nManager-facing report per developer showing scope, specialization, contribution type, pace, collaboration, and top files.\n\n```bash\n# All developers, ranked by commits\ngitcortex stats --input data.jsonl --stat profile\n\n# Single developer\ngitcortex stats --input data.jsonl --stat profile --email alice@company.com\n\n# JSON export\ngitcortex stats --input data.jsonl --stat profile --format json\n```\n\nEach profile includes:\n- **Scope**: top directories where the dev works (by unique files, %)\n- **Specialization**: Herfindahl concentration over the dev's full directory distribution; 1 = all files in one dir (narrow specialist), approaches 0 for broad generalists. Labelled `broad generalist` / `balanced` / `focused specialist` / `narrow specialist`. *Measures file distribution on disk, not domain expertise — a security engineer who refactored auth across four dirs looks like a generalist even though they are a domain specialist. See METRICS.md for the caveat in full.*\n- **Contribution**: growth (add \u003e\u003e del), balanced, or refactor (del \u003e\u003e add)\n- **Pace**: commits per active day\n- **Collaboration**: top devs sharing the same files (ranked by `shared_lines` = Σ min(linesA, linesB))\n- **Weekend %**: off-hours work ratio\n- **Top files**: most impacted files by churn\n- **Top commits**: the dev's largest individual commits by lines changed (additions + deletions); surfaces vendored drops and bulk rewrites that can skew the totals\n\n### Coupling analysis\n\nFile coupling detects files that co-change in the same commits, revealing architectural coupling invisible in the code structure. Based on Adam Tornhill's [\"Your Code as a Crime Scene\"](https://pragprog.com/titles/atcrime/your-code-as-a-crime-scene/) methodology.\n\n```bash\ngitcortex stats --input data.jsonl --stat coupling --top 20\ngitcortex stats --input data.jsonl --stat coupling --coupling-min-changes 10 --coupling-max-files 30\n```\n\n```\nFILE A                              FILE B                              CO-CHANGES  COUPLING  CHANGES A  CHANGES B\nApplicationDbContext.cs              ApplicationDbContextModelSnapshot.cs 54          61%       100        89\nGuardianPortalControllerTests.cs    GuardianPortalController.cs          40          91%       44         61\nIWorkspaceRepository.cs             WorkspaceRepository.cs               19          100%      19         29\n```\n\n- **Coupling %**: co-changes / min(changes A, changes B) — how tightly linked the pair is\n- **100% coupling**: every time the less-active file changes, the other changes too\n\n### Churn risk\n\nRanks files by recency-weighted churn and classifies each into an actionable label, so you can tell a healthy core module apart from a legacy bottleneck without eyeballing five columns.\n\n```bash\ngitcortex stats --input data.jsonl --stat churn-risk --top 15\ngitcortex stats --input data.jsonl --stat churn-risk --churn-half-life 60   # faster decay\n```\n\nReal output:\n\n```\nPATH                                   LABEL                                       RECENT CHURN  BF   AGE    TREND\nautomated install/basic-install.sh     active (age P90, trend P87)                 115.3         15   4121d  0.00\n.github/workflows/codeql-analysis.yml  active-core (age P30, trend P95)            66.2          2    1640d  0.26\nadvanced/Scripts/utils.sh              active-core (age P27, trend P94)            53.3          2    1523d  0.10\n```\n\n| Label | Meaning |\n|-------|---------|\n| `cold` | Low recent churn — ignore. |\n| `active` | Shared ownership (bus factor ≥ 3). Healthy. |\n| `active-core` | New code (younger than most of the repo), single author. Usually fine. |\n| `silo` | Old + concentrated + stable/growing. Knowledge bottleneck — plan transfer. |\n| `fading-silo` | **Urgent.** Old + concentrated + declining. A silo whose owner is drifting away. |\n\nSort order is **label priority** (fading-silo → silo → active-core → active → cold), then `recent_churn` descending within the same label. The label answers \"is this activity a problem?\" and leads the table so the actionable classifications surface at the top — without this, a mature repo's `--top 20` would be dominated by unremarkable active files and the flagged risks would scroll off. The composite `risk_score` field (`recent_churn / bus_factor`) is still emitted for CI gate back-compat.\n\n**The `(age PXX, trend PYY)` suffix** reports where the file sits in this repo's distribution: `age P90` = older than 90% of tracked files, `trend P08` = declining more sharply than 92%. Classification thresholds are not absolute — they adapt to each dataset (P75 age and P25 trend, with a fallback to fixed constants for repos under 8 files). A `fading-silo` with `(age P76, trend P24)` barely qualifies; one at `(age P98, trend P03)` is the real alarm. Distance from the boundary is now visible instead of hidden. See `docs/METRICS.md` for the adaptive-thresholds section.\n\n`--churn-half-life` controls how fast old changes lose weight (default 90 days = changes lose half their weight every 90 days).\n\nThe HTML report precedes the Churn Risk table with a colored distribution strip — `48 fading-silo · 1 silo · 2,330 active-core · 1,404 active · 4,585 cold` — counted over the full classified set. The truncated table below shows only the top N by label priority, so a reader glancing at \"all 20 rows are fading-silo\" can still tell whether the repo has 20 legacy files or 20,000 before drawing a conclusion. To inspect the full list, use `--top 0 --format json` from the CLI and filter with `jq`.\n\n### Working patterns\n\nCommit distribution heatmap by hour and day of week. Reveals timezones, overwork patterns, and deploy habits.\n\n```bash\ngitcortex stats --input data.jsonl --stat working-patterns\ngitcortex stats --input data.jsonl --stat working-patterns --format csv \u003e patterns.csv\n```\n\n```\nHOUR  Mon Tue Wed Thu Fri Sat Sun\n09:00 1   1   3   .   .   .   .\n10:00 7   4   2   2   1   6   1\n11:00 10  13  3   1   2   14  7\n...\n19:00 35  15  7   10  12  16  13\n22:00 26  9   .   1   13  9   8\n```\n\n### Developer network\n\nCollaboration graph where edges connect developers who modify the same files. Weight reflects overlap percentage.\n\n```bash\ngitcortex stats --input data.jsonl --stat dev-network --top 20\ngitcortex stats --input data.jsonl --stat dev-network --network-min-files 10\ngitcortex stats --input data.jsonl --stat dev-network --format csv \u003e network.csv\n```\n\n```\nDEV A                          DEV B            SHARED FILES  WEIGHT\nalice@company.com              bob@company.com  142           34.5%\ncarol@company.com              alice@company.com 87           21.2%\n```\n\n### Multi-repo\n\nAggregate stats across multiple repositories. File paths are automatically prefixed with the filename to avoid collisions.\n\n```bash\n# Extract each repo\ngitcortex extract --repo ./svc-auth --output auth.jsonl\ngitcortex extract --repo ./svc-payments --output payments.jsonl\n\n# Aggregate stats\ngitcortex stats --input auth.jsonl --input payments.jsonl\ngitcortex stats --input auth.jsonl --input payments.jsonl --stat coupling --top 20\n```\n\nPaths appear as `auth:src/main.go` and `payments:src/main.go`. Contributors are deduped by email across repos — the same developer contributing to both repos is counted once.\n\nFor workspaces containing many repos (an engineer's `~/work`, a platform team's service folder), `gitcortex scan` discovers every `.git` under one or more roots and extracts them in parallel — see below.\n\n### Scan: discover and aggregate every repo under a root\n\nWalk one or more directories, find every git repository (working trees and bare clones both detected), extract them in parallel, and optionally render HTML. Two output modes:\n\n- `--report-dir \u003cdir\u003e` — one standalone HTML per repo plus an `index.html` landing page linking them. Each per-repo report is equivalent to running `gitcortex report` against that repo alone; no metric mixing across unrelated codebases.\n- `--report \u003cfile\u003e --email \u003caddress\u003e` — a **single** consolidated profile report for one developer across every scanned repo. The only cross-repo aggregation in the feature, because \"where did this person spend their time?\" is the only question that genuinely benefits from pooling signal across projects.\n\nThere is no third mode. Cross-repo consolidation at the team/codebase level inflates hotspots, bus factor, and coupling with noise from unrelated codebases; if that's what you want, inspect `manifest.json` or run `gitcortex report` per JSONL.\n\n```bash\n# Discover and extract every repo under ~/work (JSONLs + manifest, no HTML)\ngitcortex scan --root ~/work --output ./scan-out\n\n# Per-repo HTML reports + index landing page\ngitcortex scan --root ~/work --output ./scan-out --report-dir ./reports\n# opens ./reports/index.html → click through to each repo\n\n# Personal cross-repo profile: only MY commits, consolidated into one HTML\ngitcortex scan --root ~/work --output ./scan-out \\\n  --report ./me.html --email me@company.com --since 1y \\\n  --include-commit-messages\n\n# Multiple roots, higher parallelism, pre-set ignore patterns\ngitcortex scan --root ~/work --root ~/personal --root ~/oss \\\n  --parallel 8 --max-depth 4 \\\n  --output ./scan-out --report-dir ./reports\n```\n\nThe scan output directory holds:\n\n| file | purpose |\n|---|---|\n| `\u003cslug\u003e.jsonl` | per-repo JSONL, one per discovered repo |\n| `\u003cslug\u003e.state` | resume checkpoint (safe to re-run scan to continue) |\n| `manifest.json` | discovery results, per-repo status (ok/failed/pending), timing |\n\nEach repo's slug is derived from its directory basename; colliding basenames get a short SHA-1 suffix (the suffix lengthens automatically on the rare truncation collision, so `\u003cslug\u003e.state` is stable across runs).\n\n**Filtering discovery with `.gitcortex-ignore`.** Create a gitignore-style file at the scan root:\n\n```\n# skip heavy clones we don't want in the report\nnode_modules\nchromium.git\nlinux.git\n\n# skip vendored repos except the one we own\nvendor/\n!vendor/in-house-fork\n```\n\nDirectory rules, globs, `**/foo`, and `!path` negations all work. Globbed negations like `!vendor*/keep` are honored — discovery descends into any dir where a negation rule could match a descendant. If `--ignore-file` is not set, scan looks for `.gitcortex-ignore` in the first `--root`.\n\n**Consolidated profile report.** When `scan --email me@company.com --report path.html` runs against a multi-repo dataset, the profile report renders a *Per-Repository Breakdown* section: commits, churn, files, active days, and share-of-total — all filtered to that developer's contributions (files count reflects only files the dev touched). The **Devs** column is the deliberate exception: it counts every author of the repo (repo-wide), so the dev can see how crowded each repo they work in is. This is the one report that legitimately aggregates across repos; team-level views live in `--report-dir` (one HTML per repo, never mixed).\n\n**Flags worth knowing:**\n\n- `--parallel N` — repos extracted concurrently (default 4). Git is I/O-bound, so values past NumCPU give diminishing returns.\n- `--max-depth N` — stop descending past N levels. Useful when a root contains a monorepo with deeply nested internal repos you don't want enumerated.\n- `--extract-ignore \u003cglob\u003e` (repeatable) — forwarded to each per-repo `extract --ignore`, e.g. `--extract-ignore 'package-lock.json' --extract-ignore 'dist/*'`.\n- `--from / --to / --since` — time window applied to the consolidated report (same semantics as `report`).\n- `--churn-half-life`, `--coupling-max-files`, `--coupling-min-changes`, `--network-min-files` — pass tuning to the consolidated report identical to `gitcortex report`.\n\nPartial failures are non-fatal: the manifest records which repos failed, and the report is built from whichever JSONLs completed. `Ctrl+C` aborts both the discovery walk and any in-flight extracts; re-running picks up from each repo's state file.\n\n### Diff: compare time periods\n\nCompare stats between two time periods, or filter to a single period.\n\n```bash\n# Compare Q1 vs Q2\ngitcortex diff --input data.jsonl \\\n  --from 2024-01-01 --to 2024-03-31 \\\n  --vs-from 2024-04-01 --vs-to 2024-06-30\n\n# Filter to a single month (runs all stats for that period)\ngitcortex diff --input data.jsonl --from 2024-03-01 --to 2024-03-31\n\n# JSON export\ngitcortex diff --input data.jsonl \\\n  --from 2024-01-01 --to 2024-06-30 \\\n  --vs-from 2024-07-01 --vs-to 2024-12-31 \\\n  --format json \u003e comparison.json\n```\n\n```\n=== Summary: 2024-01-01 to 2024-03-31 vs 2024-04-01 to 2024-06-30 ===\nCommits                        812  →       945  (+133)\nAdditions                   45420  →     62830  (+17410)\nDeletions                   12300  →     18900  (+6600)\nFiles touched                  320  →       410  (+90)\nMerge commits                   45  →        38  (-7)\n```\n\n### HTML report\n\nGenerate a self-contained HTML dashboard with all stats visualized. Pure HTML+CSS, zero external dependencies, opens in any browser.\n\n```bash\ngitcortex report --input data.jsonl --output report.html\ngitcortex report --input data.jsonl --output report.html --top 30\n\n# Per-developer profile report (shareable with managers)\ngitcortex report --input data.jsonl --email alice@company.com --output alice.html\n```\n\nIncludes: summary cards, activity heatmap (with table toggle), top contributors, file hotspots, churn risk (with full-dataset label distribution strip above the truncated table), bus factor, file coupling, extensions, the tests section (test-to-source ratio, per-language breakdown, and ratio-over-time trend), working patterns heatmap, top commits, developer network, and developer profiles (each carrying a test-share figure). A collapsible glossary at the top defines the terms (bus factor, churn, fading-silo, specialization, etc.) for readers who are not already familiar. Typical size: 50-500KB depending on number of contributors.\n\nWhen the input is multi-repo (from `gitcortex scan` or multiple `--input` files) AND `--email` is set, the profile report renders a *Per-Repository Breakdown* with commit/churn/files/active-days per repo, filtered to that developer's contributions (the Devs column is repo-wide, counting all of each repo's authors). The team-view report intentionally omits this section — per-repo aggregates on a consolidated dataset reduce to raw git-history distribution, which is more usefully inspected via `manifest.json` or `stats --input X.jsonl` per repo.\n\n\u003e The HTML activity heatmap is always monthly (year × 12 months grid). For day/week/year buckets, use `gitcortex stats --stat activity --granularity \u003cunit\u003e`.\n\n### CI: quality gates for pipelines\n\nRun automated checks and fail the build when thresholds are exceeded.\n\n```bash\n# Fail if any file has bus factor of 1\ngitcortex ci --input data.jsonl --fail-on-busfactor 1\n\n# Fail if any file has churn risk \u003e= 500 (legacy composite: recent_churn / bus_factor)\ngitcortex ci --input data.jsonl --fail-on-churn-risk 500\n\n# Both rules, GitHub Actions format\ngitcortex ci --input data.jsonl \\\n  --fail-on-busfactor 1 \\\n  --fail-on-churn-risk 500 \\\n  --format github-actions\n```\n\nOutput formats: `text` (default), `github-actions` (annotations), `gitlab` (Code Quality JSON), `json`.\n\nExit code 1 when violations are found, 0 when clean.\n\n\u003e `--fail-on-churn-risk` evaluates the legacy `risk_score = recent_churn / bus_factor` field, not the new label classification surfaced by `stats --stat churn-risk`. The two can disagree — a file might have `risk_score` below the threshold yet still classify as `fading-silo`. Use the stat command for triage; use the CI gate as a coarse threshold alarm.\n\n## Architecture\n\n```\ncmd/gitcortex/main.go          CLI entry point (cobra)\ninternal/\n  model/model.go               JSONL output types\n  git/\n    stream.go                  Single git log streaming parser\n    catfile.go                 Long-running cat-file blob size resolver\n    commands.go                Utility functions (branch detection, SHA validation)\n    parse.go                   Shared types (RawEntry, NumstatEntry)\n    discard.go                 Malformed entry tracking\n  extract/extract.go           Extraction orchestration, state, JSONL writing\n  scan/\n    scan.go                    Multi-repo orchestration (worker pool over extract)\n    discovery.go               Directory walk, bare-repo detection, slug uniqueness\n    ignore.go                  Gitignore-style matcher with negation support\n  stats/\n    reader.go                  Streaming JSONL aggregator (single-pass, multi-JSONL)\n    stats.go                   Stat computations (9 stats)\n    repo_breakdown.go          Per-repository aggregate (scan consolidated report)\n    format.go                  Table/CSV/JSON output formatting\n```\n\n### Extraction pipeline\n\nOne long-running git process for the entire extraction (a second only with\n`--blob-sizes`), regardless of repository size:\n\n```\ngit log --raw --numstat -M --- single stream ---- parse ---- emit JSONL\n                                                    |\ngit cat-file --batch-check -- long-running ---- resolve blob sizes  (only with --blob-sizes)\n```\n\n### Stats pipeline\n\nSingle-pass streaming aggregation. The JSONL file is read once, line by line, aggregating into compact maps. Raw records are never stored — only pre-computed aggregation state is kept in memory.\n\n```\nJSONL file ---- line by line ----\u003e aggregate ----\u003e lean Dataset ----\u003e stat functions\n                (no raw storage)    commits: SHA → {email, date, add, del}\n                                    files:   path → {commits, devs, churn}\n                                    coupling: computed on-the-fly\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flex0c%2Fgitcortex","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flex0c%2Fgitcortex","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flex0c%2Fgitcortex/lists"}