{"id":46573523,"url":"https://github.com/cesarandreslopez/occ","last_synced_at":"2026-04-14T08:02:52.769Z","repository":{"id":342758558,"uuid":"1175050683","full_name":"cesarandreslopez/occ","owner":"cesarandreslopez","description":"Document metrics, structure extraction, and code   exploration for real repositories","archived":false,"fork":false,"pushed_at":"2026-03-11T10:53:04.000Z","size":213,"stargazers_count":4,"open_issues_count":0,"forks_count":1,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-03-11T13:43:11.374Z","etag":null,"topics":["agents","cli","command-line-tool","document-analysis","document-metrics","docx","odf","office-documents","page-count","pdf","pptx","productivity","word-count","xlsx"],"latest_commit_sha":null,"homepage":"https://cesarandreslopez.github.io/occ/","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/cesarandreslopez.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-03-07T06:44:11.000Z","updated_at":"2026-03-11T10:53:11.000Z","dependencies_parsed_at":"2026-03-11T09:01:22.004Z","dependency_job_id":null,"html_url":"https://github.com/cesarandreslopez/occ","commit_stats":null,"previous_names":["cesarandreslopez/occ"],"tags_count":6,"template":false,"template_full_name":null,"purl":"pkg:github/cesarandreslopez/occ","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cesarandreslopez%2Focc","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cesarandreslopez%2Focc/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cesarandreslopez%2Focc/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cesarandreslopez%2Focc/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/cesarandreslopez","download_url":"https://codeload.github.com/cesarandreslopez/occ/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cesarandreslopez%2Focc/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30421085,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-12T09:20:56.688Z","status":"ssl_error","status_checked_at":"2026-03-12T09:20:13.792Z","response_time":114,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agents","cli","command-line-tool","document-analysis","document-metrics","docx","odf","office-documents","page-count","pdf","pptx","productivity","word-count","xlsx"],"created_at":"2026-03-07T09:15:15.026Z","updated_at":"2026-04-14T08:02:52.763Z","avatar_url":"https://github.com/cesarandreslopez.png","language":"TypeScript","readme":"\u003ch1 align=\"center\"\u003eOCC\u003c/h1\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://www.npmjs.com/package/@cesarandreslopez/occ\"\u003e\u003cimg src=\"https://img.shields.io/npm/v/@cesarandreslopez/occ?label=npm\" alt=\"npm\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://www.npmjs.com/package/@cesarandreslopez/occ\"\u003e\u003cimg src=\"https://img.shields.io/npm/dt/@cesarandreslopez/occ?label=npm%20Downloads\" alt=\"npm Downloads\"\u003e\u003c/a\u003e\n  \u003ca href=\"LICENSE\"\u003e\u003cimg src=\"https://img.shields.io/badge/License-MIT-yellow.svg\" alt=\"License: MIT\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://github.com/cesarandreslopez/occ/actions/workflows/ci.yml\"\u003e\u003cimg src=\"https://github.com/cesarandreslopez/occ/actions/workflows/ci.yml/badge.svg\" alt=\"CI\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://deepwiki.com/cesarandreslopez/occ\"\u003e\u003cimg src=\"https://deepwiki.com/badge.svg\" alt=\"Ask DeepWiki\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003cstrong\u003eOffice Cloc and Count\u003c/strong\u003e — document metrics, structure extraction, content inspection, and code exploration for real repositories.\n\u003c/p\u003e\n\n\u003e **Experimental:** All features in OCC are currently experimental. This project cannot be considered stable software yet. APIs, output formats, and command interfaces may change between minor versions.\n\n## What is this?\n\nOCC started as a way to make office documents visible in the same workflows that already work well for code metrics tools like `scc` and `cloc`. It has since grown into a multi-purpose CLI that can:\n\n- scan office documents for word/page/sheet/slide metrics\n- extract document heading structure for navigation and RAG-style use cases\n- inspect documents (`occ doc inspect`), spreadsheets (`occ sheet inspect`), and presentations (`occ slide inspect`) for metadata, risk flags, and content previews\n- extract structured table content from documents (`occ table inspect`)\n- analyze workspaces for combined code, document, and structure metrics (`occ workspace analyze`) and cross-document references (`occ workspace documents`)\n- summarize code metrics through `scc`\n- explore JavaScript, TypeScript, and Python repositories with symbol search, call analysis, dependency inspection, and inheritance queries (`occ code`)\n\n## Features\n\n- **Office document metrics** — words, pages, paragraphs, slides, sheets, rows, cells\n- **Seven formats supported** — DOCX, XLSX, PPTX, PDF, ODT, ODS, ODP\n- **Document structure extraction** — `--structure` parses heading hierarchy into a navigable tree with dotted section codes (1, 1.1, 1.2, ...)\n- **Document inspection via `occ doc inspect`** — metadata, risk flags, content stats, heading structure, and content preview for DOCX and ODT\n- **Spreadsheet inspection via `occ sheet inspect`** — workbook properties, hidden sheets, names, formulas, links, comments, schema preview, and token estimates for XLSX\n- **Presentation inspection via `occ slide inspect`** — metadata, risk flags, per-slide inventory, and content preview for PPTX and ODP\n- **Table extraction via `occ table inspect`** — structured table content from DOCX, XLSX, PPTX, ODT, and ODP with auto-detected headers, sample row limits, and merged cell support\n- **Code metrics via scc** — auto-detects code files and integrates scc output\n- **Code exploration via `occ code`** — JS/TS and Python-first symbol lookup, content search, callers/callees, dependency categories, inheritance, module coupling, and ambiguity-aware chains\n- **Workspace analysis via `occ workspace`** — combined code, document, and structure analysis with versioned JSON contracts, per-document summaries, and cross-reference detection\n- **Multiple output modes** — grouped by type, per-file breakdown, or JSON\n- **CI-friendly** — ASCII-only, no-color mode for pipelines\n- **Flexible filtering** — include/exclude extensions, exclude directories, .gitignore-aware\n- **Progress bar** — with ETA for large scans\n- **Zero config** — auto-downloads scc binary on install, works out of the box\n\n## Quick Start\n\n**Global install:**\n\n```bash\nnpm i -g @cesarandreslopez/occ\nocc\n```\n\n**No-install usage:**\n\n```bash\nnpx @cesarandreslopez/occ docs/ reports/\n```\n\n**From source:**\n\n```bash\ngit clone https://github.com/cesarandreslopez/occ.git \u0026\u0026 cd occ\nnpm install\nnpm run build\nnpm test\nnpm start\n```\n\n## Usage\n\n```bash\n# Scan current directory\nocc\n\n# Scan specific directories\nocc docs/ reports/\n\n# Per-file breakdown\nocc --by-file docs/\n\n# JSON output\nocc --format json docs/\n\n# Extract document structure (heading hierarchy)\nocc --structure docs/\n\n# Structure as JSON\nocc --structure --format json docs/\n\n# Inspect a document for metadata, risk flags, and content preview\nocc doc inspect report.docx\nocc doc inspect report.docx --format json\n\n# Inspect an XLSX workbook before reading its contents deeply\nocc sheet inspect finance.xlsx\nocc sheet inspect finance.xlsx --format json --sample-rows 3 --max-columns 12\n\n# Inspect a presentation for slide inventory and content preview\nocc slide inspect deck.pptx\nocc slide inspect deck.pptx --format json --slide 3\n\n# Extract structured table data from documents\nocc table inspect report.docx --format json\nocc table inspect finance.xlsx --table 1 --sample-rows 10\n\n# Explore JS/TS and Python code\nocc code find name UserService --path .\nocc code analyze callers createUser --path .\nocc code analyze deps src/deps --path .\nocc code analyze chain ambiguousCaller duplicate --path .\n\n# Module coupling metrics\nocc code analyze coupling src/code --path .\n\n# Dump full codebase index as JSON\nocc code index --path . --format json\n\n# Workspace-level analysis (code + documents + structures)\nocc workspace analyze --format json\n\n# Document summaries with cross-references\nocc workspace documents --format json\n\n# Only specific formats\nocc --include-ext pdf,docx docs/\n\n# Skip code analysis\nocc --no-code docs/\n\n# CI-friendly (ASCII, no color)\nocc --ci docs/\n```\n\n## Example Output\n\n```\n-- Documents ---------------------------------------------------------------\n  Format    Files    Words    Pages                  Details      Size\n----------------------------------------------------------------------------\n  Word         12   34,210      137              1,203 paras    1.2 MB\n  PDF           8   22,540       64                             4.5 MB\n  Excel         3                                12 sheets      890 KB\n----------------------------------------------------------------------------\n  Total        23   56,750      201              1,203 paras    6.5 MB\n\n-- Code (via scc) ----------------------------------------------------------\n  Language    Files    Lines   Blanks  Comments     Code\n----------------------------------------------------------------------------\n  JavaScript     15     2340      180       320     1840\n  Python          8     1200       90       150      960\n----------------------------------------------------------------------------\n  Total          23     3540      270       470     2800\n\nScanned 23 documents (56,750 words, 201 pages) in 120ms\n```\n\n### Structure Output (`--structure`)\n\n```\n-- Structure: report.docx --------------------------------------------------\n1   Executive Summary\n  1.1   Background ......................................... p.1\n  1.2   Key Findings ....................................... p.1-2\n2   Methodology\n  2.1   Data Collection .................................... p.3\n  2.2   Analysis Framework ................................. p.4\n    2.2.1   Quantitative Methods ........................... p.4\n    2.2.2   Qualitative Methods ............................ p.5\n3   Results ................................................ p.6-8\n4   Conclusions ............................................ p.9\n\n4 sections, 10 nodes, max depth 3\n```\n\n## Supported Formats\n\n| Format | Extension | Metrics | Structure |\n|--------|-----------|---------|-----------|\n| Word | `.docx` | words, pages*, paragraphs | Yes |\n| PDF | `.pdf` | words, pages | Yes (with page mapping) |\n| Excel | `.xlsx` | sheets, rows, cells | — |\n| PowerPoint | `.pptx` | words, slides | Yes (slide headers) |\n| ODT | `.odt` | words, pages*, paragraphs | Yes (best-effort) |\n| ODS | `.ods` | sheets, rows, cells | — |\n| ODP | `.odp` | words, slides | Yes (slide headers) |\n\n\\* Pages for Word/ODT are estimated at 250 words/page.\n\n## CLI Flags\n\n| Flag | Description | Default |\n|------|-------------|---------|\n| `--by-file` / `-f` | Row per file | grouped by type |\n| `--format \u003ctype\u003e` | `tabular` or `json` | `tabular` |\n| `--structure` | Extract and display document heading hierarchy | off |\n| `--include-ext \u003cexts\u003e` | Comma-separated extensions | all supported |\n| `--exclude-ext \u003cexts\u003e` | Comma-separated to skip | none |\n| `--exclude-dir \u003cdirs\u003e` | Directories to skip | `node_modules,.git` |\n| `--ignore-pattern \u003cpattern\u003e` | Gitignore-style pattern to ignore (repeatable) | none |\n| `--no-gitignore` | Disable .gitignore respect | enabled |\n| `--sort \u003ccol\u003e` | Sort by: files, name, words, size | `files` |\n| `--output \u003cfile\u003e` / `-o` | Write to file | stdout |\n| `--ci` | ASCII-only, no color | off |\n| `--large-file-limit \u003cmb\u003e` | Skip files over this size | `50` |\n| `--no-code` | Skip scc code analysis | off |\n| `--show-confidence` | Show confidence levels for each metric | off |\n\n## Code Exploration\n\n`occ code` adds on-demand code exploration without changing the existing document-scan workflow. It builds an in-memory repository graph for each command and does not require a database, daemon, or background indexer.\n\nThe first-class support path is **JavaScript, TypeScript, and Python**. Other languages may be discovered and partially parsed, but the current resolver, fixtures, and output contracts are intentionally optimized around JS/TS and Python behavior.\n\n```bash\n# Exact symbol lookup\nocc code find name Greeter --path test/fixtures/code-explore\n\n# Substring search\nocc code find pattern service --path .\n\n# Full-text content search\nocc code find content normalize_name --path .\n\n# Outgoing and incoming call analysis\nocc code analyze calls bootstrap --path test/fixtures/code-explore\nocc code analyze callers createUser --path test/fixtures/code-explore\n\n# Dependency and inheritance inspection\nocc code analyze deps src/service --path test/fixtures/code-explore\nocc code analyze tree UserService --path test/fixtures/code-explore\n\n# Module coupling analysis\nocc code analyze coupling src/code --path test/fixtures/code-explore\n\n# Ambiguity-aware chain analysis\nocc code analyze chain ambiguousCaller duplicate --path test/fixtures/code-explore\n```\n\nHighlights of the current code exploration behavior:\n\n- **Full index export** via `occ code index` — dump the complete graph (files, symbols, edges, language capabilities) as JSON or a summary line\n- **Exact, pattern, type, and content search** over the repository graph\n- **Call analysis** with explicit `resolved`, `ambiguous`, and `unresolved` states\n- **Receiver-aware method resolution** for `this`, `super`, `self`, and `cls`\n- **Dependency analysis** grouped into local, external, and unresolved imports\n- **Module coupling analysis** with afferent/efferent coupling, instability, and key classes\n- **Chain analysis** that reports when a path is blocked by ambiguity instead of silently returning nothing\n- **Shared CLI ergonomics** with `--path`, `--format`, `--output`, `--exclude-dir`, and `.gitignore` support\n\nAll `occ code` commands support `--format tabular|json`. Most symbol-targeted commands also support `--file` for disambiguation, and JSON output includes repository metadata, query metadata, results, repository stats, and per-language capability flags.\n\n## Programmatic Usage\n\nThe code exploration module is available as a library via subpath exports:\n\n```ts\nimport { buildCodebaseIndex } from '@cesarandreslopez/occ/code/build';\nimport { discoverCodeFiles } from '@cesarandreslopez/occ/code/discover';\nimport { findByName, analyzeCalls } from '@cesarandreslopez/occ/code/query';\nimport type { CodebaseIndex, CodeNode } from '@cesarandreslopez/occ/code/types';\n\nconst index = await buildCodebaseIndex({ repoRoot: './my-repo' });\nconst results = findByName(index, 'UserService');\n```\n\nFor a stateful session that caches the index across queries:\n\n```ts\nimport { createCodeQuerySession } from '@cesarandreslopez/occ/code/session';\n\nconst session = await createCodeQuerySession({ repoRoot: './my-repo' });\nsession.findByName('UserService');\nsession.analyzeCalls('bootstrap');\nsession.chunk({ maxChunkWords: 200 });\nawait session.refresh(); // rebuild index when files change\n```\n\nFor persistent caching across sessions with automatic freshness checks:\n\n```ts\nimport { openCodeIndexStore } from '@cesarandreslopez/occ/code/store';\n\nconst store = openCodeIndexStore({\n  repoRoot: './my-repo',\n  cacheDir: '.occ-cache',\n});\n\n// First call builds + caches; subsequent calls load from cache\nconst session = await store.getSession({ strategy: 'prefer-cache' });\nsession.findByName('UserService');\n\n// Check freshness via file manifests before returning cache\nawait store.getSession({ strategy: 'ensure-fresh' });\n\n// Force a full rebuild\nawait store.refresh();\n```\n\nOr use the unified facade for all OCC APIs from a single import:\n\n```ts\nimport { createOcc } from '@cesarandreslopez/occ';\n\nconst occ = createOcc();\nconst session = await occ.code.createSession({ repoRoot: './my-repo' });\nconst analysis = await occ.workspace.analyze('./my-project', { includeCode: true });\nconst doc = await occ.doc.inspect('report.docx', {});\n```\n\nFor workspace-level analysis:\n\n```ts\nimport { analyzeWorkspace } from '@cesarandreslopez/occ/workspace/analyze';\nimport { inspectWorkspaceDocumentSet } from '@cesarandreslopez/occ/workspace/documents';\n\nconst analysis = await analyzeWorkspace('./my-project', { includeCode: true });\nconst docs = await inspectWorkspaceDocumentSet('./my-project', { maxFiles: 20 });\n```\n\nAvailable subpath exports:\n\n| Import path | Description |\n|-------------|-------------|\n| `@cesarandreslopez/occ/code/build` | `buildCodebaseIndex` — graph construction |\n| `@cesarandreslopez/occ/code/types` | TypeScript types (`CodebaseIndex`, `CodeNode`, `CodeEdge`, etc.) |\n| `@cesarandreslopez/occ/code/query` | Query functions (`findByName`, `analyzeCalls`, `analyzeDeps`, etc.) |\n| `@cesarandreslopez/occ/code/discover` | `discoverCodeFiles` — file discovery |\n| `@cesarandreslopez/occ/code/chunk` | `chunkCodebase`, `chunkFromIndex` — semantic code chunking |\n| `@cesarandreslopez/occ/code/session` | `createCodeQuerySession` — stateful code query session |\n| `@cesarandreslopez/occ/code/store` | `openCodeIndexStore` — persistent index store with cache strategies |\n| `@cesarandreslopez/occ/code/cache` | Index caching utilities (legacy — prefer `./code/store`) |\n| `@cesarandreslopez/occ/doc/inspect` | `inspectDocument` — document metadata and content extraction |\n| `@cesarandreslopez/occ/doc/types` | Document inspection types |\n| `@cesarandreslopez/occ/doc/discover` | Document file discovery |\n| `@cesarandreslopez/occ/doc/batch` | Batch document inspection |\n| `@cesarandreslopez/occ/doc/entities` | Entity and keyword extraction |\n| `@cesarandreslopez/occ/doc/references` | Cross-reference detection |\n| `@cesarandreslopez/occ/workspace/analyze` | `analyzeWorkspace` — workspace-level analysis |\n| `@cesarandreslopez/occ/workspace/documents` | `inspectWorkspaceDocumentSet` — document summaries and cross-references |\n| `@cesarandreslopez/occ/workspace/types` | Workspace analysis types |\n| `@cesarandreslopez/occ/workspace/prepare` | `prepareWorkspaceContext` — combined code indexing + document inspection |\n| `@cesarandreslopez/occ/workspace/prepare-types` | Workspace preparation types (`WorkspacePrepareOptions`, `WorkspacePreparedContext`, etc.) |\n| `@cesarandreslopez/occ/markdown/convert` | `documentToMarkdown` — document-to-markdown conversion |\n| `@cesarandreslopez/occ/structure/extract` | `extractFromMarkdown` — heading tree extraction |\n| `@cesarandreslopez/occ/structure/types` | Structure types and helpers |\n| `@cesarandreslopez/occ/sheet/inspect` | `inspectWorkbook` — XLSX workbook inspection |\n| `@cesarandreslopez/occ/sheet/types` | Sheet inspection types |\n| `@cesarandreslopez/occ/slide/inspect` | `inspectPresentation` — presentation inspection |\n| `@cesarandreslopez/occ/table/inspect` | Table extraction from documents |\n| `@cesarandreslopez/occ/types` | Shared types (`ConfidenceLevel`, `ParseResult`, `ParserOutput`, etc.) |\n| `@cesarandreslopez/occ/tokens` | Token estimation utilities |\n| `@cesarandreslopez/occ/progress-event` | Progress event types |\n| `@cesarandreslopez/occ/stats` | Stats types (`StatsRow`, `AggregateResult`) and `aggregate()` |\n\nTypeScript ships with OCC as a direct dependency, so the code exploration module works after a normal install. You only need a separate TypeScript setup if your own project uses `tsc`.\n\n## Document Inspection\n\n`occ doc inspect` extracts metadata, risk flags, content stats, heading structure, and a content preview from DOCX and ODT documents.\n\n```bash\n# Document overview with content preview\nocc doc inspect report.docx\n\n# Machine-readable payload\nocc doc inspect report.docx --format json\n\n# More paragraphs in the preview\nocc doc inspect report.docx --sample-paragraphs 10\n```\n\nCurrent document inspection surfaces:\n\n- **Document properties** — title, author, dates, keywords\n- **Risk flags** — comments, tracked changes, hyperlinks, embedded objects, macros, tables, encryption\n- **Content stats** — words, pages, paragraphs, characters, tables, images\n- **Heading structure** — tree with section codes and depth\n- **Content preview** — first N paragraphs with heading detection\n- **Token estimates** — preview and full-document token estimates\n\n## Spreadsheet Inspection\n\n`occ sheet inspect` is a lightweight XLSX preflight command aimed at both humans and agents. It helps answer \"is this workbook worth reading in depth?\" before spending tokens serializing cells or opening the file in Excel.\n\n```bash\n# Workbook-level summary + per-sheet schema/sample preview\nocc sheet inspect finance.xlsx\n\n# Machine-readable inspection payload\nocc sheet inspect finance.xlsx --format json\n\n# Narrow to one sheet and reduce preview width\nocc sheet inspect finance.xlsx --sheet Revenue --sample-rows 3 --max-columns 8\n```\n\nCurrent XLSX inspection highlights:\n\n- **Workbook metadata** — file size, workbook properties, custom properties, workbook-scoped names\n- **Sheet inventory** — visible / hidden / very hidden sheets, used ranges, cell counts, formula/comment/link counts\n- **Schema preview** — detected header row, inferred column types, coverage ratios, example values\n- **Lightweight sampling** — small row previews designed for preflight rather than full extraction\n- **Token estimates** — sample and full-sheet token estimates to guide downstream agent reads\n\n## Presentation Inspection\n\n`occ slide inspect` provides presentation metadata, risk flags, per-slide inventory, and content previews for PPTX and ODP files.\n\n```bash\n# Presentation overview with slide preview\nocc slide inspect deck.pptx\n\n# Machine-readable payload\nocc slide inspect deck.pptx --format json\n\n# Inspect a specific slide\nocc slide inspect deck.pptx --slide 3\n```\n\nCurrent presentation inspection surfaces:\n\n- **Presentation properties** — title, author, dates\n- **Risk flags** — comments, speaker notes, hyperlinks, embedded media, animations, macros, charts, tables\n- **Slide inventory** — per-slide title, word count, notes, images, tables, charts\n- **Content preview** — text preview for sample slides\n- **Token estimates** — preview and full-presentation token estimates\n\n## Table Extraction\n\n`occ table inspect` extracts structured table content from DOCX, XLSX, PPTX, ODT, and ODP documents. For AI agents, this is the primary way to read financial summaries, comparison matrices, and data tables without parsing raw XML.\n\n```bash\n# Extract all tables as JSON\nocc table inspect report.docx --format json\n\n# Tabular preview of table content\nocc table inspect finance.xlsx\n\n# Extract a specific table\nocc table inspect finance.xlsx --table 1\n\n# Limit sample rows\nocc table inspect report.docx --sample-rows 5\n```\n\nCurrent table extraction highlights:\n\n- **Multi-format support** — DOCX (via mammoth HTML), XLSX (via SheetJS), PPTX (from slide XML), ODT and ODP (from content.xml)\n- **Auto-detected headers** — first row is treated as headers when values are unique strings\n- **Merged cell support** — colspan and rowspan are preserved in the output\n- **Sample row limits** — configurable maximum rows per table (default: 20)\n- **Table filtering** — extract a specific table by index with `--table N`\n- **Token estimates** — per-table and total token estimates\n- **PDF graceful degradation** — returns empty tables with an informative note instead of unreliable heuristic output\n\n## Workspace Analysis\n\n`occ workspace` provides combined analysis of code, documents, and structures in a single versioned JSON payload — useful for AI agents that need a complete workspace overview.\n\n```bash\n# Full workspace analysis (code + documents + structures)\nocc workspace analyze --format json\n\n# Skip code analysis\nocc workspace analyze --no-code --format json\n\n# Document summaries with cross-reference detection\nocc workspace documents --format json\n\n# Limit documents and include markdown content\nocc workspace documents --max-files 20 --include-markdown --format json\n```\n\n`occ workspace analyze` returns a `schemaVersion: 1` JSON envelope containing code metrics (via scc), document aggregates, heading structures, skipped files, and errors. `occ workspace documents` returns per-document summaries with cross-references (filename mentions, hyperlinks, citations) and unresolved mentions detected across the document set.\n\nFor combined code indexing and document inspection with progress tracking:\n\n```ts\nimport { prepareWorkspaceContext } from '@cesarandreslopez/occ/workspace/prepare';\n\nconst context = await prepareWorkspaceContext('./my-project', {\n  includeCode: true,\n  includeDocuments: true,\n  executionMode: 'auto', // 'auto' | 'inline' | 'subprocess'\n}, (event) =\u003e {\n  console.log(`[${event.scope}] ${event.stage}: ${event.completed}/${event.total}`);\n});\n\n// context.code?.index  — full CodebaseIndex\n// context.documents    — WorkspaceDocumentSet with cross-references\n// context.elapsedMs    — total wall time\n// context.errors       — collected errors from both phases\n```\n\n## Documentation\n\nFull documentation is available at [cesarandreslopez.github.io/occ](https://cesarandreslopez.github.io/occ/), including:\n\n- [Installation](https://cesarandreslopez.github.io/occ/getting-started/installation/)\n- [Quick Start](https://cesarandreslopez.github.io/occ/getting-started/quick-start/)\n- [CLI Reference](https://cesarandreslopez.github.io/occ/usage/cli-reference/)\n- [Output Formats](https://cesarandreslopez.github.io/occ/usage/output-formats/)\n- [Architecture](https://cesarandreslopez.github.io/occ/architecture/overview/)\n- [Changelog](https://cesarandreslopez.github.io/occ/changelog/)\n\n## Why OCC?\n\nTools like `scc`, `cloc`, and `tokei` give you instant visibility into codebases — lines, languages, complexity. But most projects also contain Word documents, PDFs, spreadsheets, and presentations that are invisible to these tools. OCC fills that gap.\n\n### For Humans\n\n- **Project audits** — instantly see how much documentation lives alongside your code: total word counts, page counts, spreadsheet sizes, and presentation lengths\n- **Tracking documentation growth** — run OCC in CI to monitor how documentation scales over time, catch bloat early, or enforce minimums\n- **Onboarding** — new team members get a quick sense of a project's documentation footprint before diving in\n- **Migration planning** — when moving to a new platform, know exactly what you're dealing with across hundreds of files and formats\n\n### For AI Agents\n\n- **Context budgeting** — LLMs have finite context windows. OCC's word and page counts let agents estimate how much of a document set they can ingest before hitting token limits\n- **Prioritization** — an agent deciding which documents to read can use OCC's JSON output to rank files by size, word count, or type, focusing on the most relevant content first\n- **RAG chunk mapping** — `--structure --format json` outputs heading trees with character offsets, enabling chunk-to-section mapping, scoped retrieval, and citation paths in RAG pipelines\n- **Document triage** — `occ doc inspect --format json` surfaces risk flags, content stats, structure, and token estimates before an agent reads the full document\n- **Spreadsheet triage** — `occ sheet inspect --format json` exposes sheet visibility, formulas, links, comments, schema hints, and token estimates before an agent expands workbook contents\n- **Presentation triage** — `occ slide inspect --format json` provides slide inventory, risk flags, and content previews for quick assessment\n- **Table extraction** — `occ table inspect --format json` extracts structured table data (headers, rows, cells) from documents, giving agents direct access to tabular content without parsing raw XML\n- **Repository mapping** — agents exploring an unfamiliar codebase can combine `occ --format json` for document inventory with `occ code ... --format json` for symbol and relationship data\n- **Pipeline integration** — JSON output pipes directly into agent toolchains for automated document analysis, summarization, or compliance checking\n\n## How It Works\n\nOCC is written in TypeScript and uses [fast-glob](https://github.com/mrmlnc/fast-glob) for file discovery, dispatches to format-specific parsers (mammoth for DOCX, pdf-parse for PDF, SheetJS for XLSX, JSZip + officeparser for PPTX/ODF), aggregates metrics, and renders output via cli-table3. For code metrics, it shells out to a vendored [scc](https://github.com/boyter/scc) binary (auto-downloaded during `npm install`, with PATH fallback).\n\nFor structure extraction (`--structure`), documents are first converted to markdown (mammoth + [turndown](https://github.com/mixmark-io/turndown) for DOCX, pdf-parse with page markers for PDF), then headers are extracted and assembled into a tree with dotted section codes.\n\nFor `occ code`, OCC builds an in-memory code graph on demand. JavaScript and TypeScript are parsed with the TypeScript compiler API, Python uses a language-specific parser, and the query engine resolves symbols, imports, calls, inheritance, ambiguities, and dependency categories without a persistent database.\n\n## Contributing\n\nContributions are welcome! See [CONTRIBUTING.md](CONTRIBUTING.md) for setup instructions and guidelines.\n\n## License\n\n[MIT](LICENSE)\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcesarandreslopez%2Focc","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcesarandreslopez%2Focc","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcesarandreslopez%2Focc/lists"}