An open API service indexing awesome lists of open source software.

https://github.com/jsv4/docxodus

Office XML Redline Engine Based on OpenXML SDK (Forked from OpenXMLTools)
https://github.com/jsv4/docxodus

Last synced: 6 days ago
JSON representation

Office XML Redline Engine Based on OpenXML SDK (Forked from OpenXMLTools)

Awesome Lists containing this project

README

          


Docxodus


A powerful Typescript, Python and .NET library for manipulating Open XML documents (DOCX, XLSX, PPTX).


CI
License: MIT

---

Docxodus is a fork of [Open-Xml-PowerTools](https://github.com/OfficeDev/Open-Xml-PowerTools) upgraded to .NET 8.0. It provides tools for comparing Word documents, converting between DOCX and HTML, projecting DOCX to anchor-addressed markdown for LLM pipelines, programmatically editing DOCX content via a stateful session API, merging documents, and more.

## Quick Start

### Install the Library

```bash
# Install from NuGet
dotnet add package Docxodus
```

### Using as a Library

```csharp
using Docxodus;

// Compare documents
var original = new WmlDocument("original.docx");
var modified = new WmlDocument("modified.docx");

var settings = new WmlComparerSettings
{
AuthorForRevisions = "Redline",
DetailThreshold = 0
};

var result = WmlComparer.Compare(original, modified, settings);

// Get list of revisions (with move detection)
var revisions = WmlComparer.GetRevisions(result, settings);
foreach (var rev in revisions)
{
if (rev.RevisionType == WmlComparer.WmlComparerRevisionType.Moved)
Console.WriteLine($"Moved (group {rev.MoveGroupId}): {rev.Text}");
else
Console.WriteLine($"{rev.RevisionType}: {rev.Text}");
}

// Save the redlined document
result.SaveAs("redline.docx");
```

## CLI Tools

Docxodus includes two command-line tools:

### Redline (Document Comparison)

```bash
# Install globally
dotnet tool install -g Redline

# Usage
redline original.docx modified.docx output.docx

# With custom author tag
redline original.docx modified.docx output.docx --author="Legal Review"
```

| Option | Description |
|--------|-------------|
| `--author=` | Author name for tracked changes (default: "Redline") |
| `-h, --help` | Show help message |
| `-v, --version` | Show version information |

### docx2html (HTML Conversion)

```bash
# Install globally
dotnet tool install -g Docx2Html

# Basic conversion
docx2html document.docx

# Specify output file
docx2html document.docx output.html

# Extract images to files instead of embedding as base64
docx2html document.docx --extract-images

# Use inline styles instead of CSS classes
docx2html document.docx --inline-styles
```

| Option | Description |
|--------|-------------|
| `--title=` | Page title (default: document title or filename) |
| `--css-prefix=` | CSS class prefix (default: "pt-") |
| `--inline-styles` | Use inline styles instead of CSS classes |
| `--extract-images` | Save images to separate files instead of embedding |
| `-h, --help` | Show help message |
| `-v, --version` | Show version information |

## Download Standalone Binaries

Pre-built binaries are available on the [Releases](https://github.com/JSv4/Docxodus/releases) page:

**redline** (Document Comparison):

| Platform | Download |
|----------|----------|
| Windows (x64) | `redline-win-x64.exe` |
| Linux (x64) | `redline-linux-x64` |
| macOS (x64) | `redline-osx-x64` |
| macOS (ARM) | `redline-osx-arm64` |

**docx2html** (HTML Conversion):

| Platform | Download |
|----------|----------|
| Windows (x64) | `docx2html-win-x64.exe` |
| Linux (x64) | `docx2html-linux-x64` |
| macOS (x64) | `docx2html-osx-x64` |
| macOS (ARM) | `docx2html-osx-arm64` |

## Build from Source

```bash
# Clone the repository
git clone https://github.com/JSv4/Docxodus.git
cd Docxodus

# Build
dotnet build Docxodus.sln

# Run the CLI
dotnet run --project tools/redline/redline.csproj -- --help
```

## Testing

### .NET Unit Tests

```bash
# Run all tests (~1,100 tests)
dotnet test Docxodus.Tests/Docxodus.Tests.csproj

# Run specific test by name
dotnet test --filter "FullyQualifiedName~WC001"

# Run tests for a specific class
dotnet test --filter "FullyQualifiedName~WmlComparerTests"
```

### npm/WASM Browser Tests (Playwright)

```bash
# Need to be in npm subdirectory
cd npm

# Install dependencies (first time only)
npm install
npx playwright install chromium

# Build WASM and TypeScript (required before tests)
npm run build

# Run all Playwright tests (~62 tests)
npm test

# Run specific test by name pattern
npx playwright test --grep "Document Structure"

# Run tests with browser visible
npx playwright test --headed

# TypeScript type checking
npx tsc --noEmit
```

## Features

- **WmlComparer** - Compare two DOCX files and generate redlines with tracked changes
- **Move Detection** - Automatically detects when content is relocated (not just deleted and re-inserted)
- **Format Change Detection** - Detects formatting-only changes (bold, italic, font size, etc.)
- Configurable similarity threshold and minimum word count
- Links move pairs via `MoveGroupId` for easy tracking
- **WmlToHtmlConverter** / **HtmlToWmlConverter** - Bidirectional DOCX ↔ HTML conversion
- Comment rendering (endnote-style, inline, or margin)
- Paginated output mode for PDF-like viewing
- Headers, footers, footnotes, and endnotes support
- Custom annotation rendering
- **WmlToMarkdownConverter** - Anchor-addressed markdown projection of a DOCX with stable per-block IDs - a text view suitable for LLM editing pipelines, structured search indexers, and diff/review UIs
- **DocxSession** - Stateful in-memory DOCX editor keyed by markdown-projection anchor ids - the write-side counterpart to WmlToMarkdownConverter for agentic editing pipelines
- Text and structural edits: replace paragraph text, delete blocks, insert/split/merge paragraphs, change paragraph style, adjust list level, replace table-cell content
- Character-range formatting (bold, italic, underline, strike, code, color, run style) addressed by substring, span, or a prior search match
- Surgical text replacement that preserves per-run formatting on either side of the rewritten slice - including a span-addressed variant that lets identical placeholders in one paragraph each receive distinct values
- Cross-run text search with per-fragment run breakdown (`Grep`), plus a cross-block variant (`GrepCrossBlock`) that lets a single match span adjacent paragraphs, headings, and list items
- Template-slot enumeration that classifies bracketed regions as value blanks, alternative clauses, or drafter hints
- Anchor discovery by text, regex, kind, bookmark, annotation id, or shared label - so an agent told to "edit the indemnification clause" can resolve intent to anchors without re-walking the document
- NBSP / smart-quote handling so common Word whitespace and punctuation don't sabotage literal find/replace
- Tracked-change mode that lands every mutation as `w:ins` / `w:del` instead of accepted edits
- Bounded snapshot undo/redo
- Raw OOXML escape hatch for content the markdown subset can't express (charts, equations, content controls)
- Typed result envelope on every call - no exceptions across the API boundary
- Available in .NET, WASM, and an npm TypeScript wrapper
- **DocumentBuilder** - Merge and split DOCX files
- **DocumentAssembler** - Template population from XML data
- **PresentationBuilder** - Merge and split PPTX files
- **SpreadsheetWriter** - Simplified XLSX creation API
- **OpenXmlRegex** - Search/replace in DOCX/PPTX using regular expressions
- **OpenContractExporter** - Export documents to OpenContracts format for NLP/document analysis
- Supporting utilities for document manipulation

## Browser/JavaScript Usage (npm)

Docxodus is also available as an npm package for client-side usage via WebAssembly:

```bash
npm install docxodus
```

```javascript
import {
initialize,
convertDocxToHtml,
compareDocuments,
getRevisions,
getDocumentMetadata,
isMove,
isMoveSource,
isFormatChange,
findMovePair,
CommentRenderMode,
PaginationMode
} from 'docxodus';

await initialize();

// Convert DOCX to HTML with comments and pagination
const html = await convertDocxToHtml(docxFile, {
commentRenderMode: CommentRenderMode.EndnoteStyle,
paginationMode: PaginationMode.Paginated,
renderHeadersAndFooters: true
});

// Compare two documents
const redlinedDocx = await compareDocuments(originalFile, modifiedFile);

// Get revisions with move and format change detection
const revisions = await getRevisions(redlinedDocx);
for (const rev of revisions) {
if (isMove(rev)) {
const pair = findMovePair(rev, revisions);
if (isMoveSource(rev)) {
console.log(`Content moved from: "${rev.text}" to: "${pair?.text}"`);
}
} else if (isFormatChange(rev)) {
console.log(`Format changed: ${rev.formatChange?.changedPropertyNames?.join(', ')}`);
}
}

// Get document metadata for lazy loading
const metadata = await getDocumentMetadata(docxFile);
console.log(`${metadata.totalParagraphs} paragraphs, ${metadata.estimatedPageCount} pages`);
```

See the [npm package documentation](docs/npm-package.md) for full API reference, React hooks, and usage examples.

## Requirements

- .NET 8.0 or later

## License

MIT License - see [LICENSE](LICENSE) for details.

---

*Built on the shoulders of [Open-Xml-PowerTools](https://github.com/OfficeDev/Open-Xml-PowerTools). Thanks to Eric White, Thomas Barnekow, and all original contributors.*