https://github.com/qrcommunication/gigapdf-lib

Zero-dependency Rust→WASM PDF engine + TypeScript SDK — read/edit/render/OCR/convert PDFs with no third-party libs (@qrcommunication/gigapdf-lib)
https://github.com/qrcommunication/gigapdf-lib
Last synced: 16 days ago
JSON representation
Zero-dependency Rust→WASM PDF engine + TypeScript SDK — read/edit/render/OCR/convert PDFs with no third-party libs (@qrcommunication/gigapdf-lib)
Host: GitHub
URL: https://github.com/qrcommunication/gigapdf-lib
Owner: QrCommunication
License: other
Created: 2026-06-14T22:33:49.000Z (18 days ago)
Default Branch: main
Last Pushed: 2026-06-14T22:56:11.000Z (18 days ago)
Last Synced: 2026-06-15T00:20:04.856Z (18 days ago)
Language: Rust
Size: 699 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

          # gigapdf-lib

A **zero-dependency** PDF engine, written from scratch in Rust and compiled to

WebAssembly — read, edit, render, secure, **and convert** PDFs with no third-party

crates and no native libraries.

The TypeScript SDK is published as **[`@qrcommunication/gigapdf-lib`](https://www.npmjs.com/package/@qrcommunication/gigapdf-lib)**

(see [`sdk/`](sdk/)); the self-contained `.wasm` ships inside it.

> Copyright 2025 Rony Licha / QR Communication.

> Licensed under the **PolyForm Noncommercial License 1.0.0** — see [`LICENSE`](LICENSE).

> Required Notice: Copyright 2025 Rony Licha / QR Communication.

## Why it exists

The previous editor used a Fabric.js **overlay + cosmetic mask**, which cannot

reconstruct a complex background (gradient, image, pattern) under edited text.

This engine edits the **real PDF content stream**: it physically removes/edits/adds

the page operators, so the background is preserved *by construction* and the

original glyphs never leak. It then grew into a self-contained PDF toolkit so the

product depends on **no** external PDF/Office/font library (no MuPDF, no

LibreOffice, no fontkit) for its core flows.

## Zero dependencies

**None.** Everything is pure `std` and compiles straight to `wasm32`:

- Lexer, object parser, xref-streams, object-streams.

- `FlateDecode`/zlib **inflate *and* deflate** (RFC 1950/1951) from scratch.

- Content-stream interpreter + editor; renumbering serializer.

- Crypto from scratch: MD5, RC4, AES-128/256, SHA-256/384/512, big-integer

  modular arithmetic (Montgomery), RSA, ASN.1 DER, X.509, CMS/PKCS#7.

- Rasterizer: scanline fill (AA), PNG encoder, TrueType `glyf` + CFF Type2 glyph

  outlines, image XObject blit.

- ZIP reader/writer, OOXML/ODF builders, a from-scratch PDF page builder.

The WebAssembly sandbox has **no network and no entropy** — those come from the

host through a tiny port (the host supplies `crypto.getRandomValues` bytes and

performs Google-Fonts downloads). Everything else is in the engine.

## Feature matrix

| Area | Capabilities |

|------|--------------|

| **Read** | PDF 1.7, xref + object streams, FlateDecode, encrypted (RC4/AESV2/AESV3) |

| **Write** | Renumbering serializer, `save`, `save_compressed` (Flate streams) |

| **Edit content** | Text edit/remove, elements (text/image/shape) list/remove/move/duplicate/add; draw text/rect/line/ellipse/polygon/SVG-path/image (opacity + PNG alpha); hit-test |

| **Text extraction** | Font-aware, zero-tofu via WinAnsi + `/ToUnicode` CMap (CID/Type0) |

| **Annotations** | Highlight, underline, strike-out, free-text, square, line, ink, stamp, link; **flatten** |

| **Forms (AcroForm)** | Text/checkbox/radio/combo/list/signature fields — **read · fill · create** (build widgets from scratch with appearance streams + `NeedAppearances`) |

| **Pages** | Rotate, delete, move, extract, merge; bookmarks/outline; metadata |

| **Security** | Encrypt/permissions, **self-signed digital signature** (RSA/X.509/CMS), **PKCS#12 signing** (import a user `.p12`/`.pfx` natively — PBES2 AES + PBES1 3DES/RC2, MAC-verified — no node-forge/@signpdf), **true redaction** (delete from stream, no opaque cover) |

| **Render** | Rasterize a page to PNG (vector + TrueType/CFF glyphs + images) |

| **Text intelligence** | Font-aware extraction, **structured text** (reading-order lines + boxes), **full-text search** with highlight boxes |

| **OCR** | Built-in recognizer — Otsu → connected components → line/word segmentation → MLP trained on **EMNIST handwriting + synthetic font glyphs** (Latin + accents). No Tesseract, no model download at runtime |

| **Convert →** | PDF → **TXT, HTML, DOCX, PPTX, ODP, ODT, XLSX, ODS, RTF** (real editable elements, not a page image) |

| **Convert ←** | **TXT, HTML, RTF, DOCX, ODT, ODP, PPTX, XLSX, ODS** → PDF (ODF `.odt`/`.ods`/`.odp` are fully bidirectional) |

| **HTML rendering** | Native **HTML + CSS → PDF** engine (parser, selector cascade, block / inline / table / **flex** (direction · justify-content · grow) / **grid** layout, pagination, **`page-break-*` + ``**) — no headless browser. Text set in **embedded Google fonts** (real glyphs + metrics, identical or nearest match) |

| **JavaScript** | Built-in zero-dependency **JS engine** that runs a document's inline ``s before layout — **no Chromium/Playwright**. Lexer → parser → tree-walking interpreter with **classes + `super`**, closures, destructuring, generators (`function*`/`yield`), **`async`/`await` + `Promise`** (microtask queue + `setTimeout`), and built-ins: `Object`/`Array`/`String`/`Number`/`Math`/`JSON`/`console`/`Map`/`Set`/**`RegExp`** + a backtracking regex engine. **DOM bindings**: `getElementById`, `querySelector(All)` (`#id`/`.class`/`tag`/`>`/`+`/`~`/`[attr]`), `textContent`, `innerHTML`, `createElement`/`appendChild`, `classList`, `style`, … |

| **Archival** | **PDF/A-2b** metadata (XMP + sRGB OutputIntent + ID) |

| **Fonts** | Draw **and edit** real text in **every font source & any font file** — built-in **base-14 standard fonts** (no embedding), any family / **Google Font** (1951-family catalog + URL builder + **TrueType *and* OpenType-CFF embedding**: glyf→Type0/CIDFontType2+FontFile2, `.otf`/`OTTO`→Type0/CIDFontType0+FontFile3, Identity-H + full widths + ToUnicode), and the **document's own embedded faces** (`embeddedFonts` + `extractFont` → re-embed). `addText` **and** font-aware `replaceText` resolve any face's char→glyph map (`FontFile2`/`FontFile3`); needed-font detection |

All of it is exercised by `cargo test` (**284 tests**, incl. a 100-test pure-Rust

JavaScript engine: lexer, parser, interpreter, built-ins, regex, DOM, and a

suspendable bytecode VM with lazy generators, spec-ordered async, and full

control-flow — `try`/`catch`/`finally`, `switch`, labels, destructuring,

spread), a Node WASM smoke test

(end-to-end, all green), and **validated externally**: generated Office files

(DOCX/PPTX/XLSX **and ODT/ODS/ODP**) open and round-trip in LibreOffice; embedded

fonts verify as `emb=yes` under poppler's `pdffonts`.

## Honest scope

Conversions are **content-and-layout faithful**, not pixel-perfect re-typesetting.

PDF→Office reconstructs **real, editable objects** (positioned text boxes,

re-embedded images, table cells) the way an office suite's PDF import does — not a

rendered page image. Office→PDF is **text-faithful** (all content, reading order,

pagination) using the standard-14 fonts; pixel-perfect re-layout of an arbitrary,

richly-styled document stays the job of a full layout engine. Full PDF/A

conformance additionally requires every font embedded (the engine can do that).

The **JavaScript engine** targets the language used by templating/report scripts:

classes/`super`, closures, destructuring/spread, `RegExp`, `Map`/`Set`, `Symbol`

(real, with the iterator protocol), `eval`/`Function`, tagged templates, and

`import`/`export` (parsed transparently). `function*`/`async` bodies compile to a

**suspendable bytecode VM**, so generators are **truly lazy** (infinite

`while (true) { yield … }` works, `.next(v)` is bidirectional, `yield*` delegates

lazily) and `await` **yields to the event loop** with spec microtask ordering.

The VM covers the full statement/expression language used by templates —

`try`/`catch`/`finally`, `for…of`/`for…in`, `switch`, labelled `break`/

`continue`, destructuring, compound assignment, and `...spread` — all able to

span a `yield`/`await`. A handful of corner cases (a `return`/`break` *through* a

`finally`, a logical `&&=`/`||=`/`??=` with an awaited right-hand side, sparse

array holes) transparently fall back to the eager generator / synchronous-await

model — same results, just not lazy.

By design the sandbox has **no network and no real timers** (`setTimeout`

resolves on the microtask queue). CSS **flex** supports `flex-direction`,

`justify-content` and `flex-grow`; **grid** lays out `grid-template-columns`;

**float** maps to inline-block.

## OCR & text intelligence

Text already in a PDF is extracted **font-aware** (zero tofu) with reading-order

lines and bounding boxes, and is searchable with highlight boxes. For **scanned,

image-only pages** the engine has a built-in OCR following the classic Tesseract

pipeline — Otsu binarization → connected-component blobs → line/word segmentation

→ per-glyph classification — but with a from-scratch, dependency-free classifier:

- The classifier is a small MLP **trained offline** on two public sources:

  **EMNIST** (NIST handwritten digits + letters, public domain) for **handwriting**,

  and **synthetic glyphs rendered from ~220 system fonts** (the Tesseract

  `text2image` approach) for **printed text, punctuation and accented Latin**.

- Training is build-time only (`tools/train_ocr.py`); the engine ships the

  **int8-quantized weights** and runs a pure-`std` forward pass — no ML library,

  no model download at runtime.

- **Scripts/languages:** Latin — `0-9 A-Z a-z`, common punctuation, and accented

  Latin (`é è à ç ñ ü …`) for French, Spanish, German, Portuguese, etc. Both

  **printed and handwritten** Latin are recognized. Other scripts (Cyrillic,

  Greek, CJK, Arabic) are not covered yet — they're a matter of adding classes +

  data to the trainer, with **no runtime change**.

- **Honest accuracy:** strong on clean machine print, decent on tidy handwriting

  (EMNIST-grade); noisy scans and dense layouts are harder. Retrain with more data

  to improve — the runtime never changes.

## Layout

```

crates/core   gigapdf-core  — the whole engine (parse, inflate, edit, render, crypto, convert)

crates/wasm   gigapdf-wasm  — extern "C" WebAssembly bindings (zero-dep ABI)

fixtures/     test PDFs

test/         wasm-smoke.mjs — end-to-end Node harness

tools/        catalog/ICC generators + snapshots

docs/         API.md · USAGE.md · INSTALL.md

```

## Quickstart

### Rust

```rust

use gigapdf_core::Document;

let mut doc = Document::open(&bytes)?;

let docx = doc.to_docx();            // PDF → editable Word

let pdf  = gigapdf_core::convert::reverse::txt_to_pdf("Hello\nWorld"); // text → PDF

doc.embed_truetype_font("Roboto", &ttf)?; // host-downloaded font

let signed = doc.sign(&signer, "Me", "Approval", "D:20260614120000Z")?;

let out = doc.save();

```

### Browser / Node (WebAssembly)

```js

const { instance } = await WebAssembly.instantiate(wasmBytes, {});

const ex = instance.exports;

const handle = ex.gp_open(ptr, len);     // returns an opaque handle

const docx = callBuffer(() => ex.gp_to_docx(handle, lenPtr)); // → Uint8Array

ex.gp_close(handle);

```

### Documentation

| Doc | What's in it |

|-----|--------------|

| [`docs/SDK.md`](docs/SDK.md) | **Complete TypeScript SDK reference** — every `GigaPdfEngine`/`GigaPdfDoc` method, grouped by domain, with parameters, returns and notes. |

| [`docs/USAGE.md`](docs/USAGE.md) | Cookbook: the buffer ABI plus a worked example for every feature area. |

| [`docs/API.md`](docs/API.md) | The Rust ↔ WASM ABI mapping (every `gp_*` export and its Rust method). |

| [`docs/HTML-CSS.md`](docs/HTML-CSS.md) | The **exhaustive** list of supported HTML elements, CSS properties, units, colours, selectors and JS in the HTML→PDF renderer. |

| [`docs/INSTALL.md`](docs/INSTALL.md) | Install, build-from-source, and Next.js (`output: "standalone"`) wiring. |

## Build

```bash

cargo test -p gigapdf-core   # native tests (real fixtures)

cargo wasm                   # build the WASM engine (alias, see .cargo/config.toml)

node test/wasm-smoke.mjs     # end-to-end WASM smoke test

```

`cargo wasm` is a repo alias for the full target build, so you never type the

target triple by hand (`cargo wasm-dev` for a debug build).

The release `.wasm` is ~540 KB — **zero dependencies**, versus ~14 MB for MuPDF.

## License & provenance

PolyForm Noncommercial 1.0.0. Built clean-room from the ISO 32000 specification;

**no AGPL code (e.g. MuPDF) was ever read or copied.** See [`LICENSE`](LICENSE).
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/qrcommunication/gigapdf-lib

Awesome Lists containing this project

README