{"id":50390996,"url":"https://github.com/hallelx2/pdftable","last_synced_at":"2026-05-30T18:01:38.717Z","repository":{"id":361215492,"uuid":"1250690491","full_name":"hallelx2/pdftable","owner":"hallelx2","description":"Go-native port of pdfplumber — PDF text + table extraction primitives","archived":false,"fork":false,"pushed_at":"2026-05-29T16:03:16.000Z","size":259,"stargazers_count":0,"open_issues_count":1,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-29T18:05:02.054Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hallelx2.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-26T22:01:38.000Z","updated_at":"2026-05-29T16:03:03.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/hallelx2/pdftable","commit_stats":null,"previous_names":["hallelx2/pdftable"],"tags_count":5,"template":false,"template_full_name":null,"purl":"pkg:github/hallelx2/pdftable","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hallelx2%2Fpdftable","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hallelx2%2Fpdftable/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hallelx2%2Fpdftable/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hallelx2%2Fpdftable/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hallelx2","download_url":"https://codeload.github.com/hallelx2/pdftable/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hallelx2%2Fpdftable/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33703065,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-05-30T02:00:06.278Z","response_time":92,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-05-30T18:01:33.495Z","updated_at":"2026-05-30T18:01:38.711Z","avatar_url":"https://github.com/hallelx2.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# pdftable\n\nA Go-native port of Python's [pdfplumber](https://github.com/jsvine/pdfplumber).\n\n`pdftable` reads PDF documents, walks the content streams, and surfaces\nthe positioned primitives — characters, lines, rectangles, curves — that\nhigher-level layout algorithms (text extraction, word grouping, table\ndetection) operate on. It is built on top of\n[pdfcpu](https://github.com/pdfcpu/pdfcpu) for low-level object parsing,\nxref handling, and FlateDecode decompression; everything above that\n(operator dispatch, text state, glyph positioning, ToUnicode CMaps,\nfont encodings) is implemented here.\n\nThe library targets the gap in the Go PDF ecosystem: existing libraries\neither render PDFs to images, manipulate metadata, or extract bag-of-\nwords text. None of them give you what pdfplumber gives Python users —\na structured per-page object model you can run table-detection\nheuristics on. This is that.\n\n## Status\n\n`v0.3.0` — full pdfplumber parity for table-finding strategies. All four\ncanonical strategies are implemented: `lines`, `lines_strict`, `text`,\nand `explicit`. Mix and match per-axis (e.g. `vertical=\"text\"` +\n`horizontal=\"lines\"`) works as expected. Also ships the `pdftable`\nCLI for extracting text and tables without writing Go.\n\n[![Go Reference](https://pkg.go.dev/badge/github.com/hallelx2/pdftable.svg)](https://pkg.go.dev/github.com/hallelx2/pdftable)\n[![CI](https://github.com/hallelx2/pdftable/actions/workflows/test.yml/badge.svg)](https://github.com/hallelx2/pdftable/actions/workflows/test.yml)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)\n\n## Install\n\n```sh\ngo get github.com/hallelx2/pdftable@v0.3.0\n```\n\nRequires Go 1.25+ (uses the standard-library `iter` package for the `Pages()` range-over-func iterator, and pdfcpu v0.12+).\n\n## Quickstart\n\n```go\npackage main\n\nimport (\n    \"fmt\"\n    \"log\"\n\n    \"github.com/hallelx2/pdftable\"\n)\n\nfunc main() {\n    doc, err := pdftable.OpenFile(\"report.pdf\")\n    if err != nil {\n        log.Fatal(err)\n    }\n    defer doc.Close()\n\n    for n, page := range doc.Pages() {\n        // Primitives (v0.0.1).\n        chars, _ := page.Chars()\n        rects, _ := page.Rects()\n        lines, _ := page.Lines()\n        fmt.Printf(\"page %d: %d chars, %d rects, %d lines\\n\",\n            n, len(chars), len(rects), len(lines))\n\n        // Words and text extraction (v0.1.0).\n        words, _ := page.Words(pdftable.DefaultWordOpts())\n        text, _ := page.ExtractText(pdftable.DefaultTextOpts())\n        fmt.Printf(\"  %d words; first line: %q\\n\",\n            len(words), firstLine(text))\n    }\n}\n\nfunc firstLine(s string) string {\n    for i, r := range s {\n        if r == '\\n' {\n            return s[:i]\n        }\n    }\n    return s\n}\n```\n\n## API surface\n\n```go\n// Constructors.\nfunc Open(r io.Reader) (Document, error)\nfunc OpenBytes(b []byte) (Document, error)\nfunc OpenFile(path string) (Document, error)\n\n// Document.\ntype Document interface {\n    NumPages() int\n    Page(n int) (Page, error)              // 1-indexed\n    Pages() iter.Seq2[int, Page]           // Go 1.23+ range-over-func\n    Close() error\n}\n\n// Page.\ntype Page interface {\n    Number() int\n    Width() float64\n    Height() float64\n    Chars() ([]Char, error)\n    Lines() ([]Line, error)\n    Rects() ([]Rect, error)\n    Curves() ([]Curve, error)\n    Objects() (Objects, error)\n\n    // New in v0.1.0: word + text extraction.\n    Words(opts WordOpts) ([]Word, error)\n    ExtractText(opts TextOpts) (string, error)\n    ExtractTextSimple(xTolerance, yTolerance float64) (string, error)\n\n    // Table finding: lines + lines_strict (v0.2.0); text + explicit (v0.3.0).\n    FindTables(settings TableSettings) ([]TableFinder, error)\n    ExtractTables(settings TableSettings) ([]*Table, error)\n}\n\n// Primitives.\ntype Char struct {\n    Text                  string\n    X0, Y0, X1, Y1        float64\n    FontName              string\n    FontSize              float64\n    Upright               bool\n    Advance               float64\n}\n\ntype Line struct { X0, Y0, X1, Y1 float64; Stroke bool; Width float64 }\n\ntype Rect struct { X0, Y0, X1, Y1 float64; Stroke, Fill bool; Width float64 }\n\ntype Curve struct { Points [][2]float64; Stroke, Fill bool; Width float64 }\n\ntype Objects struct { Chars []Char; Lines []Line; Rects []Rect; Curves []Curve }\n\n// Word (new in v0.1.0).\ntype Word struct {\n    Text                string\n    X0, Y0, X1, Y1      float64\n    Upright             bool\n    Direction           string // \"ltr\" | \"rtl\" | \"ttb\" | \"btt\"\n    FontName            string\n    FontSize            float64\n    Chars               []Char // populated when WordOpts.KeepChars=true\n}\n\n// WordOpts: configure Page.Words. Use DefaultWordOpts() for pdfplumber-matching defaults.\ntype WordOpts struct {\n    XTolerance         float64 // default 3\n    YTolerance         float64 // default 3\n    KeepBlankChars     bool\n    UseTextFlow        bool\n    HorizontalLTR      bool   // default true\n    VerticalTTB        bool   // default true\n    ExtraAttrs         []string\n    SplitAtPunctuation bool\n    Expand             bool   // ligature expansion; default true\n    KeepChars          bool\n}\n\n// TextOpts: configure Page.ExtractText. Use DefaultTextOpts() for defaults.\ntype TextOpts struct {\n    XTolerance, YTolerance       float64\n    Layout                       bool\n    LayoutWidthChars             int\n    LayoutHeightChars            int\n    XDensity, YDensity           float64 // PDF points per character / per line\n    UseTextFlow                  bool\n    HorizontalLTR                bool\n    VerticalTTB                  bool\n    ExtraAttrs                   []string\n    Expand                       bool\n}\n\n// Sentinel errors.\nvar (\n    ErrInvalidPDF     = errors.New(\"pdftable: invalid PDF\")\n    ErrPageOutOfRange = errors.New(\"pdftable: page out of range\")\n    ErrUnsupported    = errors.New(\"pdftable: unsupported feature\")\n    ErrEncrypted      = errors.New(\"pdftable: encrypted PDF (decryption not yet supported)\")\n)\n```\n\n## Text extraction\n\n```go\ndoc, _ := pdftable.OpenFile(\"report.pdf\")\ndefer doc.Close()\npage, _ := doc.Page(1)\n\n// Words: each Word is a contiguous text run.\nwords, _ := page.Words(pdftable.DefaultWordOpts())\nfor _, w := range words {\n    fmt.Printf(\"%-20s @ (%.1f, %.1f) %s %.1fpt\\n\",\n        w.Text, w.X0, w.Y0, w.FontName, w.FontSize)\n}\n\n// ExtractText: all text on the page as one string. Dense (no layout)\n// joins words with spaces and lines with \"\\n\".\ntext, _ := page.ExtractText(pdftable.DefaultTextOpts())\nfmt.Println(text)\n\n// Layout-preserving extraction emulates `pdftotext -layout` / pdfplumber's\n// extract_text(layout=True) — column-aligned output suitable for forms.\nopts := pdftable.DefaultTextOpts()\nopts.Layout = true\nlaid, _ := page.ExtractText(opts)\nfmt.Println(laid)\n```\n\n## Tables\n\n`Page.ExtractTables` is the table-detection entry point. It runs the\nedges → intersections → cells → tables pipeline (a direct port of\npdfplumber's `TableFinder`) and returns one `*Table` per detected\ntable, with cell text already extracted.\n\n```go\ndoc, _ := pdftable.OpenFile(\"invoice.pdf\")\ndefer doc.Close()\npage, _ := doc.Page(1)\n\nsettings := pdftable.DefaultTableSettings()\n// settings.VerticalStrategy = pdftable.StrategyLinesStrict  // ignore rect outlines\n\ntables, _ := page.ExtractTables(settings)\nfor ti, t := range tables {\n    fmt.Printf(\"table %d: %d rows × %d cols at %+v\\n\",\n        ti, len(t.Rows), len(t.Rows[0]), t.BBox)\n    for _, row := range t.Rows {\n        fmt.Println(row)\n    }\n}\n```\n\n`TableSettings` defaults match pdfplumber's\n(`snap_tolerance=3`, `join_tolerance=3`, `edge_min_length=3`,\n`intersection_tolerance=3`, `text_tolerance=3`, `min_words_vertical=3`,\n`min_words_horizontal=1`). Override any field on the value returned\nfrom `DefaultTableSettings()` to tighten or loosen the heuristics.\n\nThe four implemented strategies (one per axis, chosen independently):\n\n- `StrategyLines` — edges come from drawn `Line` segments, `Rect`\n  outlines (all four sides), and axis-aligned `Curve` segments.\n  Default. Best for typical PDFs whose tables have rule lines.\n- `StrategyLinesStrict` — only drawn `Line` segments are used. Use\n  this when your PDF draws cell BACKGROUNDS as filled rectangles\n  that you do NOT want treated as row boundaries.\n- `StrategyText` — edges inferred from word alignment. Vertical\n  edges come from clusters of words sharing X0 / X1 / centre;\n  horizontal edges from clusters sharing top-Y. Tunable via\n  `MinWordsVertical` (default 3) and `MinWordsHorizontal` (default 1).\n- `StrategyExplicit` — caller-supplied edges via\n  `ExplicitVerticalLines` / `ExplicitHorizontalLines`. Required when\n  table boundaries are known from layout analysis or manual\n  annotation.\n\n### Side-by-side: pdfplumber → pdftable (lines strategy)\n\n```python\n# Python (pdfplumber)\nimport pdfplumber\n\nwith pdfplumber.open(\"invoice.pdf\") as pdf:\n    page = pdf.pages[0]\n    for table in page.find_tables({\"vertical_strategy\": \"lines\",\n                                    \"horizontal_strategy\": \"lines\"}):\n        for row in table.extract():\n            print(row)\n```\n\n```go\n// Go (pdftable)\nimport \"github.com/hallelx2/pdftable\"\n\ndoc, _ := pdftable.OpenFile(\"invoice.pdf\")\ndefer doc.Close()\npage, _ := doc.Page(1)\n\nsettings := pdftable.DefaultTableSettings()\nsettings.VerticalStrategy = pdftable.StrategyLines\nsettings.HorizontalStrategy = pdftable.StrategyLines\n\ntables, _ := page.ExtractTables(settings)\nfor _, t := range tables {\n    for _, row := range t.Rows {\n        fmt.Println(row)\n    }\n}\n```\n\n### Side-by-side: pdfplumber → pdftable (text strategy)\n\n```python\n# Python (pdfplumber) — borderless tables\nimport pdfplumber\n\nwith pdfplumber.open(\"10k-filing.pdf\") as pdf:\n    page = pdf.pages[3]\n    for table in page.find_tables({\"vertical_strategy\": \"text\",\n                                    \"horizontal_strategy\": \"text\",\n                                    \"min_words_vertical\": 3}):\n        for row in table.extract():\n            print(row)\n```\n\n```go\n// Go (pdftable)\ndoc, _ := pdftable.OpenFile(\"10k-filing.pdf\")\ndefer doc.Close()\npage, _ := doc.Page(4)\n\nsettings := pdftable.DefaultTableSettings()\nsettings.VerticalStrategy = pdftable.StrategyText\nsettings.HorizontalStrategy = pdftable.StrategyText\nsettings.MinWordsVertical = 3\n\ntables, _ := page.ExtractTables(settings)\nfor _, t := range tables {\n    for _, row := range t.Rows {\n        fmt.Println(row)\n    }\n}\n```\n\n### Side-by-side: pdfplumber → pdftable (explicit strategy)\n\n```python\n# Python (pdfplumber) — caller-supplied edges\nimport pdfplumber\n\nwith pdfplumber.open(\"statement.pdf\") as pdf:\n    page = pdf.pages[0]\n    table = page.find_tables({\n        \"vertical_strategy\": \"explicit\",\n        \"horizontal_strategy\": \"explicit\",\n        \"explicit_vertical_lines\":   [100, 200, 300, 400],\n        \"explicit_horizontal_lines\": [600, 650, 700, 720],\n    })[0]\n    for row in table.extract():\n        print(row)\n```\n\n```go\n// Go (pdftable)\ndoc, _ := pdftable.OpenFile(\"statement.pdf\")\ndefer doc.Close()\npage, _ := doc.Page(1)\n\nsettings := pdftable.DefaultTableSettings()\nsettings.VerticalStrategy = pdftable.StrategyExplicit\nsettings.HorizontalStrategy = pdftable.StrategyExplicit\nsettings.ExplicitVerticalLines   = []float64{100, 200, 300, 400}\nsettings.ExplicitHorizontalLines = []float64{600, 650, 700, 720}\n\ntables, _ := page.ExtractTables(settings)\nfor _, row := range tables[0].Rows {\n    fmt.Println(row)\n}\n```\n\n### Mixed strategies\n\nEach axis picks its strategy independently. Combinations like\n`vertical=text` + `horizontal=lines` (common for tables with drawn\nrow separators but borderless columns) work out of the box:\n\n```go\nsettings := pdftable.DefaultTableSettings()\nsettings.VerticalStrategy   = pdftable.StrategyText\nsettings.HorizontalStrategy = pdftable.StrategyLines\ntables, _ := page.ExtractTables(settings)\n```\n\nThe two outputs match cell-for-cell on the parity fixtures (see\n`testdata/golden/*.tables-text.expected.json` and\n`*.tables.expected.json` for the regression goldens). Field naming\ndiffers in the obvious places: pdftable returns a slice of `*Table`\ninstead of `Table` objects you have to call `.extract()` on; rows are\n`[]string` instead of `list[Optional[str]]` (missing cells produce\n`\"\"` rather than `nil`); and table bboxes use `(X0, Y0, X1, Y1)` PDF\nuser space rather than pdfplumber's image-space\n`(x0, top, x1, bottom)`.\n\n## CLI\n\n`pdftable` ships a command-line interface that mirrors pdfplumber's\nCLI surface for the operations the library implements:\n\n```sh\ngo install github.com/hallelx2/pdftable/cmd/pdftable@v0.3.0\n```\n\nUsage:\n\n```sh\n# Extract every table on every page as JSON.\npdftable extract invoice.pdf --tables --format json\n\n# Borderless tables: use the text strategy.\npdftable extract 10k.pdf --tables \\\n    --vertical-strategy text --horizontal-strategy text \\\n    --min-words-vertical 4\n\n# Extract text only (no table detection).\npdftable extract report.pdf --text --format text\n\n# Subset of pages, pretty-printed JSON.\npdftable extract report.pdf --tables --pages 1,3-5 --indent 2\n\n# Caller-supplied edges.\npdftable extract statement.pdf --tables \\\n    --vertical-strategy explicit --horizontal-strategy explicit \\\n    --explicit-vertical-lines 100,200,300,400 \\\n    --explicit-horizontal-lines 600,650,700,720\n```\n\nFlags:\n\n| Flag | Default | Description |\n| --- | --- | --- |\n| `--pages` | all | Pages: `1,3-5` syntax. |\n| `--tables` | off | Output detected tables. |\n| `--text` | off | Output extracted text. |\n| `--format` | `json` | `json` \\| `text`. |\n| `--vertical-strategy` | `lines` | `lines` \\| `lines_strict` \\| `text` \\| `explicit`. |\n| `--horizontal-strategy` | `lines` | same set. |\n| `--snap-tolerance` | 3 | snap_tolerance (PDF pts). |\n| `--join-tolerance` | 3 | join_tolerance (PDF pts). |\n| `--edge-min-length` | 3 | drop merged edges shorter than this. |\n| `--intersection-tolerance` | 3 | slack on edge crossings. |\n| `--text-tolerance` | 3 | per-cell text-extraction tolerance. |\n| `--min-words-vertical` | 3 | text strategy column threshold. |\n| `--min-words-horizontal` | 1 | text strategy row threshold. |\n| `--explicit-vertical-lines` | (none) | comma list of X coords. |\n| `--explicit-horizontal-lines` | (none) | comma list of Y coords. |\n| `--indent` | 0 | JSON indent (0 = compact). |\n\n## Side-by-side comparison with pdfplumber\n\n```python\n# Python (pdfplumber)\nimport pdfplumber\n\nwith pdfplumber.open(\"report.pdf\") as pdf:\n    page = pdf.pages[0]\n    for word in page.extract_words(x_tolerance=3, y_tolerance=3):\n        print(word[\"text\"], word[\"x0\"], word[\"top\"])\n    print(page.extract_text())\n```\n\n```go\n// Go (pdftable)\nimport \"github.com/hallelx2/pdftable\"\n\ndoc, _ := pdftable.OpenFile(\"report.pdf\")\ndefer doc.Close()\npage, _ := doc.Page(1)\n\nwords, _ := page.Words(pdftable.DefaultWordOpts())\nfor _, w := range words {\n    // pdftable's Y is PDF user-space (origin bottom-left). The\n    // pdfplumber-equivalent \"top\" is page.Height() - w.Y1.\n    fmt.Println(w.Text, w.X0, page.Height()-w.Y1)\n}\nfmt.Println(must(page.ExtractText(pdftable.DefaultTextOpts())))\n```\n\nThree differences worth noting:\n\n1. **Page indexing is 1-based**, matching the PDF spec and pdfplumber's\n   `pdf.pages[0]` is actually the first page (Python is 0-indexed,\n   pdfplumber compensates). Our `Page(1)` is the same first page.\n2. **Coordinates are in PDF user space with origin at bottom-left**.\n   pdfplumber by default reports `top` (origin top-left, Y growing down)\n   on its chars and words; we report `Y0` / `Y1` in PDF native\n   coordinates. The conversion is `top = page.Height() - Y1`.\n3. **Options are explicit Go structs, not `**kwargs`**. Build a\n   `WordOpts` / `TextOpts`, override the fields you care about, pass\n   it through. `DefaultWordOpts()` / `DefaultTextOpts()` return\n   pdfplumber-matching defaults.\n\n## Parity with pdfplumber\n\nThe word-grouping and text-extraction algorithms are direct ports of\npdfplumber's `WordExtractor` and `extract_text` (see\n[`pdfplumber/utils/text.py`](https://github.com/jsvine/pdfplumber/blob/main/pdfplumber/utils/text.py)).\nTests in [`golden_test.go`](golden_test.go) compare the Go output\nagainst pdfplumber's reference output on shared fixture PDFs.\n\nBehaviours that match exactly:\n\n- Word grouping: same line-cluster-then-merge-by-gap algorithm, same\n  defaults (XTolerance=3, YTolerance=3), same handling of blank-char\n  filtering, ligature expansion (ﬁ→fi, etc.), and split-at-punctuation.\n- Ordering: words returned in pdfplumber's order (top-to-bottom, then\n  left-to-right within each line) when UseTextFlow is false.\n- Direction handling: ltr / rtl / ttb / btt mapping from\n  upright + HorizontalLTR + VerticalTTB.\n\nBehaviours that intentionally differ:\n\n- **Position precision drifts when font metrics aren't bundled**.\n  pdfplumber uses pdfminer.six's AFM tables for the standard 14 fonts;\n  we use a default-width fallback for now. Word text and order match\n  exactly; word bboxes drift by up to ~10 PDF points on glyphs whose\n  width isn't in the PDF's /Widths array. Golden tests assert text\n  parity exactly and position parity within a 15-point envelope; the\n  envelope tightens to \u003c1pt once the AFM bundle lands (planned for\n  v0.2.x).\n- **`Layout=true` output is structurally similar but not byte-equal**.\n  Pdfplumber's layout algorithm has version-to-version drift; we\n  produce a column-aligned grid with the same density defaults but\n  don't promise byte-equal output across pdfplumber releases.\n\nBehaviours not yet ported:\n\n- `extract_text_lines` (regex-based line extraction).\n- `search` on TextMap (regex over assembled page text with char-level\n  match back-references).\n- Per-character extra_attrs hooks beyond `fontname` and `size`.\n\n## Architecture\n\n```\npdftable/\n├── pdftable.go        // Open / OpenBytes / OpenFile entry points\n├── pdf.go             // Document interface + implementation\n├── page.go            // Page interface + implementation\n├── char.go            // Public Char / Line / Rect / Curve / Objects\n├── text.go            // Word + ExtractText + ExtractTextSimple (v0.1.0)\n├── table.go           // TableStrategy / TableSettings / Table types (v0.2.0)\n├── finder.go          // Cells-from-edges algorithm (v0.2.0)\n├── finder_text.go     // Text + explicit edge derivation (v0.3.0)\n├── clustering.go      // 1-D clusterObjects, groupObjectsByAttr, dedupeChars\n├── geometry.go        // BBox helpers: Union, Intersect, Contains, Snap\n├── errors.go          // Sentinel errors\n├── cmd/\n│   └── pdftable/      // Command-line interface (v0.3.0)\n│       └── main.go\n└── internal/\n    ├── layout/\n    │   └── lines.go   // Edge type + snap/join/filter pipeline (v0.2.0)\n    └── pdf/\n        ├── reader.go      // pdfcpu bridge\n        ├── content.go     // Content-stream interpreter\n        ├── ops.go         // Operator dispatch table\n        ├── state.go       // Graphics + text state, matrix math\n        ├── font.go        // Font + encoding tables + glyph-name resolution\n        └── cmap.go        // ToUnicode CMap parser\n```\n\nThe public `pdftable` package is small and stable. The `internal/pdf`\npackage owns the interpreter — its types are not exposed because they\nwill evolve as more PDF features are added (Type 3 fonts, vertical\nwriting, more exotic CMaps).\n\n## Why pdfcpu and not write a PDF parser from scratch?\n\nPDF object parsing — xref tables, indirect-object resolution, stream\ndecompression (FlateDecode, LZWDecode, ASCII85Decode), encryption — is\na large amount of mostly-uninteresting code. pdfcpu is mature, well-\ntested, and gives us a parsed `*model.Context` to work with. We layer\nthe content-stream interpreter (which pdfcpu doesn't have) on top.\n\nIf pdfcpu's dependency footprint becomes a problem (it pulls in image\ncodecs we don't strictly need), the blast radius of swapping it out is\nlimited to `internal/pdf/reader.go`. The rest of the package is\nstdlib-only.\n\n## Roadmap\n\n- `v0.0.x` — content-stream primitives.\n- `v0.1.x` — text extraction: `Page.ExtractText`, `Page.Words`,\n  `Page.ExtractTextSimple`.\n- `v0.2.x` — table finding via ruling lines: `Page.FindTables` /\n  `Page.ExtractTables` covering the `lines` and `lines_strict`\n  strategies.\n- `v0.3.x` — remaining table strategies and CLI (this release):\n  `text` (word-alignment edges), `explicit` (caller-supplied edges),\n  and a `pdftable` CLI mirroring pdfplumber's surface.\n- `v0.4.x` — bundle the standard-14 AFM metrics so word bboxes (and\n  therefore cell text) match pdfplumber to within 1 PDF point on\n  standard fonts.\n- `v0.5.x` — performance pass: parser benchmarking against\n  pdfminer.six and pdfplumber on a representative document corpus.\n\n## License\n\nMIT. See [LICENSE](LICENSE).\n\n## Acknowledgements\n\nThis library is a direct port of the algorithms in\n[pdfminer.six](https://github.com/pdfminer/pdfminer.six) and\n[pdfplumber](https://github.com/jsvine/pdfplumber). Their authors did\nthe hard work of figuring out how to robustly recover structure from\nthe PDF wire format; this is that work translated into Go.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhallelx2%2Fpdftable","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhallelx2%2Fpdftable","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhallelx2%2Fpdftable/lists"}