https://github.com/hallelx2/pdftable
Go-native port of pdfplumber — PDF text + table extraction primitives
https://github.com/hallelx2/pdftable
Last synced: 19 days ago
JSON representation
Go-native port of pdfplumber — PDF text + table extraction primitives
- Host: GitHub
- URL: https://github.com/hallelx2/pdftable
- Owner: hallelx2
- License: mit
- Created: 2026-05-26T22:01:38.000Z (23 days ago)
- Default Branch: main
- Last Pushed: 2026-05-29T16:03:16.000Z (20 days ago)
- Last Synced: 2026-05-29T18:05:02.054Z (20 days ago)
- Language: Go
- Size: 253 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
# pdftable
A Go-native port of Python's [pdfplumber](https://github.com/jsvine/pdfplumber).
`pdftable` reads PDF documents, walks the content streams, and surfaces
the positioned primitives — characters, lines, rectangles, curves — that
higher-level layout algorithms (text extraction, word grouping, table
detection) operate on. It is built on top of
[pdfcpu](https://github.com/pdfcpu/pdfcpu) for low-level object parsing,
xref handling, and FlateDecode decompression; everything above that
(operator dispatch, text state, glyph positioning, ToUnicode CMaps,
font encodings) is implemented here.
The library targets the gap in the Go PDF ecosystem: existing libraries
either render PDFs to images, manipulate metadata, or extract bag-of-
words text. None of them give you what pdfplumber gives Python users —
a structured per-page object model you can run table-detection
heuristics on. This is that.
## Status
`v0.3.0` — full pdfplumber parity for table-finding strategies. All four
canonical strategies are implemented: `lines`, `lines_strict`, `text`,
and `explicit`. Mix and match per-axis (e.g. `vertical="text"` +
`horizontal="lines"`) works as expected. Also ships the `pdftable`
CLI for extracting text and tables without writing Go.
[](https://pkg.go.dev/github.com/hallelx2/pdftable)
[](https://github.com/hallelx2/pdftable/actions/workflows/test.yml)
[](LICENSE)
## Install
```sh
go get github.com/hallelx2/pdftable@v0.3.0
```
Requires Go 1.25+ (uses the standard-library `iter` package for the `Pages()` range-over-func iterator, and pdfcpu v0.12+).
## Quickstart
```go
package main
import (
"fmt"
"log"
"github.com/hallelx2/pdftable"
)
func main() {
doc, err := pdftable.OpenFile("report.pdf")
if err != nil {
log.Fatal(err)
}
defer doc.Close()
for n, page := range doc.Pages() {
// Primitives (v0.0.1).
chars, _ := page.Chars()
rects, _ := page.Rects()
lines, _ := page.Lines()
fmt.Printf("page %d: %d chars, %d rects, %d lines\n",
n, len(chars), len(rects), len(lines))
// Words and text extraction (v0.1.0).
words, _ := page.Words(pdftable.DefaultWordOpts())
text, _ := page.ExtractText(pdftable.DefaultTextOpts())
fmt.Printf(" %d words; first line: %q\n",
len(words), firstLine(text))
}
}
func firstLine(s string) string {
for i, r := range s {
if r == '\n' {
return s[:i]
}
}
return s
}
```
## API surface
```go
// Constructors.
func Open(r io.Reader) (Document, error)
func OpenBytes(b []byte) (Document, error)
func OpenFile(path string) (Document, error)
// Document.
type Document interface {
NumPages() int
Page(n int) (Page, error) // 1-indexed
Pages() iter.Seq2[int, Page] // Go 1.23+ range-over-func
Close() error
}
// Page.
type Page interface {
Number() int
Width() float64
Height() float64
Chars() ([]Char, error)
Lines() ([]Line, error)
Rects() ([]Rect, error)
Curves() ([]Curve, error)
Objects() (Objects, error)
// New in v0.1.0: word + text extraction.
Words(opts WordOpts) ([]Word, error)
ExtractText(opts TextOpts) (string, error)
ExtractTextSimple(xTolerance, yTolerance float64) (string, error)
// Table finding: lines + lines_strict (v0.2.0); text + explicit (v0.3.0).
FindTables(settings TableSettings) ([]TableFinder, error)
ExtractTables(settings TableSettings) ([]*Table, error)
}
// Primitives.
type Char struct {
Text string
X0, Y0, X1, Y1 float64
FontName string
FontSize float64
Upright bool
Advance float64
}
type Line struct { X0, Y0, X1, Y1 float64; Stroke bool; Width float64 }
type Rect struct { X0, Y0, X1, Y1 float64; Stroke, Fill bool; Width float64 }
type Curve struct { Points [][2]float64; Stroke, Fill bool; Width float64 }
type Objects struct { Chars []Char; Lines []Line; Rects []Rect; Curves []Curve }
// Word (new in v0.1.0).
type Word struct {
Text string
X0, Y0, X1, Y1 float64
Upright bool
Direction string // "ltr" | "rtl" | "ttb" | "btt"
FontName string
FontSize float64
Chars []Char // populated when WordOpts.KeepChars=true
}
// WordOpts: configure Page.Words. Use DefaultWordOpts() for pdfplumber-matching defaults.
type WordOpts struct {
XTolerance float64 // default 3
YTolerance float64 // default 3
KeepBlankChars bool
UseTextFlow bool
HorizontalLTR bool // default true
VerticalTTB bool // default true
ExtraAttrs []string
SplitAtPunctuation bool
Expand bool // ligature expansion; default true
KeepChars bool
}
// TextOpts: configure Page.ExtractText. Use DefaultTextOpts() for defaults.
type TextOpts struct {
XTolerance, YTolerance float64
Layout bool
LayoutWidthChars int
LayoutHeightChars int
XDensity, YDensity float64 // PDF points per character / per line
UseTextFlow bool
HorizontalLTR bool
VerticalTTB bool
ExtraAttrs []string
Expand bool
}
// Sentinel errors.
var (
ErrInvalidPDF = errors.New("pdftable: invalid PDF")
ErrPageOutOfRange = errors.New("pdftable: page out of range")
ErrUnsupported = errors.New("pdftable: unsupported feature")
ErrEncrypted = errors.New("pdftable: encrypted PDF (decryption not yet supported)")
)
```
## Text extraction
```go
doc, _ := pdftable.OpenFile("report.pdf")
defer doc.Close()
page, _ := doc.Page(1)
// Words: each Word is a contiguous text run.
words, _ := page.Words(pdftable.DefaultWordOpts())
for _, w := range words {
fmt.Printf("%-20s @ (%.1f, %.1f) %s %.1fpt\n",
w.Text, w.X0, w.Y0, w.FontName, w.FontSize)
}
// ExtractText: all text on the page as one string. Dense (no layout)
// joins words with spaces and lines with "\n".
text, _ := page.ExtractText(pdftable.DefaultTextOpts())
fmt.Println(text)
// Layout-preserving extraction emulates `pdftotext -layout` / pdfplumber's
// extract_text(layout=True) — column-aligned output suitable for forms.
opts := pdftable.DefaultTextOpts()
opts.Layout = true
laid, _ := page.ExtractText(opts)
fmt.Println(laid)
```
## Tables
`Page.ExtractTables` is the table-detection entry point. It runs the
edges → intersections → cells → tables pipeline (a direct port of
pdfplumber's `TableFinder`) and returns one `*Table` per detected
table, with cell text already extracted.
```go
doc, _ := pdftable.OpenFile("invoice.pdf")
defer doc.Close()
page, _ := doc.Page(1)
settings := pdftable.DefaultTableSettings()
// settings.VerticalStrategy = pdftable.StrategyLinesStrict // ignore rect outlines
tables, _ := page.ExtractTables(settings)
for ti, t := range tables {
fmt.Printf("table %d: %d rows × %d cols at %+v\n",
ti, len(t.Rows), len(t.Rows[0]), t.BBox)
for _, row := range t.Rows {
fmt.Println(row)
}
}
```
`TableSettings` defaults match pdfplumber's
(`snap_tolerance=3`, `join_tolerance=3`, `edge_min_length=3`,
`intersection_tolerance=3`, `text_tolerance=3`, `min_words_vertical=3`,
`min_words_horizontal=1`). Override any field on the value returned
from `DefaultTableSettings()` to tighten or loosen the heuristics.
The four implemented strategies (one per axis, chosen independently):
- `StrategyLines` — edges come from drawn `Line` segments, `Rect`
outlines (all four sides), and axis-aligned `Curve` segments.
Default. Best for typical PDFs whose tables have rule lines.
- `StrategyLinesStrict` — only drawn `Line` segments are used. Use
this when your PDF draws cell BACKGROUNDS as filled rectangles
that you do NOT want treated as row boundaries.
- `StrategyText` — edges inferred from word alignment. Vertical
edges come from clusters of words sharing X0 / X1 / centre;
horizontal edges from clusters sharing top-Y. Tunable via
`MinWordsVertical` (default 3) and `MinWordsHorizontal` (default 1).
- `StrategyExplicit` — caller-supplied edges via
`ExplicitVerticalLines` / `ExplicitHorizontalLines`. Required when
table boundaries are known from layout analysis or manual
annotation.
### Side-by-side: pdfplumber → pdftable (lines strategy)
```python
# Python (pdfplumber)
import pdfplumber
with pdfplumber.open("invoice.pdf") as pdf:
page = pdf.pages[0]
for table in page.find_tables({"vertical_strategy": "lines",
"horizontal_strategy": "lines"}):
for row in table.extract():
print(row)
```
```go
// Go (pdftable)
import "github.com/hallelx2/pdftable"
doc, _ := pdftable.OpenFile("invoice.pdf")
defer doc.Close()
page, _ := doc.Page(1)
settings := pdftable.DefaultTableSettings()
settings.VerticalStrategy = pdftable.StrategyLines
settings.HorizontalStrategy = pdftable.StrategyLines
tables, _ := page.ExtractTables(settings)
for _, t := range tables {
for _, row := range t.Rows {
fmt.Println(row)
}
}
```
### Side-by-side: pdfplumber → pdftable (text strategy)
```python
# Python (pdfplumber) — borderless tables
import pdfplumber
with pdfplumber.open("10k-filing.pdf") as pdf:
page = pdf.pages[3]
for table in page.find_tables({"vertical_strategy": "text",
"horizontal_strategy": "text",
"min_words_vertical": 3}):
for row in table.extract():
print(row)
```
```go
// Go (pdftable)
doc, _ := pdftable.OpenFile("10k-filing.pdf")
defer doc.Close()
page, _ := doc.Page(4)
settings := pdftable.DefaultTableSettings()
settings.VerticalStrategy = pdftable.StrategyText
settings.HorizontalStrategy = pdftable.StrategyText
settings.MinWordsVertical = 3
tables, _ := page.ExtractTables(settings)
for _, t := range tables {
for _, row := range t.Rows {
fmt.Println(row)
}
}
```
### Side-by-side: pdfplumber → pdftable (explicit strategy)
```python
# Python (pdfplumber) — caller-supplied edges
import pdfplumber
with pdfplumber.open("statement.pdf") as pdf:
page = pdf.pages[0]
table = page.find_tables({
"vertical_strategy": "explicit",
"horizontal_strategy": "explicit",
"explicit_vertical_lines": [100, 200, 300, 400],
"explicit_horizontal_lines": [600, 650, 700, 720],
})[0]
for row in table.extract():
print(row)
```
```go
// Go (pdftable)
doc, _ := pdftable.OpenFile("statement.pdf")
defer doc.Close()
page, _ := doc.Page(1)
settings := pdftable.DefaultTableSettings()
settings.VerticalStrategy = pdftable.StrategyExplicit
settings.HorizontalStrategy = pdftable.StrategyExplicit
settings.ExplicitVerticalLines = []float64{100, 200, 300, 400}
settings.ExplicitHorizontalLines = []float64{600, 650, 700, 720}
tables, _ := page.ExtractTables(settings)
for _, row := range tables[0].Rows {
fmt.Println(row)
}
```
### Mixed strategies
Each axis picks its strategy independently. Combinations like
`vertical=text` + `horizontal=lines` (common for tables with drawn
row separators but borderless columns) work out of the box:
```go
settings := pdftable.DefaultTableSettings()
settings.VerticalStrategy = pdftable.StrategyText
settings.HorizontalStrategy = pdftable.StrategyLines
tables, _ := page.ExtractTables(settings)
```
The two outputs match cell-for-cell on the parity fixtures (see
`testdata/golden/*.tables-text.expected.json` and
`*.tables.expected.json` for the regression goldens). Field naming
differs in the obvious places: pdftable returns a slice of `*Table`
instead of `Table` objects you have to call `.extract()` on; rows are
`[]string` instead of `list[Optional[str]]` (missing cells produce
`""` rather than `nil`); and table bboxes use `(X0, Y0, X1, Y1)` PDF
user space rather than pdfplumber's image-space
`(x0, top, x1, bottom)`.
## CLI
`pdftable` ships a command-line interface that mirrors pdfplumber's
CLI surface for the operations the library implements:
```sh
go install github.com/hallelx2/pdftable/cmd/pdftable@v0.3.0
```
Usage:
```sh
# Extract every table on every page as JSON.
pdftable extract invoice.pdf --tables --format json
# Borderless tables: use the text strategy.
pdftable extract 10k.pdf --tables \
--vertical-strategy text --horizontal-strategy text \
--min-words-vertical 4
# Extract text only (no table detection).
pdftable extract report.pdf --text --format text
# Subset of pages, pretty-printed JSON.
pdftable extract report.pdf --tables --pages 1,3-5 --indent 2
# Caller-supplied edges.
pdftable extract statement.pdf --tables \
--vertical-strategy explicit --horizontal-strategy explicit \
--explicit-vertical-lines 100,200,300,400 \
--explicit-horizontal-lines 600,650,700,720
```
Flags:
| Flag | Default | Description |
| --- | --- | --- |
| `--pages` | all | Pages: `1,3-5` syntax. |
| `--tables` | off | Output detected tables. |
| `--text` | off | Output extracted text. |
| `--format` | `json` | `json` \| `text`. |
| `--vertical-strategy` | `lines` | `lines` \| `lines_strict` \| `text` \| `explicit`. |
| `--horizontal-strategy` | `lines` | same set. |
| `--snap-tolerance` | 3 | snap_tolerance (PDF pts). |
| `--join-tolerance` | 3 | join_tolerance (PDF pts). |
| `--edge-min-length` | 3 | drop merged edges shorter than this. |
| `--intersection-tolerance` | 3 | slack on edge crossings. |
| `--text-tolerance` | 3 | per-cell text-extraction tolerance. |
| `--min-words-vertical` | 3 | text strategy column threshold. |
| `--min-words-horizontal` | 1 | text strategy row threshold. |
| `--explicit-vertical-lines` | (none) | comma list of X coords. |
| `--explicit-horizontal-lines` | (none) | comma list of Y coords. |
| `--indent` | 0 | JSON indent (0 = compact). |
## Side-by-side comparison with pdfplumber
```python
# Python (pdfplumber)
import pdfplumber
with pdfplumber.open("report.pdf") as pdf:
page = pdf.pages[0]
for word in page.extract_words(x_tolerance=3, y_tolerance=3):
print(word["text"], word["x0"], word["top"])
print(page.extract_text())
```
```go
// Go (pdftable)
import "github.com/hallelx2/pdftable"
doc, _ := pdftable.OpenFile("report.pdf")
defer doc.Close()
page, _ := doc.Page(1)
words, _ := page.Words(pdftable.DefaultWordOpts())
for _, w := range words {
// pdftable's Y is PDF user-space (origin bottom-left). The
// pdfplumber-equivalent "top" is page.Height() - w.Y1.
fmt.Println(w.Text, w.X0, page.Height()-w.Y1)
}
fmt.Println(must(page.ExtractText(pdftable.DefaultTextOpts())))
```
Three differences worth noting:
1. **Page indexing is 1-based**, matching the PDF spec and pdfplumber's
`pdf.pages[0]` is actually the first page (Python is 0-indexed,
pdfplumber compensates). Our `Page(1)` is the same first page.
2. **Coordinates are in PDF user space with origin at bottom-left**.
pdfplumber by default reports `top` (origin top-left, Y growing down)
on its chars and words; we report `Y0` / `Y1` in PDF native
coordinates. The conversion is `top = page.Height() - Y1`.
3. **Options are explicit Go structs, not `**kwargs`**. Build a
`WordOpts` / `TextOpts`, override the fields you care about, pass
it through. `DefaultWordOpts()` / `DefaultTextOpts()` return
pdfplumber-matching defaults.
## Parity with pdfplumber
The word-grouping and text-extraction algorithms are direct ports of
pdfplumber's `WordExtractor` and `extract_text` (see
[`pdfplumber/utils/text.py`](https://github.com/jsvine/pdfplumber/blob/main/pdfplumber/utils/text.py)).
Tests in [`golden_test.go`](golden_test.go) compare the Go output
against pdfplumber's reference output on shared fixture PDFs.
Behaviours that match exactly:
- Word grouping: same line-cluster-then-merge-by-gap algorithm, same
defaults (XTolerance=3, YTolerance=3), same handling of blank-char
filtering, ligature expansion (fi→fi, etc.), and split-at-punctuation.
- Ordering: words returned in pdfplumber's order (top-to-bottom, then
left-to-right within each line) when UseTextFlow is false.
- Direction handling: ltr / rtl / ttb / btt mapping from
upright + HorizontalLTR + VerticalTTB.
Behaviours that intentionally differ:
- **Position precision drifts when font metrics aren't bundled**.
pdfplumber uses pdfminer.six's AFM tables for the standard 14 fonts;
we use a default-width fallback for now. Word text and order match
exactly; word bboxes drift by up to ~10 PDF points on glyphs whose
width isn't in the PDF's /Widths array. Golden tests assert text
parity exactly and position parity within a 15-point envelope; the
envelope tightens to <1pt once the AFM bundle lands (planned for
v0.2.x).
- **`Layout=true` output is structurally similar but not byte-equal**.
Pdfplumber's layout algorithm has version-to-version drift; we
produce a column-aligned grid with the same density defaults but
don't promise byte-equal output across pdfplumber releases.
Behaviours not yet ported:
- `extract_text_lines` (regex-based line extraction).
- `search` on TextMap (regex over assembled page text with char-level
match back-references).
- Per-character extra_attrs hooks beyond `fontname` and `size`.
## Architecture
```
pdftable/
├── pdftable.go // Open / OpenBytes / OpenFile entry points
├── pdf.go // Document interface + implementation
├── page.go // Page interface + implementation
├── char.go // Public Char / Line / Rect / Curve / Objects
├── text.go // Word + ExtractText + ExtractTextSimple (v0.1.0)
├── table.go // TableStrategy / TableSettings / Table types (v0.2.0)
├── finder.go // Cells-from-edges algorithm (v0.2.0)
├── finder_text.go // Text + explicit edge derivation (v0.3.0)
├── clustering.go // 1-D clusterObjects, groupObjectsByAttr, dedupeChars
├── geometry.go // BBox helpers: Union, Intersect, Contains, Snap
├── errors.go // Sentinel errors
├── cmd/
│ └── pdftable/ // Command-line interface (v0.3.0)
│ └── main.go
└── internal/
├── layout/
│ └── lines.go // Edge type + snap/join/filter pipeline (v0.2.0)
└── pdf/
├── reader.go // pdfcpu bridge
├── content.go // Content-stream interpreter
├── ops.go // Operator dispatch table
├── state.go // Graphics + text state, matrix math
├── font.go // Font + encoding tables + glyph-name resolution
└── cmap.go // ToUnicode CMap parser
```
The public `pdftable` package is small and stable. The `internal/pdf`
package owns the interpreter — its types are not exposed because they
will evolve as more PDF features are added (Type 3 fonts, vertical
writing, more exotic CMaps).
## Why pdfcpu and not write a PDF parser from scratch?
PDF object parsing — xref tables, indirect-object resolution, stream
decompression (FlateDecode, LZWDecode, ASCII85Decode), encryption — is
a large amount of mostly-uninteresting code. pdfcpu is mature, well-
tested, and gives us a parsed `*model.Context` to work with. We layer
the content-stream interpreter (which pdfcpu doesn't have) on top.
If pdfcpu's dependency footprint becomes a problem (it pulls in image
codecs we don't strictly need), the blast radius of swapping it out is
limited to `internal/pdf/reader.go`. The rest of the package is
stdlib-only.
## Roadmap
- `v0.0.x` — content-stream primitives.
- `v0.1.x` — text extraction: `Page.ExtractText`, `Page.Words`,
`Page.ExtractTextSimple`.
- `v0.2.x` — table finding via ruling lines: `Page.FindTables` /
`Page.ExtractTables` covering the `lines` and `lines_strict`
strategies.
- `v0.3.x` — remaining table strategies and CLI (this release):
`text` (word-alignment edges), `explicit` (caller-supplied edges),
and a `pdftable` CLI mirroring pdfplumber's surface.
- `v0.4.x` — bundle the standard-14 AFM metrics so word bboxes (and
therefore cell text) match pdfplumber to within 1 PDF point on
standard fonts.
- `v0.5.x` — performance pass: parser benchmarking against
pdfminer.six and pdfplumber on a representative document corpus.
## License
MIT. See [LICENSE](LICENSE).
## Acknowledgements
This library is a direct port of the algorithms in
[pdfminer.six](https://github.com/pdfminer/pdfminer.six) and
[pdfplumber](https://github.com/jsvine/pdfplumber). Their authors did
the hard work of figuring out how to robustly recover structure from
the PDF wire format; this is that work translated into Go.