https://github.com/bbqsrc/pdf-strings
https://github.com/bbqsrc/pdf-strings
Last synced: about 1 month ago
JSON representation
- Host: GitHub
- URL: https://github.com/bbqsrc/pdf-strings
- Owner: bbqsrc
- License: mit
- Created: 2025-10-26T23:34:09.000Z (7 months ago)
- Default Branch: main
- Last Pushed: 2025-11-19T17:41:51.000Z (6 months ago)
- Last Synced: 2026-02-02T07:50:57.119Z (4 months ago)
- Language: Rust
- Size: 505 KB
- Stars: 1
- Watchers: 0
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
## pdf-extract
[](https://crates.io/crates/pdf-strings)
[](https://docs.rs/pdf-strings)
Extract text from PDFs with position data.
## Usage
```rust
// Simple extraction
let output = pdf_strings::from_path("file.pdf")?;
println!("{}", output); // Plain text
// With password
let output = pdf_strings::PdfExtractor::builder()
.password("secret")
.build()
.from_path("encrypted.pdf")?;
// Preserve spatial layout
println!("{}", output.to_string_pretty());
// Access structured data with bounding boxes
for line in output.lines() {
for span in line {
println!("{} at {:?}", span.text, span.bbox);
}
}
```
## Features
- Plain text extraction
- Spatial layout preservation
- Bounding box coordinates for every text span
- Font encoding resolution (ToUnicode, Type1, TrueType, CID, Type3)
- Password-protected PDF support
- Handles complex fonts, rotated text, and multi-column layouts
## API
Three output formats:
- `to_string()` - Plain text
- `to_string_pretty()` - Character grid rendering that preserves spatial layout
- `lines()` - Structured data with `TextSpan` objects containing text, bounding boxes, and font sizes
## Acknowledgements
This is a fork of [pdf-extract](https://github.com/jrmuizel/pdf-extract). Thanks for laying the groundwork, PDFs are ... something else.