Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/joeychilson/pdftotext
A Go library for converting PDF files to text using the pdftotext utility.
https://github.com/joeychilson/pdftotext
go pdf pdftotext
Last synced: 14 days ago
JSON representation
A Go library for converting PDF files to text using the pdftotext utility.
- Host: GitHub
- URL: https://github.com/joeychilson/pdftotext
- Owner: joeychilson
- License: mit
- Created: 2024-10-31T05:35:09.000Z (17 days ago)
- Default Branch: main
- Last Pushed: 2024-11-01T05:43:37.000Z (16 days ago)
- Last Synced: 2024-11-02T15:08:38.835Z (14 days ago)
- Topics: go, pdf, pdftotext
- Language: Go
- Homepage:
- Size: 15.6 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# pdftotext
A Go library for converting PDF files to text using the `pdftotext` utility.
## Prerequisites
- `pdftotext` utility installed on your system (usually part of the `poppler-utils` package)
### Installing pdftotext
**Ubuntu/Debian:**
```bash
sudo apt-get install poppler-utils
```**macOS:**
```bash
brew install poppler
```## Installation
```bash
go get github.com/joeychilson/pdftotext
```## Quick Start
```go
package mainimport (
"context"
"fmt"
"log""github.com/joeychilson/pdftotext"
)func main() {
ctx := context.Background()converter, err := pdftotext.New()
if err != nil {
log.Fatal(err)
}text, err := converter.Convert(ctx, "input.pdf", &pdftotext.Options{
Layout: true,
Encoding: "UTF-8",
})
if err != nil {
log.Fatal(err)
}
fmt.Println(text)
}
```## Converting to File
```go
err = converter.ConvertToFile(ctx, "input.pdf", "output.txt", &pdftotext.Options{
Layout: true,
Encoding: "UTF-8",
})
if err != nil {
log.Fatal(err)
}
```## Available Options
```go
type Options struct {
// FirstPage is the first page to convert
FirstPage int
// LastPage is the last page to convert
LastPage int
// Resolution is the resolution in DPI (default 72)
Resolution int
// CropX is the X-coordinate of crop area
CropX int
// CropY is the Y-coordinate of crop area
CropY int
// CropWidth is the width of crop area
CropWidth int
// CropHeight is the height of crop area
CropHeight int
// Layout maintains the original layout
Layout bool
// FixedPitch keeps the text in a fixed-pitch font
FixedPitch float64
// Raw keeps text in content stream order
Raw bool
// NoDiagonal discards diagonal text
NoDiagonal bool
// HTMLMeta generates HTML with meta information
HTMLMeta bool
// BBox generates XHTML with word bounding boxes
BBox bool
// BBoxLayout generates XHTML with block/line/word bounding boxes
BBoxLayout bool
// TSV generates TSV with bounding box information
TSV bool
// CropBox uses crop box instead of media box
CropBox bool
// ColSpacing is the column spacing (default 0.7)
ColSpacing float64
// Encoding is the text output encoding (default UTF-8)
Encoding string
// EOL is the end-of-line convention (default Unix)
EOL EOLType
// NoPageBreaks don't insert page breaks
NoPageBreaks bool
// OwnerPassword is the PDF owner password
OwnerPassword string
// UserPassword is the PDF user password
UserPassword string
// Quiet suppresses messages and errors
Quiet bool
}
```## Error Handling
The library provides specific error types for common failure cases:
```go
var (
ErrPDFOpen = errors.New("error opening PDF file")
ErrOutputFile = errors.New("error opening output file")
ErrPermissions = errors.New("error related to PDF permissions")
ErrInvalidPage = errors.New("invalid page number")
ErrInvalidRange = errors.New("invalid page range")
ErrCommandFailed = errors.New("pdftotext command failed")
ErrBinaryNotFound = errors.New("pdftotext binary not found")
)
```