https://github.com/joeychilson/pdftotext

A Go library for converting PDF files to text using the pdftotext utility.
https://github.com/joeychilson/pdftotext

go pdf pdftotext

Last synced: 3 months ago
JSON representation

A Go library for converting PDF files to text using the pdftotext utility.

Host: GitHub
URL: https://github.com/joeychilson/pdftotext
Owner: joeychilson
License: mit
Created: 2024-10-31T05:35:09.000Z (9 months ago)
Default Branch: main
Last Pushed: 2024-11-01T05:43:37.000Z (9 months ago)
Last Synced: 2025-02-13T12:18:29.198Z (5 months ago)
Topics: go, pdf, pdftotext
Language: Go
Homepage:
Size: 15.6 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # pdftotext

A Go library for converting PDF files to text using the `pdftotext` utility.

## Prerequisites

- `pdftotext` utility installed on your system (usually part of the `poppler-utils` package)

### Installing pdftotext

**Ubuntu/Debian:**

```bash

sudo apt-get install poppler-utils

```

**macOS:**

```bash

brew install poppler

```

## Installation

```bash

go get github.com/joeychilson/pdftotext

```

## Quick Start

```go

package main

import (

    "context"

    "fmt"

    "log"

    "github.com/joeychilson/pdftotext"

)

func main() {

   	ctx := context.Background()

    converter, err := pdftotext.New()

    if err != nil {

        log.Fatal(err)

    }

    text, err := converter.Convert(ctx, "input.pdf", &pdftotext.Options{

        Layout:   true,

        Encoding: "UTF-8",

    })

    if err != nil {

        log.Fatal(err)

    }

    fmt.Println(text)

}

```

## Converting to File

```go

err = converter.ConvertToFile(ctx, "input.pdf", "output.txt", &pdftotext.Options{

	Layout:   true,

	Encoding: "UTF-8",

})

if err != nil {

    log.Fatal(err)

}

```

## Available Options

```go

type Options struct {

	// FirstPage is the first page to convert

	FirstPage int

	// LastPage is the last page to convert

	LastPage int

	// Resolution is the resolution in DPI (default 72)

	Resolution int

	// CropX is the X-coordinate of crop area

	CropX int

	// CropY is the Y-coordinate of crop area

	CropY int

	// CropWidth is the width of crop area

	CropWidth int

	// CropHeight is the height of crop area

	CropHeight int

	// Layout maintains the original layout

	Layout bool

	// FixedPitch keeps the text in a fixed-pitch font

	FixedPitch float64

	// Raw keeps text in content stream order

	Raw bool

	// NoDiagonal discards diagonal text

	NoDiagonal bool

	// HTMLMeta generates HTML with meta information

	HTMLMeta bool

	// BBox generates XHTML with word bounding boxes

	BBox bool

	// BBoxLayout generates XHTML with block/line/word bounding boxes

	BBoxLayout bool

	// TSV generates TSV with bounding box information

	TSV bool

	// CropBox uses crop box instead of media box

	CropBox bool

	// ColSpacing is the column spacing (default 0.7)

	ColSpacing float64

	// Encoding is the text output encoding (default UTF-8)

	Encoding string

	// EOL is the end-of-line convention (default Unix)

	EOL EOLType

	// NoPageBreaks don't insert page breaks

	NoPageBreaks bool

	// OwnerPassword is the PDF owner password

	OwnerPassword string

	// UserPassword is the PDF user password

	UserPassword string

	// Quiet suppresses messages and errors

	Quiet bool

}

```

## Error Handling

The library provides specific error types for common failure cases:

```go

var (

    ErrPDFOpen        = errors.New("error opening PDF file")

    ErrOutputFile     = errors.New("error opening output file")

    ErrPermissions    = errors.New("error related to PDF permissions")

    ErrInvalidPage    = errors.New("invalid page number")

    ErrInvalidRange   = errors.New("invalid page range")

    ErrCommandFailed  = errors.New("pdftotext command failed")

    ErrBinaryNotFound = errors.New("pdftotext binary not found")

)

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/joeychilson/pdftotext

Awesome Lists containing this project

README