https://github.com/nisaacson/pdf-text-extract

Extract text from pdfs that contain searchable pdf text
https://github.com/nisaacson/pdf-text-extract

Last synced: 3 months ago
JSON representation

Extract text from pdfs that contain searchable pdf text

Host: GitHub
URL: https://github.com/nisaacson/pdf-text-extract
Owner: nisaacson
License: bsd-3-clause
Created: 2013-03-20T01:50:52.000Z (over 12 years ago)
Default Branch: master
Last Pushed: 2019-01-01T13:31:12.000Z (over 6 years ago)
Last Synced: 2025-03-30T03:05:56.688Z (4 months ago)
Language: JavaScript
Size: 4.39 MB
Stars: 116
Watchers: 5
Forks: 31
Open Issues: 12
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # PDF Text Extract

Extract text from pdfs that contain searchable pdf text. The module is wrapper that calls the `pdftotext` command to perform the actual extraction

[![Build Status](https://travis-ci.org/nisaacson/pdf-text-extract.png?branch=master)](https://travis-ci.org/nisaacson/pdf-text-extract) [![Dependency Status](https://david-dm.org/nisaacson/pdf-text-extract.png)](https://david-dm.org/nisaacson/pdf-text-extract)

# Installation

```bash

npm install --save pdf-text-extract

```

You will need the `pdftotext` binary available on your path. There are packages available for many different operating systems

See [https://github.com/nisaacson/pdf-extract#osx](https://github.com/nisaacson/pdf-extract#osx) for how to install the `pdftotext` command

# Usage

## As a module

`extract(filePath, [options], [pdftotextcommand], callback)`

Options and pdftotextcommand are not required.

```javascript

var path = require('path')

var filePath = path.join(__dirname, 'test/data/multipage.pdf')

var extract = require('pdf-text-extract')

extract(filePath, function (err, pages) {

  if (err) {

    console.dir(err)

    return

  }

  console.dir(pages)

})

```

The output will be an array of where each entry is a page of text. If you want just a string of all pages you can set the option to `splitPages: false`.

```javascript

var filePath = path.join(__dirname, 'test/data/multipage.pdf')

var extract = require('pdf-text-extract')

extract(filePath, { splitPages: false }, function (err, text) {

  if (err) {

    console.dir(err)

    return

  }

  console.dir(text)

})

```

You can set the following options:

- `firstPage`: First page to extract

- `lastPage`: Last page to extract

- `resolution`: in dpi, as is specified by pdftotext -r

- `crop`: Should be an object { x:x, y:y, w:w, h:h }

- `layout`: Should be either `layout`, `raw` or `htmlmeta`. Default: `layout`

- `encoding`: Should be either `UCS-2`, `ASCII7`, `Latin1`, `UTF-8`, `ZapfDingbats` or `Symbol`. Default: `UTF-8`

- `eol`: End of line convention. One of either: `unix`, `dos` or `mac`

- `ownerPassword`: Owner password (for encrypted files)

- `userPassword`: User password (for encrypted files)

- `splitPages`: If true, the result will be an array of pages. Default: true.

If needed you can pass optional arguments to the extract function. These will be passed to the `child_process.spawn` call.

```javascript

var filePath = path.join(__dirname, 'test/data/multipage.pdf')

var extract = require('pdf-text-extract')

var options = {

  cwd: "./"

}

extract(filePath, options, function (err, pages) {

  if (err) {

    console.dir(err)

    return

  }

  console.dir('extracted pages', pages)

})

```

You can also override the command for `pdftotext` if it is installed in a location that is not available in the `PATH` environment variable

```javascript

var filePath = path.join(__dirname, 'test/data/multipage.pdf')

var pdfToTextCommand = '/opt/bin/pdftotext'

var extract = require('pdf-text-extract')

var options = {

  cwd: "./"

}

extract(filePath, options, pdfToTextCommand, function (err, pages) {

  if (err) {

    console.dir(err)

    return

  }

  console.dir('extracted pages', pages)

})

```

ES6 promises are supported. You can now call .then(onFulfilled[, onRejected]):

```javascript

var filePath = path.join(__dirname, 'test/data/multipage.pdf')

var Extract = require('../index.js')

var extract = new Extract(filePath)

extract.then(function (pages) {

  console.dir('extracted pages', pages)

}).catch(function (err) {

  console.error('error:', err)

})

```

## As a command line tool

```bash

npm install -g pdf-text-extract

```

Execute with the filePath as an argument. Output will be json-formatted array of pages

```bash

pdf-text-extract ./test/data/multipage.pdf

# outputs

# ['', '']

```

# Test

```bash

# install dev dependencies

npm install

# run tests

npm test

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/nisaacson/pdf-text-extract

Awesome Lists containing this project

README