Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/nisaacson/pdf-text-extract
Extract text from pdfs that contain searchable pdf text
https://github.com/nisaacson/pdf-text-extract
Last synced: about 10 hours ago
JSON representation
Extract text from pdfs that contain searchable pdf text
- Host: GitHub
- URL: https://github.com/nisaacson/pdf-text-extract
- Owner: nisaacson
- License: bsd-3-clause
- Created: 2013-03-20T01:50:52.000Z (over 11 years ago)
- Default Branch: master
- Last Pushed: 2019-01-01T13:31:12.000Z (almost 6 years ago)
- Last Synced: 2024-11-07T00:49:27.535Z (7 days ago)
- Language: JavaScript
- Size: 4.39 MB
- Stars: 115
- Watchers: 6
- Forks: 31
- Open Issues: 12
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# PDF Text Extract
Extract text from pdfs that contain searchable pdf text. The module is wrapper that calls the `pdftotext` command to perform the actual extraction
[![Build Status](https://travis-ci.org/nisaacson/pdf-text-extract.png?branch=master)](https://travis-ci.org/nisaacson/pdf-text-extract) [![Dependency Status](https://david-dm.org/nisaacson/pdf-text-extract.png)](https://david-dm.org/nisaacson/pdf-text-extract)
# Installation
```bash
npm install --save pdf-text-extract
```You will need the `pdftotext` binary available on your path. There are packages available for many different operating systems
See [https://github.com/nisaacson/pdf-extract#osx](https://github.com/nisaacson/pdf-extract#osx) for how to install the `pdftotext` command
# Usage
## As a module
`extract(filePath, [options], [pdftotextcommand], callback)`
Options and pdftotextcommand are not required.
```javascript
var path = require('path')
var filePath = path.join(__dirname, 'test/data/multipage.pdf')
var extract = require('pdf-text-extract')
extract(filePath, function (err, pages) {
if (err) {
console.dir(err)
return
}
console.dir(pages)
})
```
The output will be an array of where each entry is a page of text. If you want just a string of all pages you can set the option to `splitPages: false`.```javascript
var filePath = path.join(__dirname, 'test/data/multipage.pdf')
var extract = require('pdf-text-extract')
extract(filePath, { splitPages: false }, function (err, text) {
if (err) {
console.dir(err)
return
}
console.dir(text)
})
```You can set the following options:
- `firstPage`: First page to extract
- `lastPage`: Last page to extract
- `resolution`: in dpi, as is specified by pdftotext -r
- `crop`: Should be an object { x:x, y:y, w:w, h:h }
- `layout`: Should be either `layout`, `raw` or `htmlmeta`. Default: `layout`
- `encoding`: Should be either `UCS-2`, `ASCII7`, `Latin1`, `UTF-8`, `ZapfDingbats` or `Symbol`. Default: `UTF-8`
- `eol`: End of line convention. One of either: `unix`, `dos` or `mac`
- `ownerPassword`: Owner password (for encrypted files)
- `userPassword`: User password (for encrypted files)
- `splitPages`: If true, the result will be an array of pages. Default: true.If needed you can pass optional arguments to the extract function. These will be passed to the `child_process.spawn` call.
```javascript
var filePath = path.join(__dirname, 'test/data/multipage.pdf')
var extract = require('pdf-text-extract')
var options = {
cwd: "./"
}
extract(filePath, options, function (err, pages) {
if (err) {
console.dir(err)
return
}
console.dir('extracted pages', pages)
})
```You can also override the command for `pdftotext` if it is installed in a location that is not available in the `PATH` environment variable
```javascript
var filePath = path.join(__dirname, 'test/data/multipage.pdf')
var pdfToTextCommand = '/opt/bin/pdftotext'
var extract = require('pdf-text-extract')
var options = {
cwd: "./"
}
extract(filePath, options, pdfToTextCommand, function (err, pages) {
if (err) {
console.dir(err)
return
}
console.dir('extracted pages', pages)
})
```ES6 promises are supported. You can now call .then(onFulfilled[, onRejected]):
```javascript
var filePath = path.join(__dirname, 'test/data/multipage.pdf')
var Extract = require('../index.js')
var extract = new Extract(filePath)extract.then(function (pages) {
console.dir('extracted pages', pages)
}).catch(function (err) {
console.error('error:', err)
})
```## As a command line tool
```bash
npm install -g pdf-text-extract
```Execute with the filePath as an argument. Output will be json-formatted array of pages
```bash
pdf-text-extract ./test/data/multipage.pdf
# outputs
# ['', '']
```# Test
```bash
# install dev dependencies
npm install
# run tests
npm test