https://github.com/elsehow/corpus-from-pdfs

make a text corpus (for machine learning) from a batch of PDFs
https://github.com/elsehow/corpus-from-pdfs
Last synced: about 2 months ago
JSON representation
make a text corpus (for machine learning) from a batch of PDFs
Host: GitHub
URL: https://github.com/elsehow/corpus-from-pdfs
Owner: elsehow
Created: 2015-12-20T03:55:36.000Z (over 9 years ago)
Default Branch: master
Last Pushed: 2015-12-20T18:43:57.000Z (over 9 years ago)
Last Synced: 2024-04-15T07:13:44.480Z (about 1 year ago)
Size: 2.93 KB
Stars: 1
Watchers: 3
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

        # corpus-from-pdfs

this makes a text corpus (for machine learning) from a batch of PDFs

## usage

**TODO this api**

    corpus-from-pdfs my-pdf-dir/*.pdf

## installation

**TODO publish on npm**

    npm install -g corpus-from-pdfs

unfortunately, [pdf-extract](https://www.npmjs.com/package/pdf-extract) has a few native dependencies you'll need to install on your platform:

- pdftk

- pdftotext

- ghostscript

- tesseract

### OSX

To begin on OSX, first make sure you have the homebrew package manager installed.

**pdftk** is not available in Homebrew. However a gui install is available here.

[http://www.pdflabs.com/docs/install-pdftk/](http://www.pdflabs.com/docs/install-pdftk/)

**pdftotext** is included as part of the **poppler** utilities library. **poppler** can be installed via homebrew

``` bash

brew install poppler

```

**ghostscript** can be install via homebrew

``` bash

brew install gs

```

**tesseract** can be installed via homebrew as well

`brew install tesseract`

After tesseract is installed you need to install the alphanumeric config and an updated trained data file

``` bash

cd 

npm install

cp "./node_modules/share/eng.traineddata" "/usr/local/Cellar/tesseract/3.02.02_3/share/tessdata/eng.traineddata"

cp "./node_modules/share/dia.traineddata" "/usr/local/Cellar/tesseract/3.02.02_3/share/tessdata/dia.traineddata"

cp "./node_modules/share/configs/alphanumeric" "/usr/local/Cellar/tesseract/3.02.02_3/share/tessdata/configs/alphanumeric"

```

### Ubuntu

**pdftk** can be installed directly via apt-get

```bash

apt-get install pdftk

```

**pdftotext** is included in the **poppler-utils** library. To installer poppler-utils execute

``` bash

apt-get install poppler-utils

```

**ghostscript** can be install via apt-get

``` bash

apt-get install ghostscript

```

**tesseract** can be installed via apt-get. Note that unlike the osx install the package is called **tesseract-ocr** on Ubuntu, not **tesseract**

``` bash

apt-get install tesseract-ocr

```

For the OCR to work, you need to have the tesseract-ocr binaries available on your path. If you only need to handle ASCII characters, the accuracy of the OCR process can be increased by limiting the tesseract output. To do this copy the *alphanumeric* file included with this pdf-extract module into the *tess-data* folder on your system. Also the eng.traineddata included with the standard tesseract-ocr package is out of date. This pdf-extract module provides an up-to-date version which you should copy into the appropriate location on your system

``` bash

cd 

npm install

cp "./node_modules/share/eng.traineddata" "/usr/share/tesseract-ocr/tessdata/eng.traineddata"

cp "./node_modules/share/configs/alphanumeric" "/usr/share/tesseract-ocr/tessdata/configs/alphanumeric"

```

### SmartOS

**pdftk** can be installed directly via apt-get

```bash

apt-get install pdftk

```

**pdftotext** is included in the **poppler-utils** library. To installer poppler-utils execute

``` bash

apt-get install poppler-utils

```

**ghostscript** can be install via pkgin. Note you may need to update the pkgin repo to include the additional sources provided by Joyent. Check [http://www.perkin.org.uk/posts/9000-packages-for-smartos-and-illumos.html](http://www.perkin.org.uk/posts/9000-packages-for-smartos-and-illumos.html) for details

``` bash

pkgin install ghostscript

```

**tesseract** can be must be manually downloaded and compiled. You must also install leptonica before installing tesseract. At the time of this writing leptonica is available from [http://www.leptonica.com/download.html](http://www.leptonica.com/download.html), with the latest version tarball available from [http://www.leptonica.com/source/leptonica-1.69.tar.gz](http://www.leptonica.com/source/leptonica-1.69.tar.gz)

``` bash

pkgin install autoconf

wget http://www.leptonica.com/source/leptonica-1.69.tar.gz

tar -xvzf leptonica-1.69.tar.gz

cd leptonica-1.69

./configure

make

[sudo] make install

```

After installing leptonic move on to tesseract. Tesseract is available from [https://code.google.com/p/tesseract-ocr/downloads/list](https://code.google.com/p/tesseract-ocr/downloads/list) with the latest version available from [https://code.google.com/p/tesseract-ocr/downloads/detail?name=tesseract-ocr-3.02.02.tar.gz&can=2&q=](https://code.google.com/p/tesseract-ocr/downloads/detail?name=tesseract-ocr-3.02.02.tar.gz&can=2&q=)

``` bash

wget https://code.google.com/p/tesseract-ocr/downloads/detail?name=tesseract-ocr-3.02.02.tar.gz&can=2&q=

tar -xvzf tesseract-ocr-3.02.02.tar.gz

cd tesseract-ocr

./configure

make

[sudo] make install

```

### Windows

Not yet tested. If you figure out how to use pdf-extract on windows send me a pull request and I will update the readme accordingly

## Usage

=======

### OCR Extract from scanned image

Extract from a pdf file which contains a scanned image and no searchable text

``` javascript

var inspect = require('eyes').inspector({maxLength:20000});

var pdf_extract = require('pdf-extract');

var absolute_path_to_pdf = '~/Downloads/sample.pdf'

var options = {

  type: 'ocr' // perform ocr to get the text within the scanned image

}

var processor = pdf_extract(absolute_path_to_pdf, options, function(err) {

  if (err) {

    return callback(err);

  }

});

processor.on('complete', function(data) {

  inspect(data.text_pages, 'extracted text pages');

  callback(null, text_pages);

});

processor.on('error', function(err) {

  inspect(err, 'error while extracting pages');

  return callback(err);

});

```

### Text extract from searchable pdf

Extract from a pdf file which contains actual searchable text

``` javascript

var inspect = require('eyes').inspector({maxLength:20000});

var pdf_extract = require('pdf-extract');

var absolute_path_to_pdf = '~/Downloads/electronic.pdf'

var options = {

  type: 'text'  // extract the actual text in the pdf file

}

var processor = pdf_extract(absolute_path_to_pdf, options, function(err) {

  if (err) {

    return callback(err);

  }

});

processor.on('complete', function(data) {

  inspect(data.text_pages, 'extracted text pages');

  callback(null, data.text_pages);

});

processor.on('error', function(err) {

  inspect(err, 'error while extracting pages');

  return callback(err);

});

```

#### Options

At a minimum you must specific the type of pdf extract you wish to perform

**clean**

When the system performs extracts text from a multi-page pdf, it first splits the pdf into single pages. This are written to disk before the ocr occurs. For some applications these single page files can be useful. If you need to work with the single page pdf files after the ocr is complete, set the **clean** option to **false** as show below. Note that the single page pdf files are written to the system appropriate temp directory, so if you must copy the files to a more permanent location yourself after the ocr process completes

``` javascript

var options = {

  type: 'ocr' // (required), perform ocr to get the text within the scanned image

  clean: false // keep the single page pdfs created during the ocr process

  ocr_flags: [

    '-psm 1',       // automatically detect page orientation

    '-l dia',       // use a custom language file

    'alphanumeric'  // only output ascii characters

  ]

}

```

### Events

When processing, the module will emit various events as they occurr

**page**

Emitted when a page has completed processing. The data passed with this event looks like

``` javascript

var data = {

  hash: 

  text: ,

  index: 2,

  num_pages: 4,

  pdf_path: "~/Downloads/input_pdf_file.pdf",

  single_page_pdf_path: "/tmp/temp_pdf_file2.pdf"

}

```

**error**

Emitted when an error occurs during processing. After this event is emitted processing will stop.

The data passed with this event looks like

```

var data = {

  error: 'no file exists at the path you specified',

  pdf_path: "~/Downloads/input_pdf_file.pdf",

}

```

**complete**

Emitted when all pages have completed processing and the pdf extraction is complete

```

var data = {

  hash: 

  text_pages: ,

  pdf_path: "~/Downloads/input_pdf_file.pdf",

  single_page_pdf_file_paths: [

    "/tmp/temp_pdf_file1.pdf",

    "/tmp/temp_pdf_file2.pdf",

    "/tmp/temp_pdf_file3.pdf",

    "/tmp/temp_pdf_file4.pdf",

  ]

}

```

**log**

To avoid spamming process.stdout, log events are emitted instead.

## Tests

=======

To test that your system satisfies the needed dependencies and that module is functioning correctly execute the command in the pdf-extract module folder

```

cd /node_modules/pdf-extract

npm test

```
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/elsehow/corpus-from-pdfs

Awesome Lists containing this project

README