Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/axsaucedo/pdftovideo

Last synced: about 11 hours ago
JSON representation

Host: GitHub
URL: https://github.com/axsaucedo/pdftovideo
Owner: axsaucedo
Created: 2013-07-20T14:14:50.000Z (over 11 years ago)
Default Branch: master
Last Pushed: 2013-07-21T08:45:58.000Z (over 11 years ago)
Last Synced: 2024-04-15T03:22:14.543Z (7 months ago)
Language: JavaScript
Size: 168 KB
Stars: 0
Watchers: 2
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README

Awesome Lists containing this project

README

        In order to run this you will need:

> Change all the 'module.exports = someexport' to 'exports.someexport = someexport' in the nlp.js

In order to install in Ubuntu, you will need:

pdftk can be installed directly via apt-get

apt-get install pdftk

pdftotext is included in the poppler-utils library. To installer poppler-utils execute

apt-get install poppler-utils

ghostscript can be install via apt-get

apt-get install ghostscript

tesseract can be installed via apt-get. Note that unlike the osx install the package is called tesseract-ocr on Ubuntu, not tesseract

apt-get install tesseract-ocr

For the OCR to work, you need to have the tesseract-ocr binaries available on your path. If you only need to handle ASCII characters, the accuracy of the OCR process can be increased by limiting the tesseract output. To do this copy the alphanumeric file included with this pdf-extract module into the tess-data folder on your system. Also the eng.traineddata included with the standard tesseract-ocr package is out of date. This pdf-extract module provides an up-to-date version which you should copy into the appropriate location on your system

cd 

cp "./share/eng.traineddata" "/usr/share/tesseract-ocr/tessdata/eng.traineddata"

cp "./share/alphanumeric" "/usr/share/tesseract-ocr/tessdata/configs/alphanumeric"