Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/axsaucedo/pdftovideo
https://github.com/axsaucedo/pdftovideo
Last synced: about 11 hours ago
JSON representation
- Host: GitHub
- URL: https://github.com/axsaucedo/pdftovideo
- Owner: axsaucedo
- Created: 2013-07-20T14:14:50.000Z (over 11 years ago)
- Default Branch: master
- Last Pushed: 2013-07-21T08:45:58.000Z (over 11 years ago)
- Last Synced: 2024-04-15T03:22:14.543Z (7 months ago)
- Language: JavaScript
- Size: 168 KB
- Stars: 0
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README
Awesome Lists containing this project
README
In order to run this you will need:
> Change all the 'module.exports = someexport' to 'exports.someexport = someexport' in the nlp.js
In order to install in Ubuntu, you will need:
pdftk can be installed directly via apt-get
apt-get install pdftk
pdftotext is included in the poppler-utils library. To installer poppler-utils executeapt-get install poppler-utils
ghostscript can be install via apt-getapt-get install ghostscript
tesseract can be installed via apt-get. Note that unlike the osx install the package is called tesseract-ocr on Ubuntu, not tesseractapt-get install tesseract-ocr
For the OCR to work, you need to have the tesseract-ocr binaries available on your path. If you only need to handle ASCII characters, the accuracy of the OCR process can be increased by limiting the tesseract output. To do this copy the alphanumeric file included with this pdf-extract module into the tess-data folder on your system. Also the eng.traineddata included with the standard tesseract-ocr package is out of date. This pdf-extract module provides an up-to-date version which you should copy into the appropriate location on your systemcd
cp "./share/eng.traineddata" "/usr/share/tesseract-ocr/tessdata/eng.traineddata"
cp "./share/alphanumeric" "/usr/share/tesseract-ocr/tessdata/configs/alphanumeric"