https://github.com/thisisparker/bookscanning

:books: some scripts and things for processing scanned books
https://github.com/thisisparker/bookscanning

Last synced: about 1 year ago
JSON representation

:books: some scripts and things for processing scanned books

Host: GitHub
URL: https://github.com/thisisparker/bookscanning
Owner: thisisparker
License: unlicense
Created: 2013-11-28T20:08:47.000Z (over 12 years ago)
Default Branch: master
Last Pushed: 2013-11-28T20:27:01.000Z (over 12 years ago)
Last Synced: 2025-03-25T06:34:00.247Z (about 1 year ago)
Language: Ruby
Homepage:
Size: 117 KB
Stars: 11
Watchers: 4
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

bookscanning
============

Some scripts and things for processing scanned books. These will probably only be useful to people that have a huge directory of TIFF files of the sort that ScanTailor puts out after processing a huge directory of JPEG files that a DIY Book Scanner produces. I wrote these to help process a book scanned at Noisebridge.

These steps assume you've got `imagemagick`, `tesseract`, `pdftk`, and, like, Perl and Ruby installed.

`looptifftopdf.rb` converts all those TIFFs to PDF files in a subdirectory called `/pdf/`. Then I just went into that directory and used pdftk like `pdftk *.pdf cat full-book.pdf`. It's made a very large PDF, and I'm working on making that smaller.

`ocrthethings.rb` runs through the same TIFFs with Tesseract and produces a final OCRed output that is pretty good. One weird thing is it had page numbers in it, which I don't think I need. The difficult part is that they're surrounded by newlines, and I couldn't just strip out all the lines at once with grep or sed or whatever. So I turned to Perl, and used this one-liner:

`perl -pe 'undef $/; s/\n\d{1,3}\n\n//g' finaltext.txt > finaltext-nonums.txt`

That'll work so long as the page numbers are surrounded by newlines, are between 1 and 3 digits long, have nothing else on the same line, and you want to yank out all three lines each time.

TODO
====

* Make PDF much smaller, with some kinda optimization along the way
* (perhaps at cross purposes) add the OCR to the PDF with hOCR or something.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/thisisparker/bookscanning

Awesome Lists containing this project

README