An open API service indexing awesome lists of open source software.

https://github.com/thisisparker/bookscanning

:books: some scripts and things for processing scanned books
https://github.com/thisisparker/bookscanning

Last synced: about 1 year ago
JSON representation

:books: some scripts and things for processing scanned books

Awesome Lists containing this project

README

          

bookscanning
============

Some scripts and things for processing scanned books. These will probably only be useful to people that have a huge directory of TIFF files of the sort that ScanTailor puts out after processing a huge directory of JPEG files that a DIY Book Scanner produces. I wrote these to help process a book scanned at Noisebridge.

These steps assume you've got `imagemagick`, `tesseract`, `pdftk`, and, like, Perl and Ruby installed.

`looptifftopdf.rb` converts all those TIFFs to PDF files in a subdirectory called `/pdf/`. Then I just went into that directory and used pdftk like `pdftk *.pdf cat full-book.pdf`. It's made a very large PDF, and I'm working on making that smaller.

`ocrthethings.rb` runs through the same TIFFs with Tesseract and produces a final OCRed output that is pretty good. One weird thing is it had page numbers in it, which I don't think I need. The difficult part is that they're surrounded by newlines, and I couldn't just strip out all the lines at once with grep or sed or whatever. So I turned to Perl, and used this one-liner:

`perl -pe 'undef $/; s/\n\d{1,3}\n\n//g' finaltext.txt > finaltext-nonums.txt`

That'll work so long as the page numbers are surrounded by newlines, are between 1 and 3 digits long, have nothing else on the same line, and you want to yank out all three lines each time.

TODO
====

* Make PDF much smaller, with some kinda optimization along the way
* (perhaps at cross purposes) add the OCR to the PDF with hOCR or something.