{"id":21620998,"url":"https://github.com/thisisparker/bookscanning","last_synced_at":"2025-04-11T09:19:21.937Z","repository":{"id":12184037,"uuid":"14785561","full_name":"thisisparker/bookscanning","owner":"thisisparker","description":":books: some scripts and things for processing scanned books","archived":false,"fork":false,"pushed_at":"2013-11-28T20:27:01.000Z","size":120,"stargazers_count":11,"open_issues_count":0,"forks_count":0,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-03-25T06:34:00.247Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Ruby","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"unlicense","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/thisisparker.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2013-11-28T20:08:47.000Z","updated_at":"2024-02-26T16:51:33.000Z","dependencies_parsed_at":"2022-08-20T10:01:04.443Z","dependency_job_id":null,"html_url":"https://github.com/thisisparker/bookscanning","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thisisparker%2Fbookscanning","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thisisparker%2Fbookscanning/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thisisparker%2Fbookscanning/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thisisparker%2Fbookscanning/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/thisisparker","download_url":"https://codeload.github.com/thisisparker/bookscanning/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248366947,"owners_count":21092074,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-24T23:13:33.711Z","updated_at":"2025-04-11T09:19:21.904Z","avatar_url":"https://github.com/thisisparker.png","language":"Ruby","funding_links":[],"categories":[],"sub_categories":[],"readme":"bookscanning\n============\n\nSome scripts and things for processing scanned books. These will probably only be useful to people that have a huge directory of TIFF files of the sort that ScanTailor puts out after processing a huge directory of JPEG files that a DIY Book Scanner produces. I wrote these to help process a book scanned at Noisebridge.\n\nThese steps assume you've got `imagemagick`, `tesseract`, `pdftk`, and, like, Perl and Ruby installed.\n\n`looptifftopdf.rb` converts all those TIFFs to PDF files in a subdirectory called `/pdf/`. Then I just went into that directory and used pdftk like `pdftk *.pdf cat full-book.pdf`. It's made a very large PDF, and I'm working on making that smaller.\n\n`ocrthethings.rb` runs through the same TIFFs with Tesseract and produces a final OCRed output that is pretty good. One weird thing is it had page numbers in it, which I don't think I need. The difficult part is that they're surrounded by newlines, and I couldn't just strip out all the lines at once with grep or sed or whatever. So I turned to Perl, and used this one-liner:\n\n`perl -pe 'undef $/; s/\\n\\d{1,3}\\n\\n//g' finaltext.txt \u003e finaltext-nonums.txt`\n\nThat'll work so long as the page numbers are surrounded by newlines, are between 1 and 3 digits long, have nothing else on the same line, and you want to yank out all three lines each time.\n\nTODO\n====\n\n* Make PDF much smaller, with some kinda optimization along the way\n* (perhaps at cross purposes) add the OCR to the PDF with hOCR or something.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthisisparker%2Fbookscanning","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fthisisparker%2Fbookscanning","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthisisparker%2Fbookscanning/lists"}