{"id":16958480,"url":"https://github.com/dohliam/ebook-corpus","last_synced_at":"2025-04-14T09:19:57.479Z","repository":{"id":88991959,"uuid":"200760139","full_name":"dohliam/ebook-corpus","owner":"dohliam","description":"Ebook Corpus - A parser and extractor for electronic books","archived":false,"fork":false,"pushed_at":"2019-08-06T02:27:03.000Z","size":40,"stargazers_count":7,"open_issues_count":0,"forks_count":1,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-04-14T09:19:50.531Z","etag":null,"topics":["corpus","corpus-builder","corpus-linguistics","ebook-parsing","ebooks","epub","fb2","mobi"],"latest_commit_sha":null,"homepage":null,"language":"Ruby","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dohliam.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-08-06T02:16:45.000Z","updated_at":"2024-10-10T05:29:36.000Z","dependencies_parsed_at":"2023-06-13T11:15:10.563Z","dependency_job_id":null,"html_url":"https://github.com/dohliam/ebook-corpus","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dohliam%2Febook-corpus","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dohliam%2Febook-corpus/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dohliam%2Febook-corpus/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dohliam%2Febook-corpus/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dohliam","download_url":"https://codeload.github.com/dohliam/ebook-corpus/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248852186,"owners_count":21171843,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["corpus","corpus-builder","corpus-linguistics","ebook-parsing","ebooks","epub","fb2","mobi"],"created_at":"2024-10-13T22:42:42.150Z","updated_at":"2025-04-14T09:19:57.471Z","avatar_url":"https://github.com/dohliam.png","language":"Ruby","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Ebook Corpus - A parser and extractor for electronic books\n\nEbook Corpus is a set of tools for parsing and extracting the text of ebooks in various formats, designed for the purpose of creating large multilingual ebook-based text corpora.\n\nMany people have amassed enormous collections of ebooks, often containing millions of lines of text when taken as a whole, so it is always surprising to find that there aren't more tools and libraries available to work with ebooks as a corpus source. It seems that almost all the existing tools are focused on consuming (reading) ebooks, while the remaining few provide the functionality to create ebooks to be thus consumed.\n\nAs wonderful as ebooks are, they are often packaged in formats that are incredibly underspecified, or worse, that don't follow the specifications that do exist. A remarkable number of parsing libraries choke on very simple books even in presumably well-supported formats like EPUB3.\n\nThere are many ways for an ebook to defy the expectations of the parser -- perhaps it has been written in Unicode and the parser only handles US-ASCII, or the parser expects Unicode and it's written in KOI-8. Maybe the ebook contains an OPF file called `content.opf` in the root directory, or maybe it's in a separate `CONTENT` subfolder -- or called something completely different, like `mytoc.opf` or `目录.opf`.\n\nThe Ebook Corpus tools won't solve all of these problems, but they nevertheless provide a number of options to make it easier to work with large, multilingual collections of ebooks as a raw text source.\n\n## Usage\n\nInvoking the program on the command-line is straightforward:\n\n    ./ebook.rb [options] [filename]\n\nWhere `[filename]` is the path to the ebook file that you want to work with. If the file has a standard extension (`*.epub`, `*.mobi`, `*.fb2`) it should be detected automatically.\n\n### Options\n\n* `-a` or `--all`: _Extract all contents of epub_\n* `-c` or `--cover`: _Extract cover image_\n* `-f` or `--flatten-dir`: _Save all files to the current folder rather than an individual directory_\n* `-h` or `--html`: _Extract raw html_\n* `-i` or `--images`: _Extract images to a separate folder_\n* `-m` or `--metadata`: _Print metadata_\n  * `-T` or `--title`: _Print title metadata only_\n  * `-A` or `--author`: _Print author metadata only_\n  * `-I` or `--isbn`: _Print ISBN metadata only_\n  * `-L` or `--language`: _Print language metadata only_\n  * `-P` or `--publisher`: _Print publisher metadata only_\n  * `-D` or `--description`: _Print description metadata only_\n* `-o` or `--output-dir DIR`: _Save output to specified director_\n* `-s` or `--save`: _Save (text or html) to file instead of printing_\n* `-t` or `--text`: _Extract plain text_\n* `-T` or `--tests`: _Run test suite_\n* `-p` or `--pager`: _View text in pager_\n* `-v` or `--view`: _Open images in viewer_\n\n## Supported formats\n\nFormat | File extension\n------ | --------------\nEPUB | `.epub`\n[FictionBook](https://en.wikipedia.org/wiki/FictionBook) | `.fb2`\n[Mobipocket](http://wiki.mobileread.com/wiki/MOBI) | `.mobi`, `.prc`, `azw`\n\nSupport for Mobipocket files is provided via a wrapper for the python script [mobiunpack.py](http://www.mobileread.com/forums/showthread.php?t=61986) by [@kevinhendricks](https://github.com/kevinhendricks) (released as [GPL3](https://github.com/kevinhendricks/KindleUnpack/blob/master/COPYING.txt)). If you know of a drop-in replacement library in Ruby for parsing MOBI files (or are interested in writing one), please let me know!\n\nNote that only ebooks without DRM will work with this script.\n\n## Contributing\n\nPRs, suggestions, examples of ebooks that don't parse properly, and other contributions are always welcome! Providing support for additional formats or opening issues for bugs are examples of ways to help.\n\nMOBI support has only been tested against files with the `.mobi` extension. It should in theory also work for other extensions. If you have access to ebooks with a `.prc` or `.azw` file extension and can confirm this, that would be appreciated!\n\n## To do\n\nCode is pretty ad hoc at the moment and in general need of a cleanup. Different formats are handled separately but should probably be merged.\n\nOther things:\n\n* Guess alternately-named `content.opf` files\n* Figure out cross-platform way of opening images in default viewer (current kludge is hard-coded to open image folder in Gwenview since `xdg-open` doesn't play nicely with cleaning up temporary files after viewing)\n\n## License\n\nMIT.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdohliam%2Febook-corpus","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdohliam%2Febook-corpus","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdohliam%2Febook-corpus/lists"}