{"id":13760669,"url":"https://github.com/DaRealFreak/epub-scraper","last_synced_at":"2025-05-10T11:30:32.625Z","repository":{"id":38859698,"uuid":"206810406","full_name":"DaRealFreak/epub-scraper","owner":"DaRealFreak","description":"scrapes novels dynamically with YAML configuration files","archived":false,"fork":false,"pushed_at":"2022-10-03T17:16:40.000Z","size":313,"stargazers_count":13,"open_issues_count":12,"forks_count":4,"subscribers_count":2,"default_branch":"master","last_synced_at":"2024-08-03T13:05:16.316Z","etag":null,"topics":["epub","epub-scraper","novels"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/DaRealFreak.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-09-06T14:24:03.000Z","updated_at":"2024-03-16T20:00:53.000Z","dependencies_parsed_at":"2022-09-18T12:41:39.870Z","dependency_job_id":null,"html_url":"https://github.com/DaRealFreak/epub-scraper","commit_stats":null,"previous_names":[],"tags_count":9,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DaRealFreak%2Fepub-scraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DaRealFreak%2Fepub-scraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DaRealFreak%2Fepub-scraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DaRealFreak%2Fepub-scraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/DaRealFreak","download_url":"https://codeload.github.com/DaRealFreak/epub-scraper/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224949904,"owners_count":17397250,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["epub","epub-scraper","novels"],"created_at":"2024-08-03T13:01:16.236Z","updated_at":"2024-11-16T17:31:40.128Z","avatar_url":"https://github.com/DaRealFreak.png","language":"Go","funding_links":[],"categories":["Go"],"sub_categories":[],"readme":"# Epub-Scraper\n[![Go Report Card](https://goreportcard.com/badge/github.com/DaRealFreak/epub-scraper)](https://goreportcard.com/report/github.com/DaRealFreak/epub-scraper)  ![GitHub](https://img.shields.io/github/license/DaRealFreak/epub-scraper)  \nApplication to scrape novels and convert them into EPUB files based on YAML configuration files.\n\n## Dependencies\n- [Calibre](https://calibre-ebook.com/) - cross-platform open-source suite of e-book software.\n\nThis dependency is required to fix encoding errors, image compression and to keep latest standards (ebook-polish of it has to be callable)\n\n## Usage\nYou can simply pass the configuration file you want to process by either dropping them onto the binary\nor by passing it in the command line.  \nOn passing folders to the binary it'll process all available .yaml files from within folder.\n\n```\nUsage:\n  scraper [file 1] [file 2] ... [flags]\n  scraper [command]\n\nAvailable Commands:\n  help        Help about any command\n  update      update the application\n\nFlags:\n  -h, --help               help for scraper\n  -v, --verbosity string   log level (debug, info, warn, error, fatal, panic) (default \"info\")\n      --version            version for scraper\n```\n\n## Configuration\nTo be compatible with most use cases a lot of configurations are possible for the extraction of the e-book source.\nOnly a few keys are actually required though, so you can generate valid Epub files with a minimal configuration already.\n\nMost minimal configuration with at least 1 chapter would be:\n```yaml\ngeneral:\n  title: [string][required]\n  author: [string][required]\nchapters:\n  - chapter:\n      url: [string][required]\n      title-content:\n        title-selector: [string][required]\n      chapter-content:\n        content-selector: [string][required]\n```\n\nYou can also find multiple real usage example configurations in the [examples](examples) folder.\n\n### General\nMetadata and Table of Content related information for the generated Epub file.\n\nAll available configuration options:\n```yaml\ngeneral:\n  # title of the generated Epub\n  title: [string][required]\n  # sub title of the generated Epub\n  alt-title: [string]\n  # author of the generated Epub\n  author: [string][required]\n  # description of the generated Epub\n  description: [string]\n  # cover image, can be either a file path or an URL to an image\n  cover: [string]\n  # language of the generated Epub\n  language: [string]\n  # link to the original novel\n  raw: [string]\n  # translators to be mentioned and linked in the Table of Content page\n  translators:\n    - # displayed name of the translator\n      name: [string]\n      # URL to link the displayed name to\n      url: [string]\n```\n\n### Sites\nOptional section with the intention to single out the chapter title and content settings by the domain.\nEspecially useful in case single chapters are getting added in the chapters section.  \nRedirects are only configurable in this section. Each redirect configuration is only used if the chapter host matches the site configuration host.\nIf we get redirected to a different host it'll also use use the site configuration of the new host.\n\nAll available configuration options:\n```yaml\nsites:\n  - # host of site\n    host: [string][required]\n    # possible redirects, it'll try to follow them as deep as possible, else it'll use the next closes URL\n    redirects: [list of strings]\n    # configurations related to the wayback machine in case the website doesn't exist anymore\n    wayback-machine:\n      # to enable or disable the usage of the wayback machine, default value is false\n      use: [boolean]\n      # version of the wayback machine to use:\n      # 0 is the oldest entry\n      # 2 is the newest entry\n      version: [integer]\n    # optional configuration in case the Table of Content has multiple pages\n    pagination:\n      # should extracted chapters be reversed?\n      # allows newest -\u003e oldest navigation to work with unknown amount of pages\n      reverse-posts: [boolean]\n      # CSS selector to the next page, has to point to an element with an \"href\" attribute\n      next-page-selector: [string]\n    # required configurations to extract the chapter titles\n    title-content:\n      # will add a \"Chapter [index+1] - \" to the title if true\n      add-prefix: [boolean]\n      # CSS selector for the title\n      title-selector: [string][required]\n      # possibility to narrow down title selection by cutting of prefix\n      # cut off will only occur at first match, so use 2x same prefix if you want to select after the 2nd occurrence\n      prefix-selectors: [list of strings]\n      # possibility to narrow down title selection by cutting of suffix\n      # cut off will only occur after first match, so use 2x same suffix if you want to select before 2nd last occurrence\n      suffix-selectors: [list of strings]\n      # option to further strip the extracted title from unwanted content using regular expressions\n      # requires the capture group \"Title\"\n      strip-regex: [string]\n      # option to remove content from title using regular expressions\n      # everything matching will be replaced with empty string\n      cleanup-regex: [string]\n    # required configuration to extract the chapter content\n    chapter-content:\n      # CSS selector for the chapter content\n      content-selector: [string][required]\n      # option to further strip the extracted chapter from unwanted content using regular expressions\n      # requires the capture group \"Content\"\n      strip-regex: [string]\n      # option to remove content from chapter content using regular expressions\n      # everything matching will be replaced with empty string\n      cleanup-regex: [string]\n      # possibility to narrow down title selection by cutting of prefix\n      # cut off will only occur at first match, so use 2x same prefix if you want to select after the 2nd occurrence\n      prefix-selectors: [list of strings]\n      # possibility to narrow down title selection by cutting of suffix\n      # cut off will only occur after first match, so use 2x same suffix if you want to select before 2nd last occurrence\n      suffix-selectors: [list of strings]\n```\n\n### Chapters\nContains the configuration where to extract chapters from. Either direct links to chapters (chapter) of links to\nTable of Content (toc) pages are available.  \nOne element can't have both chapter and toc at the same time since the order of the chapters would be unknown.\nJust append them each as one chapter source.  \nIf no configuration is set for `title-content` and `chapter-content` it'll use the related site configuration if set.\nIf chapter source and related site are both configured the chapter source configuration will be preferred over the site configuration.\n\nAll available configuration options:\n```yaml\nchapters:\n  # table of content element where we can extract chapters from\n  - toc:\n      # URL to extract chapters from (and starting point of the navigation if set)\n      url: [string][required]\n      # CSS selector to the chapter link, has to point to an element with an \"href\" attribute\n      # redirects are possible with the site configuration (for f.e. blog post -\u003e chapter links)\n      chapter-selector: [string][required]\n      # optional configuration in case the Table of Content has multiple pages\n      pagination:\n        # should extracted chapters be reversed?\n        # allows newest -\u003e oldest navigation to work with unknown amount of pages\n        reverse-posts: [boolean]\n        # CSS selector to the next page, has to point to an element with an \"href\" attribute\n        next-page-selector: [string]\n      # required configurations to extract the chapter titles\n      title-content:\n        # will add a \"Chapter [index+1] - \" to the title if true\n        add-prefix: [boolean]\n        # CSS selector for the title\n        title-selector: [string][required]\n        # possibility to narrow down title selection by cutting of prefix\n        # cut off will only occur at first match, so use 2x same prefix if you want to select after the 2nd occurrence\n        prefix-selectors: [list of strings]\n        # possibility to narrow down title selection by cutting of suffix\n        # cut off will only occur after first match, so use 2x same suffix if you want to select before 2nd last occurrence\n        suffix-selectors: [list of strings]\n        # option to further strip the extracted title from unwanted content using regular expressions\n        # requires the capture group \"Title\"\n        strip-regex: [string]\n        # option to remove content from title using regular expressions\n        # everything matching will be replaced with empty string\n        cleanup-regex: [string]\n      # required configuration to extract the chapter content\n      chapter-content:\n        # CSS selector for the chapter content\n        content-selector: [string][required]\n        # option to further strip the extracted chapter from unwanted content using regular expressions\n        # requires the capture group \"Content\"\n        strip-regex: [string]\n        # option to remove content from chapter content using regular expressions\n        # everything matching will be replaced with empty string\n        cleanup-regex: [string]\n        # possibility to narrow down title selection by cutting of prefix\n        # cut off will only occur at first match, so use 2x same prefix if you want to select after the 2nd occurrence\n        prefix-selectors: [list of strings]\n        # possibility to narrow down title selection by cutting of suffix\n        # cut off will only occur after first match, so use 2x same suffix if you want to select before 2nd last occurrence\n        suffix-selectors: [list of strings]\n\n  # chapter element, direct link to the chapter\n  - chapter:\n      # direct link to the chapter, redirects possible with the site configuration (for f.e. blog post -\u003e chapter links)\n      url: [string][required]\n      # required configurations to extract the chapter titles\n      title-content:\n        # will add a \"Chapter [index+1] - \" to the title if true\n        add-prefix: [boolean]\n        # CSS selector for the title\n        title-selector: [string][required]\n        # possibility to narrow down title selection by cutting of prefix\n        # cut off will only occur at first match, so use 2x same prefix if you want to select after the 2nd occurrence\n        prefix-selectors: [list of strings]\n        # possibility to narrow down title selection by cutting of suffix\n        # cut off will only occur after first match, so use 2x same suffix if you want to select before 2nd last occurrence\n        suffix-selectors: [list of strings]\n        # option to further strip the extracted title from unwanted content using regular expressions\n        # requires the capture group \"Title\"\n        strip-regex: [string]\n        # option to remove content from title using regular expressions\n        # everything matching will be replaced with empty string\n        cleanup-regex: [string]\n      # required configuration to extract the chapter content\n      chapter-content:\n        # CSS selector for the chapter content\n        content-selector: [string][required]\n        # option to further strip the extracted chapter from unwanted content using regular expressions\n        # requires the capture group \"Content\"\n        strip-regex: [string]\n        # option to remove content from chapter content using regular expressions\n        # everything matching will be replaced with empty string\n        cleanup-regex: [string]\n        # possibility to narrow down title selection by cutting of prefix\n        # cut off will only occur at first match, so use 2x same prefix if you want to select after the 2nd occurrence\n        prefix-selectors: [list of strings]\n        # possibility to narrow down title selection by cutting of suffix\n        # cut off will only occur after first match, so use 2x same suffix if you want to select before 2nd last occurrence\n        suffix-selectors: [list of strings]\n```\n\n### Blacklist\nYou can blacklist URLs of which no chapter data will be extracted. This is useful if you use multiple hosts\nto extract chapters which may overlap with each other. The blacklist will also be checked during the redirect checks.\n\nconfiguration:\n```yaml\nblacklist: [list of strings]\n```\n\n\n### Assets\nThe assets section contains information about the assets included in the generated .epub file.\nAdded assets will be included in every added chapter automatically.\n```yaml\nassets:\n  css:\n    # path relative to YAML file to the CSS file used in the generated Epub\n    path: [string]\n  font:\n    # path relative to YAML file to the font file used in the generated Epub\n    path: [string]\n```\n\n### Replacements\nIn case of some renamed domains or the like you have the possibility to replace found URIs.\nThis also applies for redirects and can be configured in the replacements section of the YAMl configuration:\n```yaml\nreplacements:\n  # list of replacements\n    # url is the found URI to be redirected\n  - url: [string]\n    # replacement is the URI to replace the found URI with\n    replacement: [string]\n```\n\n### Templates\nAside from the CSS and font files you can also modify the used templates to create your own individually styled epub.\nThese can be configured in the templates section of the YAML configuration:\n```yaml\ntemplates:\n  # all configurations related to the table of content\n  toc:\n    # this is the full HTML page template of the table of content page\n    content: [string]\n    # alt title template used as sub headline\n    alt-title: [string]\n    # this is the HTML string of the chained list of translators\n    translator: [string]\n  # all configurations related to the HTML content of the extracted chapters\n  chapter:\n    # this is the full HTML page template of the extracted chapter pages\n    content: [string]\n    # chapter title used in chapter displays (title/headline/optional ToC content)\n    title: [string]\n```\n\nEvery template can use multiple variables using the template Syntax `{{.variableName}}`.  \n\n---\n**toc.content**:  \n\n| Name | Description | Related Configuration |\n|:---|:---|:---|\n|title|Title of the generated Epub|general.title|\n|altTitle|Alternative Title/Subtitle generated from the templates.toc.alt-title template|-|\n|rawUrl|URL to the untranslated chapters|general.raw|\n|author|Author name|general.author|\n|toc|Table of Contents, generated from the chapter list, **this variable is not used by default**|-|\n|translators|List of translators using the templates.toc.translator template|-|\n|epubScraperCredits|Credit for the Epub Scraper project including link to the repository|-|\n\n*default*:\n```html\n\u003cdiv\u003e\n    \u003ch3\u003e{{.title}}\u003c/h3\u003e\n    {{.altTitle}}\n    \u003cdiv class=\"center\"\u003e\n        \u003cp\u003e\u003ca href=\"{{.rawUrl}}\"\u003eOriginal Webnovel\u003c/a\u003e by {{.author}}\u003c/p\u003e\n        {{.toc}}\n    \u003c/div\u003e\n    \u003cdiv class=\"small-font bottom-align center\"\u003e\n        \u003cp\u003eVisit the translators at:\u003cbr/\u003e\n            {{.translators}}\n        \u003c/p\u003e\n        \u003cp\u003e\n            {{.epubScraperCredits}}\n        \u003c/p\u003e\n    \u003c/div\u003e\n\u003c/div\u003e\n```\n\n---\n**toc.alt-title**\n\n| Name | Description | Related Configuration |\n|:---|:---|:---|\n|altTitle|Alternative Title|general.alt-title|\n\n*default*\n```html\n\u003ch4\u003e\n    \u003ci\u003e- {{.altTitle}} -\u003c/i\u003e\n\u003c/h4\u003e\n```\n\n---\n**toc.translator**\n\n| Name | Description | Related Configuration |\n|:---|:---|:---|\n|translatorURL|URL to Website of the Translator|general.translators.[i].url|\n|translatorName|Name of the Translator/Translator Group|general.translators.[i].name|\n\n*default*\n```html\n\u003ca href=\"{{.translatorURL}}\"\u003e{{.translatorName}}\u003c/a\u003e\n\u003cbr/\u003e\n```\n\n---\n**chapter.content**\n\n| Name | Description |\n|:---|:---|\n|chapterTitle|Title Text of the Chapter generated with the chapter.title template|\n|content|HTML Content of the Chapter|\n\n*default*\n```html\n\u003cdiv class=\"left\" style=\"text-align:left;text-indent:0;\"\u003e\n    \u003ch3\u003e{{.chapterTitle}}\u003c/h3\u003e\n    \u003chr/\u003e\n    {{.content}}\n\u003c/div\u003e\n```\n\n---\n**chapter.title**\n\n| Name | Description |\n|:---|:---|\n|chapterIndex|Numeric index of the chapter starting with 1|\n|chapterTitle|Title Text extracted from the chapter|\n\n*default*\n```html\nChapter {{.chapterIndex}} - {{.chapterTitle}}\n```\n\n## License\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FDaRealFreak%2Fepub-scraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FDaRealFreak%2Fepub-scraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FDaRealFreak%2Fepub-scraper/lists"}