{"id":13393344,"url":"https://github.com/danburzo/percollate","last_synced_at":"2025-05-14T02:04:57.884Z","repository":{"id":39636901,"uuid":"150767713","full_name":"danburzo/percollate","owner":"danburzo","description":"A command-line tool to turn web pages into readable PDF, EPUB, HTML, or Markdown docs.","archived":false,"fork":false,"pushed_at":"2024-09-02T09:51:13.000Z","size":1817,"stargazers_count":4304,"open_issues_count":16,"forks_count":165,"subscribers_count":42,"default_branch":"main","last_synced_at":"2024-10-29T09:28:26.630Z","etag":null,"topics":["cli","epub","html","markdown","pdf","puppeteer","readability"],"latest_commit_sha":null,"homepage":"https://danburzo.ro/projects/percollate/","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/danburzo.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-09-28T16:32:56.000Z","updated_at":"2024-10-29T00:45:22.000Z","dependencies_parsed_at":"2022-07-18T05:46:10.372Z","dependency_job_id":"63e91f6e-51ca-439b-9208-3a95714b8b68","html_url":"https://github.com/danburzo/percollate","commit_stats":{"total_commits":299,"total_committers":19,"mean_commits":"15.736842105263158","dds":0.4782608695652174,"last_synced_commit":"436cc0ab600264319bfd0d8607b55bbf8b867f85"},"previous_names":[],"tags_count":67,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/danburzo%2Fpercollate","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/danburzo%2Fpercollate/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/danburzo%2Fpercollate/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/danburzo%2Fpercollate/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/danburzo","download_url":"https://codeload.github.com/danburzo/percollate/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254052692,"owners_count":22006716,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cli","epub","html","markdown","pdf","puppeteer","readability"],"created_at":"2024-07-30T17:00:50.781Z","updated_at":"2025-05-14T02:04:57.849Z","avatar_url":"https://github.com/danburzo.png","language":"JavaScript","readme":"\u003cimg src='./.github/percollate.svg' alt='percollate' width='200'/\u003e\n\n\u003ca href=\"https://www.npmjs.org/package/percollate\"\u003e\u003cimg src=\"https://img.shields.io/npm/v/percollate.svg?style=flat-square\u0026labelColor=324A97\u0026color=black\" alt=\"npm version\"\u003e\u003c/a\u003e\n\nPercollate is a command-line tool that turns web pages into beautifully formatted PDF, EPUB, HTML or Markdown files.\n\n\u003cfigure style='margin: 1rem 0'\u003e\n\t\u003cimg alt=\"Sample Output\" src=\"./.github/dimensions-of-colour.png\"\u003e\n\t\u003cfigcaption style='font-style: italic'\u003eSample spread from the generated PDF of \u003ca href='http://www.huevaluechroma.com/072.php'\u003ea chapter in Dimensions of Colour\u003c/a\u003e; rendered here in black \u0026 white for a smaller image file size.\u003c/figcaption\u003e\n\u003c/figure\u003e\n\n-   [Installation](#installation)\n-   [Usage](#usage)\n    -   [Available commands](#available-commands)\n    -   [Available options](#available-options)\n-   [Recipes](#recipes)\n    -   [Basic bundling](#basic-bundling)\n    -   [The `--css` option](#the---css-option)\n    -   [The `--style` option](#the---style-option)\n    -   [The `--template` option](#the---template-option)\n-   [How it works](#how-it-works)\n-   [Updating](#updating)\n-   [Limitations](#limitations)\n-   [Troubleshooting](#troubleshooting)\n-   [Contributing](#contributing)\n-   [See also](#see-also)\n\n## Installation\n\n`percollate` is a Node.js command-line tool which you can install globally from npm:\n\n```bash\nnpm install -g percollate\n```\n\nPercollate and its dependencies **require Node.js 14.17.0** or later.\n\n#### Community-maintained packages\n\nThere's [a packaged version](https://aur.archlinux.org/packages/nodejs-percollate/) available on [Arch User Repository](https://wiki.archlinux.org/index.php/Arch_User_Repository), which you can install using your local [AUR helper](https://wiki.archlinux.org/index.php/AUR_helpers) (`yay`, `pacaur`, or similar):\n\n```\nyay -S nodejs-percollate\n```\n\nSome Docker images are available in this [tracking issue](https://github.com/danburzo/percollate/issues/95).\n\n## Usage\n\n\u003e Run `percollate --help` for a list of available commands and options.\n\nPercollate is invoked on one or more operands (usually URLs):\n\n```bash\npercollate \u003ccommand\u003e [options] url [url]...\n```\n\nThe following commands are available:\n\n-   `percollate pdf` produces a PDF file;\n-   `percollate epub` produces an EPUB file;\n-   `percollate html` produces a HTML file.\n-   `percollate md` produces a Markdown file.\n\nThe operands can be URLs, paths to local files, or the `-` character which stands for `stdin` (the standard inputs).\n\n### Available options\n\nUnless otherwise stated, these options apply to all three commands.\n\n#### `-o, --output`\n\nSpecify the path of the resulting bundle relative to the current folder.\n\n```bash\npercollate pdf https://example.com -o my-example.pdf\n```\n\n#### `-u, --url`\n\nUsing the `-` operand you can read the HTML content from `stdin`, as fetched by a separate command, such as `curl`. In this sort of setup, `percollate` does not know the URL from which the content has been fetched, and relative paths on images, anchors, et cetera won't resolve correctly.\n\nUse the `--url` option to supply the source's original URL.\n\n```bash\ncurl https://example.com | percollate pdf - --url=https://example.com\n```\n\n#### `-w, --wait`\n\nBy default, percollate processes URLs in parallel. Use the `--wait` option to process them sequentially instead, with a pause between items. The delay is specified in seconds, and can be zero.\n\n```bash\npercollate epub --wait=1 url1 url2 url3\n```\n\n#### `--individual`\n\nBy default, percollate bundles all web pages in a single file. Use the `--individual` flag to export each source to a separate file.\n\n```bash\npercollate pdf --individual http://example.com/page1 http://example.com/page2\n```\n\n#### `--template`\n\nPath to a custom HTML template. Applies to `pdf`, `html`, and `md`.\n\n#### `--style`\n\nPath to a custom CSS stylesheet, relative to the current folder.\n\n#### `--css`\n\nAdditional CSS styles you can pass from the command-line to override styles specified by the default/custom stylesheet.\n\n#### `--no-amp`\n\nDon't prefer the AMP version of the web page.\n\n#### `--debug`\n\nPrint more detailed information.\n\n#### `-t, --title`\n\nProvide a title for the bundle.\n\n```bash\npercollate epub http://example.com/page-1 http://example.com/page-2 --title=\"Best Of Example\"\n```\n\n#### `-a, --author`\n\nProvide an author for the bundle.\n\n```bash\npercollate pdf --author=\"Ella Example\" http://example.com\n```\n\n#### `--cover`\n\nGenerate a cover. The option is implicitly enabled when the `--title` option is provided, or when bundling more than one web page to a single file. Disable this implicit behavior by passing the `--no-cover` flag.\n\n#### `--toc`\n\nGenerate a hyperlinked table of contents. The option is implicitly enabled when bundling more than one web page to a single file. Disable this implicit behavior by passing the `--no-toc` flag.\n\nApplies to `pdf`, `html`, and `md`.\n\n#### `--toc-level=\u003clevel\u003e`\n\nBy default, the table of contents is a flat list of article titles. With the `--toc-level` option the table of contents will include headings under each article title (`\u003ch2\u003e`, `\u003ch3\u003e`, etc.), up to the specified heading depth. A number between `1` and `6` is expected.\n\nUsing `--toc-level` with a value greater than `1` implies `--toc`.\n\n#### `--hyphenate`\n\nHyphenation is enabled by default for `pdf`, and disabled for `epub`, `html`, and `md`. You can opt into hyphenation with the `--hyphenate` flag, or disable it with the `--no-hyphenate` flag.\n\nSee also the [Hyphenation and justification](#hyphenation-and-justification) recipe.\n\n#### `--inline`\n\nEmbed images inline with the document. Images are fetched and converted to Base64-encoded `data` URLs.\n\nThis option is particularly useful for `html` to produce self-contained HTML files.\n\n#### `--md.\u003coption\u003e=\u003cvalue\u003e`\n\nPass options to the underlying Markdown stringifier, [`mdast-util-to-markdown`](https://github.com/syntax-tree/mdast-util-to-markdown#options). These are the default Markdown options:\n\n```js\nconst DEFAULT_MARKDOWN_OPTIONS = {\n\tfences: true,\n\temphasis: '_',\n\tstrong: '_',\n\tresourceLink: true,\n\trule: '-'\n};\n```\n\n#### `--unsafe`\n\nDisables some [JSDOM validations](https://github.com/jsdom/jsdom/blob/main/lib/jsdom/living/helpers/validate-names.js) that may throw an error when parsing invalid HTML pages (See [#177](https://github.com/danburzo/percollate/issues/177)).\n\n## Recipes\n\n### Basic bundling\n\nTo turn a single web page into a PDF:\n\n```bash\npercollate pdf --output=some.pdf https://example.com\n```\n\nTo bundle _several_ web pages into a single PDF, specify them as separate arguments to the command:\n\n```bash\npercollate pdf --output=some.pdf https://example.com/page1 https://example.com/page2\n```\n\nYou can use common Unix commands and keep the list of URLs in a newline-delimited text file:\n\n```bash\ncat urls.txt | xargs percollate pdf --output=some.pdf\n```\n\nTo transform several web pages into individual PDF files at once, use the `--individual` flag:\n\n```bash\npercollate pdf --individual https://example.com/page1 https://example.com/page2\n```\n\nIf you'd like to fetch the HTML with an external command, you can use `-` as an operand, which stands for `stdin` (the standard input):\n\n```bash\ncurl https://example.com/page1 | percollate pdf --url=https://example.com/page1 -\n```\n\nNotice we're using the `url` option to tell percollate the source of our (now-anonymous) HTML it gets on stdin, so that relative URLs on links and images resolve correctly.\n\n### The `--css` option\n\nThe `--css` option lets you pass a small snippet of CSS to percollate. Here are some common use-cases:\n\n#### Custom page size / margins\n\nThe default page size is A5 (portrait). You can use the `--css` option to override it using [any supported CSS `size`](https://www.w3.org/TR/css3-page/#page-size):\n\n```bash\npercollate pdf --css \"@page { size: A3 landscape }\" http://example.com\n```\n\nSimilarly, you can define:\n\n-   custom margins, e.g. `@page { margin: 0 }`\n-   the base font size: `html { font-size: 10pt }`\n\n#### Changing the font stacks\n\nThe default stylesheet includes CSS variables for the fonts used in the PDF:\n\n```css\n:root {\n\t--main-font: Palatino, 'Palatino Linotype', 'Times New Roman',\n\t\t'Droid Serif', Times, 'Source Serif Pro', serif, 'Apple Color Emoji',\n\t\t'Segoe UI Emoji', 'Segoe UI Symbol';\n\t--alt-font: 'helvetica neue', ubuntu, roboto, noto, 'segoe ui', arial,\n\t\tsans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol';\n\t--code-font: Menlo, Consolas, monospace;\n}\n```\n\n| CSS variable  | What it does                          |\n| ------------- | ------------------------------------- |\n| `--main-font` | The font stack used for body text     |\n| `--alt-font`  | Used in headings, captions, et cetera |\n| `--code-font` | Used for code snippets                |\n\nTo override them, use the `--css` option:\n\n```bash\npercollate pdf --css \":root { --main-font: 'PT Serif';  --alt-font: Roboto; }\" http://example.com\n```\n\n\u003e 💡 To work correctly, you must have the fonts installed on your machine. Custom web fonts currently require you to use a custom CSS stylesheet / HTML template.\n\n#### Remove the appended `href`s from hyperlinks\n\nThe idea with percollate is to make PDFs that can be printed without losing where the hyperlinks point to. However, for some link-heavy pages, the appended `href`s can become bothersome. You can remove them using:\n\n```bash\npercollate pdf --css \"a:after { display: none }\" http://example.com\n```\n\n#### Hyphenation and justification\n\nHyphenation is only enabled by default for PDFs, but you can opt in or out of it for any output format with [a flag](#--hyphenate).\n\nWhen hyphenation is enabled, paragraphs will be justified:\n\n```css\n.article__content p {\n\ttext-align: justify;\n}\n```\n\nIf you prefer left-aligned text:\n\n```bash\npercollate pdf --css \".article__content p { text-align: left }\" http://example.com\n```\n\n### The `--style` option\n\nThe `--style` option lets you use your own CSS stylesheet instead of the default one. Here are some common use-cases for this option:\n\n\u003e ⚠️ TODO add examples here\n\n### The `--template` option\n\nThe `--template` option lets you use a custom HTML template for the PDF.\n\n\u003e 💡 The HTML template is parsed with [nunjucks](https://mozilla.github.io/nunjucks/), which is a close JavaScript relative of Twig for PHP, Jinja2 for Python and L for Ruby.\n\nHere are some common use-cases:\n\n#### Customizing the page header / footer\n\nPuppeteer can print some basic information about the page in the PDF. The following CSS class names are available for the header / footer, into which the appropriate content will be injected:\n\n-   `date` — The formatted print date\n-   `title` — The document title\n-   `url` — document location (**Note:** this will print the path of the _temporary html_, not the original web page URL)\n-   `pageNumber` — the current page number\n-   `totalPages` — total pages in the document\n\n\u003e 👉 See the [Chromium source code](https://cs.chromium.org/chromium/src/components/printing/resources/print_header_footer_template_page.html) for details.\n\nYou place your header / footer template in a `template` element in your HTML:\n\n```html\n\u003ctemplate class=\"header-template\"\u003e My header \u003c/template\u003e\n\n\u003ctemplate class=\"footer-template\"\u003e\n\t\u003cdiv class=\"text center\"\u003e\n\t\t\u003cspan class=\"pageNumber\"\u003e\u003c/span\u003e\n\t\u003c/div\u003e\n\u003c/template\u003e\n```\n\nSee the [default HTML](./templates/default.html) for example usage.\n\nYou can add CSS styles to the header / footer with either the `--css` option or a separate CSS stylesheet (the `--style` option).\n\n\u003e 💡 The header / footer template [do not inherit their styles](https://github.com/puppeteer/puppeteer/issues/1853) from the rest of the page (i.e. they are not part of the cascade), so you'll have to write the full CSS you want to apply to them.\n\nAn example from the [default stylesheet](./templates/default.css):\n\n```css\n.footer-template {\n\tfont-size: 10pt;\n\tfont-weight: bold;\n}\n```\n\n## Updating\n\nTo keep the tool up-to-date, you can run:\n\n```bash\nnpm install -g percollate\n```\n\nOccasionally, an upgrade might not go according to plan; in this case, you can uninstall and re-install `percollate`:\n\n```bash\nnpm uninstall -g percollate \u0026\u0026 npm install -g percollate\n```\n\n## How it works\n\nAll export formats follow a common pipeline:\n\n1. Fetch the page(s) using [`node-fetch`](https://github.com/node-fetch/node-fetch)\n2. If an AMP version of the page exists, use that instead (disable with `--no-amp` flag)\n3. [Enhance](./src/enhancements.js) the DOM using [`jsdom`](https://github.com/jsdom/jsdom)\n4. Pass the DOM through [`mozilla/readability`](https://github.com/mozilla/readability) to strip unnecessary elements\n5. Apply the [HTML template](./templates/default.html) and the [stylesheet](./templates/default.css) to the resulting HTML\n\nDifferent formats then use different tools to produce the final file.\n\nPDFs are rendered with [`puppeteer`](https://github.com/puppeteer/puppeteer).\n\nEPUBs have external images fetched and bundled together with the HTML of each article. When the `--inline` option is used, images are instead converted to `data` URLs and embedded into the HTML.\n\nHTMLs are saved without any further changes. When the `--inline` option is used, images are converted to `data` URLs and embedded into the HTML. External images are not otherwise fetched.\n\nMarkdown files are produced the same way as HTMLs, then processed with a series of utilities from the [unified.js](https://unifiedjs.com/) umbrella.\n\n## Limitations\n\nPercollate inherits the limitations of two of its main components, Readability and Puppeteer (headless Chrome).\n\nThe imperative approach Readability takes will not be perfect in each case, especially on HTML pages with atypical markup; you may occasionally notice that it either leaves in superfluous content, or that it strips out parts of the content. You can confirm the problem against [Firefox's Reader View](https://blog.mozilla.org/firefox/reader-view/). In this case, consider [filing an issue on `mozilla/readability`](https://github.com/mozilla/readability/issues).\n\nUsing a browser to generate the PDF is a double-edged sword. On the one hand, you get excellent support for web platform features. On the other hand, [print CSS](https://www.smashingmagazine.com/2018/05/print-stylesheets-in-2018/) as defined by W3C specifications is only partially implemented, and it seems unlikely that support will be improved any time soon. However, even with modest print support, I think Chrome is the best (free) tool for the job.\n\n## Troubleshooting\n\nOn some Linux machines you'll need to [install a few more Chrome dependencies](https://github.com/puppeteer/puppeteer/blob/master/docs/troubleshooting.md#chrome-headless-doesnt-launch) before `percollate` works correctly. (_Thanks to @ptica for [sorting it out](https://github.com/danburzo/percollate/issues/19#issuecomment-428496041)_)\n\nThe `percollate pdf` command supports the `--no-sandbox` Puppeteer flag, but make sure you're [aware of the implications](https://github.com/puppeteer/puppeteer/blob/master/docs/troubleshooting.md#chrome-headless-fails-due-to-sandbox-issues) before disabling the sandbox.\n\n## Using Firefox to render PDFs\n\n\u003e This feature is experimental. Please log an issue if you notice anything wrong.\n\nStarting with `percollate` 3.x, it's possible to use Firefox Nightly as an alternative browser with which to render PDFs. To make Firefox available to Percollate, use the following install command:\n\n```bash\nPUPPETEER_PRODUCT=firefox npm install percollate\n```\n\nAfter installation, `percollate pdf` commands can be run with the `--browser=firefox` option.\n\n### Limitations of Firefox PDF rendering\n\nAt the moment, rendering PDFs with Firefox has the following limitations:\n\n-   The pages can't have headers and footers, so there are no page numbers.\n\n## Contributing\n\nContributions of all kinds are welcome! See [CONTRIBUTING.md](./CONTRIBUTING.md) for details.\n\n## See also\n\nHere are some other projects to check out if you're interested in building books using the browser:\n\n-   [weasyprint](https://github.com/Kozea/WeasyPrint) ([website](https://weasyprint.org/))\n-   [bindery.js](https://github.com/evnbr/bindery) ([website](https://evanbrooks.info/bindery/))\n-   [HummusJS](https://github.com/galkahana/HummusJS)\n-   [Editoria](https://gitlab.coko.foundation/editoria/editoria) ([website](https://editoria.pub/))\n-   [pagedjs](https://gitlab.pagedmedia.org/tools/pagedjs) ([article](https://www.pagedmedia.org/pagedjs-sneak-peeks/))\n-   [Mercury](https://mercury.postlight.com/)\n-   [Foliojs](https://github.com/foliojs)\n-   [Magicbook](https://github.com/magicbookproject/magicbook)\n-   [monolith](https://github.com/Y2Z/monolith)\n-   [SaraVieira/starter-book](https://github.com/SaraVieira/starter-book)\n-   [SingleFileZ](https://github.com/gildas-lormeau/SingleFileZ)\n","funding_links":[],"categories":["Opensource projects","JavaScript","Repository","Tools","cli","前端常用","工具集"],"sub_categories":["Office","Node","小工具集"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdanburzo%2Fpercollate","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdanburzo%2Fpercollate","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdanburzo%2Fpercollate/lists"}