Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/ourongxing/web-printer

A printer that can print multiple web pages as one pretty PDF
https://github.com/ourongxing/web-printer

pdf printer typescript

Last synced: 2 months ago
JSON representation

A printer that can print multiple web pages as one pretty PDF

Awesome Lists containing this project

README

        




Web Printer



A printer that can print multiple web pages as one pretty PDF



with outlines, without distractions


and learn in depth




language
version
license

> **Warning**
>
> Respect the copyright please! Do not share non-public content on the Internet, especially paid content!
## Features

[Playwright](https://github.com/microsoft/playwright) is used to print PDFs, similar to printing in Chrome, but with the added ability to print multiple web pages into one seamless PDF automatically.

- Fully customizable as it is a Node.js library.
- Universal compatibility with any website through plugins.
- Unique feature to replace internal website links with internal PDF links, supporting hash positioning.
- Automatically generates PDF outlines, with support for different levels and collapsed statuses.
- Easy to remove distracting elements, leaving only pure knowledge.

## Installation

> **Warning**
>
> Web Printer is a Node.js library, not an application. If you're new to Node.js/TypeScript/JavaScript, Web Printer might be challenging to use. An app is currently being developed for general use. Please follow [@pbkapp](https://github.com/pbkapp) for updates.

If you're not a beginner, feel free to proceed as you would with any npm package installation.

```bash
pnpm i playwright @web-printer/core
# Web Printer use Chrome by default. Other supported browsers can be viewed in PrinterOption.channel.
# If you have installed Chrome, you can skip it.
pnpm exec playwright install chrome
# install plugin you need
pnpm i @web-printer/vitepress
```

Then create a `.ts` file, input

```ts
import { Printer } from "@web-printer/core"
// import plugin you have installed
import vitepress from "@web-printer/vitepress"

// Will open a browser to login if you need.
// new Printer().login(url)

new Printer()
.use(
vitepress({
url: {
Guide: "https://vuejs.org/guide/introduction.html",
API: "https://vuejs.org/api/application.html"
}
})
)
.print("Vue 3.2 Documentation")
```

And run it by [tsx](https://github.com/esbuild-kit/tsx), in other ways may throw errors. I have no time to fix it now.

---

But if you are a novice, follow me, maybe easier.

First you shoud install [pnpm(with node)](https://pnpm.io/installation), [vscode(support typescript)](https://code.visualstudio.com/).

```bash
pnpm create printer@latest

# or complete in one step. https://github.com/ourongxing/web-printer/tree/main/packages/create-printer
pnpm create printer@latest web-printer -p vitepress -c chrome
```

And follow the tips. After customizing, use `pnpm print` to print. A pretty PDF will appear in `./output`.

## Options

The [@web-printer/core](https://github.com/ourongxing/web-printer/tree/main/packages/core) provide a Printer object, some types and some utilities.

```ts
import { Printer, type Plugin } from "@web-printer/core"
import type { Plugin, PrinterOption, PrinterPrintOption } from "@web-printer/core"

// Will open a browser to login if you need.
// new Printer().login(url)

new Printer({} as PrinterOption)
.use({} as Plugin)
.print("PDF name", {} as PrinterPrintOption )
```

`PrinterOption` extends Playwright `browserType.launchPersistentContext` [options](https://playwright.dev/docs/api/class-browsertype#browser-type-launch).

```ts
{
/**
* Chromium distribution channel. Choose you have installed.
* @default "chrome"
* */
channel?: "chromium" | "chrome" | "chrome-beta" | "chrome-dev" | "chrome-canary" | "msedge" | "msedge-beta" | "msedge-dev" | "msedge-canary"
/**
* Dir of userdata of Chrome. It is not recommended to use your system userData of Chrome.
* @default "./userData"
*/
userDataDir?: string
/**
* Dir of output pdfs
* @default "./output"
*/
outputDir?: string
/**
* Number of threads to print, will speed up printing.
* @default 1
*/
threads?: number
}
```

`PrinterPrintOption` extends Playwright `page.pdf()` [options](https://playwright.dev/docs/api/class-page#page-pdf).

```ts
{
/**
* Used for outline. If given, Printer could fetch titles and set it as part of outline.
* @default 0 means not set sub titles as outline.
*/
subTitleOutline?: number
/**
* Make a test print, only print two pages and name will be appended "test: "
* @default false
*/
test?: boolean
/**
* Filter the pages you want
*/
filter?: PageFilter
/**
* Reverse the printing order.
* If the outline has different levels, outline may be confused.
*/
reverse?: boolean
/**
* A local cover pdf path.
* Maybe you can use it to marge exist pdf, but can't merge outlines.
*/
coverPath?: string
/**
* inject additonal css
*/
style?: string | (false | undefined | string)[]
/**
* Set the top and bottom margins of all pages except the first page of each artical to zero.
* @default false
*/
continuous?: boolean
/**
* Replace website link to PDF link
* @default false
*/
replaceLink?: boolean
/**
* Add page numbers to the bottom center of the page.
* @default false
* @requires PrinterPrintOption.continuous = false
*/
addPageNumber?: boolean
/**
* Margins of each page
* @default
* {
* top: 60,
* right: 55,
* bottom: 60,
* left: 55,
* }
*/
margin?: {
/**
* @default 60
*/
top?: string | number
/**
* @default 55
*/
right?: string | number

/**
* @default 60
*/
bottom?: string | number
/**
* @default 55
*/
left?: string | number
}
/**
* Paper format. If set, takes priority over `width` or `height` options.
* @defaults "A4"
*/
format?: "A0" | "A1" | "A2" | "A3" | "A4" | "A5" | "Legal" | "Letter" | "Tabloid"
}
```

## Plugins

Plugins in Web Printer is only used to adapt to different websites.

A plugin have five methods:

- `fetchPagesInfo`: Used to fetch a list of page url and title, need return the list.
- `injectStyle`: Used to remove distracting elements and make web pages more PDF-friendly.
- `onPageLoaded`: Run after page loaded.
- `onPageWillPrint`: Run before page will be printed.
- `otherParams`: Used to place other useful params.

### Offical plugins

- Content Site
- [@web-printer/javascript-info](https://github.com/ourongxing/web-printer/tree/main/packages/javascript-info)
- [@web-printer/juejin](https://github.com/ourongxing/web-printer/tree/main/packages/juejin)
- [@web-printer/xiaobot](https://github.com/ourongxing/web-printer/tree/main/packages/xiaobot)
- [@web-printer/zhihu](https://github.com/ourongxing/web-printer/tree/main/packages/zhihu)
- [@web-printer/zhubai](https://github.com/ourongxing/web-printer/tree/main/packages/zhubai)
- [@web-printer/wikipedia](https://github.com/ourongxing/web-printer/tree/main/packages/wikipedia)
- Amazing Blog
- [@web-printer/ruanyifeng](https://github.com/ourongxing/web-printer/tree/main/packages/ruanyifeng)
- Documentation Site Generator
- [@web-printer/vitepress](https://github.com/ourongxing/web-printer/tree/main/packages/vitepress)
- [@web-printer/mdbook](https://github.com/ourongxing/web-printer/tree/main/packages/mdbook)

### How to write a plugin

In fact, it is just use [Playwright](https://playwright.dev/docs/library) to inject JS and CSS into the page. You can read the code of offical plugins to learn how to write a plugin. It's pretty simple most of the time.

*Let's make some rules*

- Use a function to return a plugin.
- The function parameter is an options object.
- If the number of pages info to be fetched is large and fetched slow, you need to provide the `maxPages` option, especially endless loading.

#### fetchPagesInfo

Used to fetch a list of page url and title, need return the list. Usually need to parse sidebar outline. Web Printer could restore the hierarchy and collapsed state of the original outline perfectly.

```typescript
type fetchPagesInfo = (params: {context: BrowserContext}) => MaybePromise
interface PageInfoWithoutIndex {
url: string
title: string
/**
* Outer ... Inner
*/
groups?: (
| {
name: string
collapsed?: boolean
}
| string
)[]
/**
* When this item is a group but have a link and content.
*/
selfGroup?: boolean
collapsed?: boolean
}
```

The pageInfo need returned just like

```ts
// https://javascript.info/
[
{
title: "Manuals and specifications",
url: "https://javascript.info/manuals-specifications",
groups: [
{
name: "The JavaScript language"
},
{
name: "An introduction"
}
]
},
...
]
```

*Examples*

- simple outline: [javascript-info/src/index.ts](https://github.com/ourongxing/web-printer/blob/main/packages/javascript-info/src/index.ts#L18-L52)
- complex outline: [mdbook/src/index.ts](https://github.com/ourongxing/web-printer/blob/main/packages/mdbook/src/index.ts#L17-L93)
- scroll loading: [juejin/src/index.ts](https://github.com/ourongxing/web-printer/blob/main/packages/juejin/src/index.ts#L31-L54)

- pagination: [zhihu/src/index.ts](https://github.com/ourongxing/web-printer/blob/main/packages/zhihu/src/index.ts#L183-L245)

#### injectStyle

Used to remove distracting elements and make web pages more PDF-friendly.

```ts
type injectStyle = (params: { url: string; printOption: PrinterPrintOption }): MaybePromise<{
style?: string
contentSelector?: string
titleSelector?: string
avoidBreakSelector?: string
}>
```

*Let's make some rules*:

- Hide all elements but content.
- Set the margin of the content element and it's ancestor elements to zero.

Therefore, everyone can set the same margin for any website.

Don't worry, It's so easy. You only need to provide a `contentSelector` , support [selector list](https://developer.mozilla.org/en-US/docs/Web/CSS/Selector_list). Web Printer can hide all elements but it and make the margin of it and it's ancestor elements zero automatically.

But not all websites can do this, sometimes you still need to write CSS yourself, just return the `style` property.

When you set `PrinterPrintOption.continuous` to `true`. Web Printer will set the top and bottom margins of all pages to zero.

The `titleSelector` is used to mark the title element, and set top margin for it only. The default value is same as `contentSelector` if `contentSelector` is not empty. And If `contentSelector` has `,`, Printer will use the first selector. If `titleSelector` and `contentSelector` are both empty, the default value will be `body`, but sometimes setting margin top for the body may result in extra white space.

The `avoidBreakSelector` is used to avoid page breaks in some elements. The default value is `pre,blockquote,tbody tr`
#### onPageLoaded

Run after page loaded. Usually used to wait img loaded, especially lazy loaded images.

```ts
type onPageLoaded = (params: { page: Page; pageInfo: PageInfo; printOption: PrinterPrintOption }): MaybePromise
```

Web Printer provide two methods to handle image loading:

- ```ts
type evaluateWaitForImgLoad = (page: Page, imgSelector = "img"): Promise
```

- ```ts
type evaluateWaitForImgLoadLazy = ( page: Page, imgSelector = "img", waitingTime = 200 ): Promise
```

#### onPageWillPrint

Run before page will be printed.

```ts
type onPageWillPrint = (params: { page: Page; pageInfo: PageInfo; printOption: PrinterPrintOption }): MaybePromise
```

### otherParams

Used to place other useful params.

```ts
type otherParams = (params: { page: Page; pageInfo: PageInfo; printOption: PrinterPrintOption }): MaybePromise<{
hashIDSelector: string
}>
```

In some sites, such as Wikipedia, like to use a hash id to jump to the specified element. If you give the `hashIDSelector` and `PrinterPrintOption.replaceLink` is `true`, Printer could replace the hash of url to PDF position. The default value is `h2[id],h3[id],h4[id],h5[id]`.

## Shrink PDF

PDF generated by Web Printer maybe need to be shrinked in size by yourself.

## Acknowledgements

- [microsoft/playwright](https://github.com/microsoft/playwright)
- [Hopding/pdf-lib](https://github.com/Hopding/pdf-lib)
- [lillallol/outline-pdf](https://github.com/lillallol/outline-pdf)
## License

MIT ©