Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/unjs/unpdf
📄 Utilities to work with PDFs in Node.js, browser and workers
https://github.com/unjs/unpdf
pdf pdfjs serverless
Last synced: 3 months ago
JSON representation
📄 Utilities to work with PDFs in Node.js, browser and workers
- Host: GitHub
- URL: https://github.com/unjs/unpdf
- Owner: unjs
- License: mit
- Created: 2023-08-11T21:01:06.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2024-05-17T06:03:26.000Z (6 months ago)
- Last Synced: 2024-05-17T07:25:11.342Z (6 months ago)
- Topics: pdf, pdfjs, serverless
- Language: TypeScript
- Homepage:
- Size: 521 KB
- Stars: 317
- Watchers: 3
- Forks: 7
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# unpdf
A collection of utilities to work with PDFs. Designed specifically for Deno, workers and other nodeless environments.
`unpdf` ships with a serverless build/redistribution of Mozilla's [PDF.js](https://github.com/mozilla/pdf.js) for serverless environments. Apart from some string replacements and mocks, [`unenv`](https://github.com/unjs/unenv) does the heavy lifting by converting Node.js specific code to be platform-agnostic. See [`pdfjs.rollup.config.ts`](./pdfjs.rollup.config.ts) for all the details.
This library is also intended as a modern alternative to the unmaintained but still popular [`pdf-parse`](https://www.npmjs.com/package/pdf-parse).
## Features
- 🏗️ Works in Node.js, browser and workers
- 🪭 Includes serverless build of PDF.js ([`unpdf/pdfjs`](./package.json#L34))
- 💬 Extract text and images from PDFs
- 🧱 Opt-in to legacy PDF.js build
- 💨 Zero dependencies## PDF.js Compatibility
The serverless build of PDF.js provided by `unpdf` is based on PDF.js v4.3.136. If you need a different version, you can [use another PDF.js build](#use-official-or-legacy-pdfjs-build).
## Installation
Run the following command to add `unpdf` to your project.
```bash
# pnpm
pnpm add unpdf# npm
npm install unpdf# yarn
yarn add unpdf
```## Usage
### Extract Text From PDF
```ts
import { extractText, getDocumentProxy } from "unpdf";// Fetch a PDF file from the web
const buffer = await fetch(
"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf",
).then((res) => res.arrayBuffer());// Or load it from the filesystem
const buffer = await readFile("./dummy.pdf");// Load PDF from buffer
const pdf = await getDocumentProxy(new Uint8Array(buffer));
// Extract text from PDF
const { totalPages, text } = await extractText(pdf, { mergePages: true });
```### Access the PDF.js API
This will return the resolved PDF.js module and gives full access to the PDF.js API, like:
- `getDocument`
- `version`
- … and all other methodsEspecially useful for platforms like 🦕 Deno or if you want to use the PDF.js API directly. If no custom build was defined beforehand, the serverless build bundled with `unpdf` will be initialized.
```ts
import { getResolvedPDFJS } from "unpdf";const { getDocument } = await getResolvedPDFJS();
const data = Deno.readFileSync("dummy.pdf");
const doc = await getDocument(data).promise;console.log(await doc.getMetadata());
for (let i = 1; i <= doc.numPages; i++) {
const page = await doc.getPage(i);
const textContent = await page.getTextContent();
const contents = textContent.items.map((item) => item.str).join(" ");
console.log(contents);
}
```### Use Official or Legacy PDF.js Build
Generally speaking, you don't need to worry about the PDF.js build. `unpdf` ships with a serverless build of the latest PDF.js version. However, if you want to use the official PDF.js version or the legacy build, you can define a custom PDF.js module.
> [!WARNING]
> The latest PDF.js v4.3.136 uses `Promise.withResolvers`, which may not be supported in all environments, such as Node < 22. Consider to use the bundled serverless build, which includes a polyfill, or use an older version of PDF.js.```ts
// Before using any other method, define the PDF.js module
// if you need another PDF.js build
import { configureUnPDF } from "unpdf";await configureUnPDF({
// Use the official PDF.js build (make sure to install it first)
pdfjs: () => import("pdfjs-dist"),
});// Now, you can use the other methods
// …
```## Config
```ts
interface UnPDFConfiguration {
/**
* By default, UnPDF will use the latest version of PDF.js compiled for
* serverless environments. If you want to use a different version, you can
* provide a custom resolver function.
*
* @example
* // Use the official PDF.js build (make sure to install it first)
* () => import('pdfjs-dist')
*/
pdfjs?: () => Promise;
}
```## Methods
### `configureUnPDF`
Define a custom PDF.js module, like the legacy build. Make sure to call this method before using any other methods.
```ts
function configureUnPDF(config: UnPDFConfiguration): Promise;
```### `getResolvedPDFJS`
Returns the resolved PDF.js module. If no build is defined, the latest version will be initialized.
```ts
function getResolvedPDFJS(): Promise;
```### `getMeta`
```ts
function getMeta(
data: DocumentInitParameters["data"] | PDFDocumentProxy,
): Promise<{
info: Record;
metadata: Record;
}>;
```### `extractText`
Extracts all text from a PDF. If `mergePages` is set to `true`, the text of all pages will be merged into a single string. Otherwise, an array of strings for each page will be returned.
```ts
function extractText(
data: DocumentInitParameters["data"] | PDFDocumentProxy,
{ mergePages }?: { mergePages?: boolean },
): Promise<{
totalPages: number;
text: string | string[];
}>;
```### `renderPageAsImage`
> [!NOTE]
> This method will only work in Node.js and browser environments.To render a PDF page as an image, you can use the `renderPageAsImage` method. This method will return an `ArrayBuffer` of the rendered image.
In order to use this method, you have to meet the following requirements:
- Use the official PDF.js build
- Install the [`canvas`](https://www.npmjs.com/package/canvas) package in Node.js environments**Example**
```ts
import { configureUnPDF, renderPageAsImage } from "unpdf";await configureUnPDF({
// Use the official PDF.js build
pdfjs: () => import("pdfjs-dist"),
});const pdf = await readFile("./dummy.pdf");
const buffer = new Uint8Array(pdf);
const pageNumber = 1;const result = await renderPageAsImage(buffer, pageNumber, {
canvas: () => import("canvas"),
});
await writeFile("dummy-page-1.png", Buffer.from(result));
```**Type Declaration**
```ts
declare function renderPageAsImage(
data: DocumentInitParameters["data"],
pageNumber: number,
options?: {
canvas?: () => Promise;
/** @default 1 */
scale?: number;
width?: number;
height?: number;
},
): Promise;
```## FAQ
### Why Is `canvas` An Optional Dependency?
The official PDF.js library depends on the `canvas` module for Node.js environments, which [doesn't work inside worker threads](https://github.com/Automattic/node-canvas/issues/1394). That's why `unpdf` ships with a serverless build of PDF.js that mocks the `canvas` module.
However, to render PDF pages as images in Node.js environments, you need to install the `canvas` module. That's why it is a peer dependency.
## License
[MIT](./LICENSE) License © 2023-PRESENT [Johann Schopplich](https://github.com/johannschopplich)