{"id":13990771,"url":"https://github.com/unjs/unpdf","last_synced_at":"2026-04-28T06:04:31.170Z","repository":{"id":187723283,"uuid":"677539733","full_name":"unjs/unpdf","owner":"unjs","description":"📄 Utilities to work with PDFs in Node.js, browser and workers","archived":false,"fork":false,"pushed_at":"2024-05-17T06:03:26.000Z","size":534,"stargazers_count":317,"open_issues_count":4,"forks_count":7,"subscribers_count":3,"default_branch":"main","last_synced_at":"2024-05-17T07:25:11.342Z","etag":null,"topics":["pdf","pdfjs","serverless"],"latest_commit_sha":null,"homepage":"","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/unjs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-08-11T21:01:06.000Z","updated_at":"2024-05-17T07:25:27.404Z","dependencies_parsed_at":"2023-11-29T11:31:16.026Z","dependency_job_id":"d3fe5fc4-9d78-4a17-a50d-503dceac1160","html_url":"https://github.com/unjs/unpdf","commit_stats":null,"previous_names":["johannschopplich/unpdf","unjs/unpdf"],"tags_count":21,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/unjs%2Funpdf","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/unjs%2Funpdf/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/unjs%2Funpdf/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/unjs%2Funpdf/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/unjs","download_url":"https://codeload.github.com/unjs/unpdf/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":227098990,"owners_count":17730690,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["pdf","pdfjs","serverless"],"created_at":"2024-08-09T13:03:12.575Z","updated_at":"2026-04-28T06:04:31.165Z","avatar_url":"https://github.com/unjs.png","language":"TypeScript","funding_links":[],"categories":["Creation and production","TypeScript"],"sub_categories":[],"readme":"# unpdf\n\nUtilities for PDF extraction and rendering across all JavaScript runtimes – Node.js, Deno, Bun, the browser, and serverless environments like Cloudflare Workers. Especially useful for AI applications that need to summarize or analyze PDF documents.\n\nShips with a serverless build of Mozilla's [PDF.js](https://github.com/mozilla/pdf.js), optimized for edge environments. If you're coming from [`pdf-parse`](https://www.npmjs.com/package/pdf-parse), `unpdf` is a modern, actively maintained alternative with broader runtime support.\n\n## Features\n\n- 🏗️ Works in Node.js, browser and serverless environments\n- 🪭 Includes serverless build of PDF.js ([`unpdf/pdfjs`](./package.json#L34))\n- 💬 Extract [text](#extract-text-from-pdf), [links](#extractlinks), and [images](#extractimages) from PDF files\n- 🧠 Perfect for AI applications and PDF summarization\n- 🧱 Opt-in to official or legacy PDF.js build\n\n## Installation\n\n```bash\n# pnpm\npnpm add unpdf\n\n# npm\nnpm install unpdf\n```\n\n## Usage\n\n### Extract Text From PDF\n\n```ts\nimport { extractText, getDocumentProxy } from 'unpdf'\n\n// Fetch a PDF from the web or load it from the file system\nconst buffer = await fetch('https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf')\n  .then(res =\u003e res.arrayBuffer())\n\nconst pdf = await getDocumentProxy(new Uint8Array(buffer))\nconst { totalPages, text } = await extractText(pdf, { mergePages: true })\n\nconsole.log(`Total pages: ${totalPages}`)\nconsole.log(text)\n```\n\n### Official or Legacy PDF.js Build\n\nUsually you don't need to worry about the PDF.js build. `unpdf` ships with a serverless build of the latest PDF.js version. However, if you want to use the official PDF.js version or the legacy build, you can define a custom PDF.js module.\n\n\u003e [!WARNING]\n\u003e PDF.js v5.x uses `Promise.withResolvers`, which may not be supported in all environments, such as Node \u003c 22. Consider using the bundled serverless build, which includes a polyfill, or use an older version of PDF.js.\n\nFor example, if you want to use the official PDF.js build:\n\n```ts\nimport { definePDFJSModule, extractText, getDocumentProxy } from 'unpdf'\n\n// Define the PDF.js build before using any other unpdf method\nawait definePDFJSModule(() =\u003e import('pdfjs-dist'))\n\n// Now, you can use all unpdf methods with the official PDF.js build\nconst pdf = await getDocumentProxy(/* … */)\nconst { text } = await extractText(pdf)\n```\n\n### PDF.js API\n\n`unpdf` provides helpful [methods](#api) to work with PDF files, such as `extractText` and `extractImages`, which should cover most use cases. However, if you need more control over the PDF.js API, you can use the `getResolvedPDFJS` method to get the resolved PDF.js module.\n\nAccess the PDF.js API directly by calling `getResolvedPDFJS`:\n\n```ts\nimport { getResolvedPDFJS } from 'unpdf'\n\nconst { version } = await getResolvedPDFJS()\n```\n\n\u003e [!NOTE]\n\u003e If no other PDF.js build was defined, the serverless build will always be used.\n\nFor example, you can use the `getDocument` method to load a PDF file and then use the `getMetadata` method to get the metadata of the PDF file:\n\n```ts\nimport { readFile } from 'node:fs/promises'\nimport { getResolvedPDFJS } from 'unpdf'\n\nconst { getDocument } = await getResolvedPDFJS()\nconst data = await readFile('./dummy.pdf')\nconst document = await getDocument(new Uint8Array(data)).promise\n\nconsole.log(await document.getMetadata())\n```\n\n## How It Works\n\n\u003e [!NOTE]\n\u003e The serverless PDF.js bundle is built from PDF.js v5.6.205.\n\nHeart and soul of this package is the [`pdfjs.rollup.config.ts`](./pdfjs.rollup.config.ts) file. It uses [Rollup](https://rollupjs.org/) to bundle PDF.js into a single file for serverless environments. The key techniques:\n\n- **String replacements** strip browser-specific references from the PDF.js source.\n- **Worker inlining** embeds the PDF.js worker directly into the main bundle, since serverless runtimes can't load separate worker files.\n- **Global polyfills** provide missing APIs like `FinalizationRegistry` (unavailable in Cloudflare Workers).\n\n## API\n\n### `definePDFJSModule`\n\nAllows to define a custom PDF.js build. This method should be called before using any other method. If no custom build is defined, the serverless build will be used.\n\n**Type Declaration**\n\n```ts\nfunction definePDFJSModule(pdfjs: () =\u003e Promise\u003cPDFJS\u003e): Promise\u003cvoid\u003e\n```\n\n### `getResolvedPDFJS`\n\nReturns the resolved PDF.js module. If no other PDF.js build was defined, the serverless build will be used. This method is useful if you want to use the PDF.js API directly.\n\n**Type Declaration**\n\n```ts\nfunction getResolvedPDFJS(): Promise\u003cPDFJS\u003e\n```\n\n### `getMeta`\n\nExtracts metadata from a PDF. If `parseDates` is set to `true`, the date properties will be parsed into `Date` objects.\n\n**Type Declaration**\n\n```ts\nfunction getMeta(\n  data: DocumentInitParameters['data'] | PDFDocumentProxy,\n  options?: {\n    parseDates?: boolean\n  },\n): Promise\u003c{\n  info: Record\u003cstring, any\u003e\n  metadata: Record\u003cstring, any\u003e\n}\u003e\n```\n\n### `extractText`\n\nExtracts all text from a PDF. If `mergePages` is set to `true`, the text of all pages will be merged into a single string. Otherwise, an array of strings for each page will be returned.\n\n**Type Declaration**\n\n```ts\nfunction extractText(\n  data: DocumentInitParameters['data'] | PDFDocumentProxy,\n  options?: {\n    mergePages?: false\n  }\n): Promise\u003c{\n  totalPages: number\n  text: string[]\n}\u003e\nfunction extractText(\n  data: DocumentInitParameters['data'] | PDFDocumentProxy,\n  options: {\n    mergePages: true\n  }\n): Promise\u003c{\n  totalPages: number\n  text: string\n}\u003e\n```\n\n### `extractLinks`\n\nExtracts all links from a PDF document, including hyperlinks and external URLs.\n\n**Type Declaration**\n\n```ts\nfunction extractLinks(\n  data: DocumentInitParameters['data'] | PDFDocumentProxy,\n): Promise\u003c{\n  totalPages: number\n  links: string[]\n}\u003e\n```\n\n**Example**\n\n```ts\nimport { readFile } from 'node:fs/promises'\nimport { extractLinks, getDocumentProxy } from 'unpdf'\n\n// Load a PDF file\nconst buffer = await readFile('./document.pdf')\nconst pdf = await getDocumentProxy(new Uint8Array(buffer))\n\n// Extract all links from the PDF\nconst { totalPages, links } = await extractLinks(pdf)\n\nconsole.log(`Total pages: ${totalPages}`)\nconsole.log(`Found ${links.length} links:`)\nfor (const link of links) console.log(link)\n```\n\n### `extractImages`\n\nExtracts images from a specific page of a PDF document, including necessary metadata such as width, height, and calculated color channels. Works with both the serverless and official PDF.js build.\n\n**Type Declaration**\n\n```ts\ninterface ExtractedImageObject {\n  data: Uint8ClampedArray\n  width: number\n  height: number\n  channels: 1 | 3 | 4\n  key: string\n}\n\nfunction extractImages(\n  data: DocumentInitParameters['data'] | PDFDocumentProxy,\n  pageNumber: number,\n): Promise\u003cExtractedImageObject[]\u003e\n```\n\n**Example**\n\n\u003e [!NOTE]\n\u003e The following example uses the [sharp](https://github.com/lovell/sharp) library to process and save the extracted images. You will need to install it with your preferred package manager.\n\n```ts\nimport { readFile, writeFile } from 'node:fs/promises'\nimport sharp from 'sharp'\nimport { extractImages, getDocumentProxy } from 'unpdf'\n\nasync function extractPdfImages() {\n  const buffer = await readFile('./document.pdf')\n  const pdf = await getDocumentProxy(new Uint8Array(buffer))\n\n  // Extract images from page 1\n  const imagesData = await extractImages(pdf, 1)\n  console.log(`Found ${imagesData.length} images on page 1`)\n\n  // Process each image with sharp (optional)\n  let totalImagesProcessed = 0\n  for (const imgData of imagesData) {\n    const imageIndex = ++totalImagesProcessed\n\n    await sharp(imgData.data, {\n      raw: {\n        width: imgData.width,\n        height: imgData.height,\n        channels: imgData.channels\n      }\n    })\n      .png()\n      .toFile(`image-${imageIndex}.png`)\n\n    console.log(`Saved image ${imageIndex} (${imgData.width}x${imgData.height}, ${imgData.channels} channels)`)\n  }\n}\n\nextractPdfImages().catch(console.error)\n```\n\n### `renderPageAsImage`\n\nTo render a PDF page as an image, you can use the `renderPageAsImage` method. This method will return an `ArrayBuffer` of the rendered image. It can also return a data URL (`string`) if `toDataURL` option is set to `true`.\n\n\u003e [!NOTE]\n\u003e This method will only work in Node.js and browser environments.\n\nIn order to use this method, make sure to meet the following requirements:\n\n- Use the official PDF.js build (see [Official or Legacy PDF.js Build](#official-or-legacy-pdfjs-build)).\n- Install the [`@napi-rs/canvas`](https://github.com/Brooooooklyn/canvas) package if you are using Node.js. This package is required to render the PDF page as an image.\n\n\u003e [!TIP]\n\u003e In Node.js, `getDocumentProxy` automatically sets `disableFontFace: true` and resolves `standardFontDataUrl` from your local `pdfjs-dist` package for correct font rendering. To customize this behavior, pass your own options:\n\u003e\n\u003e ```ts\n\u003e const pdf = await getDocumentProxy(buffer, {\n\u003e   disableFontFace: false,\n\u003e   standardFontDataUrl: 'https://unpkg.com/pdfjs-dist@latest/standard_fonts/',\n\u003e })\n\u003e ```\n\n**Type Declaration**\n\n```ts\nfunction renderPageAsImage(\n  data: DocumentInitParameters['data'] | PDFDocumentProxy,\n  pageNumber: number,\n  options?: {\n    canvasImport?: () =\u003e Promise\u003ctypeof import('@napi-rs/canvas')\u003e\n    /** @default 1.0 */\n    scale?: number\n    width?: number\n    height?: number\n    toDataURL?: false\n  },\n): Promise\u003cArrayBuffer\u003e\nfunction renderPageAsImage(\n  data: DocumentInitParameters['data'] | PDFDocumentProxy,\n  pageNumber: number,\n  options: {\n    canvasImport?: () =\u003e Promise\u003ctypeof import('@napi-rs/canvas')\u003e\n    /** @default 1.0 */\n    scale?: number\n    width?: number\n    height?: number\n    toDataURL: true\n  },\n): Promise\u003cstring\u003e\n```\n\n**Examples**\n\n```ts\nimport { definePDFJSModule, renderPageAsImage } from 'unpdf'\n\n// Use the official PDF.js build\nawait definePDFJSModule(() =\u003e import('pdfjs-dist'))\n\nconst pdf = await readFile('./dummy.pdf')\nconst buffer = new Uint8Array(pdf)\nconst pageNumber = 1\n\nconst result = await renderPageAsImage(buffer, pageNumber, {\n  canvasImport: () =\u003e import('@napi-rs/canvas'),\n  scale: 2,\n})\nawait writeFile('dummy-page-1.png', new Uint8Array(result))\n```\n\n```ts\nimport { definePDFJSModule, renderPageAsImage } from 'unpdf'\n\nawait definePDFJSModule(() =\u003e import('pdfjs-dist'))\n\nconst pdf = await readFile('./dummy.pdf')\nconst buffer = new Uint8Array(pdf)\nconst pageNumber = 1\n\nconst result = await renderPageAsImage(buffer, pageNumber, {\n  canvasImport: () =\u003e import('@napi-rs/canvas'),\n  scale: 2,\n  toDataURL: true,\n})\n\nconst html = `\u003c!DOCTYPE html\u003e\n\u003chtml lang=\"en\"\u003e\n  \u003chead\u003e\n    \u003cmeta charset=\"UTF-8\"\u003e\n    \u003cmeta name=\"viewport\" content=\"width=device-width, initial-scale=1.0\"\u003e\n    \u003ctitle\u003eDummy Page\u003c/title\u003e\n  \u003c/head\u003e\n  \u003cbody\u003e\n    \u003cimg alt=\"Example Page\" src=\"${result}\"\u003e\n  \u003c/body\u003e\n\u003c/html\u003e`\n\nawait writeFile('dummy-page-1.html', html)\n```\n\n## License\n\n[MIT](./LICENSE) License © 2023-PRESENT [Johann Schopplich](https://github.com/johannschopplich)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Funjs%2Funpdf","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Funjs%2Funpdf","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Funjs%2Funpdf/lists"}