https://github.com/johannschopplich/pdfjs-serverless

🪭 Serverless build of PDF.js for Deno, workers, and other nodeless environments
https://github.com/johannschopplich/pdfjs-serverless

Last synced: 6 months ago
JSON representation

🪭 Serverless build of PDF.js for Deno, workers, and other nodeless environments

Host: GitHub
URL: https://github.com/johannschopplich/pdfjs-serverless
Owner: johannschopplich
License: mit
Created: 2023-08-26T18:46:54.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2025-03-02T19:29:06.000Z (7 months ago)
Last Synced: 2025-03-29T07:02:13.820Z (6 months ago)
Language: TypeScript
Homepage:
Size: 357 KB
Stars: 75
Watchers: 1
Forks: 4
Open Issues: 3
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # pdfjs-serverless

A redistribution of Mozilla's [PDF.js](https://github.com/mozilla/pdf.js) for serverless environments, like Deno Deploy and Cloudflare Workers with zero dependencies. The whole export is about 1.4 MB (minified).

## PDF.js Compatibility

> [!NOTE]

> This package is currently using PDF.js v4.10.38.

## Installation

Run the following command to add `pdfjs-serverless` to your project.

```bash

# pnpm

pnpm add pdfjs-serverless

# npm

npm install pdfjs-serverless

# yarn

yarn add pdfjs-serverless

```

## Usage

Since PDF.js v4, the library migrated to ESM. Which is great. However, it also uses a top-level await, which is not supported by Cloudflare workers yet. Therefore, we have to wrap all named exports in a function that resolves the PDF.js library:

```ts

declare function resolvePDFJS(): Promise

```

So, instead of importing the named exports directly:

```ts

// This will NOT work at the moment

import { getDocument } from 'pdfjs-serverless'

```

We have to use the `resolvePDFJS` function to get the named exports:

```ts

import { resolvePDFJS } from 'pdfjs-serverless'

const { getDocument } = await resolvePDFJS()

```

> [!NOTE]

> Once Cloudflare workers support top-level await, we can remove this wrapper and re-export all PDF.js named exports directly again.

### 🦕 Deno

```ts

import { resolvePDFJS } from 'https://esm.sh/pdfjs-serverless'

// Initialize PDF.js

const { getDocument } = await resolvePDFJS()

const data = Deno.readFileSync('sample.pdf')

const doc = await getDocument({

  data,

  useSystemFonts: true,

}).promise

console.log(await doc.getMetadata())

// Iterate through each page and fetch the text content

for (let i = 1; i <= doc.numPages; i++) {

  const page = await doc.getPage(i)

  const textContent = await page.getTextContent()

  const contents = textContent.items.map(item => item.str).join(' ')

  console.log(contents)

}

```

### 🌩 Cloudflare Workers

```ts

import { resolvePDFJS } from 'pdfjs-serverless'

addEventListener('fetch', (event) => {

  event.respondWith(handleRequest(event.request))

})

async function handleRequest(request) {

  if (request.method !== 'POST')

    return new Response('Method Not Allowed', { status: 405 })

  // Get the PDF file from the POST request body as a buffer

  const data = await request.arrayBuffer()

  // Initialize PDF.js

  const { getDocument } = await resolvePDFJS()

  const doc = await getDocument({

    data,

    useSystemFonts: true,

  }).promise

  // Get metadata and initialize output object

  const metadata = await doc.getMetadata()

  const output = {

    metadata,

    pages: []

  }

  // Iterate through each page and fetch the text content

  for (let i = 1; i <= doc.numPages; i++) {

    const page = await doc.getPage(i)

    const textContent = await page.getTextContent()

    const contents = textContent.items.map(item => item.str).join(' ')

    // Add page content to output

    output.pages.push({

      pageNumber: i,

      content: contents

    })

  }

  // Return the results as JSON

  return new Response(JSON.stringify(output), {

    headers: { 'Content-Type': 'application/json' }

  })

}

```

## How It Works

First, some string replacements of the `PDF.js` library is necessary, i.e. removing browser context references and checks like `typeof window`. Additionally, we enforce Node.js compatibility (might sound paradox at first, bear with me), i.e. mocking the `canvas` module and setting the `isNodeJS` flag to `true`.

PDF.js uses a worker to parse and work with PDF documents. This worker is a separate file that is loaded by the main library. For the serverless build, we need to inline the worker code into the main library.

To achieve the final nodeless build, [`unenv`](https://github.com/unjs/unenv) does the heavy lifting by converting Node.js specific code to be platform-agnostic. This ensures that Node.js built-in modules like `fs` are mocked.

See the [`rollup.config.ts`](./rollup.config.ts) file for more information.

## Inspiration

- [`pdf.mjs`](https://github.com/bru02/pdf.mjs), a nodeless build of PDF.js v2.

## License

[MIT](./LICENSE) License © 2023-PRESENT [Johann Schopplich](https://github.com/johannschopplich)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/johannschopplich/pdfjs-serverless

Awesome Lists containing this project

README