https://github.com/adrienjoly/npm-pdfreader

🚜 Parse text and tables from PDF files.
https://github.com/adrienjoly/npm-pdfreader

data-extraction javascript parse-tables parsing pdf-converter pdf-reader rule-based-parsing tabular-data

Last synced: 5 months ago
JSON representation

🚜 Parse text and tables from PDF files.

Host: GitHub
URL: https://github.com/adrienjoly/npm-pdfreader
Owner: adrienjoly
License: mit
Created: 2015-03-05T18:02:23.000Z (over 10 years ago)
Default Branch: master
Last Pushed: 2025-01-22T15:20:47.000Z (9 months ago)
Last Synced: 2025-05-05T19:15:05.379Z (5 months ago)
Topics: data-extraction, javascript, parse-tables, parsing, pdf-converter, pdf-reader, rule-based-parsing, tabular-data
Language: HTML
Homepage: https://www.npmjs.com/package/pdfreader
Size: 1.77 MB
Stars: 674
Watchers: 9
Forks: 85
Open Issues: 3
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Funding: .github/FUNDING.yml
- License: LICENSE

Awesome Lists containing this project

README

          # pdfreader ![Node CI](https://github.com/adrienjoly/npm-pdfreader/workflows/Node%20CI/badge.svg) [![Code Quality](https://api.codacy.com/project/badge/Grade/73d37dbb0ff84795acf65a55c5936d83)](https://app.codacy.com/gh/adrienjoly/npm-pdfreader?utm_source=github.com&utm_medium=referral&utm_content=adrienjoly/npm-pdfreader&utm_campaign=Badge_Grade)

Read text and parse tables from PDF files.

Supports **tabular data** with automatic column detection, and **rule-based parsing**.

Dependencies: it is based on [pdf2json](https://www.npmjs.com/package/pdf2json), which itself relies on Mozilla's [pdf.js](https://github.com/mozilla/pdf.js/).

🆕 Now includes TypeScript type definitions!

ℹ️ Important notes:

- This module is meant to be run using Node.js only. **It does not work from a web browser.**

- This module extracts text entries from PDF files. It does not support photographed text. If you cannot select text from the PDF file, **you may need to use OCR software first**.

Summary:

- [Installation, tests and CLI usage](#installation-tests-and-cli-usage)

- [Raw PDF reading](#raw-pdf-reading) (incl. examples)

- [Rule-based data extraction](#rule-based-data-extraction)

- [Troubleshooting & FAQ](#troubleshooting--faq)

## Installation, tests and CLI usage

After installing [Node.js](https://nodejs.org/):

```sh

git clone https://github.com/adrienjoly/npm-pdfreader.git

cd npm-pdfreader

npm install

npm test

node parse.js test/sample.pdf

```

## Installation into an existing project

To install `pdfreader` as a dependency of your Node.js project:

```sh

npm install pdfreader

```

Then, see below for examples of use.

## Raw PDF reading

This module exposes the `PdfReader` class, to be instantiated. You can pass `{ debug: true }` to the constructor, in order to log debugging information. (useful for troubleshooting)

Your instance has two methods for parsing a PDF. They return the same output and differ only in input: `PdfReader.parseFileItems` (as below) for a filename, and `PdfReader.parseBuffer` (see: "Raw PDF reading from a PDF already in memory (buffer)") from data that you don't want to reference from the filesystem.

Whichever method you choose, it asks for a callback, which gets called each time the instance finds what it denotes as a PDF item.

An item object can match one of the following objects:

- `null`, when the parsing is over, or an error occured.

- File metadata, `{file:{path:string}}`, when a PDF file is being opened, and is always the first item.

- Page metadata, `{page:integer, width:float, height:float}`, when a new page is being parsed, provides the page number, starting at 1. This basically acts as a carriage return for the coordinates of text items to be processed.

- Text items, `{text:string, x:float, y:float, w:float, ...}`, which you can think of as simple objects with a text property, and floating 2D AABB coordinates on the page.

It's up to your callback to process these items into a data structure of your choice, and also to handle any errors thrown to it.

For example:

```javascript

import { PdfReader } from "pdfreader";

new PdfReader().parseFileItems("test/sample.pdf", (err, item) => {

  if (err) console.error("error:", err);

  else if (!item) console.warn("end of file");

  else if (item.text) console.log(item.text);

});

```

### Parsing a password-protected PDF file

```javascript

new PdfReader({ password: "YOUR_PASSWORD" }).parseFileItems(

  "test/sample-with-password.pdf",

  function (err, item) {

    if (err) console.error(err);

    else if (!item) console.warn("end of file");

    else if (item.text) console.log(item.text);

  }

);

```

### Raw PDF reading from a PDF buffer

As above, but reading from a buffer in memory rather than from a file referenced by path. For example:

```javascript

import fs from "fs";

import { PdfReader } from "pdfreader";

fs.readFile("test/sample.pdf", (err, pdfBuffer) => {

  // pdfBuffer contains the file content

  new PdfReader().parseBuffer(pdfBuffer, (err, item) => {

    if (err) console.error("error:", err);

    else if (!item) console.warn("end of buffer");

    else if (item.text) console.log(item.text);

  });

});

```

### Other examples of use

![example cv resume parse convert pdf to text](https://github.com/adrienjoly/npm-pdfreader-example/raw/master/parseRows.png)

![example cv resume parse convert pdf table to text](https://github.com/adrienjoly/npm-pdfreader-example/raw/master/parseTable.png)

Source code of the examples above: [parsing a CV/résumé](https://github.com/adrienjoly/npm-pdfreader-example).

For more, see [Examples of use](https://github.com/adrienjoly/npm-pdfreader/discussions/categories/examples-of-use).

## Rule-based data extraction

The `Rule` class can be used to define and process data extraction rules, while parsing a PDF document.

`Rule` instances expose "accumulators": methods that defines the data extraction strategy to be used for each rule.

Example:

```javascript

const processItem = Rule.makeItemProcessor([

  Rule.on(/^Hello \"(.*)\"$/)

    .extractRegexpValues()

    .then(displayValue),

  Rule.on(/^Value\:/)

    .parseNextItemValue()

    .then(displayValue),

  Rule.on(/^c1$/).parseTable(3).then(displayTable),

  Rule.on(/^Values\:/)

    .accumulateAfterHeading()

    .then(displayValue),

]);

new PdfReader().parseFileItems("test/sample.pdf", (err, item) => {

  if (err) console.error(err);

  else processItem(item);

});

```

## Troubleshooting & FAQ

### Is it possible to parse a PDF document from a web application?

Solutions exist, but this module cannot be run directly by a web browser. If you really want to use this module, you will have to integrate it into your back-end so that PDF files can be read from your server.

### `Cannot read property 'userAgent' of undefined` error from an express-based node.js app

Dmitry found out that you may need to run these instructions before including the `pdfreader` module:

```js

global.navigator = {

  userAgent: "node",

};

window.navigator = {

  userAgent: "node",

};

```

Source: [express - TypeError: Cannot read property 'userAgent' of undefined error on node.js app run - Stack Overflow](https://stackoverflow.com/questions/49208414/typeerror-cannot-read-property-useragent-of-undefined-error-on-node-js-app-ru)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/adrienjoly/npm-pdfreader

Awesome Lists containing this project

README