Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/adrienjoly/npm-pdfreader
🚜 Parse text and tables from PDF files.
https://github.com/adrienjoly/npm-pdfreader
data-extraction javascript parse-tables parsing pdf-converter pdf-reader rule-based-parsing tabular-data
Last synced: 6 days ago
JSON representation
🚜 Parse text and tables from PDF files.
- Host: GitHub
- URL: https://github.com/adrienjoly/npm-pdfreader
- Owner: adrienjoly
- License: mit
- Created: 2015-03-05T18:02:23.000Z (almost 10 years ago)
- Default Branch: master
- Last Pushed: 2025-01-22T15:20:47.000Z (30 days ago)
- Last Synced: 2025-02-07T04:41:43.762Z (14 days ago)
- Topics: data-extraction, javascript, parse-tables, parsing, pdf-converter, pdf-reader, rule-based-parsing, tabular-data
- Language: HTML
- Homepage: https://www.npmjs.com/package/pdfreader
- Size: 1.77 MB
- Stars: 659
- Watchers: 10
- Forks: 85
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Funding: .github/FUNDING.yml
- License: LICENSE
Awesome Lists containing this project
README
# pdfreader  [](https://app.codacy.com/gh/adrienjoly/npm-pdfreader?utm_source=github.com&utm_medium=referral&utm_content=adrienjoly/npm-pdfreader&utm_campaign=Badge_Grade)
Read text and parse tables from PDF files.
Supports **tabular data** with automatic column detection, and **rule-based parsing**.
Dependencies: it is based on [pdf2json](https://www.npmjs.com/package/pdf2json), which itself relies on Mozilla's [pdf.js](https://github.com/mozilla/pdf.js/).
🆕 Now includes TypeScript type definitions!
ℹ️ Important notes:
- This module is meant to be run using Node.js only. **It does not work from a web browser.**
- This module extracts text entries from PDF files. It does not support photographed text. If you cannot select text from the PDF file, **you may need to use OCR software first**.Summary:
- [Installation, tests and CLI usage](#installation-tests-and-cli-usage)
- [Raw PDF reading](#raw-pdf-reading) (incl. examples)
- [Rule-based data extraction](#rule-based-data-extraction)
- [Troubleshooting & FAQ](#troubleshooting--faq)## Installation, tests and CLI usage
After installing [Node.js](https://nodejs.org/):
```sh
git clone https://github.com/adrienjoly/npm-pdfreader.git
cd npm-pdfreader
npm install
npm test
node parse.js test/sample.pdf
```## Installation into an existing project
To install `pdfreader` as a dependency of your Node.js project:
```sh
npm install pdfreader
```Then, see below for examples of use.
## Raw PDF reading
This module exposes the `PdfReader` class, to be instantiated. You can pass `{ debug: true }` to the constructor, in order to log debugging information. (useful for troubleshooting)
Your instance has two methods for parsing a PDF. They return the same output and differ only in input: `PdfReader.parseFileItems` (as below) for a filename, and `PdfReader.parseBuffer` (see: "Raw PDF reading from a PDF already in memory (buffer)") from data that you don't want to reference from the filesystem.
Whichever method you choose, it asks for a callback, which gets called each time the instance finds what it denotes as a PDF item.
An item object can match one of the following objects:
- `null`, when the parsing is over, or an error occured.
- File metadata, `{file:{path:string}}`, when a PDF file is being opened, and is always the first item.
- Page metadata, `{page:integer, width:float, height:float}`, when a new page is being parsed, provides the page number, starting at 1. This basically acts as a carriage return for the coordinates of text items to be processed.
- Text items, `{text:string, x:float, y:float, w:float, ...}`, which you can think of as simple objects with a text property, and floating 2D AABB coordinates on the page.It's up to your callback to process these items into a data structure of your choice, and also to handle any errors thrown to it.
For example:
```javascript
import { PdfReader } from "pdfreader";new PdfReader().parseFileItems("test/sample.pdf", (err, item) => {
if (err) console.error("error:", err);
else if (!item) console.warn("end of file");
else if (item.text) console.log(item.text);
});
```### Parsing a password-protected PDF file
```javascript
new PdfReader({ password: "YOUR_PASSWORD" }).parseFileItems(
"test/sample-with-password.pdf",
function (err, item) {
if (err) console.error(err);
else if (!item) console.warn("end of file");
else if (item.text) console.log(item.text);
}
);
```### Raw PDF reading from a PDF buffer
As above, but reading from a buffer in memory rather than from a file referenced by path. For example:
```javascript
import fs from "fs";
import { PdfReader } from "pdfreader";fs.readFile("test/sample.pdf", (err, pdfBuffer) => {
// pdfBuffer contains the file content
new PdfReader().parseBuffer(pdfBuffer, (err, item) => {
if (err) console.error("error:", err);
else if (!item) console.warn("end of buffer");
else if (item.text) console.log(item.text);
});
});
```### Other examples of use


Source code of the examples above: [parsing a CV/résumé](https://github.com/adrienjoly/npm-pdfreader-example).
For more, see [Examples of use](https://github.com/adrienjoly/npm-pdfreader/discussions/categories/examples-of-use).
## Rule-based data extraction
The `Rule` class can be used to define and process data extraction rules, while parsing a PDF document.
`Rule` instances expose "accumulators": methods that defines the data extraction strategy to be used for each rule.
Example:
```javascript
const processItem = Rule.makeItemProcessor([
Rule.on(/^Hello \"(.*)\"$/)
.extractRegexpValues()
.then(displayValue),
Rule.on(/^Value\:/)
.parseNextItemValue()
.then(displayValue),
Rule.on(/^c1$/).parseTable(3).then(displayTable),
Rule.on(/^Values\:/)
.accumulateAfterHeading()
.then(displayValue),
]);
new PdfReader().parseFileItems("test/sample.pdf", (err, item) => {
if (err) console.error(err);
else processItem(item);
});
```## Troubleshooting & FAQ
### Is it possible to parse a PDF document from a web application?
Solutions exist, but this module cannot be run directly by a web browser. If you really want to use this module, you will have to integrate it into your back-end so that PDF files can be read from your server.
### `Cannot read property 'userAgent' of undefined` error from an express-based node.js app
Dmitry found out that you may need to run these instructions before including the `pdfreader` module:
```js
global.navigator = {
userAgent: "node",
};window.navigator = {
userAgent: "node",
};
```Source: [express - TypeError: Cannot read property 'userAgent' of undefined error on node.js app run - Stack Overflow](https://stackoverflow.com/questions/49208414/typeerror-cannot-read-property-useragent-of-undefined-error-on-node-js-app-ru)