{"id":13467893,"url":"https://github.com/adrienjoly/npm-pdfreader","last_synced_at":"2025-05-14T04:07:03.262Z","repository":{"id":28223512,"uuid":"31727906","full_name":"adrienjoly/npm-pdfreader","owner":"adrienjoly","description":"🚜 Parse text and tables from PDF files.","archived":false,"fork":false,"pushed_at":"2025-01-22T15:20:47.000Z","size":1859,"stargazers_count":674,"open_issues_count":3,"forks_count":85,"subscribers_count":9,"default_branch":"master","last_synced_at":"2025-05-05T19:15:05.379Z","etag":null,"topics":["data-extraction","javascript","parse-tables","parsing","pdf-converter","pdf-reader","rule-based-parsing","tabular-data"],"latest_commit_sha":null,"homepage":"https://www.npmjs.com/package/pdfreader","language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/adrienjoly.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null},"funding":{"github":["adrienjoly"],"custom":["https://adrienjoly.com/donate/"]}},"created_at":"2015-03-05T18:02:23.000Z","updated_at":"2025-05-01T23:36:03.000Z","dependencies_parsed_at":"2023-12-07T00:23:57.128Z","dependency_job_id":"7391d775-b21b-4043-9e54-4cb088783fbc","html_url":"https://github.com/adrienjoly/npm-pdfreader","commit_stats":{"total_commits":150,"total_committers":22,"mean_commits":6.818181818181818,"dds":0.52,"last_synced_commit":"300351cc771027fe2b56d6dd728ee273c57ebe7e"},"previous_names":[],"tags_count":46,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/adrienjoly%2Fnpm-pdfreader","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/adrienjoly%2Fnpm-pdfreader/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/adrienjoly%2Fnpm-pdfreader/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/adrienjoly%2Fnpm-pdfreader/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/adrienjoly","download_url":"https://codeload.github.com/adrienjoly/npm-pdfreader/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252823577,"owners_count":21809705,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-extraction","javascript","parse-tables","parsing","pdf-converter","pdf-reader","rule-based-parsing","tabular-data"],"created_at":"2024-07-31T15:01:02.132Z","updated_at":"2025-05-14T04:06:58.222Z","avatar_url":"https://github.com/adrienjoly.png","language":"HTML","funding_links":["https://github.com/sponsors/adrienjoly","https://adrienjoly.com/donate/"],"categories":["HTML"],"sub_categories":[],"readme":"# pdfreader ![Node CI](https://github.com/adrienjoly/npm-pdfreader/workflows/Node%20CI/badge.svg) [![Code Quality](https://api.codacy.com/project/badge/Grade/73d37dbb0ff84795acf65a55c5936d83)](https://app.codacy.com/gh/adrienjoly/npm-pdfreader?utm_source=github.com\u0026utm_medium=referral\u0026utm_content=adrienjoly/npm-pdfreader\u0026utm_campaign=Badge_Grade)\n\nRead text and parse tables from PDF files.\n\nSupports **tabular data** with automatic column detection, and **rule-based parsing**.\n\nDependencies: it is based on [pdf2json](https://www.npmjs.com/package/pdf2json), which itself relies on Mozilla's [pdf.js](https://github.com/mozilla/pdf.js/).\n\n🆕 Now includes TypeScript type definitions!\n\nℹ️ Important notes:\n\n- This module is meant to be run using Node.js only. **It does not work from a web browser.**\n- This module extracts text entries from PDF files. It does not support photographed text. If you cannot select text from the PDF file, **you may need to use OCR software first**.\n\nSummary:\n\n- [Installation, tests and CLI usage](#installation-tests-and-cli-usage)\n- [Raw PDF reading](#raw-pdf-reading) (incl. examples)\n- [Rule-based data extraction](#rule-based-data-extraction)\n- [Troubleshooting \u0026 FAQ](#troubleshooting--faq)\n\n## Installation, tests and CLI usage\n\nAfter installing [Node.js](https://nodejs.org/):\n\n```sh\ngit clone https://github.com/adrienjoly/npm-pdfreader.git\ncd npm-pdfreader\nnpm install\nnpm test\nnode parse.js test/sample.pdf\n```\n\n## Installation into an existing project\n\nTo install `pdfreader` as a dependency of your Node.js project:\n\n```sh\nnpm install pdfreader\n```\n\nThen, see below for examples of use.\n\n## Raw PDF reading\n\nThis module exposes the `PdfReader` class, to be instantiated. You can pass `{ debug: true }` to the constructor, in order to log debugging information. (useful for troubleshooting)\n\nYour instance has two methods for parsing a PDF. They return the same output and differ only in input: `PdfReader.parseFileItems` (as below) for a filename, and `PdfReader.parseBuffer` (see: \"Raw PDF reading from a PDF already in memory (buffer)\") from data that you don't want to reference from the filesystem.\n\nWhichever method you choose, it asks for a callback, which gets called each time the instance finds what it denotes as a PDF item.\n\nAn item object can match one of the following objects:\n\n- `null`, when the parsing is over, or an error occured.\n- File metadata, `{file:{path:string}}`, when a PDF file is being opened, and is always the first item.\n- Page metadata, `{page:integer, width:float, height:float}`, when a new page is being parsed, provides the page number, starting at 1. This basically acts as a carriage return for the coordinates of text items to be processed.\n- Text items, `{text:string, x:float, y:float, w:float, ...}`, which you can think of as simple objects with a text property, and floating 2D AABB coordinates on the page.\n\nIt's up to your callback to process these items into a data structure of your choice, and also to handle any errors thrown to it.\n\nFor example:\n\n```javascript\nimport { PdfReader } from \"pdfreader\";\n\nnew PdfReader().parseFileItems(\"test/sample.pdf\", (err, item) =\u003e {\n  if (err) console.error(\"error:\", err);\n  else if (!item) console.warn(\"end of file\");\n  else if (item.text) console.log(item.text);\n});\n```\n\n### Parsing a password-protected PDF file\n\n```javascript\nnew PdfReader({ password: \"YOUR_PASSWORD\" }).parseFileItems(\n  \"test/sample-with-password.pdf\",\n  function (err, item) {\n    if (err) console.error(err);\n    else if (!item) console.warn(\"end of file\");\n    else if (item.text) console.log(item.text);\n  }\n);\n```\n\n### Raw PDF reading from a PDF buffer\n\nAs above, but reading from a buffer in memory rather than from a file referenced by path. For example:\n\n```javascript\nimport fs from \"fs\";\nimport { PdfReader } from \"pdfreader\";\n\nfs.readFile(\"test/sample.pdf\", (err, pdfBuffer) =\u003e {\n  // pdfBuffer contains the file content\n  new PdfReader().parseBuffer(pdfBuffer, (err, item) =\u003e {\n    if (err) console.error(\"error:\", err);\n    else if (!item) console.warn(\"end of buffer\");\n    else if (item.text) console.log(item.text);\n  });\n});\n```\n\n### Other examples of use\n\n![example cv resume parse convert pdf to text](https://github.com/adrienjoly/npm-pdfreader-example/raw/master/parseRows.png)\n\n![example cv resume parse convert pdf table to text](https://github.com/adrienjoly/npm-pdfreader-example/raw/master/parseTable.png)\n\nSource code of the examples above: [parsing a CV/résumé](https://github.com/adrienjoly/npm-pdfreader-example).\n\nFor more, see [Examples of use](https://github.com/adrienjoly/npm-pdfreader/discussions/categories/examples-of-use).\n\n## Rule-based data extraction\n\nThe `Rule` class can be used to define and process data extraction rules, while parsing a PDF document.\n\n`Rule` instances expose \"accumulators\": methods that defines the data extraction strategy to be used for each rule.\n\nExample:\n\n```javascript\nconst processItem = Rule.makeItemProcessor([\n  Rule.on(/^Hello \\\"(.*)\\\"$/)\n    .extractRegexpValues()\n    .then(displayValue),\n  Rule.on(/^Value\\:/)\n    .parseNextItemValue()\n    .then(displayValue),\n  Rule.on(/^c1$/).parseTable(3).then(displayTable),\n  Rule.on(/^Values\\:/)\n    .accumulateAfterHeading()\n    .then(displayValue),\n]);\nnew PdfReader().parseFileItems(\"test/sample.pdf\", (err, item) =\u003e {\n  if (err) console.error(err);\n  else processItem(item);\n});\n```\n\n## Troubleshooting \u0026 FAQ\n\n### Is it possible to parse a PDF document from a web application?\n\nSolutions exist, but this module cannot be run directly by a web browser. If you really want to use this module, you will have to integrate it into your back-end so that PDF files can be read from your server.\n\n### `Cannot read property 'userAgent' of undefined` error from an express-based node.js app\n\nDmitry found out that you may need to run these instructions before including the `pdfreader` module:\n\n```js\nglobal.navigator = {\n  userAgent: \"node\",\n};\n\nwindow.navigator = {\n  userAgent: \"node\",\n};\n```\n\nSource: [express - TypeError: Cannot read property 'userAgent' of undefined error on node.js app run - Stack Overflow](https://stackoverflow.com/questions/49208414/typeerror-cannot-read-property-useragent-of-undefined-error-on-node-js-app-ru)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fadrienjoly%2Fnpm-pdfreader","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fadrienjoly%2Fnpm-pdfreader","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fadrienjoly%2Fnpm-pdfreader/lists"}