{"id":20597051,"url":"https://github.com/tradle/pdf-parse","last_synced_at":"2026-06-04T17:31:43.626Z","repository":{"id":152402892,"uuid":"624115464","full_name":"tradle/pdf-parse","owner":"tradle","description":null,"archived":false,"fork":false,"pushed_at":"2023-04-25T15:52:07.000Z","size":7834,"stargazers_count":2,"open_issues_count":0,"forks_count":1,"subscribers_count":6,"default_branch":"master","last_synced_at":"2025-01-17T00:53:10.485Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tradle.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-04-05T19:23:20.000Z","updated_at":"2024-07-10T13:36:32.000Z","dependencies_parsed_at":null,"dependency_job_id":"c06ec6ad-eaf5-4259-8880-dadc92e0f9a7","html_url":"https://github.com/tradle/pdf-parse","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tradle%2Fpdf-parse","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tradle%2Fpdf-parse/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tradle%2Fpdf-parse/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tradle%2Fpdf-parse/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tradle","download_url":"https://codeload.github.com/tradle/pdf-parse/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":242231440,"owners_count":20093636,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-16T08:19:59.477Z","updated_at":"2025-03-06T15:18:44.589Z","avatar_url":"https://github.com/tradle.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# pdf-parse\n\n_This is the forked repository from [pdf-parse](https://gitlab.com/autokent/pdf-parse)._\n\n**Pure javascript cross-platform module to extract texts from PDFs.**\n\n[![version](https://img.shields.io/npm/v/pdf-parse.svg)](https://www.npmjs.org/package/pdf-parse)\n[![downloads](https://img.shields.io/npm/dt/pdf-parse.svg)](https://www.npmjs.org/package/pdf-parse)\n[![node](https://img.shields.io/node/v/pdf-parse.svg)](https://nodejs.org/)\n[![status](https://gitlab.com/autokent/pdf-parse/badges/master/pipeline.svg)](https://gitlab.com/autokent/pdf-parse/pipelines)\n\n## Similar Packages\n* [pdf2json](https://www.npmjs.com/package/pdf2json) buggy, no support anymore, memory leak, throws non-catchable fatal errors\n* [j-pdfjson](https://www.npmjs.com/package/j-pdfjson) fork of pdf2json\n* [pdf-parser](https://github.com/dunso/pdf-parse) buggy, no tests\n* [pdfreader](https://www.npmjs.com/package/pdfreader) using pdf2json\n* [pdf-extract](https://www.npmjs.com/package/pdf-extract) not cross-platform using xpdf\n\n## Installation\n`npm install pdf-parse`\n \n## Basic Usage - Local Files\n\n```js\nconst fs = require('fs');\nconst pdf = require('pdf-parse');\n\nlet dataBuffer = fs.readFileSync('path to PDF file...');\n\npdf(dataBuffer).then(function(data) {\n\n\t// number of pages\n\tconsole.log(data.numpages);\n\t// number of rendered pages\n\tconsole.log(data.numrender);\n\t// PDF info\n\tconsole.log(data.info);\n\t// PDF metadata\n\tconsole.log(data.metadata); \n\t// PDF.js version\n\t// check https://mozilla.github.io/pdf.js/getting_started/\n\tconsole.log(data.version);\n\t// PDF text\n\tconsole.log(data.text); \n        \n});\n```\n\n## Basic Usage - HTTP\nYou can use [crawler-request](https://www.npmjs.com/package/crawler-request) which uses the `pdf-parse`\n\n## Exception Handling\n\n```js\nconst fs = require('fs');\nconst pdf = require('pdf-parse');\n\nlet dataBuffer = fs.readFileSync('path to PDF file...');\n\npdf(dataBuffer).then(function(data) {\n\t// use data\n})\n.catch(function(error){\n\t// handle exceptions\n})\n```\n\n## Extend\n* v1.0.9 and above break pagerender callback [changelog](https://gitlab.com/autokent/pdf-parse/blob/master/CHANGELOG)\n* If you need another format like json, you can change page render behaviour with a callback\n* Check out https://mozilla.github.io/pdf.js/\n\n```js\n// default render callback\nfunction render_page(pageData) {\n    //check documents https://mozilla.github.io/pdf.js/\n    let render_options = {\n        //replaces all occurrences of whitespace with standard spaces (0x20). The default value is `false`.\n        normalizeWhitespace: false,\n        //do not attempt to combine same line TextItem's. The default value is `false`.\n        disableCombineTextItems: false\n    }\n\n    return pageData.getTextContent(render_options)\n\t.then(function(textContent) {\n\t\tlet lastY, text = '';\n\t\tfor (let item of textContent.items) {\n\t\t\tif (lastY == item.transform[5] || !lastY){\n\t\t\t\ttext += item.str;\n\t\t\t}  \n\t\t\telse{\n\t\t\t\ttext += '\\n' + item.str;\n\t\t\t}    \n\t\t\tlastY = item.transform[5];\n\t\t}\n\t\treturn text;\n\t});\n}\n\nlet options = {\n    pagerender: render_page\n}\n\nlet dataBuffer = fs.readFileSync('path to PDF file...');\n\npdf(dataBuffer,options).then(function(data) {\n\t//use new format\n});\n```\n\n## Options\n\n```js\nconst DEFAULT_OPTIONS = {\n\t// internal page parser callback\n\t// you can set this option, if you need another format except raw text\n\tpagerender: render_page,\n\t\n\t// max page number to parse\n\tmax: 0,\n\t\n\t//check https://mozilla.github.io/pdf.js/getting_started/\n\tversion: 'v1.10.100'\n}\n```\n### *pagerender* (callback)\nIf you need another format except raw text.  \n\n### *max* (number)\nMax number of page to parse. If the value is less than or equal to 0, parser renders all pages.  \n\n### *version* (string, pdf.js version)\ncheck [pdf.js](https://mozilla.github.io/pdf.js/getting_started/)\n\n* `'default'`\n* `'v1.9.426'`\n* `'v1.10.100'`\n* `'v1.10.88'`\n* `'v2.0.550'`\n\n\u003e*default* version is *v1.10.100*   \n\u003e[mozilla.github.io/pdf.js](https://mozilla.github.io/pdf.js/getting_started/#download)\n\n## Test\n* `mocha` or `npm test`\n* Check [test folder](https://gitlab.com/autokent/pdf-parse/tree/master/test) and [quickstart.js](https://gitlab.com/autokent/pdf-parse/blob/master/quickstart.js) for extra usages.\n\n## Support\nI use this package actively myself, so it has my top priority. You can chat on WhatsApp about any infos, ideas and suggestions.\n\n[![WhatsApp](https://img.shields.io/badge/style-chat-green.svg?style=flat\u0026label=whatsapp)](https://api.whatsapp.com/send?phone=905063042480\u0026text=Hi%2C%0ALet%27s%20talk%20about%20pdf-parse)\n\n### Submitting an Issue\nIf you find a bug or a mistake, you can help by submitting an issue to [GitLab Repository](https://gitlab.com/autokent/pdf-parse/issues)\n\n### Creating a Merge Request\nGitLab calls it merge request instead of pull request.  \n\n* [A Guide for First-Timers](https://about.gitlab.com/2016/06/16/fearless-contribution-a-guide-for-first-timers/)\n* [How to create a merge request](https://docs.gitlab.com/ee/gitlab-basics/add-merge-request.html)\n* Check [Contributing Guide](https://gitlab.com/autokent/pdf-parse/blob/master/CONTRIBUTING.md) \n\n## License\n[MIT licensed](https://gitlab.com/autokent/pdf-parse/blob/master/LICENSE) and all it's dependencies are MIT or BSD licensed.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftradle%2Fpdf-parse","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftradle%2Fpdf-parse","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftradle%2Fpdf-parse/lists"}