{"id":43064385,"url":"https://github.com/mehmet-kozan/pdf-parse","last_synced_at":"2026-01-31T12:23:38.492Z","repository":{"id":315513074,"uuid":"1052760396","full_name":"mehmet-kozan/pdf-parse","owner":"mehmet-kozan","description":"Pure TypeScript, cross-platform module for extracting text, images, and tabular data from PDFs. Run 🤗 directly in your browser or in Node.js","archived":false,"fork":false,"pushed_at":"2025-12-17T00:44:02.000Z","size":49901,"stargazers_count":108,"open_issues_count":8,"forks_count":11,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-12-20T05:44:14.358Z","etag":null,"topics":["pdf","pdf-parse","pdf-parser","pdf-screenshot","pdf-table","pdf-thumbnail","pdf-to-image","pdf-to-text","pdf-tools","pdf-utils","pdf-viewer","pdf2image","pdf2json","pdf2pic","pdf2text","pdfjs","pdfjs-dist","turkey"],"latest_commit_sha":null,"homepage":"https://mehmet-kozan.github.io/pdf-parse/","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mehmet-kozan.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":".github/SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null},"funding":{"github":"mehmet-kozan"}},"created_at":"2025-09-08T14:06:57.000Z","updated_at":"2025-12-18T06:51:28.000Z","dependencies_parsed_at":null,"dependency_job_id":"db131c84-13c9-46d5-88d1-8e337abdcdca","html_url":"https://github.com/mehmet-kozan/pdf-parse","commit_stats":null,"previous_names":["mehmet-kozan/pdf-parse"],"tags_count":41,"template":false,"template_full_name":null,"purl":"pkg:github/mehmet-kozan/pdf-parse","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mehmet-kozan%2Fpdf-parse","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mehmet-kozan%2Fpdf-parse/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mehmet-kozan%2Fpdf-parse/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mehmet-kozan%2Fpdf-parse/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mehmet-kozan","download_url":"https://codeload.github.com/mehmet-kozan/pdf-parse/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mehmet-kozan%2Fpdf-parse/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28942300,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-31T12:10:04.904Z","status":"ssl_error","status_checked_at":"2026-01-31T12:09:58.894Z","response_time":128,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["pdf","pdf-parse","pdf-parser","pdf-screenshot","pdf-table","pdf-thumbnail","pdf-to-image","pdf-to-text","pdf-tools","pdf-utils","pdf-viewer","pdf2image","pdf2json","pdf2pic","pdf2text","pdfjs","pdfjs-dist","turkey"],"created_at":"2026-01-31T12:23:37.182Z","updated_at":"2026-01-31T12:23:38.487Z","avatar_url":"https://github.com/mehmet-kozan.png","language":"TypeScript","funding_links":["https://github.com/sponsors/mehmet-kozan"],"categories":["TypeScript"],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e \n\n# pdf-parse\n**Pure TypeScript, cross-platform module for extracting text, images, and tables from PDFs.**  \n**Run 🤗 directly in your browser or in Node!** \n\n\u003c/div\u003e \n\n\u003cdiv align=\"center\"\u003e \n\n[![npm version](https://img.shields.io/npm/v/pdf-parse.svg)](https://www.npmjs.com/package/pdf-parse) \n[![npm downloads](https://img.shields.io/npm/dm/pdf-parse.svg)](https://www.npmjs.com/package/pdf-parse) \n[![node version](https://img.shields.io/node/v/pdf-parse.svg)](https://www.npmjs.com/package/pdf-parse) \n[![tests](https://github.com/mehmet-kozan/pdf-parse/actions/workflows/test.yml/badge.svg)](https://github.com/mehmet-kozan/pdf-parse/actions/workflows/test.yml) \n[![tests](https://github.com/mehmet-kozan/pdf-parse/actions/workflows/test_integration.yml/badge.svg)](https://github.com/mehmet-kozan/pdf-parse/actions/workflows/test_integration.yml) \n[![biome](https://img.shields.io/badge/code_style-biome-60a5fa?logo=biome)](https://biomejs.dev) \n[![vitest](https://img.shields.io/badge/tested_with-vitest-6E9F18?logo=vitest)](https://vitest.dev) \n[![codecov](https://codecov.io/github/mehmet-kozan/pdf-parse/graph/badge.svg?token=FZL3G8KNZ8)](https://codecov.io/github/mehmet-kozan/pdf-parse) \n[![test \u0026 coverage reports](https://img.shields.io/badge/reports-view-brightgreen.svg)](https://mehmet-kozan.github.io/pdf-parse/)  \n\n\u003c/div\u003e\n\u003cbr /\u003e\n\n## Getting Started with v2/v3 (coming from v1)\n\n```js\n// v1\n// const pdf = require('pdf-parse');\n// pdf(buffer).then(result =\u003e console.log(result.text));\n\n// v2\nconst { PDFParse } = require('pdf-parse');\n// import { PDFParse } from 'pdf-parse';\n\nasync function run() {\n\tconst parser = new PDFParse({ url: 'https://bitcoin.org/bitcoin.pdf' });\n\n\tconst result = await parser.getText();\n\t// or use getRaw() for v1 compatibility\n\tconsole.log(result.text);\n}\n\nrun();\n```  \n\n## Features \u003ca href=\"https://mehmet-kozan.github.io/pdf-parse/\" target=\"_blank\"\u003e\u003cimg align=\"right\" src=\"https://img.shields.io/badge/live-demo-brightgreen.svg\" alt=\"demo\"\u003e\u003c/a\u003e\n\n- CJS, ESM, Node.js, and browser support.\n- Can be integrated with `React`, `Vue`, `Angular`, or any other web framework.\n- **Command-line interface** for quick PDF processing: [`CLI Documentation`](./docs/command-line.md)\n- [`Security Policy`](https://github.com/mehmet-kozan/pdf-parse?tab=security-ov-file#security-policy)\n- Retrieve headers and validate PDF: [`getHeader()`](#getheader--node-utility-pdf-header-retrieval-and-validation)\n- Extract document info: [`getInfo()`](#getinfo--extract-metadata-and-document-information)\n- Extract page text: [`getRaw() getText() getParagraph()`](#gettext--extract-text) \n- Render pages as PNG: [`getScreenshot()`](#getscreenshot--render-pages-as-png)\n- Extract embedded images: [`getImage()`](#getimage--extract-embedded-images)\n- Detect and extract tabular data: [`getTable()`](#gettable--extract-tabular-data) \n- See [LoadParameters](./docs/options.md#load-parameters) and [ParseParameters](./docs/options.md#parse-parameters) for all available options.\n- Examples: [`live demo`](./reports/demo/), [`examples`](./examples/), [`tests`](./tests/unit/) and [`tests example`](./tests/unit/test-example/) folders.\n- Supports: [`Next.js + Vercel`](https://github.com/mehmet-kozan/vercel-next-app-demo), Netlify, AWS Lambda, Cloudflare Workers.\n\n\n## Installation\n\n```sh\nnpm install pdf-parse\n# or\npnpm add pdf-parse\n# or\nyarn add pdf-parse\n# or\nbun add pdf-parse\n```\n\n### CLI Installation\n\nFor command-line usage, install the package globally:\n\n```sh\n# installation\nnpm install -g pdf-parse\n\n# updating\nnpm update -g pdf-parse\n\n# uninstallation\nnpm uninstall -g pdf-parse\n\n# help\npdf-parse -h\n```\n\nFor detailed CLI documentation and usage examples, see: [CLI Documentation](./docs/command-line.md)\n\n## Usage\n\n### `getHeader` — Node Utility: PDF Header Retrieval and Validation  \n\n```js\n// Important: getHeader is available from the 'pdf-parse/node' submodule\nimport { getHeader } from 'pdf-parse/node';\n\n// Retrieve HTTP headers and file size without downloading the full file.\n// Pass `true` to check PDF magic bytes via range request.\n// Optionally validates PDFs by fetching the first 4 bytes (magic bytes).\n// Useful for checking file existence, size, and type before full parsing.\n// Node only, will not work in browser environments.\nconst result = await getHeader('https://bitcoin.org/bitcoin.pdf', true);\n\nconsole.log(`Status: ${result.status}`);\nconsole.log(`Content-Length: ${result.size}`);\nconsole.log(`Is PDF: ${result.isPdf}`);\nconsole.log(`Headers:`, result.headers);\n```\n\n### `getInfo` — Extract Metadata and Document Information  \n\n```js\nimport { readFile } from 'node:fs/promises';\nimport { PDFParse } from 'pdf-parse';\n\nconst link = 'https://mehmet-kozan.github.io/pdf-parse/pdf/climate.pdf';\n// const buffer = await readFile('reports/pdf/climate.pdf');\n// const parser = new PDFParse({ data: buffer });\n\nconst parser = new PDFParse({ url: link });\nconst result = await parser.getInfo({ parsePageInfo: true });\nawait parser.destroy();\n\nconsole.log(`Total pages: ${result.total}`);\nconsole.log(`Title: ${result.infoData?.Title}`);\nconsole.log(`Author: ${result.infoData?.Author}`);\nconsole.log(`Creator: ${result.infoData?.Creator}`);\nconsole.log(`Producer: ${result.infoData?.Producer}`);\nconsole.log(`Creation Date: ${result.infoData?.CreationDate}`);\nconsole.log(`Modification Date: ${result.infoData?.ModDate}`);\n\n// Links, pageLabel, width, height (when `parsePageInfo` is true)\nconsole.log('Per-Page information:');\nconsole.log(JSON.stringify(result.pages, null, 2));\n\nconsole.log('full information:');\nconsole.log(JSON.stringify(result.toJSON(), null, 2));\n```\n\n### `getText` — Extract Text  \n\n```js\nimport { PDFParse } from 'pdf-parse';\n\nconst parser = new PDFParse({ url: 'https://bitcoin.org/bitcoin.pdf' });\nconst result = await parser.getText();\n// to extract text from page 3 only:\n// const result = await parser.getText({ partial: [3] });\nawait parser.destroy();\nconsole.log(result.text);\n```\nFor a complete list of configuration options, see:\n\n- [LoadParameters](./docs/options.md#load-parameters)\n- [ParseParameters](./docs/options.md#parse-parameters)\n\n\nUsage Examples:\n- Parse password protected PDF:  [`password.test.ts`](tests/unit/test-example/password.test.ts)\n- Parse only specific pages: [`specific-pages.test.ts`](tests/unit/test-example/specific-pages.test.ts)\n- Parse embedded hyperlinks: [`hyperlink.test.ts`](tests/unit/test-example/hyperlink.test.ts)\n- Set verbosity level example: [`password.test.ts`](tests/unit/test-example/password.test.ts)\n- Load PDF from URL: [`url.test.ts`](tests/unit/test-example/url.test.ts)\n- Load PDF from base64 data: [`base64.test.ts`](tests/unit/test-example/base64.test.ts)\n- Loading large files (\u003e 5 MB): [`large-file.test.ts`](tests/unit/test-example/large-file.test.ts)\n\n### `getScreenshot` — Render Pages as PNG  \n\n```js\nimport { readFile, writeFile } from 'node:fs/promises';\nimport { PDFParse } from 'pdf-parse';\n\nconst link = 'https://bitcoin.org/bitcoin.pdf';\n// const buffer = await readFile('reports/pdf/bitcoin.pdf');\n// const parser = new PDFParse({ data: buffer });\n\nconst parser = new PDFParse({ url: link });\n\n// scale:1 for original page size.\n// scale:1.5 50% bigger.\nconst result = await parser.getScreenshot({ scale: 1.5 });\n\nawait parser.destroy();\nawait writeFile('bitcoin.png', result.pages[0].data);\n```\n\nUsage Examples:\n- Limit output resolution or specific pages using [ParseParameters](./docs/options.md#parse-parameters)\n- `getScreenshot({scale:1.5})` — Increase rendering scale (higher DPI / larger image)\n- `getScreenshot({desiredWidth:1024})` — Request a target width in pixels; height scales to keep aspect ratio\n- `imageDataUrl` (default: `true`) — include base64 data URL string in the result.\n- `imageBuffer` (default: `true`) — include a binary buffer for each image.\n- Select specific pages with `partial` (e.g. `getScreenshot({ partial: [1,3] })`) \n- `partial` overrides `first`/`last`.\n- Use `first` to render the first N pages (e.g. `getScreenshot({ first: 3 })`).\n- Use `last` to render the last N pages (e.g. `getScreenshot({ last: 2 })`).\n- When both `first` and `last` are provided they form an inclusive range (`first..last`).\n\n### `getImage` — Extract Embedded Images  \n\n```js\nimport { readFile, writeFile } from 'node:fs/promises';\nimport { PDFParse } from 'pdf-parse';\n\nconst link = new URL('https://mehmet-kozan.github.io/pdf-parse/pdf/image-test.pdf');\n// const buffer = await readFile('reports/pdf/image-test.pdf');\n// const parser = new PDFParse({ data: buffer });\n\nconst parser = new PDFParse({ url: link });\nconst result = await parser.getImage();\nawait parser.destroy();\n\nawait writeFile('adobe.png', result.pages[0].images[0].data);\n```\n\nUsage Examples:\n- Exclude images with width or height \u003c= 50 px: `getImage({ imageThreshold: 50 })`\n- Default `imageThreshold` is `80` (pixels)\n- Useful for excluding tiny decorative or tracking images.\n- To disable size-based filtering and include all images, set `imageThreshold: 0`.\n- `imageDataUrl` (default: `true`) — include base64 data URL string in the result.\n- `imageBuffer` (default: `true`) — include a binary buffer for each image.\n- Extract images from specific pages: `getImage({ partial: [2,4] })`\n\n\n\n### `getTable` — Extract Tabular Data  \n\n```js\nimport { readFile } from 'node:fs/promises';\nimport { PDFParse } from 'pdf-parse';\n\nconst link = new URL('https://mehmet-kozan.github.io/pdf-parse/pdf/simple-table.pdf');\n// const buffer = await readFile('reports/pdf/simple-table.pdf');\n// const parser = new PDFParse({ data: buffer });\n\nconst parser = new PDFParse({ url: link });\nconst result = await parser.getTable();\nawait parser.destroy();\n\n// Pretty-print each row of the first table\nfor (const row of result.pages[0].tables[0]) {\n\tconsole.log(JSON.stringify(row));\n}\n```\n\n## Exception Handling \u0026 Type Usage\n\n```ts\nimport type { LoadParameters, ParseParameters, TextResult } from 'pdf-parse';\nimport { PasswordException, PDFParse, VerbosityLevel } from 'pdf-parse';\n\nconst loadParams: LoadParameters = {\n\turl: 'https://mehmet-kozan.github.io/pdf-parse/pdf/password-123456.pdf',\n\tverbosity: VerbosityLevel.WARNINGS,\n\tpassword: 'abcdef',\n};\n\nconst parseParams: ParseParameters = {\n\tfirst: 1,\n};\n\n// Initialize the parser class without executing any code yet\nconst parser = new PDFParse(loadParams);\n\nfunction handleResult(result: TextResult) {\n\tconsole.log(result.text);\n}\n\ntry {\n\tconst result = await parser.getText(parseParams);\n\thandleResult(result);\n} catch (error) {\n\t// InvalidPDFException\n\t// PasswordException\n\t// FormatError\n\t// ResponseException\n\t// AbortException\n\t// UnknownErrorException\n\tif (error instanceof PasswordException) {\n\t\tconsole.error('Password must be 123456\\n', error);\n\t} else {\n\t\tthrow error;\n\t}\n} finally {\n\t// Always call destroy() to free memory\n\tawait parser.destroy();\n}\n``` \n\n## Web / Browser \u003ca href=\"https://www.jsdelivr.com/package/npm/pdf-parse\" target=\"_blank\"\u003e\u003cimg align=\"right\" src=\"https://img.shields.io/jsdelivr/npm/hm/pdf-parse\"\u003e\u003c/a\u003e\n\n- Can be integrated into `React`, `Vue`, `Angular`, or any other web framework.\n- **Live Demo:** [`https://mehmet-kozan.github.io/pdf-parse/`](https://mehmet-kozan.github.io/pdf-parse/)\n- **Demo Source:** [`reports/demo`](reports/demo)\n- **ES Module**:  `pdf-parse.es.js` **UMD/Global**: `pdf-parse.umd.js`\n- For browser build, set the `web worker` explicitly.\n\n### CDN Usage\n\n```html\n\u003c!-- ES Module --\u003e\n\u003cscript type=\"module\"\u003e\n\n  import {PDFParse} from 'https://cdn.jsdelivr.net/npm/pdf-parse@latest/dist/pdf-parse/web/pdf-parse.es.js';\n  //// Available Worker Files\n  // pdf.worker.mjs\n  // pdf.worker.min.mjs\n  // If you use a custom build or host pdf.worker.mjs yourself, configure worker accordingly.\n  PDFParse.setWorker('https://cdn.jsdelivr.net/npm/pdf-parse@latest/dist/pdf-parse/web/pdf.worker.mjs');\n\n  const parser = new PDFParse({url:'https://mehmet-kozan.github.io/pdf-parse/pdf/bitcoin.pdf'});\n  const result = await parser.getText();\n\n  console.log(result.text)\n\u003c/script\u003e\n```\n\n**CDN Options: https://www.jsdelivr.com/package/npm/pdf-parse**\n\n- `https://cdn.jsdelivr.net/npm/pdf-parse@latest/dist/pdf-parse/web/pdf-parse.es.js`\n- `https://cdn.jsdelivr.net/npm/pdf-parse@2.4.5/dist/pdf-parse/web/pdf-parse.es.js`\n- `https://cdn.jsdelivr.net/npm/pdf-parse@latest/dist/pdf-parse/web/pdf-parse.umd.js`\n- `https://cdn.jsdelivr.net/npm/pdf-parse@2.4.5/dist/pdf-parse/web/pdf-parse.umd.js`\n\n**Worker Options:**\n\n- `https://cdn.jsdelivr.net/npm/pdf-parse@latest/dist/pdf-parse/web/pdf.worker.mjs`\n- `https://cdn.jsdelivr.net/npm/pdf-parse@latest/dist/pdf-parse/web/pdf.worker.min.mjs`\n\n\n## Similar Packages\n\n* [pdf2json](https://www.npmjs.com/package/pdf2json) — Buggy, memory leaks, uncatchable errors in some PDF files.\n* [pdfdataextract](https://www.npmjs.com/package/pdfdataextract) — `pdf-parse`-based\n* [unpdf](https://www.npmjs.com/package/unpdf) — `pdf-parse`-based\n* [pdf-extract](https://www.npmjs.com/package/pdf-extract) — Non-cross-platform, depends on xpdf\n* [j-pdfjson](https://www.npmjs.com/package/j-pdfjson) — Fork of pdf2json\n* [pdfreader](https://www.npmjs.com/package/pdfreader) — Uses pdf2json\n* [pdf-extract](https://www.npmjs.com/package/pdf-extract) — Non-cross-platform, depends on xpdf  \n\n\u003e **Benchmark Note:** The benchmark currently runs only against `pdf2json`. I don't know the current state of `pdf2json` — the original reason for creating `pdf-parse` was to work around stability issues with `pdf2json`. I deliberately did not include `pdf-parse` or other `pdf.js`-based packages in the benchmark because dependencies conflict. If you have recommendations for additional packages to include, please open an issue, see [`benchmark results`](https://mehmet-kozan.github.io/pdf-parse/benchmark.html).\n\n## Supported Node.js Versions (20.x, 22.x, 23.x, 24.x)\n\n- Supported: Node.js 20 (\u003e= 20.16.0), Node.js 22 (\u003e= 22.3.0), Node.js 23 (\u003e= 23.0.0), and Node.js 24 (\u003e= 24.0.0).\n- Not supported: Node.js 21.x, and Node.js 19.x and earlier.\n\nIntegration tests run on Node.js 20–24, see [`test_integration.yml`](./.github/workflows/test_integration.yml).\n\n### Unsupported Node.js Versions (18.x, 19.x, 21.x)\n\nFor these versions, extra configuration is required; see [docs/troubleshooting.md](./docs/troubleshooting.md).\n\n## Worker Configuration \u0026 Troubleshooting\n\nSee [docs/troubleshooting.md](./docs/troubleshooting.md) for detailed troubleshooting steps and worker configuration for Node.js and serverless environments.\n\n- Worker setup for Node.js, Next.js, Vercel, AWS Lambda, Netlify, Cloudflare Workers.\n- Common error messages and solutions.\n- Manual worker configuration for custom builds and Electron/NW.js.\n- Node.js version compatibility.\n\nIf you encounter issues, please refer to the [Troubleshooting Guide](./docs/troubleshooting.md).\n\n## Contributing\n\nWhen opening an issue, please attach the relevant PDF file if possible. Providing the file will help us reproduce and resolve your issue more efficiently. For detailed guidelines on how to contribute, report bugs, or submit pull requests, see: [`contributing to pdf-parse`](https://github.com/mehmet-kozan/pdf-parse?tab=contributing-ov-file#contributing-to-pdf-parse)\n\n\n\n\n\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmehmet-kozan%2Fpdf-parse","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmehmet-kozan%2Fpdf-parse","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmehmet-kozan%2Fpdf-parse/lists"}