{"id":15369766,"url":"https://github.com/electrovir/pdf-text-reader","last_synced_at":"2025-04-11T04:14:54.370Z","repository":{"id":42467709,"uuid":"234206522","full_name":"electrovir/pdf-text-reader","owner":"electrovir","description":"Dead simple pdf text reader","archived":false,"fork":false,"pushed_at":"2024-05-08T14:27:55.000Z","size":1330,"stargazers_count":39,"open_issues_count":2,"forks_count":5,"subscribers_count":1,"default_branch":"dev","last_synced_at":"2025-04-11T02:12:56.155Z","etag":null,"topics":["npm","pdf","pdf-reader"],"latest_commit_sha":null,"homepage":"https://electrovir.github.io/pdf-text-reader/","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"cc0-1.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/electrovir.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE-CC0","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-01-16T01:16:45.000Z","updated_at":"2025-04-08T22:52:26.000Z","dependencies_parsed_at":"2024-06-19T03:56:10.593Z","dependency_job_id":"2ca3e3ed-b203-44b1-9328-2e8faaaef047","html_url":"https://github.com/electrovir/pdf-text-reader","commit_stats":{"total_commits":19,"total_committers":3,"mean_commits":6.333333333333333,"dds":"0.10526315789473684","last_synced_commit":"a459f889ea694d5cb026770dcf0d5e2204caa6c3"},"previous_names":[],"tags_count":14,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/electrovir%2Fpdf-text-reader","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/electrovir%2Fpdf-text-reader/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/electrovir%2Fpdf-text-reader/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/electrovir%2Fpdf-text-reader/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/electrovir","download_url":"https://codeload.github.com/electrovir/pdf-text-reader/tar.gz/refs/heads/dev","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248328163,"owners_count":21085261,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["npm","pdf","pdf-reader"],"created_at":"2024-10-01T13:38:17.086Z","updated_at":"2025-04-11T04:14:54.340Z","avatar_url":"https://github.com/electrovir.png","language":"TypeScript","readme":"# PDF Text Reader\n\nDead simple PDF text reader for Node.js. Uses Mozilla's [`pdfjs-dist`](https://www.npmjs.com/package/pdfjs-dist) package.\n\nRequires ESM and Node.js v22 or greater. (These are requirements from Mozilla's `pdf-dist` package itself.)\n\n# Install\n\n```\nnpm install pdf-text-reader\n```\n\n# Usage\n\n-   Read all pages into a single string with `readPdfText`:\n\n    \u003c!-- example-link: src/readme-examples/read-pdf-text.example.ts --\u003e\n\n    ```TypeScript\n    import {readPdfText} from 'pdf-text-reader';\n\n    async function main() {\n        const pdfText: string = await readPdfText({url: 'path/to/pdf/file.pdf'});\n        console.info(pdfText);\n    }\n\n    main();\n    ```\n\n-   Read a PDF into individual pages with `readPdfPages`:\n    \u003c!-- example-link: src/readme-examples/read-pdf-pages.example.ts --\u003e\n\n    ```TypeScript\n    import {readPdfPages} from 'pdf-text-reader';\n\n    async function main() {\n        const pages = await readPdfPages({url: 'path/to/pdf/file.pdf'});\n        console.info(pages[0]?.lines);\n    }\n\n    main();\n    ```\n\nSee [the types](https://github.com/electrovir/pdf-text-reader/tree/master/src/read-pdf.ts) for detailed argument and return value types.\n\n# Details\n\nThis package simply reads the output of `pdfjs.getDocument` and sorts it into lines based on text position in the document. It also inserts spaces for text on the same line that is far apart horizontally and new lines in between lines that are far apart vertically.\n\nExample:\n\nThe text below in a PDF will be read as having spaces in between them even if the space characters aren't in the PDF.\n\n```\ncell 1               cell 2                 cell 3\n```\n\nThe number of spaces to insert is calculated by an extremely naive but very simple calculation of `Math.ceil(distance-between-text/text-height)`.\n\n# Low Level Control\n\nIf you need lower level parsing control, you can also use the exported `parsePageItems` function. This only reads one page at a time as seen below. This function is used by `readPdfPages` so the output will be identical for the same pdf page.\n\nYou may need to independently install the [`pdfjs-dist`](https://www.npmjs.com/package/pdfjs-dist) npm package for this to work.\n\n\u003c!-- example-link: src/readme-examples/lower-level-controls.example.ts --\u003e\n\n```TypeScript\nimport * as pdfjs from 'pdfjs-dist';\nimport type {TextItem} from 'pdfjs-dist/types/src/display/api';\nimport {parsePageItems} from 'pdf-text-reader';\n\nasync function main() {\n    const doc = await pdfjs.getDocument('myDocument.pdf').promise;\n    const page = await doc.getPage(1); // 1-indexed\n    const content = await page.getTextContent();\n    const items: TextItem[] = content.items.filter((item): item is TextItem =\u003e 'str' in item);\n    const parsedPage = parsePageItems(items);\n    console.info(parsedPage.lines);\n}\n\nmain();\n```\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Felectrovir%2Fpdf-text-reader","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Felectrovir%2Fpdf-text-reader","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Felectrovir%2Fpdf-text-reader/lists"}