{"id":23159211,"url":"https://github.com/jogemu/pdf2tree","last_synced_at":"2025-10-13T23:26:32.967Z","repository":{"id":164988078,"uuid":"640391646","full_name":"jogemu/pdf2tree","owner":"jogemu","description":"Parse PDF and group elements based on enclosing lines. A node.js module that promisifies the pdf2json parser and structures the data in a way that is suitable for tables with merged cells.","archived":false,"fork":false,"pushed_at":"2023-06-11T01:02:46.000Z","size":13,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-10-13T23:26:29.737Z","etag":null,"topics":["data-table","hierarchical-data","merged-table-cells","pdf-parser","tree-structure"],"latest_commit_sha":null,"homepage":"","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"unlicense","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jogemu.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-05-13T23:41:41.000Z","updated_at":"2023-06-15T12:38:55.000Z","dependencies_parsed_at":null,"dependency_job_id":"463f3c8e-78e4-4d8e-bae0-181d3a022fc4","html_url":"https://github.com/jogemu/pdf2tree","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/jogemu/pdf2tree","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jogemu%2Fpdf2tree","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jogemu%2Fpdf2tree/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jogemu%2Fpdf2tree/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jogemu%2Fpdf2tree/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jogemu","download_url":"https://codeload.github.com/jogemu/pdf2tree/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jogemu%2Fpdf2tree/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279017243,"owners_count":26086015,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-13T02:00:06.723Z","response_time":61,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-table","hierarchical-data","merged-table-cells","pdf-parser","tree-structure"],"created_at":"2024-12-17T22:33:29.177Z","updated_at":"2025-10-13T23:26:32.950Z","avatar_url":"https://github.com/jogemu.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# pdf2tree\nParse PDF and group elements based on enclosing lines. A node.js module that promisifies the pdf2json parser and structures the data in a way that is suitable for tables with merged cells.\n\n## How to use\nAfter installing [node.js](https://nodejs.org) you can use npm to add pdf2tree in your project folder.\n\n    npm install pdf2tree\n\nWhen you create a new parser object as shown below, parameters are passed to the [pdf2json](https://github.com/modesty/pdf2json) parser.\n\n    import PDF2Tree from 'pdf2tree'\n    let pdf2tree = new PDF2Tree()\n\nThen you can set the following pdf2tree specific parameters.\n\n    pdf2tree.maxStrokeWidth = 1\n    pdf2tree.maxGapWidth = 0.1\n\nFinally, parsing can start either with a filepath or a buffer.\n\n    pdf2tree.loadPDF(PDFpath)\n    pdf2tree.parseBuffer(PDFbuffer)\n\nThe promise returns a JSON object as documented in [pdf2json](https://github.com/modesty/pdf2json), but adds an additional `Tree` property. To simplify readability `\u003cstr\u003e` represents an object like the ones pdf2json provides for every Page but each object only contains all elements within the lines, i.e. `{ ..., Texts: [ { x, y, ..., R: [ { T: 'str', ... } ] } ], ... }`.\n\n    {\n      ...\n      Tree: [\n        [\n          \u003cPage 1\u003e,\n          [\n            [ \u003cA\u003e, \u003cB\u003e, \u003cC\u003e, \u003cD\u003e ],\n            [ \u003cX\u003e, \u003c1\u003e, \u003c2\u003e, \u003c3\u003e ],\n            [ \n              \u003cY\u003e,\n              [\n                [ \u003c5\u003e, \u003c6\u003e, \u003c7\u003e ],\n                [ \u003c8\u003e, \u003c9\u003e ]\n              ]\n            ]\n          ]\n        ],\n        [\n          \u003cPage 2\u003e,\n          [\n            [ \u003cTITLE\u003e ],\n            [\n              \u003cZ\u003e, \n              [\n                [\n                  \u003cF\u003e,\n                  \u003cG\u003e,\n                  [\n                    [ \u003cH\u003e ],\n                    [ \u003cI\u003e ]\n                  ]\n                ],\n                [ \u003cJ\u003e, \u003c?\u003e ],\n                [ \u003cK\u003e, \u003c?\u003e ]\n              ]\n            ]\n          ]\n        ]\n      ]\n    }\n\nFor content structured like this:\n\n    Page 1\n\n    +---+---+---+---+\n    | A | B | C | D |\n    +---+---+---+---+\n    | X | 1 | 2 | 3 |\n    +---+---+---+---+\n    |   | 5 | 6 | 7 |\n    | Y +---+---+---+\n    |   | 8 |   9   |\n    +---+---+-------+\n\n    Page 2\n    \n    +---+---+---+---+\n    |     TITLE     |\n    +---+---+---+---+\n    |   |   |   | H |\n    |   | F | G +---+\n    |   |   |   | I |\n    | Z +---+---+---+\n    |   | J |       |\n    |   +---+   ?   |\n    |   | K |       |\n    +---+---+-------+\n\nIf a cell is not rectangular or merges rows that the cell to the left did not also merge then the resulting tree might contain errors. This would require a data structure that allows traversing the neighborhood with `.right` or `.below` and can include loops for non-rectangular areas. It should be easier to fix those special cases after the parsing.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjogemu%2Fpdf2tree","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjogemu%2Fpdf2tree","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjogemu%2Fpdf2tree/lists"}