https://github.com/jogemu/pdf2tree
Parse PDF and group elements based on enclosing lines. A node.js module that promisifies the pdf2json parser and structures the data in a way that is suitable for tables with merged cells.
https://github.com/jogemu/pdf2tree
data-table hierarchical-data merged-table-cells pdf-parser tree-structure
Last synced: 18 days ago
JSON representation
Parse PDF and group elements based on enclosing lines. A node.js module that promisifies the pdf2json parser and structures the data in a way that is suitable for tables with merged cells.
- Host: GitHub
- URL: https://github.com/jogemu/pdf2tree
- Owner: jogemu
- License: unlicense
- Created: 2023-05-13T23:41:41.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2023-06-11T01:02:46.000Z (almost 2 years ago)
- Last Synced: 2024-03-15T11:04:47.949Z (about 1 year ago)
- Topics: data-table, hierarchical-data, merged-table-cells, pdf-parser, tree-structure
- Language: JavaScript
- Homepage:
- Size: 12.7 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: readme.md
- License: LICENSE
Awesome Lists containing this project
README
# pdf2tree
Parse PDF and group elements based on enclosing lines. A node.js module that promisifies the pdf2json parser and structures the data in a way that is suitable for tables with merged cells.## How to use
After installing [node.js](https://nodejs.org) you can use npm to add pdf2tree in your project folder.npm install pdf2tree
When you create a new parser object as shown below, parameters are passed to the [pdf2json](https://github.com/modesty/pdf2json) parser.
import PDF2Tree from 'pdf2tree'
let pdf2tree = new PDF2Tree()Then you can set the following pdf2tree specific parameters.
pdf2tree.maxStrokeWidth = 1
pdf2tree.maxGapWidth = 0.1Finally, parsing can start either with a filepath or a buffer.
pdf2tree.loadPDF(PDFpath)
pdf2tree.parseBuffer(PDFbuffer)The promise returns a JSON object as documented in [pdf2json](https://github.com/modesty/pdf2json), but adds an additional `Tree` property. To simplify readability `` represents an object like the ones pdf2json provides for every Page but each object only contains all elements within the lines, i.e. `{ ..., Texts: [ { x, y, ..., R: [ { T: 'str', ... } ] } ], ... }`.
{
...
Tree: [
[
,
[
[ , , , ],
[ , <1>, <2>, <3> ],
[
,
[
[ <5>, <6>, <7> ],
[ <8>, <9> ]
]
]
]
],
[
,
[
[ ],
[
,
[
[
,
,
[
[ ],
[ ]
]
],
[ , > ],
[ , > ]
]
]
]
]
]
}For content structured like this:
Page 1
+---+---+---+---+
| A | B | C | D |
+---+---+---+---+
| X | 1 | 2 | 3 |
+---+---+---+---+
| | 5 | 6 | 7 |
| Y +---+---+---+
| | 8 | 9 |
+---+---+-------+Page 2
+---+---+---+---+
| TITLE |
+---+---+---+---+
| | | | H |
| | F | G +---+
| | | | I |
| Z +---+---+---+
| | J | |
| +---+ ? |
| | K | |
+---+---+-------+If a cell is not rectangular or merges rows that the cell to the left did not also merge then the resulting tree might contain errors. This would require a data structure that allows traversing the neighborhood with `.right` or `.below` and can include loops for non-rectangular areas. It should be easier to fix those special cases after the parsing.