https://github.com/jogemu/pdf2tree

Parse PDF and group elements based on enclosing lines. A node.js module that promisifies the pdf2json parser and structures the data in a way that is suitable for tables with merged cells.
https://github.com/jogemu/pdf2tree

data-table hierarchical-data merged-table-cells pdf-parser tree-structure

Last synced: 29 days ago
JSON representation

Parse PDF and group elements based on enclosing lines. A node.js module that promisifies the pdf2json parser and structures the data in a way that is suitable for tables with merged cells.

Host: GitHub
URL: https://github.com/jogemu/pdf2tree
Owner: jogemu
License: unlicense
Created: 2023-05-13T23:41:41.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2023-06-11T01:02:46.000Z (about 2 years ago)
Last Synced: 2025-05-07T19:19:43.152Z (2 months ago)
Topics: data-table, hierarchical-data, merged-table-cells, pdf-parser, tree-structure
Language: JavaScript
Homepage:
Size: 12.7 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: readme.md
- License: LICENSE

Awesome Lists containing this project

README

        # pdf2tree

Parse PDF and group elements based on enclosing lines. A node.js module that promisifies the pdf2json parser and structures the data in a way that is suitable for tables with merged cells.

## How to use

After installing [node.js](https://nodejs.org) you can use npm to add pdf2tree in your project folder.

    npm install pdf2tree

When you create a new parser object as shown below, parameters are passed to the [pdf2json](https://github.com/modesty/pdf2json) parser.

    import PDF2Tree from 'pdf2tree'

    let pdf2tree = new PDF2Tree()

Then you can set the following pdf2tree specific parameters.

    pdf2tree.maxStrokeWidth = 1

    pdf2tree.maxGapWidth = 0.1

Finally, parsing can start either with a filepath or a buffer.

    pdf2tree.loadPDF(PDFpath)

    pdf2tree.parseBuffer(PDFbuffer)

The promise returns a JSON object as documented in [pdf2json](https://github.com/modesty/pdf2json), but adds an additional `Tree` property. To simplify readability `` represents an object like the ones pdf2json provides for every Page but each object only contains all elements within the lines, i.e. `{ ..., Texts: [ { x, y, ..., R: [ { T: 'str', ... } ] } ], ... }`.

    {

      ...

      Tree: [

        [

          ,

          [

            [ , , ,  ],

            [ , <1>, <2>, <3> ],

            [ 

              ,

              [

                [ <5>, <6>, <7> ],

                [ <8>, <9> ]

              ]

            ]

          ]

        ],

        [

          ,

          [

            [  ],

            [

              , 

              [

                [

                  ,

                  ,

                  [

                    [  ],

                    [  ]

                  ]

                ],

                [ , > ],

                [ , > ]

              ]

            ]

          ]

        ]

      ]

    }


For content structured like this:

    Page 1

    +---+---+---+---+

    | A | B | C | D |

    +---+---+---+---+

    | X | 1 | 2 | 3 |

    +---+---+---+---+

    |   | 5 | 6 | 7 |

    | Y +---+---+---+

    |   | 8 |   9   |

    +---+---+-------+

    Page 2

    

    +---+---+---+---+

    |     TITLE     |

    +---+---+---+---+

    |   |   |   | H |

    |   | F | G +---+

    |   |   |   | I |

    | Z +---+---+---+

    |   | J |       |

    |   +---+   ?   |

    |   | K |       |

    +---+---+-------+

If a cell is not rectangular or merges rows that the cell to the left did not also merge then the resulting tree might contain errors. This would require a data structure that allows traversing the neighborhood with `.right` or `.below` and can include loops for non-rectangular areas. It should be easier to fix those special cases after the parsing.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/jogemu/pdf2tree

Awesome Lists containing this project

README