https://github.com/hubgit/pdf-data
https://github.com/hubgit/pdf-data
pdf pdfjs
Last synced: about 2 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/hubgit/pdf-data
- Owner: hubgit
- Created: 2018-01-20T07:59:04.000Z (about 8 years ago)
- Default Branch: master
- Last Pushed: 2019-01-15T11:50:15.000Z (about 7 years ago)
- Last Synced: 2025-03-11T11:08:50.017Z (11 months ago)
- Topics: pdf, pdfjs
- Language: JavaScript
- Size: 185 KB
- Stars: 1
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
1. Collect the text divs from each page, in the correct order (done as they're rendering, but could also be selected afterwards if they're not originally in the right order).
1. Calculate the most frequent left and right edges, and exclude divs that start or end outside those boundaries.
1. Group divs together into blocks by vertical position and font size changes. (use gaps, if not start + end or page, and/or lines that finish before the end of the line, or leading indentation, to detect paragraphs?) TODO: handle tables, lists, etc.
1. (optional, for diffing) Split each block into sentences and combine into a single piece of text with newline separators between blocks.