Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/SimpleApp/PDFParser
Swift PDFParser for PDF parsing and text mining. Includes a TrueType font parser
https://github.com/SimpleApp/PDFParser
pdf-parser swift truetype
Last synced: about 1 month ago
JSON representation
Swift PDFParser for PDF parsing and text mining. Includes a TrueType font parser
- Host: GitHub
- URL: https://github.com/SimpleApp/PDFParser
- Owner: SimpleApp
- Created: 2018-07-03T10:03:39.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2019-08-05T18:34:21.000Z (over 5 years ago)
- Last Synced: 2024-08-09T17:30:08.145Z (5 months ago)
- Topics: pdf-parser, swift, truetype
- Language: Swift
- Homepage:
- Size: 146 KB
- Stars: 35
- Watchers: 5
- Forks: 10
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# PDFParser
A pure Swift library for extracting text information from pdf files, such as text blocks with coordinates and font information. Also includes a true type font parser for glyph width computation.Parsing code based on PDFKitten https://github.com/KurtCode/PDFKitten
TrueType parser based on http://stevehanov.ca/blog/index.php?id=143Parsing is done very simply, and returns TextBlocks structs, that can be later indexed by custom code.
A simple indexer is provided, assuming single column layout, aggregating words.```Swift
var documentIndexer = SimpleDocumentIndexer()
let documentPath = Bundle.main.path(forResource: "Kurt the Cat", ofType: "pdf", inDirectory: nil, forLocalization: nil)let parser = try! Parser(documentURL: URL(fileURLWithPath: documentPath!), delegate:self, indexer: documentIndexer)
parser.parse()print( "All Text Blocks Raw dump : \n")
print(documentIndexer.pageIndexes[1]!.textBlocks)print( "\nWords per lines : \n")
print(documentIndexer.pageIndexes[1]!.allLinesDescription())
```ViewController in the DemoApp displays UILabel for textblocks. This lets you see if the frames for the textblock returned by the parser is correct.
> This code is not ready for production. Use at your own risk.
> This code is probably way too unoptimized to be used for anything latency-sensitive. It was meant to be easy to understand and correct first and foremost.