https://github.com/SimpleApp/PDFParser

Swift PDFParser for PDF parsing and text mining. Includes a TrueType font parser
https://github.com/SimpleApp/PDFParser

pdf-parser swift truetype

Last synced: 8 months ago
JSON representation

Swift PDFParser for PDF parsing and text mining. Includes a TrueType font parser

Host: GitHub
URL: https://github.com/SimpleApp/PDFParser
Owner: SimpleApp
Created: 2018-07-03T10:03:39.000Z (about 7 years ago)
Default Branch: master
Last Pushed: 2019-08-05T18:34:21.000Z (almost 6 years ago)
Last Synced: 2024-08-09T17:30:08.145Z (11 months ago)
Topics: pdf-parser, swift, truetype
Language: Swift
Homepage:
Size: 146 KB
Stars: 35
Watchers: 5
Forks: 10
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        #  PDFParser

A pure Swift library for extracting text information from pdf files, such as text blocks with coordinates and font information. Also includes a true type font parser for glyph width computation.

Parsing code based on PDFKitten https://github.com/KurtCode/PDFKitten

TrueType parser based on  http://stevehanov.ca/blog/index.php?id=143

Parsing is done very simply, and returns TextBlocks structs, that can be later indexed by custom code.

A simple indexer is provided, assuming single column layout, aggregating words.

```Swift

var documentIndexer = SimpleDocumentIndexer()

let documentPath = Bundle.main.path(forResource: "Kurt the Cat", ofType: "pdf", inDirectory: nil, forLocalization: nil)

let parser = try! Parser(documentURL: URL(fileURLWithPath: documentPath!), delegate:self, indexer: documentIndexer)

parser.parse()

print( "All Text Blocks Raw dump : \n")

print(documentIndexer.pageIndexes[1]!.textBlocks)

print( "\nWords per lines : \n")

print(documentIndexer.pageIndexes[1]!.allLinesDescription())

```

ViewController in the DemoApp displays UILabel for textblocks. This lets you see if the frames for the textblock returned by the parser is correct.

> This code is not ready for production. Use at your own risk.

> This code is probably way too unoptimized to be used for anything latency-sensitive. It was meant to be easy to understand and correct first and foremost.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/SimpleApp/PDFParser

Awesome Lists containing this project

README