Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/nichtich/marginalia
Extract Annotations from PDF files
https://github.com/nichtich/marginalia
Last synced: 2 months ago
JSON representation
Extract Annotations from PDF files
- Host: GitHub
- URL: https://github.com/nichtich/marginalia
- Owner: nichtich
- Created: 2010-11-04T00:07:48.000Z (about 14 years ago)
- Default Branch: master
- Last Pushed: 2010-11-17T22:53:52.000Z (about 14 years ago)
- Last Synced: 2024-10-15T13:24:31.169Z (3 months ago)
- Language: Java
- Homepage: https://github.com/nichtich/marginalia/wiki/Support-of-PDF-annotations
- Size: 331 KB
- Stars: 19
- Watchers: 7
- Forks: 1
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Name
Marginalia - extract annotations from PDF files# Description
This is just an experiment with PDF file format, iText, and maven# Background
> We are using the computer as a paper simulator, which is like tearing
> the wings off a 747 and driving it as a bus on the highway.
> -- Ted NelsonPDF files are crap, but we somehow have to live with them for a while.
This application contains some experiment in at least making use of
annotations in PDF files. Annotations should be publications of their
own, but they are hidden in PDF or other proprietary software.There is free software, open source PDF reader that supports annotations.
There is no eBook-reader or similar device that is a convenient to use
to make annotations, as a physical piece of paper and a pen. So far.To get into the PDF file format, which in fact is more like a
database or a file system, you need PDF parser libraries, like
[iText](http://www.itextpdf.com/).## PDF Reference
[Adobe PDF Reference Archives](http://www.adobe.com/devnet/pdf/pdf_reference_archive.html).
PDF annotations are defined in chapter 8.4 of the PDF reference (2006),
page 604-647.
[see here](http://www.verypdf.com/document/pdf-format-reference/pg_0604.htm).# Usage of the developer snapshot
First download, clone or fork the project, e.g.
$ git clone git://github.com/nichtich/marginalia.git
$ cd marginaliaYou need maven2 to compile and run this application. Try
$ mvn compile
If you are lucky, maven will install all required dependencies and compile.
You can then run marginalia in the development environment:$ mvn exec:java -Dexec.args="yourfile.pdf"
To create a single jar that includes all dependencies, call:
$ mvn assembly:single
You can then copy the jar file to a place of your choice and run it via:
$ java -jar target/marginalia-0.0.1dev-jar-with-dependencies.jar yourfile.pdf
## Extracting text
To extract annotated text, you can use the pdftotext command line tool from
poppler (maybe I better move from iText to poppler). For instance if you have
an annotation on page 1 with:rect="52.559917,437.8619,286.3729,528.27844"
and page size is 595 x 842 pts (A4). Then the crop area can be calculated
with `x = 52`, `y = 842 - 528.27 = 313`, `W = 233`, `H = 91`.$ pdftotext -layout -nopgbrk -f 1 -l 1 -x 52 -y 313 -W 233 -H 91 your.pdf && cat your.txt
Sure this should be automized, and it does not cover details.
## Converting annotations to SVG
The XML output contains some custom Marginalia elements. With these and the
script `marginalia2svg.xsl` you can convert extracted annotations to SVG.$ xsltproc marginalia2svg.xsl youroutput.xml
# Author
Jakob Voss