Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/rostrovsky/pdf-table

Java utility for parsing PDF tabular data using Apache PDFBox and OpenCV
https://github.com/rostrovsky/pdf-table

java-library java8 opencv opencv3 pdf-parsing pdfbox table tables

Last synced: about 1 month ago
JSON representation

Java utility for parsing PDF tabular data using Apache PDFBox and OpenCV

Awesome Lists containing this project

README

        

= PDF-table
:toc:

== What is PDF-table?
PDF-table is Java utility library that can be used for parsing tabular data in PDF documents. +
Core processing of PDF documents is performed with utilization of *Apache PDFBox* and *OpenCV*.

== Prerequisites

=== JDK

JAVA 8 is required.

=== External dependencies

pdf-table requires compiled *OpenCV 3.4.2* to work properly:

. Download OpenCV v3.4.2 from https://github.com/opencv/opencv/releases/tag/3.4.2
. Unpack it and add to your system PATH:
* Windows: `\build\java\x64`
* Linux: `TODO`

== Installation
[source, xml]
----

com.github.rostrovsky
pdf-table
1.0.0

----

== Usage

=== Parsing PDFs
When PDF document page is being parsed, following operations are performed:

. Page is converted to grayscale image [OpenCV].
. Binary Inverted Threshold (BIT) is applied to grayscaled image [OpenCV].
. Contours are detected on BIT image and contour mask is created (additional Canny filtering can be turned on in this step) [OpenCV].
. Contour mask is XORed with BIT image [OpenCV].
. Contours are detected once again on XORed image (additional Canny filtering can be turned on in this step) [OpenCV].
. Final contours are drawn [OpenCV].
. Bounding rectangles are detected from final contours [OpenCV].
. PDF is being parsed region-by-region using bounding rectangles coordinates [Apache PDFBox].

Above algorithm is mostly derived from http://stackoverflow.com/a/23106594.

For more information about parsed output, refer to <>

==== single-threaded example
[source, java]
----
class SingleThreadParser {
public static void main(String[] args) throws IOException {
PDDocument pdfDoc = PDDocument.load(new File("some.pdf"));
PdfTableReader reader = new PdfTableReader();
List parsed = reader.parsePdfTablePages(pdfDoc, 1, pdfDoc.getNumberOfPages());
}
}
----

==== multi-threaded example
[source, java]
----
class MultiThreadParser {
public static void main(String[] args) throws IOException {
final int THREAD_COUNT = 8;
PDDocument pdfDoc = PDDocument.load(new File("some.pdf"));
PdfTableReader reader = new PdfTableReader();

// parse pages simultaneously
ExecutorService executor = Executors.newFixedThreadPool(THREAD_COUNT);
List> futures = new ArrayList<>();
for (final int pageNum : IntStream.rangeClosed(1, pdfDoc.getNumberOfPages()).toArray()) {
Callable callable = () -> {
ParsedTablePage page = reader.parsePdfTablePage(pdfDoc, pageNum);
return page;
};
futures.add(executor.submit(callable));
}

// collect parsed pages
List unsortedParsedPages = new ArrayList<>(pdfDoc.getNumberOfPages());
try {
for (Future f : futures) {
ParsedTablePage page = f.get();
unsortedParsedPages.add(page.getPageNum() - 1, page);
}
} catch (Exception e) {
throw new RuntimeException(e);
}

// sort pages by pageNum
List sortedParsedPages = unsortedParsedPages.stream()
.sorted((p1, p2) -> Integer.compare(p1.getPageNum(), p2.getPageNum())).collect(Collectors.toList());
}
}
----

=== Saving PDF pages as PNG images
PDF-Table provides methods for saving PDF pages as PNG images. +
Rendering DPI can be modified in `PdfTableSettings` (see: <>).

==== single-threaded example
[source, java]
----
class SingleThreadPNGDump {
public static void main(String[] args) throws IOException {
PDDocument pdfDoc = PDDocument.load(new File("some.pdf"));
Path outputPath = Paths.get("C:", "some_directory");
PdfTableReader reader = new PdfTableReader();
reader.savePdfPagesAsPNG(pdfDoc, 1, pdfDoc.getNumberOfPages(), outputPath);
}
}
----

==== multi-threaded example
[source, java]
----
class MultiThreadPNGDump {
public static void main(String[] args) throws IOException {
final int THREAD_COUNT = 8;
Path outputPath = Paths.get("C:", "some_directory");
PDDocument pdfDoc = PDDocument.load(new File("some.pdf"));
PdfTableReader reader = new PdfTableReader();

ExecutorService executor = Executors.newFixedThreadPool(THREAD_COUNT);
List> futures = new ArrayList<>();
for (final int pageNum : IntStream.rangeClosed(1, pdfDoc.getNumberOfPages()).toArray()) {
Callable callable = () -> {
reader.savePdfPageAsPNG(pdfDoc, pageNum, outputPath);
return true;
};
futures.add(executor.submit(callable));
}

try {
for (Future f : futures) {
f.get();
}
} catch (Exception e) {
throw new RuntimeException(e);
}
}
}
----

=== Saving debug PNG images
When tables in PDF document cannot be parsed correctly with default settings, user can save debug images that show page
at various stages of processing. +
Using these images, user can adjust `PdfTableSettings` accordingly to achieve desired results
(see: <>).

==== single-threaded example
[source, java]
----
class SingleThreadDebugImgsDump {
public static void main(String[] args) throws IOException {
PDDocument pdfDoc = PDDocument.load(new File("some.pdf"));
Path outputPath = Paths.get("C:", "some_directory");
PdfTableReader reader = new PdfTableReader();
reader.savePdfTablePagesDebugImages(pdfDoc, 1, pdfDoc.getNumberOfPages(), outputPath);
}
}
----

==== multi-threaded example
[source, java]
----
class MultiThreadDebugImgsDump {
public static void main(String[] args) throws IOException {
final int THREAD_COUNT = 8;
Path outputPath = Paths.get("C:", "some_directory");
PDDocument pdfDoc = PDDocument.load(new File("some.pdf"));
PdfTableReader reader = new PdfTableReader();

ExecutorService executor = Executors.newFixedThreadPool(THREAD_COUNT);
List> futures = new ArrayList<>();
for (final int pageNum : IntStream.rangeClosed(1, pdfDoc.getNumberOfPages()).toArray()) {
Callable callable = () -> {
reader.savePdfTablePagesDebugImage(pdfDoc, pageNum, outputPath);
return true;
};
futures.add(executor.submit(callable));
}

try {
for (Future f : futures) {
f.get();
}
} catch (Exception e) {
throw new RuntimeException(e);
}
}
}
----

=== Parsing settings

PDF rendering and OpenCV filtering settings are stored in `PdfTableSettings` object.

Custom settings instance can be passed to `PdfTableReader` constructor when non-default values are needed:

[source, java]
----
(...)

// build settings object
PdfTableSettings settings = PdfTableSettings.getBuilder()
.setCannyFiltering(true)
.setCannyApertureSize(5)
.setCannyThreshold1(40)
.setCannyThreshold2(190.5)
.setPdfRenderingDpi(160)
.build();

// pass settings to reader
PdfTableReader reader = new PdfTableReader(settings);
----

=== Output format
Each parsed PDF page is being returned as `ParsedTablePage` object:
[source, java]
----
(...)

PDDocument pdfDoc = PDDocument.load(new File("some.pdf"));
PdfTableReader reader = new PdfTableReader();

// first page in document has index == 1, not 0 !
ParsedTablePage firstPage = reader.parsePdfTablePage(pdfDoc, 1);

// getting page number
assert firstPage.getPageNum() == 1;

// rows and cells are zero-indexed just like elements of the List
// getting first row
ParsedTablePage.ParsedTableRow firstRow = firstPage.getRow(0);

// getting third cell in second row
String thirdCellContent = firstPage.getRow(1).getCell(2);

// cell content usually contain characters,
// so it is recommended to trim them before processing
double thirdCellNumericValue = Double.valueOf(thirdCellContent.trim());
----