Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/rostrovsky/pdf-table
Java utility for parsing PDF tabular data using Apache PDFBox and OpenCV
https://github.com/rostrovsky/pdf-table
java-library java8 opencv opencv3 pdf-parsing pdfbox table tables
Last synced: about 1 month ago
JSON representation
Java utility for parsing PDF tabular data using Apache PDFBox and OpenCV
- Host: GitHub
- URL: https://github.com/rostrovsky/pdf-table
- Owner: rostrovsky
- License: mit
- Created: 2017-02-19T21:44:09.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2023-05-09T18:44:27.000Z (over 1 year ago)
- Last Synced: 2024-09-30T14:09:55.938Z (about 1 month ago)
- Topics: java-library, java8, opencv, opencv3, pdf-parsing, pdfbox, table, tables
- Language: Java
- Size: 145 KB
- Stars: 69
- Watchers: 7
- Forks: 12
- Open Issues: 2
-
Metadata Files:
- Readme: README.adoc
- License: LICENSE
Awesome Lists containing this project
README
= PDF-table
:toc:== What is PDF-table?
PDF-table is Java utility library that can be used for parsing tabular data in PDF documents. +
Core processing of PDF documents is performed with utilization of *Apache PDFBox* and *OpenCV*.== Prerequisites
=== JDK
JAVA 8 is required.
=== External dependencies
pdf-table requires compiled *OpenCV 3.4.2* to work properly:
. Download OpenCV v3.4.2 from https://github.com/opencv/opencv/releases/tag/3.4.2
. Unpack it and add to your system PATH:
* Windows: `\build\java\x64`
* Linux: `TODO`== Installation
[source, xml]
----com.github.rostrovsky
pdf-table
1.0.0----
== Usage
=== Parsing PDFs
When PDF document page is being parsed, following operations are performed:. Page is converted to grayscale image [OpenCV].
. Binary Inverted Threshold (BIT) is applied to grayscaled image [OpenCV].
. Contours are detected on BIT image and contour mask is created (additional Canny filtering can be turned on in this step) [OpenCV].
. Contour mask is XORed with BIT image [OpenCV].
. Contours are detected once again on XORed image (additional Canny filtering can be turned on in this step) [OpenCV].
. Final contours are drawn [OpenCV].
. Bounding rectangles are detected from final contours [OpenCV].
. PDF is being parsed region-by-region using bounding rectangles coordinates [Apache PDFBox].Above algorithm is mostly derived from http://stackoverflow.com/a/23106594.
For more information about parsed output, refer to <>
==== single-threaded example
[source, java]
----
class SingleThreadParser {
public static void main(String[] args) throws IOException {
PDDocument pdfDoc = PDDocument.load(new File("some.pdf"));
PdfTableReader reader = new PdfTableReader();
List parsed = reader.parsePdfTablePages(pdfDoc, 1, pdfDoc.getNumberOfPages());
}
}
----==== multi-threaded example
[source, java]
----
class MultiThreadParser {
public static void main(String[] args) throws IOException {
final int THREAD_COUNT = 8;
PDDocument pdfDoc = PDDocument.load(new File("some.pdf"));
PdfTableReader reader = new PdfTableReader();// parse pages simultaneously
ExecutorService executor = Executors.newFixedThreadPool(THREAD_COUNT);
List> futures = new ArrayList<>();
for (final int pageNum : IntStream.rangeClosed(1, pdfDoc.getNumberOfPages()).toArray()) {
Callable callable = () -> {
ParsedTablePage page = reader.parsePdfTablePage(pdfDoc, pageNum);
return page;
};
futures.add(executor.submit(callable));
}// collect parsed pages
List unsortedParsedPages = new ArrayList<>(pdfDoc.getNumberOfPages());
try {
for (Future f : futures) {
ParsedTablePage page = f.get();
unsortedParsedPages.add(page.getPageNum() - 1, page);
}
} catch (Exception e) {
throw new RuntimeException(e);
}// sort pages by pageNum
List sortedParsedPages = unsortedParsedPages.stream()
.sorted((p1, p2) -> Integer.compare(p1.getPageNum(), p2.getPageNum())).collect(Collectors.toList());
}
}
----=== Saving PDF pages as PNG images
PDF-Table provides methods for saving PDF pages as PNG images. +
Rendering DPI can be modified in `PdfTableSettings` (see: <>).==== single-threaded example
[source, java]
----
class SingleThreadPNGDump {
public static void main(String[] args) throws IOException {
PDDocument pdfDoc = PDDocument.load(new File("some.pdf"));
Path outputPath = Paths.get("C:", "some_directory");
PdfTableReader reader = new PdfTableReader();
reader.savePdfPagesAsPNG(pdfDoc, 1, pdfDoc.getNumberOfPages(), outputPath);
}
}
----==== multi-threaded example
[source, java]
----
class MultiThreadPNGDump {
public static void main(String[] args) throws IOException {
final int THREAD_COUNT = 8;
Path outputPath = Paths.get("C:", "some_directory");
PDDocument pdfDoc = PDDocument.load(new File("some.pdf"));
PdfTableReader reader = new PdfTableReader();ExecutorService executor = Executors.newFixedThreadPool(THREAD_COUNT);
List> futures = new ArrayList<>();
for (final int pageNum : IntStream.rangeClosed(1, pdfDoc.getNumberOfPages()).toArray()) {
Callable callable = () -> {
reader.savePdfPageAsPNG(pdfDoc, pageNum, outputPath);
return true;
};
futures.add(executor.submit(callable));
}try {
for (Future f : futures) {
f.get();
}
} catch (Exception e) {
throw new RuntimeException(e);
}
}
}
----=== Saving debug PNG images
When tables in PDF document cannot be parsed correctly with default settings, user can save debug images that show page
at various stages of processing. +
Using these images, user can adjust `PdfTableSettings` accordingly to achieve desired results
(see: <>).==== single-threaded example
[source, java]
----
class SingleThreadDebugImgsDump {
public static void main(String[] args) throws IOException {
PDDocument pdfDoc = PDDocument.load(new File("some.pdf"));
Path outputPath = Paths.get("C:", "some_directory");
PdfTableReader reader = new PdfTableReader();
reader.savePdfTablePagesDebugImages(pdfDoc, 1, pdfDoc.getNumberOfPages(), outputPath);
}
}
----==== multi-threaded example
[source, java]
----
class MultiThreadDebugImgsDump {
public static void main(String[] args) throws IOException {
final int THREAD_COUNT = 8;
Path outputPath = Paths.get("C:", "some_directory");
PDDocument pdfDoc = PDDocument.load(new File("some.pdf"));
PdfTableReader reader = new PdfTableReader();ExecutorService executor = Executors.newFixedThreadPool(THREAD_COUNT);
List> futures = new ArrayList<>();
for (final int pageNum : IntStream.rangeClosed(1, pdfDoc.getNumberOfPages()).toArray()) {
Callable callable = () -> {
reader.savePdfTablePagesDebugImage(pdfDoc, pageNum, outputPath);
return true;
};
futures.add(executor.submit(callable));
}try {
for (Future f : futures) {
f.get();
}
} catch (Exception e) {
throw new RuntimeException(e);
}
}
}
----=== Parsing settings
PDF rendering and OpenCV filtering settings are stored in `PdfTableSettings` object.
Custom settings instance can be passed to `PdfTableReader` constructor when non-default values are needed:
[source, java]
----
(...)// build settings object
PdfTableSettings settings = PdfTableSettings.getBuilder()
.setCannyFiltering(true)
.setCannyApertureSize(5)
.setCannyThreshold1(40)
.setCannyThreshold2(190.5)
.setPdfRenderingDpi(160)
.build();// pass settings to reader
PdfTableReader reader = new PdfTableReader(settings);
----=== Output format
Each parsed PDF page is being returned as `ParsedTablePage` object:
[source, java]
----
(...)PDDocument pdfDoc = PDDocument.load(new File("some.pdf"));
PdfTableReader reader = new PdfTableReader();// first page in document has index == 1, not 0 !
ParsedTablePage firstPage = reader.parsePdfTablePage(pdfDoc, 1);// getting page number
assert firstPage.getPageNum() == 1;// rows and cells are zero-indexed just like elements of the List
// getting first row
ParsedTablePage.ParsedTableRow firstRow = firstPage.getRow(0);// getting third cell in second row
String thirdCellContent = firstPage.getRow(1).getCell(2);// cell content usually contain characters,
// so it is recommended to trim them before processing
double thirdCellNumericValue = Double.valueOf(thirdCellContent.trim());
----