https://github.com/rostrovsky/pdf-table

Java utility for parsing PDF tabular data using Apache PDFBox and OpenCV
https://github.com/rostrovsky/pdf-table

java-library java8 opencv opencv3 pdf-parsing pdfbox table tables

Last synced: 4 months ago
JSON representation

Java utility for parsing PDF tabular data using Apache PDFBox and OpenCV

Host: GitHub
URL: https://github.com/rostrovsky/pdf-table
Owner: rostrovsky
License: mit
Created: 2017-02-19T21:44:09.000Z (almost 9 years ago)
Default Branch: master
Last Pushed: 2023-05-09T18:44:27.000Z (over 2 years ago)
Last Synced: 2025-03-24T15:21:58.796Z (9 months ago)
Topics: java-library, java8, opencv, opencv3, pdf-parsing, pdfbox, table, tables
Language: Java
Size: 145 KB
Stars: 72
Watchers: 7
Forks: 13
Open Issues: 2
Metadata Files:
- Readme: README.adoc
- License: LICENSE

Awesome Lists containing this project

README

          = PDF-table

:toc:

== What is PDF-table?

PDF-table is Java utility library that can be used for parsing tabular data in PDF documents. +

Core processing of PDF documents is performed with utilization of *Apache PDFBox* and *OpenCV*.

== Prerequisites

=== JDK

JAVA 8 is required.

=== External dependencies

pdf-table requires compiled *OpenCV 3.4.2* to work properly:

. Download OpenCV v3.4.2 from https://github.com/opencv/opencv/releases/tag/3.4.2

. Unpack it and add to your system PATH:

    * Windows: `\build\java\x64`

    * Linux: `TODO`

== Installation

[source, xml]

----

  com.github.rostrovsky

  pdf-table

  1.0.0

----

== Usage

=== Parsing PDFs

When PDF document page is being parsed, following operations are performed:

. Page is converted to grayscale image [OpenCV].

. Binary Inverted Threshold (BIT) is applied to grayscaled image [OpenCV].

. Contours are detected on BIT image and contour mask is created (additional Canny filtering can be turned on in this step) [OpenCV].

. Contour mask is XORed with BIT image [OpenCV].

. Contours are detected once again on XORed image (additional Canny filtering can be turned on in this step) [OpenCV].

. Final contours are drawn [OpenCV].

. Bounding rectangles are detected from final contours [OpenCV].

. PDF is being parsed region-by-region using bounding rectangles coordinates [Apache PDFBox].

Above algorithm is mostly derived from http://stackoverflow.com/a/23106594.

For more information about parsed output, refer to <>

==== single-threaded example

[source, java]

----

class SingleThreadParser {

    public static void main(String[] args) throws IOException {

        PDDocument pdfDoc = PDDocument.load(new File("some.pdf"));

        PdfTableReader reader = new PdfTableReader();

        List parsed = reader.parsePdfTablePages(pdfDoc, 1, pdfDoc.getNumberOfPages());

    }

}

----

==== multi-threaded example

[source, java]

----

class MultiThreadParser {

    public static void main(String[] args) throws IOException {

        final int THREAD_COUNT = 8;

        PDDocument pdfDoc = PDDocument.load(new File("some.pdf"));

        PdfTableReader reader = new PdfTableReader();

        // parse pages simultaneously

        ExecutorService executor = Executors.newFixedThreadPool(THREAD_COUNT);

        List> futures = new ArrayList<>();

        for (final int pageNum : IntStream.rangeClosed(1, pdfDoc.getNumberOfPages()).toArray()) {

            Callable callable = () -> {

                ParsedTablePage page = reader.parsePdfTablePage(pdfDoc, pageNum);

                return page;

            };

            futures.add(executor.submit(callable));

        }

        // collect parsed pages

        List unsortedParsedPages = new ArrayList<>(pdfDoc.getNumberOfPages());

        try {

            for (Future f : futures) {

                ParsedTablePage page = f.get();

                unsortedParsedPages.add(page.getPageNum() - 1, page);

            }

        } catch (Exception e) {

            throw new RuntimeException(e);

        }

        // sort pages by pageNum

        List sortedParsedPages = unsortedParsedPages.stream()

                .sorted((p1, p2) -> Integer.compare(p1.getPageNum(), p2.getPageNum())).collect(Collectors.toList());

    }

}

----

=== Saving PDF pages as PNG images

PDF-Table provides methods for saving PDF pages as PNG images. +

Rendering DPI can be modified in `PdfTableSettings` (see: <>).

==== single-threaded example

[source, java]

----

class SingleThreadPNGDump {

    public static void main(String[] args) throws IOException {

        PDDocument pdfDoc = PDDocument.load(new File("some.pdf"));

        Path outputPath = Paths.get("C:", "some_directory");

        PdfTableReader reader = new PdfTableReader();

        reader.savePdfPagesAsPNG(pdfDoc, 1, pdfDoc.getNumberOfPages(), outputPath);

    }

}

----

==== multi-threaded example

[source, java]

----

class MultiThreadPNGDump {

    public static void main(String[] args) throws IOException {

        final int THREAD_COUNT = 8;

        Path outputPath = Paths.get("C:", "some_directory");

        PDDocument pdfDoc = PDDocument.load(new File("some.pdf"));

        PdfTableReader reader = new PdfTableReader();

        ExecutorService executor = Executors.newFixedThreadPool(THREAD_COUNT);

        List> futures = new ArrayList<>();

        for (final int pageNum : IntStream.rangeClosed(1, pdfDoc.getNumberOfPages()).toArray()) {

            Callable callable = () -> {

                reader.savePdfPageAsPNG(pdfDoc, pageNum, outputPath);

                return true;

            };

            futures.add(executor.submit(callable));

        }

        try {

            for (Future f : futures) {

                f.get();

            }

        } catch (Exception e) {

            throw new RuntimeException(e);

        }

    }

}

----

=== Saving debug PNG images

When tables in PDF document cannot be parsed correctly with default settings, user can save debug images that show page

at various stages of processing. +

Using these images, user can adjust `PdfTableSettings` accordingly to achieve desired results

(see: <>).

==== single-threaded example

[source, java]

----

class SingleThreadDebugImgsDump {

    public static void main(String[] args) throws IOException {

        PDDocument pdfDoc = PDDocument.load(new File("some.pdf"));

        Path outputPath = Paths.get("C:", "some_directory");

        PdfTableReader reader = new PdfTableReader();

        reader.savePdfTablePagesDebugImages(pdfDoc, 1, pdfDoc.getNumberOfPages(), outputPath);

    }

}

----

==== multi-threaded example

[source, java]

----

class MultiThreadDebugImgsDump {

    public static void main(String[] args) throws IOException {

        final int THREAD_COUNT = 8;

        Path outputPath = Paths.get("C:", "some_directory");

        PDDocument pdfDoc = PDDocument.load(new File("some.pdf"));

        PdfTableReader reader = new PdfTableReader();

        ExecutorService executor = Executors.newFixedThreadPool(THREAD_COUNT);

        List> futures = new ArrayList<>();

        for (final int pageNum : IntStream.rangeClosed(1, pdfDoc.getNumberOfPages()).toArray()) {

            Callable callable = () -> {

                reader.savePdfTablePagesDebugImage(pdfDoc, pageNum, outputPath);

                return true;

            };

            futures.add(executor.submit(callable));

        }

        try {

            for (Future f : futures) {

                f.get();

            }

        } catch (Exception e) {

            throw new RuntimeException(e);

        }

    }

}

----

=== Parsing settings

PDF rendering and OpenCV filtering settings are stored in `PdfTableSettings` object.

Custom settings instance can be passed to `PdfTableReader` constructor when non-default values are needed:

[source, java]

----

(...)

// build settings object

PdfTableSettings settings = PdfTableSettings.getBuilder()

                .setCannyFiltering(true)

                .setCannyApertureSize(5)

                .setCannyThreshold1(40)

                .setCannyThreshold2(190.5)

                .setPdfRenderingDpi(160)

                .build();

// pass settings to reader

PdfTableReader reader = new PdfTableReader(settings);

----

=== Output format

Each parsed PDF page is being returned as `ParsedTablePage` object:

[source, java]

----

(...)

PDDocument pdfDoc = PDDocument.load(new File("some.pdf"));

PdfTableReader reader = new PdfTableReader();

// first page in document has index == 1, not 0 !

ParsedTablePage firstPage = reader.parsePdfTablePage(pdfDoc, 1);

// getting page number

assert firstPage.getPageNum() == 1;

// rows and cells are zero-indexed just like elements of the List

// getting first row

ParsedTablePage.ParsedTableRow firstRow = firstPage.getRow(0);

// getting third cell in second row

String thirdCellContent = firstPage.getRow(1).getCell(2);

// cell content usually contain  characters,

// so it is recommended to trim them before processing

double thirdCellNumericValue = Double.valueOf(thirdCellContent.trim());

----

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/rostrovsky/pdf-table

Awesome Lists containing this project

README