{"id":15068791,"url":"https://github.com/rostrovsky/pdf-table","last_synced_at":"2025-08-13T17:08:54.843Z","repository":{"id":27169277,"uuid":"82491694","full_name":"rostrovsky/pdf-table","owner":"rostrovsky","description":"Java utility for parsing PDF tabular data using Apache PDFBox and OpenCV","archived":false,"fork":false,"pushed_at":"2023-05-09T18:44:27.000Z","size":148,"stargazers_count":72,"open_issues_count":2,"forks_count":13,"subscribers_count":7,"default_branch":"master","last_synced_at":"2025-03-24T15:21:58.796Z","etag":null,"topics":["java-library","java8","opencv","opencv3","pdf-parsing","pdfbox","table","tables"],"latest_commit_sha":null,"homepage":null,"language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rostrovsky.png","metadata":{"files":{"readme":"README.adoc","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-02-19T21:44:09.000Z","updated_at":"2025-02-13T00:54:25.000Z","dependencies_parsed_at":"2024-10-13T04:41:05.824Z","dependency_job_id":null,"html_url":"https://github.com/rostrovsky/pdf-table","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rostrovsky%2Fpdf-table","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rostrovsky%2Fpdf-table/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rostrovsky%2Fpdf-table/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rostrovsky%2Fpdf-table/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rostrovsky","download_url":"https://codeload.github.com/rostrovsky/pdf-table/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248262290,"owners_count":21074282,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["java-library","java8","opencv","opencv3","pdf-parsing","pdfbox","table","tables"],"created_at":"2024-09-25T01:39:18.067Z","updated_at":"2025-04-10T17:43:55.260Z","avatar_url":"https://github.com/rostrovsky.png","language":"Java","readme":"= PDF-table\n:toc:\n\n== What is PDF-table?\nPDF-table is Java utility library that can be used for parsing tabular data in PDF documents. +\nCore processing of PDF documents is performed with utilization of *Apache PDFBox* and *OpenCV*.\n\n== Prerequisites\n\n=== JDK\n\nJAVA 8 is required.\n\n=== External dependencies\n\npdf-table requires compiled *OpenCV 3.4.2* to work properly:\n\n. Download OpenCV v3.4.2 from https://github.com/opencv/opencv/releases/tag/3.4.2\n. Unpack it and add to your system PATH:\n    * Windows: `\u003copencv dir\u003e\\build\\java\\x64`\n    * Linux: `TODO`\n\n== Installation\n[source, xml]\n----\n\u003cdependency\u003e\n  \u003cgroupId\u003ecom.github.rostrovsky\u003c/groupId\u003e\n  \u003cartifactId\u003epdf-table\u003c/artifactId\u003e\n  \u003cversion\u003e1.0.0\u003c/version\u003e\n\u003c/dependency\u003e\n----\n\n== Usage\n\n=== Parsing PDFs\nWhen PDF document page is being parsed, following operations are performed:\n\n. Page is converted to grayscale image [OpenCV].\n. Binary Inverted Threshold (BIT) is applied to grayscaled image [OpenCV].\n. Contours are detected on BIT image and contour mask is created (additional Canny filtering can be turned on in this step) [OpenCV].\n. Contour mask is XORed with BIT image [OpenCV].\n. Contours are detected once again on XORed image (additional Canny filtering can be turned on in this step) [OpenCV].\n. Final contours are drawn [OpenCV].\n. Bounding rectangles are detected from final contours [OpenCV].\n. PDF is being parsed region-by-region using bounding rectangles coordinates [Apache PDFBox].\n\nAbove algorithm is mostly derived from http://stackoverflow.com/a/23106594.\n\nFor more information about parsed output, refer to \u003c\u003cOutput format\u003e\u003e\n\n==== single-threaded example\n[source, java]\n----\nclass SingleThreadParser {\n    public static void main(String[] args) throws IOException {\n        PDDocument pdfDoc = PDDocument.load(new File(\"some.pdf\"));\n        PdfTableReader reader = new PdfTableReader();\n        List\u003cParsedTablePage\u003e parsed = reader.parsePdfTablePages(pdfDoc, 1, pdfDoc.getNumberOfPages());\n    }\n}\n----\n\n==== multi-threaded example\n[source, java]\n----\nclass MultiThreadParser {\n    public static void main(String[] args) throws IOException {\n        final int THREAD_COUNT = 8;\n        PDDocument pdfDoc = PDDocument.load(new File(\"some.pdf\"));\n        PdfTableReader reader = new PdfTableReader();\n\n        // parse pages simultaneously\n        ExecutorService executor = Executors.newFixedThreadPool(THREAD_COUNT);\n        List\u003cFuture\u003cParsedTablePage\u003e\u003e futures = new ArrayList\u003c\u003e();\n        for (final int pageNum : IntStream.rangeClosed(1, pdfDoc.getNumberOfPages()).toArray()) {\n            Callable\u003cParsedTablePage\u003e callable = () -\u003e {\n                ParsedTablePage page = reader.parsePdfTablePage(pdfDoc, pageNum);\n                return page;\n            };\n            futures.add(executor.submit(callable));\n        }\n\n        // collect parsed pages\n        List\u003cParsedTablePage\u003e unsortedParsedPages = new ArrayList\u003c\u003e(pdfDoc.getNumberOfPages());\n        try {\n            for (Future\u003cParsedTablePage\u003e f : futures) {\n                ParsedTablePage page = f.get();\n                unsortedParsedPages.add(page.getPageNum() - 1, page);\n            }\n        } catch (Exception e) {\n            throw new RuntimeException(e);\n        }\n\n        // sort pages by pageNum\n        List\u003cParsedTablePage\u003e sortedParsedPages = unsortedParsedPages.stream()\n                .sorted((p1, p2) -\u003e Integer.compare(p1.getPageNum(), p2.getPageNum())).collect(Collectors.toList());\n    }\n}\n----\n\n=== Saving PDF pages as PNG images\nPDF-Table provides methods for saving PDF pages as PNG images. +\nRendering DPI can be modified in `PdfTableSettings` (see: \u003c\u003cParsing settings\u003e\u003e).\n\n==== single-threaded example\n[source, java]\n----\nclass SingleThreadPNGDump {\n    public static void main(String[] args) throws IOException {\n        PDDocument pdfDoc = PDDocument.load(new File(\"some.pdf\"));\n        Path outputPath = Paths.get(\"C:\", \"some_directory\");\n        PdfTableReader reader = new PdfTableReader();\n        reader.savePdfPagesAsPNG(pdfDoc, 1, pdfDoc.getNumberOfPages(), outputPath);\n    }\n}\n----\n\n==== multi-threaded example\n[source, java]\n----\nclass MultiThreadPNGDump {\n    public static void main(String[] args) throws IOException {\n        final int THREAD_COUNT = 8;\n        Path outputPath = Paths.get(\"C:\", \"some_directory\");\n        PDDocument pdfDoc = PDDocument.load(new File(\"some.pdf\"));\n        PdfTableReader reader = new PdfTableReader();\n\n        ExecutorService executor = Executors.newFixedThreadPool(THREAD_COUNT);\n        List\u003cFuture\u003cBoolean\u003e\u003e futures = new ArrayList\u003c\u003e();\n        for (final int pageNum : IntStream.rangeClosed(1, pdfDoc.getNumberOfPages()).toArray()) {\n            Callable\u003cBoolean\u003e callable = () -\u003e {\n                reader.savePdfPageAsPNG(pdfDoc, pageNum, outputPath);\n                return true;\n            };\n            futures.add(executor.submit(callable));\n        }\n\n        try {\n            for (Future\u003cBoolean\u003e f : futures) {\n                f.get();\n            }\n        } catch (Exception e) {\n            throw new RuntimeException(e);\n        }\n    }\n}\n----\n\n=== Saving debug PNG images\nWhen tables in PDF document cannot be parsed correctly with default settings, user can save debug images that show page\nat various stages of processing. +\nUsing these images, user can adjust `PdfTableSettings` accordingly to achieve desired results\n(see: \u003c\u003cParsing settings\u003e\u003e).\n\n==== single-threaded example\n[source, java]\n----\nclass SingleThreadDebugImgsDump {\n    public static void main(String[] args) throws IOException {\n        PDDocument pdfDoc = PDDocument.load(new File(\"some.pdf\"));\n        Path outputPath = Paths.get(\"C:\", \"some_directory\");\n        PdfTableReader reader = new PdfTableReader();\n        reader.savePdfTablePagesDebugImages(pdfDoc, 1, pdfDoc.getNumberOfPages(), outputPath);\n    }\n}\n----\n\n==== multi-threaded example\n[source, java]\n----\nclass MultiThreadDebugImgsDump {\n    public static void main(String[] args) throws IOException {\n        final int THREAD_COUNT = 8;\n        Path outputPath = Paths.get(\"C:\", \"some_directory\");\n        PDDocument pdfDoc = PDDocument.load(new File(\"some.pdf\"));\n        PdfTableReader reader = new PdfTableReader();\n\n        ExecutorService executor = Executors.newFixedThreadPool(THREAD_COUNT);\n        List\u003cFuture\u003cBoolean\u003e\u003e futures = new ArrayList\u003c\u003e();\n        for (final int pageNum : IntStream.rangeClosed(1, pdfDoc.getNumberOfPages()).toArray()) {\n            Callable\u003cBoolean\u003e callable = () -\u003e {\n                reader.savePdfTablePagesDebugImage(pdfDoc, pageNum, outputPath);\n                return true;\n            };\n            futures.add(executor.submit(callable));\n        }\n\n        try {\n            for (Future\u003cBoolean\u003e f : futures) {\n                f.get();\n            }\n        } catch (Exception e) {\n            throw new RuntimeException(e);\n        }\n    }\n}\n----\n\n=== Parsing settings\n\nPDF rendering and OpenCV filtering settings are stored in `PdfTableSettings` object.\n\nCustom settings instance can be passed to `PdfTableReader` constructor when non-default values are needed:\n\n[source, java]\n----\n(...)\n\n// build settings object\nPdfTableSettings settings = PdfTableSettings.getBuilder()\n                .setCannyFiltering(true)\n                .setCannyApertureSize(5)\n                .setCannyThreshold1(40)\n                .setCannyThreshold2(190.5)\n                .setPdfRenderingDpi(160)\n                .build();\n\n// pass settings to reader\nPdfTableReader reader = new PdfTableReader(settings);\n----\n\n\n=== Output format\nEach parsed PDF page is being returned as `ParsedTablePage` object:\n[source, java]\n----\n(...)\n\nPDDocument pdfDoc = PDDocument.load(new File(\"some.pdf\"));\nPdfTableReader reader = new PdfTableReader();\n\n// first page in document has index == 1, not 0 !\nParsedTablePage firstPage = reader.parsePdfTablePage(pdfDoc, 1);\n\n// getting page number\nassert firstPage.getPageNum() == 1;\n\n// rows and cells are zero-indexed just like elements of the List\n// getting first row\nParsedTablePage.ParsedTableRow firstRow = firstPage.getRow(0);\n\n// getting third cell in second row\nString thirdCellContent = firstPage.getRow(1).getCell(2);\n\n// cell content usually contain \u003cCR\u003e\u003cLF\u003e characters,\n// so it is recommended to trim them before processing\ndouble thirdCellNumericValue = Double.valueOf(thirdCellContent.trim());\n----\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frostrovsky%2Fpdf-table","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frostrovsky%2Fpdf-table","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frostrovsky%2Fpdf-table/lists"}