{"id":13425787,"url":"https://github.com/JonathanLink/PDFLayoutTextStripper","last_synced_at":"2025-03-15T20:31:19.450Z","repository":{"id":45725999,"uuid":"44072711","full_name":"JonathanLink/PDFLayoutTextStripper","owner":"JonathanLink","description":"Converts a pdf file into a text file while keeping the layout of the original pdf. Useful to extract the content from a table in a pdf file for instance. This is a subclass of PDFTextStripper class (from the Apache PDFBox library).","archived":false,"fork":false,"pushed_at":"2023-12-17T17:19:17.000Z","size":22135,"stargazers_count":1580,"open_issues_count":25,"forks_count":213,"subscribers_count":53,"default_branch":"master","last_synced_at":"2025-03-10T10:46:25.337Z","etag":null,"topics":["data-extraction","extract","java","layout","pdf","pdfbox","text"],"latest_commit_sha":null,"homepage":"https://jonathanlink.ch/PDFLayoutTextStripper.html","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/JonathanLink.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2015-10-11T22:49:10.000Z","updated_at":"2025-03-04T09:07:31.000Z","dependencies_parsed_at":"2024-09-25T00:07:34.126Z","dependency_job_id":"ec503c57-218a-46db-86ae-d4ac71c3a98e","html_url":"https://github.com/JonathanLink/PDFLayoutTextStripper","commit_stats":null,"previous_names":[],"tags_count":9,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JonathanLink%2FPDFLayoutTextStripper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JonathanLink%2FPDFLayoutTextStripper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JonathanLink%2FPDFLayoutTextStripper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JonathanLink%2FPDFLayoutTextStripper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/JonathanLink","download_url":"https://codeload.github.com/JonathanLink/PDFLayoutTextStripper/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243790943,"owners_count":20348378,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-extraction","extract","java","layout","pdf","pdfbox","text"],"created_at":"2024-07-31T00:01:18.905Z","updated_at":"2025-03-15T20:31:14.890Z","avatar_url":"https://github.com/JonathanLink.png","language":"Java","readme":"# PDFLayoutTextStripper\n\nConverts a PDF file into a text file while keeping the layout of the original PDF. Useful to extract the content from a table or a form in a PDF file. PDFLayoutTextStripper is a subclass of PDFTextStripper class (from the [Apache PDFBox](https://pdfbox.apache.org/) library).\n\n## Use cases\nData extraction from a table in a PDF file\n![example](sample.png)\n-\nData extraction from a form in a PDF file\n![example](sample2.png)\n\n## How to install\n\n### Maven\n```\n\u003cdependency\u003e\n  \u003cgroupId\u003eio.github.jonathanlink\u003c/groupId\u003e\n  \u003cartifactId\u003ePDFLayoutTextStripper\u003c/artifactId\u003e\n  \u003cversion\u003e2.2.3\u003c/version\u003e\n\u003c/dependency\u003e\n```\n\n### Manual\n1) Install **apache pdfbox** manually ([to get the v2.0.6 click here](https://mvnrepository.com/artifact/org.apache.pdfbox/pdfbox/2.0.6) ) and its two dependencies\ncommons-logging.jar and fontbox\n\n\u003e**warning**: only pdfbox versions **from version 2.0.0 upwards** are compatible with this version of PDFLayoutTextStripper.java\n\n\n### How to use on Linux/Mac\n```\ncd PDFLayoutTextStripper\njavac -cp .:/pathto/pdfbox-2.0.6.jar:/pathto/commons-logging-1.2.jar:/pathto/PDFLayoutTextStripper/fontbox-2.0.6.jar *.java\njava -cp .:/pathto/pdfbox-2.0.6.jar:/pathto/commons-logging-1.2.jar:/pathto/PDFLayoutTextStripper/fontbox-2.0.6.jar test\n```\n\n### How to use on Windows\n\nThe same as for Linux (see above) but replace :  with ;\n\n## Sample code\n```\nimport java.io.File;\nimport java.io.FileNotFoundException;\nimport java.io.IOException;\nimport org.apache.pdfbox.io.RandomAccessFile;\nimport org.apache.pdfbox.pdfparser.PDFParser;\nimport org.apache.pdfbox.pdmodel.PDDocument;\nimport org.apache.pdfbox.text.PDFTextStripper;\n\npublic class test {\n\tpublic static void main(String[] args) {\n\t\tString string = null;\n        try {\n            PDFParser pdfParser = new PDFParser(new RandomAccessFile(new File(\"./samples/bus.pdf\"), \"r\"));\n            pdfParser.parse();\n            PDDocument pdDocument = new PDDocument(pdfParser.getDocument());\n            PDFTextStripper pdfTextStripper = new PDFLayoutTextStripper();\n            string = pdfTextStripper.getText(pdDocument);\n        } catch (FileNotFoundException e) {\n            e.printStackTrace();\n        } catch (IOException e) {\n            e.printStackTrace();\n        };\n        System.out.println(string);\n\t}\n}\n```\n\n## Contributors\nThanks to\n\n* Dmytro Zelinskyy for reporting an issue with its correction (v2.2.3) \n* Ho Ting Cheng for reporting an issue (v2.1)\n* James Sullivan for having updated the code to make it work with the latest version of PDFBox (v2.0)\n","funding_links":[],"categories":["Java","JAVA","Products"],"sub_categories":["Knowledge Graphs"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FJonathanLink%2FPDFLayoutTextStripper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FJonathanLink%2FPDFLayoutTextStripper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FJonathanLink%2FPDFLayoutTextStripper/lists"}