{"id":13588973,"url":"https://github.com/tabulapdf/tabula-java","last_synced_at":"2025-05-14T00:09:54.237Z","repository":{"id":17276190,"uuid":"20046106","full_name":"tabulapdf/tabula-java","owner":"tabulapdf","description":"Extract tables from PDF files","archived":false,"fork":false,"pushed_at":"2025-03-19T18:21:14.000Z","size":10257,"stargazers_count":1924,"open_issues_count":194,"forks_count":441,"subscribers_count":68,"default_branch":"master","last_synced_at":"2025-05-06T21:24:22.525Z","etag":null,"topics":["extracting-tables","extraction-engine","pdfs"],"latest_commit_sha":null,"homepage":null,"language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tabulapdf.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2014-05-22T03:11:57.000Z","updated_at":"2025-05-02T14:28:52.000Z","dependencies_parsed_at":"2023-01-16T20:16:24.772Z","dependency_job_id":"6336463f-9ec2-463f-99dd-aac83612ee39","html_url":"https://github.com/tabulapdf/tabula-java","commit_stats":{"total_commits":436,"total_committers":38,"mean_commits":"11.473684210526315","dds":0.6100917431192661,"last_synced_commit":"5d91f1d733c4895d31854a641c152220f8c5f341"},"previous_names":[],"tags_count":11,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tabulapdf%2Ftabula-java","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tabulapdf%2Ftabula-java/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tabulapdf%2Ftabula-java/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tabulapdf%2Ftabula-java/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tabulapdf","download_url":"https://codeload.github.com/tabulapdf/tabula-java/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253528422,"owners_count":21922623,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["extracting-tables","extraction-engine","pdfs"],"created_at":"2024-08-01T16:00:16.712Z","updated_at":"2025-05-14T00:09:54.199Z","avatar_url":"https://github.com/tabulapdf.png","language":"Java","funding_links":["https://opencollective.com/tabulapdf","https://opencollective.com/tabulapdf/backer/0/website","https://opencollective.com/tabulapdf/backer/0/avatar","https://opencollective.com/tabulapdf/backer/1/website","https://opencollective.com/tabulapdf/backer/1/avatar","https://opencollective.com/tabulapdf/backer/2/website","https://opencollective.com/tabulapdf/backer/2/avatar","https://opencollective.com/tabulapdf/backer/3/website","https://opencollective.com/tabulapdf/backer/3/avatar","https://opencollective.com/tabulapdf/backer/4/website","https://opencollective.com/tabulapdf/backer/4/avatar","https://opencollective.com/tabulapdf/backer/5/website","https://opencollective.com/tabulapdf/backer/5/avatar"],"categories":["JAVA","Java","Projects","项目"],"sub_categories":["PDF"],"readme":"tabula-java [![Build Status](https://travis-ci.org/tabulapdf/tabula-java.svg?branch=master)](https://travis-ci.org/tabulapdf/tabula-java)\n===========\n\n`tabula-java` is a library for extracting tables from PDF files — it is the table extraction engine that powers [Tabula](http://tabula.technology/) ([repo](http://github.com/tabulapdf/tabula)). You can use `tabula-java` as a command-line tool to programmatically extract tables from PDFs.\n\n© 2014-2020 Manuel Aristarán. Available under MIT License. See [`LICENSE`](LICENSE).\n\n## Download\n\nDownload a version of the tabula-java's jar, with all dependencies included, that works on Mac, Windows and Linux from our [releases page](../../releases).\n\n## Commandline Usage Examples\n\n`tabula-java` provides a command line application:\n\n```\n$ java -jar target/tabula-1.0.5-jar-with-dependencies.jar --help\nusage: tabula [-a \u003cAREA\u003e] [-b \u003cDIRECTORY\u003e] [-c \u003cCOLUMNS\u003e] [-f \u003cFORMAT\u003e]\n       [-g] [-h] [-i] [-l] [-n] [-o \u003cOUTFILE\u003e] [-p \u003cPAGES\u003e] [-r] [-s\n       \u003cPASSWORD\u003e] [-t] [-u] [-v]\n\nTabula helps you extract tables from PDFs\n\n -a,--area \u003cAREA\u003e           -a/--area = Portion of the page to analyze.\n                            Example: --area 269.875,12.75,790.5,561.\n                            Accepts top,left,bottom,right i.e. y1,x1,y2,x2\n                            where all values are in points relative to the\n                            top left corner. If all values are between\n                            0-100 (inclusive) and preceded by '%', input\n                            will be taken as % of actual height or width\n                            of the page. Example: --area %0,0,100,50. To\n                            specify multiple areas, -a option should be\n                            repeated. Default is entire page\n -b,--batch \u003cDIRECTORY\u003e     Convert all .pdfs in the provided directory.\n -c,--columns \u003cCOLUMNS\u003e     X coordinates of column boundaries. Example\n                            --columns 10.1,20.2,30.3. If all values are\n                            between 0-100 (inclusive) and preceded by '%',\n                            input will be taken as % of actual width of\n                            the page. Example: --columns %25,50,80.6\n -f,--format \u003cFORMAT\u003e       Output format: (CSV,TSV,JSON). Default: CSV\n -g,--guess                 Guess the portion of the page to analyze per\n                            page.\n -h,--help                  Print this help text.\n -i,--silent                Suppress all stderr output.\n -l,--lattice               Force PDF to be extracted using lattice-mode\n                            extraction (if there are ruling lines\n                            separating each cell, as in a PDF of an Excel\n                            spreadsheet)\n -n,--no-spreadsheet        [Deprecated in favor of -t/--stream] Force PDF\n                            not to be extracted using spreadsheet-style\n                            extraction (if there are no ruling lines\n                            separating each cell)\n -o,--outfile \u003cOUTFILE\u003e     Write output to \u003cfile\u003e instead of STDOUT.\n                            Default: -\n -p,--pages \u003cPAGES\u003e         Comma separated list of ranges, or all.\n                            Examples: --pages 1-3,5-7, --pages 3 or\n                            --pages all. Default is --pages 1\n -r,--spreadsheet           [Deprecated in favor of -l/--lattice] Force\n                            PDF to be extracted using spreadsheet-style\n                            extraction (if there are ruling lines\n                            separating each cell, as in a PDF of an Excel\n                            spreadsheet)\n -s,--password \u003cPASSWORD\u003e   Password to decrypt document. Default is empty\n -t,--stream                Force PDF to be extracted using stream-mode\n                            extraction (if there are no ruling lines\n                            separating each cell)\n -u,--use-line-returns      Use embedded line returns in cells. (Only in\n                            spreadsheet mode.)\n -v,--version               Print version and exit.\n```\n\nIt also includes a debugging tool, run `java -cp ./target/tabula-1.0.5-jar-with-dependencies.jar technology.tabula.debug.Debug -h` for the available options.\n\nYou can also integrate `tabula-java` with any JVM language. For Java examples, see the [`tests`](src/test/java/technology/tabula/) folder.\n\nJVM start-up time is a lot of the cost of the `tabula` command, so if you're trying to extract many tables from PDFs, you have a few options for speeding it up:\n\n - the -b option, which allows you to convert all pdfs in a given directory\n - the [drip](https://github.com/ninjudd/drip) utility\n - the [Ruby](http://github.com/tabulapdf/tabula-extractor), [Python](https://github.com/chezou/tabula-py), [R](https://github.com/leeper/tabulizer), and [Node.js](https://github.com/ezodude/tabula-js) bindings\n - writing your own program in any JVM language (Java, JRuby, Scala) that imports tabula-java.\n - waiting for us to implement an API/server-style system (it's on the [roadmap](https://github.com/tabulapdf/tabula-api))\n\n## API Usage Examples\n\nA simple Java code example which extracts all rows and cells from all tables of all pages of a PDF document:\n\n```java\nInputStream in = this.getClass().getResourceAsStream(\"my.pdf\");\ntry (PDDocument document = PDDocument.load(in)) {\n    SpreadsheetExtractionAlgorithm sea = new SpreadsheetExtractionAlgorithm();\n    PageIterator pi = new ObjectExtractor(document).extract();\n    while (pi.hasNext()) {\n        // iterate over the pages of the document\n        Page page = pi.next();\n        List\u003cTable\u003e table = sea.extract(page);\n        // iterate over the tables of the page\n        for(Table tables: table) {\n            List\u003cList\u003cRectangularTextContainer\u003e\u003e rows = tables.getRows();\n            // iterate over the rows of the table\n            for (List\u003cRectangularTextContainer\u003e cells : rows) {\n                // print all column-cells of the row plus linefeed\n                for (RectangularTextContainer content : cells) {\n                    // Note: Cell.getText() uses \\r to concat text chunks\n                    String text = content.getText().replace(\"\\r\", \" \");\n                    System.out.print(text + \"|\");\n                }\n                System.out.println();\n            }\n        }\n    }\n}\n```\n\n\nFor more detail information check the Javadoc. \nThe Javadoc API documentation can be generated (see also '_Building from Source_' section) via\n\n```\nmvn javadoc:javadoc\n```\n\nwhich generates the HTML files to directory ```target/site/apidocs/```\n\n## Building from Source\n\nClone this repo and run:\n\n```\nmvn clean compile assembly:single\n```\n\n## Contributing\n\nInterested in helping out? We'd love to have your help!\n\nYou can help by:\n\n- [Reporting a bug](https://github.com/tabulapdf/tabula-java/issues).\n- Adding or editing documentation.\n- Contributing code via a Pull Request.\n- Spreading the word about `tabula-java` to people who might be able to benefit from using it.\n\n### Backers\n\nYou can also support our continued work on `tabula-java` with a one-time or monthly donation [on OpenCollective](https://opencollective.com/tabulapdf#support). Organizations who use `tabula-java` can also [sponsor the project](https://opencollective.com/tabulapdf#support) for acknowledgement on [our official site](http://tabula.technology/) and this README.\n\nSpecial thanks to the following users and organizations for generously supporting Tabula with donations and grants:\n\n\u003ca href=\"https://opencollective.com/tabulapdf/backer/0/website\" target=\"_blank\"\u003e\u003cimg src=\"https://opencollective.com/tabulapdf/backer/0/avatar\"\u003e\u003c/a\u003e\n\u003ca href=\"https://opencollective.com/tabulapdf/backer/1/website\" target=\"_blank\"\u003e\u003cimg src=\"https://opencollective.com/tabulapdf/backer/1/avatar\"\u003e\u003c/a\u003e\n\u003ca href=\"https://opencollective.com/tabulapdf/backer/2/website\" target=\"_blank\"\u003e\u003cimg src=\"https://opencollective.com/tabulapdf/backer/2/avatar\"\u003e\u003c/a\u003e\n\u003ca href=\"https://opencollective.com/tabulapdf/backer/3/website\" target=\"_blank\"\u003e\u003cimg src=\"https://opencollective.com/tabulapdf/backer/3/avatar\"\u003e\u003c/a\u003e\n\u003ca href=\"https://opencollective.com/tabulapdf/backer/4/website\" target=\"_blank\"\u003e\u003cimg src=\"https://opencollective.com/tabulapdf/backer/4/avatar\"\u003e\u003c/a\u003e\n\u003ca href=\"https://opencollective.com/tabulapdf/backer/5/website\" target=\"_blank\"\u003e\u003cimg src=\"https://opencollective.com/tabulapdf/backer/5/avatar\"\u003e\u003c/a\u003e\n\n\u003ca title=\"The John S. and James L. Knight Foundation\" href=\"http://www.knightfoundation.org/\" target=\"_blank\"\u003e\u003cimg alt=\"The John S. and James L. Knight Foundation\" src=\"https://knightfoundation.org/wp-content/uploads/2019/10/KF_Logotype_Icon-and-Stacked-Name.png\" width=\"300\"\u003e\u003c/a\u003e\n\u003ca title=\"The Shuttleworth Foundation\" href=\"https://shuttleworthfoundation.org/\" target=\"_blank\"\u003e\u003cimg width=\"200\" alt=\"The Shuttleworth Foundation\" src=\"https://raw.githubusercontent.com/tabulapdf/tabula/gh-pages/shuttleworth.jpg\"\u003e\u003c/a\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftabulapdf%2Ftabula-java","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftabulapdf%2Ftabula-java","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftabulapdf%2Ftabula-java/lists"}