Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/tabulapdf/tabula-java
Extract tables from PDF files
https://github.com/tabulapdf/tabula-java
extracting-tables extraction-engine pdfs
Last synced: 5 days ago
JSON representation
Extract tables from PDF files
- Host: GitHub
- URL: https://github.com/tabulapdf/tabula-java
- Owner: tabulapdf
- License: mit
- Created: 2014-05-22T03:11:57.000Z (over 10 years ago)
- Default Branch: master
- Last Pushed: 2024-10-07T21:27:27.000Z (2 months ago)
- Last Synced: 2024-10-29T15:04:39.375Z (about 2 months ago)
- Topics: extracting-tables, extraction-engine, pdfs
- Language: Java
- Size: 9.8 MB
- Stars: 1,836
- Watchers: 68
- Forks: 429
- Open Issues: 192
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-pdf - java
- awesome-github-repos - tabulapdf/tabula-java - Extract tables from PDF files (Java)
README
tabula-java [![Build Status](https://travis-ci.org/tabulapdf/tabula-java.svg?branch=master)](https://travis-ci.org/tabulapdf/tabula-java)
===========`tabula-java` is a library for extracting tables from PDF files — it is the table extraction engine that powers [Tabula](http://tabula.technology/) ([repo](http://github.com/tabulapdf/tabula)). You can use `tabula-java` as a command-line tool to programmatically extract tables from PDFs.
© 2014-2020 Manuel Aristarán. Available under MIT License. See [`LICENSE`](LICENSE).
## Download
Download a version of the tabula-java's jar, with all dependencies included, that works on Mac, Windows and Linux from our [releases page](../../releases).
## Commandline Usage Examples
`tabula-java` provides a command line application:
```
$ java -jar target/tabula-1.0.5-jar-with-dependencies.jar --help
usage: tabula [-a ] [-b ] [-c ] [-f ]
[-g] [-h] [-i] [-l] [-n] [-o ] [-p ] [-r] [-s
] [-t] [-u] [-v]Tabula helps you extract tables from PDFs
-a,--area -a/--area = Portion of the page to analyze.
Example: --area 269.875,12.75,790.5,561.
Accepts top,left,bottom,right i.e. y1,x1,y2,x2
where all values are in points relative to the
top left corner. If all values are between
0-100 (inclusive) and preceded by '%', input
will be taken as % of actual height or width
of the page. Example: --area %0,0,100,50. To
specify multiple areas, -a option should be
repeated. Default is entire page
-b,--batch Convert all .pdfs in the provided directory.
-c,--columns X coordinates of column boundaries. Example
--columns 10.1,20.2,30.3. If all values are
between 0-100 (inclusive) and preceded by '%',
input will be taken as % of actual width of
the page. Example: --columns %25,50,80.6
-f,--format Output format: (CSV,TSV,JSON). Default: CSV
-g,--guess Guess the portion of the page to analyze per
page.
-h,--help Print this help text.
-i,--silent Suppress all stderr output.
-l,--lattice Force PDF to be extracted using lattice-mode
extraction (if there are ruling lines
separating each cell, as in a PDF of an Excel
spreadsheet)
-n,--no-spreadsheet [Deprecated in favor of -t/--stream] Force PDF
not to be extracted using spreadsheet-style
extraction (if there are no ruling lines
separating each cell)
-o,--outfile Write output to instead of STDOUT.
Default: -
-p,--pages Comma separated list of ranges, or all.
Examples: --pages 1-3,5-7, --pages 3 or
--pages all. Default is --pages 1
-r,--spreadsheet [Deprecated in favor of -l/--lattice] Force
PDF to be extracted using spreadsheet-style
extraction (if there are ruling lines
separating each cell, as in a PDF of an Excel
spreadsheet)
-s,--password Password to decrypt document. Default is empty
-t,--stream Force PDF to be extracted using stream-mode
extraction (if there are no ruling lines
separating each cell)
-u,--use-line-returns Use embedded line returns in cells. (Only in
spreadsheet mode.)
-v,--version Print version and exit.
```It also includes a debugging tool, run `java -cp ./target/tabula-1.0.5-jar-with-dependencies.jar technology.tabula.debug.Debug -h` for the available options.
You can also integrate `tabula-java` with any JVM language. For Java examples, see the [`tests`](src/test/java/technology/tabula/) folder.
JVM start-up time is a lot of the cost of the `tabula` command, so if you're trying to extract many tables from PDFs, you have a few options for speeding it up:
- the -b option, which allows you to convert all pdfs in a given directory
- the [drip](https://github.com/ninjudd/drip) utility
- the [Ruby](http://github.com/tabulapdf/tabula-extractor), [Python](https://github.com/chezou/tabula-py), [R](https://github.com/leeper/tabulizer), and [Node.js](https://github.com/ezodude/tabula-js) bindings
- writing your own program in any JVM language (Java, JRuby, Scala) that imports tabula-java.
- waiting for us to implement an API/server-style system (it's on the [roadmap](https://github.com/tabulapdf/tabula-api))## API Usage Examples
A simple Java code example which extracts all rows and cells from all tables of all pages of a PDF document:
```java
InputStream in = this.getClass().getResourceAsStream("my.pdf");
try (PDDocument document = PDDocument.load(in)) {
SpreadsheetExtractionAlgorithm sea = new SpreadsheetExtractionAlgorithm();
PageIterator pi = new ObjectExtractor(document).extract();
while (pi.hasNext()) {
// iterate over the pages of the document
Page page = pi.next();
List table = sea.extract(page);
// iterate over the tables of the page
for(Table tables: table) {
List> rows = tables.getRows();
// iterate over the rows of the table
for (List cells : rows) {
// print all column-cells of the row plus linefeed
for (RectangularTextContainer content : cells) {
// Note: Cell.getText() uses \r to concat text chunks
String text = content.getText().replace("\r", " ");
System.out.print(text + "|");
}
System.out.println();
}
}
}
}
```For more detail information check the Javadoc.
The Javadoc API documentation can be generated (see also '_Building from Source_' section) via```
mvn javadoc:javadoc
```which generates the HTML files to directory ```target/site/apidocs/```
## Building from Source
Clone this repo and run:
```
mvn clean compile assembly:single
```## Contributing
Interested in helping out? We'd love to have your help!
You can help by:
- [Reporting a bug](https://github.com/tabulapdf/tabula-java/issues).
- Adding or editing documentation.
- Contributing code via a Pull Request.
- Spreading the word about `tabula-java` to people who might be able to benefit from using it.### Backers
You can also support our continued work on `tabula-java` with a one-time or monthly donation [on OpenCollective](https://opencollective.com/tabulapdf#support). Organizations who use `tabula-java` can also [sponsor the project](https://opencollective.com/tabulapdf#support) for acknowledgement on [our official site](http://tabula.technology/) and this README.
Special thanks to the following users and organizations for generously supporting Tabula with donations and grants: