Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/tabulapdf/tabula-extractor
Extract tables from PDF files
https://github.com/tabulapdf/tabula-extractor
Last synced: 3 months ago
JSON representation
Extract tables from PDF files
- Host: GitHub
- URL: https://github.com/tabulapdf/tabula-extractor
- Owner: tabulapdf
- License: mit
- Archived: true
- Created: 2013-05-08T01:16:42.000Z (over 11 years ago)
- Default Branch: master
- Last Pushed: 2016-05-17T01:26:34.000Z (over 8 years ago)
- Last Synced: 2024-04-22T14:20:50.287Z (7 months ago)
- Language: Ruby
- Size: 63.5 MB
- Stars: 352
- Watchers: 21
- Forks: 56
- Open Issues: 25
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project
README
tabula-extractor (old version)
==============================**Deprecation Note:** *This is the old version of the Tabula extraction engine. New projects wishing to integrate Tabula should use [tabula-java][tabula-java] (the new Java version of this extraction engine) unless you prefer to use JRuby. Users looking for the command-line version of Tabula should also use [tabula-java][tabula-java].*
[tabula-java]: http://www.github.com/tabulapdf/tabula-java
---
Extract tables from PDF files. `tabula-extractor` is the table extraction engine that used to power [Tabula](http://tabula.nerdpower.org).
If you're beginning a new project, consider using [tabula-java](http://www.github.com/tabulapdf/tabula-java), a pure-Java version of the extraction engine behind Tabula. If you want Ruby bindings and are okay using JRuby (or have already begin a project), you may continue to use this project. This project's JRuby backend has been replaced with the Java backend; all that remains here is a thin wrapper for Ruby compatibility. This wrapper maintains API backwards-compatibility with the old, pure-JRuby implementation that we all know and love.
## Installation
`tabula-extractor` only works with JRuby 1.7 or newer. [Install JRuby](http://jruby.org/getting-started) and run
``
jruby -S gem install tabula-extractor
``## Usage
```
Tabula helps you extract tables from PDFsUsage:
tabula [options]
where [options] are:
Tabula helps you extract tables from PDFs
--pages, -p : Comma separated list of ranges. Examples: --pages
1-3,5-7 or --pages 3. Default is --pages 1 (default:
1)
--area, -a : Portion of the page to analyze
(top,left,bottom,right). Example: --area
269.875,12.75,790.5,561. Default is entire page
--columns, -c : X coordinates of column boundaries. Example --columns
10.1,20.2,30.3
--password, -s : Password to decrypt document. Default is empty
(default: )
--guess, -g: Guess the portion of the page to analyze per page.
--debug, -d: Print detected table areas instead of processing.
--format, -f : Output format (CSV,TSV,HTML,JSON) (default: CSV)
--outfile, -o : Write output to instead of STDOUT (default: -)
--spreadsheet, -r: Force PDF to be extracted using spreadsheet-style
extraction (if there are ruling lines separating each
cell, as in a PDF of an Excel spreadsheet)
--no-spreadsheet, -n: Force PDF not to be extracted using spreadsheet-style
extraction (if there are ruling lines separating each
cell, as in a PDF of an Excel spreadsheet)
--silent, -i: Suppress all stderr output.
--use-line-returns, -u: Use embedded line returns in cells.
--version, -v: Print version and exit
--help, -h: Show this message
```## Scripting examples
`tabula-extractor` is a RubyGem that you can use to programmatically extract tabular data, using the Tabula engine, in your scripts or applications. We don't have docs yet, but [the tests](test/tests.rb) are a good source of information.
Here's a very basic example:
````ruby
require 'tabula'pdf_file_path = "whatever.pdf"
outfilename = "whatever.csv"out = open(outfilename, 'w')
extractor = Tabula::Extraction::ObjectExtractor.new(pdf_file_path, :all )
extractor.extract.each do |pdf_page|
pdf_page.spreadsheets.each do |spreadsheet|
out << spreadsheet.to_csv
out << "\n\n"
end
end
out.close````