Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/tabulapdf/tabula-extractor

Extract tables from PDF files
https://github.com/tabulapdf/tabula-extractor

Last synced: 3 months ago
JSON representation

Extract tables from PDF files

Host: GitHub
URL: https://github.com/tabulapdf/tabula-extractor
Owner: tabulapdf
License: mit
Archived: true
Created: 2013-05-08T01:16:42.000Z (over 11 years ago)
Default Branch: master
Last Pushed: 2016-05-17T01:26:34.000Z (over 8 years ago)
Last Synced: 2024-04-22T14:20:50.287Z (7 months ago)
Language: Ruby
Size: 63.5 MB
Stars: 352
Watchers: 21
Forks: 56
Open Issues: 25
Metadata Files:
- Readme: README.md
- License: LICENSE.md

Awesome Lists containing this project

README

        tabula-extractor (old version)

==============================

**Deprecation Note:** *This is the old version of the Tabula extraction engine. New projects wishing to integrate Tabula should use [tabula-java][tabula-java] (the new Java version of this extraction engine) unless you prefer to use JRuby. Users looking for the command-line version of Tabula should also use [tabula-java][tabula-java].*

[tabula-java]: http://www.github.com/tabulapdf/tabula-java

---

Extract tables from PDF files. `tabula-extractor` is the table extraction engine that used to power [Tabula](http://tabula.nerdpower.org).

If you're beginning a new project, consider using [tabula-java](http://www.github.com/tabulapdf/tabula-java), a pure-Java version of the extraction engine behind Tabula. If you want Ruby bindings and are okay using JRuby (or have already begin a project), you may continue to use this project. This project's JRuby backend has been replaced with the Java backend; all that remains here is a thin wrapper for Ruby compatibility. This wrapper maintains API backwards-compatibility with the old, pure-JRuby implementation that we all know and love.

## Installation

`tabula-extractor` only works with JRuby 1.7 or newer. [Install JRuby](http://jruby.org/getting-started) and run

``

jruby -S gem install tabula-extractor

``

## Usage

```

Tabula helps you extract tables from PDFs

Usage:

       tabula [options] 

where [options] are:

Tabula helps you extract tables from PDFs

       --pages, -p :   Comma separated list of ranges. Examples: --pages

                          1-3,5-7 or --pages 3. Default is --pages 1 (default:

                          1)

        --area, -a :   Portion of the page to analyze

                          (top,left,bottom,right). Example: --area

                          269.875,12.75,790.5,561. Default is entire page

     --columns, -c :   X coordinates of column boundaries. Example --columns

                          10.1,20.2,30.3

    --password, -s :   Password to decrypt document. Default is empty

                          (default: )

           --guess, -g:   Guess the portion of the page to analyze per page.

           --debug, -d:   Print detected table areas instead of processing.

      --format, -f :   Output format (CSV,TSV,HTML,JSON) (default: CSV)

     --outfile, -o :   Write output to  instead of STDOUT (default: -)

     --spreadsheet, -r:   Force PDF to be extracted using spreadsheet-style

                          extraction (if there are ruling lines separating each

                          cell, as in a PDF of an Excel spreadsheet)

  --no-spreadsheet, -n:   Force PDF not to be extracted using spreadsheet-style

                          extraction (if there are ruling lines separating each

                          cell, as in a PDF of an Excel spreadsheet)

          --silent, -i:   Suppress all stderr output.

--use-line-returns, -u:   Use embedded line returns in cells.

         --version, -v:   Print version and exit

            --help, -h:   Show this message

```

## Scripting examples

`tabula-extractor` is a RubyGem that you can use to programmatically extract tabular data, using the Tabula engine, in your scripts or applications. We don't have docs yet, but [the tests](test/tests.rb) are a good source of information.

Here's a very basic example:

````ruby

require 'tabula'

pdf_file_path = "whatever.pdf"

outfilename = "whatever.csv"

out = open(outfilename, 'w')

extractor = Tabula::Extraction::ObjectExtractor.new(pdf_file_path, :all )

extractor.extract.each do |pdf_page|

  pdf_page.spreadsheets.each do |spreadsheet|

    out << spreadsheet.to_csv

    out << "\n\n"

  end

end

out.close

````