{"id":13858652,"url":"https://github.com/tabulapdf/tabula-extractor","last_synced_at":"2025-07-14T01:31:10.835Z","repository":{"id":8360889,"uuid":"9925552","full_name":"tabulapdf/tabula-extractor","owner":"tabulapdf","description":"Extract tables from PDF files","archived":true,"fork":false,"pushed_at":"2016-05-17T01:26:34.000Z","size":66563,"stargazers_count":354,"open_issues_count":25,"forks_count":57,"subscribers_count":21,"default_branch":"master","last_synced_at":"2024-10-29T22:38:14.693Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Ruby","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tabulapdf.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2013-05-08T01:16:42.000Z","updated_at":"2024-09-14T02:44:54.000Z","dependencies_parsed_at":"2022-07-30T23:48:02.931Z","dependency_job_id":null,"html_url":"https://github.com/tabulapdf/tabula-extractor","commit_stats":null,"previous_names":["jazzido/tabula-extractor"],"tags_count":14,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tabulapdf%2Ftabula-extractor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tabulapdf%2Ftabula-extractor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tabulapdf%2Ftabula-extractor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tabulapdf%2Ftabula-extractor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tabulapdf","download_url":"https://codeload.github.com/tabulapdf/tabula-extractor/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":225814916,"owners_count":17528295,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-05T03:02:16.285Z","updated_at":"2024-11-22T17:30:18.600Z","avatar_url":"https://github.com/tabulapdf.png","language":"Ruby","funding_links":[],"categories":["Ruby"],"sub_categories":[],"readme":"tabula-extractor (old version)\n==============================\n\n**Deprecation Note:** *This is the old version of the Tabula extraction engine. New projects wishing to integrate Tabula should use \u003cb\u003e[tabula-java][tabula-java] (the new Java version of this extraction engine)\u003c/b\u003e unless you prefer to use JRuby. Users looking for the command-line version of Tabula should also use \u003cb\u003e[tabula-java][tabula-java]\u003c/b\u003e.*\n\n[tabula-java]: http://www.github.com/tabulapdf/tabula-java\n\n---\n\nExtract tables from PDF files. `tabula-extractor` is the table extraction engine that used to power [Tabula](http://tabula.nerdpower.org).\n\nIf you're beginning a new project, consider using [tabula-java](http://www.github.com/tabulapdf/tabula-java), a pure-Java version of the extraction engine behind Tabula. If you want Ruby bindings and are okay using JRuby (or have already begin a project), you may continue to use this project. This project's JRuby backend has been replaced with the Java backend; all that remains here is a thin wrapper for Ruby compatibility. This wrapper maintains API backwards-compatibility with the old, pure-JRuby implementation that we all know and love.\n\n\n## Installation\n\n`tabula-extractor` only works with JRuby 1.7 or newer. [Install JRuby](http://jruby.org/getting-started) and run\n\n``\njruby -S gem install tabula-extractor\n``\n\n\n## Usage\n\n```\nTabula helps you extract tables from PDFs\n\nUsage:\n       tabula [options] \u003cpdf_file\u003e\nwhere [options] are:\nTabula helps you extract tables from PDFs\n       --pages, -p \u003cs\u003e:   Comma separated list of ranges. Examples: --pages\n                          1-3,5-7 or --pages 3. Default is --pages 1 (default:\n                          1)\n        --area, -a \u003cs\u003e:   Portion of the page to analyze\n                          (top,left,bottom,right). Example: --area\n                          269.875,12.75,790.5,561. Default is entire page\n     --columns, -c \u003cs\u003e:   X coordinates of column boundaries. Example --columns\n                          10.1,20.2,30.3\n    --password, -s \u003cs\u003e:   Password to decrypt document. Default is empty\n                          (default: )\n           --guess, -g:   Guess the portion of the page to analyze per page.\n           --debug, -d:   Print detected table areas instead of processing.\n      --format, -f \u003cs\u003e:   Output format (CSV,TSV,HTML,JSON) (default: CSV)\n     --outfile, -o \u003cs\u003e:   Write output to \u003cfile\u003e instead of STDOUT (default: -)\n     --spreadsheet, -r:   Force PDF to be extracted using spreadsheet-style\n                          extraction (if there are ruling lines separating each\n                          cell, as in a PDF of an Excel spreadsheet)\n  --no-spreadsheet, -n:   Force PDF not to be extracted using spreadsheet-style\n                          extraction (if there are ruling lines separating each\n                          cell, as in a PDF of an Excel spreadsheet)\n          --silent, -i:   Suppress all stderr output.\n--use-line-returns, -u:   Use embedded line returns in cells.\n         --version, -v:   Print version and exit\n            --help, -h:   Show this message\n```\n\n## Scripting examples\n\n`tabula-extractor` is a RubyGem that you can use to programmatically extract tabular data, using the Tabula engine, in your scripts or applications. We don't have docs yet, but [the tests](test/tests.rb) are a good source of information.\n\nHere's a very basic example:\n\n````ruby\nrequire 'tabula'\n\npdf_file_path = \"whatever.pdf\"\noutfilename = \"whatever.csv\"\n\nout = open(outfilename, 'w')\n\nextractor = Tabula::Extraction::ObjectExtractor.new(pdf_file_path, :all )\nextractor.extract.each do |pdf_page|\n  pdf_page.spreadsheets.each do |spreadsheet|\n    out \u003c\u003c spreadsheet.to_csv\n    out \u003c\u003c \"\\n\\n\"\n  end\nend\nout.close\n\n````\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftabulapdf%2Ftabula-extractor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftabulapdf%2Ftabula-extractor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftabulapdf%2Ftabula-extractor/lists"}