https://github.com/mrcsparker/ruby_tika_app
A ruby wrapper for the Tika jar (tika-app.jar) that extracts text in a lot of formats from PDF, xls, doc, etc files
https://github.com/mrcsparker/ruby_tika_app
ruby ruby-tika tika
Last synced: 7 months ago
JSON representation
A ruby wrapper for the Tika jar (tika-app.jar) that extracts text in a lot of formats from PDF, xls, doc, etc files
- Host: GitHub
- URL: https://github.com/mrcsparker/ruby_tika_app
- Owner: mrcsparker
- License: mit
- Created: 2011-11-30T03:33:13.000Z (almost 14 years ago)
- Default Branch: master
- Last Pushed: 2022-09-30T18:25:22.000Z (about 3 years ago)
- Last Synced: 2025-02-27T08:20:46.580Z (7 months ago)
- Topics: ruby, ruby-tika, tika
- Language: DIGITAL Command Language
- Homepage: https://github.com/mrcsparker/ruby_tika_app
- Size: 415 MB
- Stars: 26
- Watchers: 2
- Forks: 20
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
- Changelog: HISTORY
- License: LICENSE
Awesome Lists containing this project
README
## Ruby Tika Parser
### Introduction
This is a simple frontend to the Java Tika parser command line jar / app.
It is the same as running:
java -server -Djava.awt.headless=true -Dfile.encoding=UTF-8 -jar tika-app-1.24.1.jar FileToParse.pdf
with options like --xml, --text, etc.
### Installation
To install, add ruby_tika_app to your _Gemfile_ and run `bundle install`:
gem 'ruby_tika_app'
### Note about installation
RubyTikaApp is a pretty big gem since it includes the ruby-tika-app jarfile.
It might take a while to install.### Usage
First, you need Java installed. And it needs to be in your $PATH.
Then:
```ruby
require 'ruby_tika_app'rta = RubyTikaApp.new("sample_file.pdf")
puts rta.to_xml #
# You also get to_json, to_text, to_text_main, and to_metadata
```
### Testing
Run:
bundle exec rspec spec/
*NOTE*: Since we are using an underlying java library to connect to external
URLs we can't use a standard mocking library. The test suite starts a
rack-based web server.### Contributing
Fork on GitHub and after you've committed tested patches, send a pull request.