https://github.com/jhund/pdfbox_text_extraction
Provides a Jruby wrapper for Apache PDFBox library to extract plain text from PDF documents.
https://github.com/jhund/pdfbox_text_extraction
extract jruby pdfbox plain-text
Last synced: 9 months ago
JSON representation
Provides a Jruby wrapper for Apache PDFBox library to extract plain text from PDF documents.
- Host: GitHub
- URL: https://github.com/jhund/pdfbox_text_extraction
- Owner: jhund
- License: mit
- Created: 2016-03-18T15:20:03.000Z (over 10 years ago)
- Default Branch: master
- Last Pushed: 2019-07-12T21:18:06.000Z (almost 7 years ago)
- Last Synced: 2025-06-29T09:02:52.646Z (12 months ago)
- Topics: extract, jruby, pdfbox, plain-text
- Language: Ruby
- Size: 4.25 MB
- Stars: 4
- Watchers: 2
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE.txt
Awesome Lists containing this project
README
# PDFBox text extraction
This gem lets you extract plain text from PDF documents. It is a Jruby wrapper for the [Apache PDFBox](https://pdfbox.apache.org/) library.
## Installation
Add this line to your application's Gemfile:
gem 'pdfbox_text_extraction'
And then execute:
$ bundle
Or install it yourself as:
$ gem install pdfbox_text_extraction
## Usage
To extract all text on every page:
extracted_text = PdfboxTextExtraction.run(path_to_pdf)
To extract text inside a crop area:
extracted_text = PdfboxTextExtraction.run(
path_to_pdf,
{
crop_x: 0, # crop area top left corner x-coordinate
crop_y: 1.0, # crop area top left corner y-coordinate
crop_width: 8.5, # crop area width
crop_height: 9.4, # crop area height
}
)
## Contributing
1. Fork it ( https://github.com/jhund/pdfbox_text_extraction/fork )
2. Create your feature branch (`git checkout -b my-new-feature`)
3. Commit your changes (`git commit -am 'Add some feature'`)
4. Push to the branch (`git push origin my-new-feature`)
5. Create a new Pull Request
### Resources
* [Source code (github)](https://github.com/jhund/pdfbox_text_extraction)
* [Issues](https://github.com/jhund/pdfbox_text_extraction/issues)
* [Rubygems.org](http://rubygems.org/gems/pdfbox_text_extraction)
### License
[MIT licensed](https://github.com/jhund/pdfbox_text_extraction/blob/master/LICENSE.txt).
### Copyright
Copyright (c) 2016 Jo Hund. See [(MIT) LICENSE](https://github.com/jhund/pdfbox_text_extraction/blob/master/LICENSE.txt) for details.