https://github.com/jhund/pdfbox_text_extraction

Provides a Jruby wrapper for Apache PDFBox library to extract plain text from PDF documents.
https://github.com/jhund/pdfbox_text_extraction

extract jruby pdfbox plain-text

Last synced: 9 months ago
JSON representation

Provides a Jruby wrapper for Apache PDFBox library to extract plain text from PDF documents.

Host: GitHub
URL: https://github.com/jhund/pdfbox_text_extraction
Owner: jhund
License: mit
Created: 2016-03-18T15:20:03.000Z (over 10 years ago)
Default Branch: master
Last Pushed: 2019-07-12T21:18:06.000Z (almost 7 years ago)
Last Synced: 2025-06-29T09:02:52.646Z (12 months ago)
Topics: extract, jruby, pdfbox, plain-text
Language: Ruby
Size: 4.25 MB
Stars: 4
Watchers: 2
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE.txt

Awesome Lists containing this project

README

          # PDFBox text extraction

This gem lets you extract plain text from PDF documents. It is a Jruby wrapper for the [Apache PDFBox](https://pdfbox.apache.org/) library.

## Installation

Add this line to your application's Gemfile:

    gem 'pdfbox_text_extraction'

And then execute:

    $ bundle

Or install it yourself as:

    $ gem install pdfbox_text_extraction

## Usage

To extract all text on every page:

    extracted_text = PdfboxTextExtraction.run(path_to_pdf)

To extract text inside a crop area:

    extracted_text = PdfboxTextExtraction.run(

      path_to_pdf,

      {

        crop_x: 0, # crop area top left corner x-coordinate

        crop_y: 1.0, # crop area top left corner y-coordinate

        crop_width: 8.5, # crop area width

        crop_height: 9.4, # crop area height

      }

    )

## Contributing

1. Fork it ( https://github.com/jhund/pdfbox_text_extraction/fork )

2. Create your feature branch (`git checkout -b my-new-feature`)

3. Commit your changes (`git commit -am 'Add some feature'`)

4. Push to the branch (`git push origin my-new-feature`)

5. Create a new Pull Request

### Resources

* [Source code (github)](https://github.com/jhund/pdfbox_text_extraction)

* [Issues](https://github.com/jhund/pdfbox_text_extraction/issues)

* [Rubygems.org](http://rubygems.org/gems/pdfbox_text_extraction)

### License

[MIT licensed](https://github.com/jhund/pdfbox_text_extraction/blob/master/LICENSE.txt).

### Copyright

Copyright (c) 2016 Jo Hund. See [(MIT) LICENSE](https://github.com/jhund/pdfbox_text_extraction/blob/master/LICENSE.txt) for details.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/jhund/pdfbox_text_extraction

Awesome Lists containing this project

README