{"id":28938135,"url":"https://github.com/emmeryn/hocr-turtletext","last_synced_at":"2025-10-08T18:11:58.680Z","repository":{"id":40183682,"uuid":"235540115","full_name":"emmeryn/hocr-turtletext","owner":"emmeryn","description":"A gem that parses positional text from hOCR output and provides convenience methods to find text.","archived":false,"fork":false,"pushed_at":"2022-10-20T05:16:36.000Z","size":18,"stargazers_count":3,"open_issues_count":2,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-05-20T04:38:42.845Z","etag":null,"topics":["extract-text","gem","hocr","ruby-on-rails"],"latest_commit_sha":null,"homepage":"","language":"Ruby","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/emmeryn.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-01-22T09:34:14.000Z","updated_at":"2023-01-12T02:53:43.000Z","dependencies_parsed_at":"2022-09-11T08:40:11.077Z","dependency_job_id":null,"html_url":"https://github.com/emmeryn/hocr-turtletext","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/emmeryn/hocr-turtletext","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/emmeryn%2Fhocr-turtletext","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/emmeryn%2Fhocr-turtletext/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/emmeryn%2Fhocr-turtletext/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/emmeryn%2Fhocr-turtletext/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/emmeryn","download_url":"https://codeload.github.com/emmeryn/hocr-turtletext/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/emmeryn%2Fhocr-turtletext/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":261138366,"owners_count":23115124,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["extract-text","gem","hocr","ruby-on-rails"],"created_at":"2025-06-22T22:06:55.622Z","updated_at":"2025-10-08T18:11:53.644Z","avatar_url":"https://github.com/emmeryn.png","language":"Ruby","funding_links":[],"categories":[],"sub_categories":[],"readme":"# HocrTurtletext\n\nHeavily inspired by [PDF::Reader::Turtletext](https://github.com/tardate/pdf-reader-turtletext), HocrTurtletext provides convenient methods to extract content from a hOCR file. hOCR output is commonly produced by OCR software such as tesseract-ocr.\n\n## Installation\n\nAdd this line to your application's Gemfile:\n\n```ruby\ngem 'hocr_turtletext'\n```\n\nAnd then execute:\n\n    $ bundle\n\nOr install it yourself as:\n\n    $ gem install hocr_turtletext\n\n## Usage\n\n### Instantiate HocrTurtletext\n\nTypical usage: \n```ruby\nhocr_path = '/tmp/page1.hocr'\noptions = { :y_precision =\u003e 7 }\nreader = HocrTurtletext::Reader.new(hocr_path, options)\n```\n\nOptions:  \n`x_whitespace_threshold`: Words with a x distance of less than this threshold will be concatenated with a space. Try increasing this value if words/letters that are supposed to belong together are separated.   \n`y_precision`: Different rows of text with y positions that are less than y_precision of difference will be put together into one row. Try increasing this value if words that are supposed to be on the same row are detected as separate rows.\n\n### Extract text within a region described in relation to other text\n\nThis method works nearly identically to its counterpart from PDF::Reader::Turtletext. \nThe main difference is that we are not dealing with multiple pages in our hOCR input, so\nthere is no need to support page selection.\n\nGiven that we know the text we want to find is relatively positioned (for example)\nbelow a certain bit of text, to the left of another, and above some other text, use \nthe `bounding_box` method to describe the region and extract the matching text.\n```\n  textangle = reader.bounding_box do\n    below /electricity/i\n    above 10\n    right_of 240.0\n    left_of \"Total ($)\"\n  end\n  textangle.text\n  =\u003e [['string','string'],['string']] # array of rows, each row is an array of text elements in the row\n```\n\nThe range of methods that can be used within the `bounding_box` block are all optional, and include:\n- `inclusive` - whether region selection should be inclusive or exclusive of the specified positions\n  (default is false).\n- `below` - a string, regex or number that describes the upper limit of the text box\n  (default is top border of the page)`.\n- `above` - a string, regex or number that describes the lower limit of the text box\n  (default is bottom border of the page).\n- `left_of` - a string, regex or number that describes the right limit of the text box\n  (default is right border of the page).\n- `right_of` - a string, regex or number that describes the left limit of the text box\n  (default is left border of the page).\n\nNote that `left_of` and `right_of` constraints do *not* need to be within the vertical\nrange of the box being described.\nFor example, you could use an element in the page header to describe the `left_of` limit\nfor a table at the bottom of the page, if it has the correct alignment needed to describe your text region.\n\nSimilarly, `above` and `below` constraints do *not* need to be within the horizontal\nrange of the box being described.\n\n### Using a block parameter with the `bounding_box` method\n\nAn explicit block parameter may be used with the `bounding_box` method:\n```\n  textangle = reader.bounding_box do |r|\n    r.below /electricity/i\n    r.left_of \"Total ($)\"\n  end\n  textangle.text\n  =\u003e [['string','string'],['string']] # array of rows, each row is an array of text elements in the row\n```\n\n### How to describe an inclusive `bounding_box` region\n\nBy default, the `bounding_box` method makes exclusive selection (i.e. not including the\nregion limits).\n\nTo specify an inclusive region, use the `inclusive!` command:\n```ruby\n  textangle = reader.bounding_box do\n    inclusive!\n    below /electricity/i\n    left_of \"Total ($)\"\n  end\n```\nAlternatively, set `inclusive` to true:\n```ruby\n  textangle = reader.bounding_box do\n    inclusive true\n    below /electricity/i\n    left_of \"Total ($)\"\n  end\n```\nOr with a block parameter, you may also assign `inclusive` to true:\n```ruby\n  textangle = reader.bounding_box do |r|\n    r.inclusive = true\n    r.below /electricity/i\n    r.left_of \"Total ($)\"\n  end\n```\n### Extract text for a region with known positional co-ordinates\n\nIf you know (or can calculate) the x,y positions of the required text region, you can extract the region's text using the `text_in_region` method.\n```\n  text = reader.text_in_region(\n    10,   # minimum x (left-most)\n    900,  # maximum x (right-most)\n    200,  # minimum y (top-most)\n    400,  # maximum y (bottom-most)\n    false # inclusive of x/y position if true (default false)\n  )\n  =\u003e [['string','string'],['string']] # array of rows, each row is an array of text elements in the row\n```\nNote that the x,y origin is at the **top-left**. \nThis differs from how it works in PDF::Reader::Turtletext, where the origin \nwas bottom-left of the page.\n\n### How to find the x,y co-ordinate of a specific text element\n\nIf you are doing low-level text extraction with `text_in_region` for example,\nit is usually necessary to locate specific text to provide a positional reference.\n\nUse the `text_position` method to locate text by exact or partial match.\nIt returns a Hash of x/y co-ordinates that is the bottom-left corner of the text.\n```\n  text_by_exact_match = reader.text_position(\"Transaction Table\")\n  =\u003e { :x =\u003e 10.0, :y =\u003e 600.0 }\n  text_by_regex_match = reader.text_position(/transaction summary/i)\n  =\u003e { :x =\u003e 10.0, :y =\u003e 300.0 }\n```\nNote: in the case of multiple matches, only the first match is returned.\n\n## Contributing\n\n- Check issue tracker if someone is working on what you plan to work on\n- Fork project\n- Create new branch\n- Make changes in new branch\n- Submit pull request\n\n## License\n\nThe gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).\n\n## Special Thanks\n- Paul Gallagher, creator of the [PDF::Reader::Turtletext](https://github.com/tardate/pdf-reader-turtletext) gem, from which large sections of this gem was copied/modified from.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Femmeryn%2Fhocr-turtletext","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Femmeryn%2Fhocr-turtletext","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Femmeryn%2Fhocr-turtletext/lists"}