{"id":14969845,"url":"https://github.com/planio-gmbh/plaintext","last_synced_at":"2025-11-11T18:40:35.142Z","repository":{"id":56888214,"uuid":"114996263","full_name":"planio-gmbh/plaintext","owner":"planio-gmbh","description":"This gem wraps command line tools to extract plain text from typical files, such as PDF and common office formats.","archived":false,"fork":false,"pushed_at":"2025-10-16T00:51:11.000Z","size":756,"stargazers_count":14,"open_issues_count":3,"forks_count":9,"subscribers_count":11,"default_branch":"master","last_synced_at":"2025-10-17T02:35:05.791Z","etag":null,"topics":["cv","doc","docx","extract","extraction","files","fulltext","odt","office","pdf","ppt","pptx","rtf","ruby","ruby-on-rails","xsl","xslt"],"latest_commit_sha":null,"homepage":"","language":"Ruby","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/planio-gmbh.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-12-21T11:00:43.000Z","updated_at":"2025-10-16T00:51:15.000Z","dependencies_parsed_at":"2022-08-20T17:40:28.676Z","dependency_job_id":null,"html_url":"https://github.com/planio-gmbh/plaintext","commit_stats":null,"previous_names":["planio-gmbh/text-extractor"],"tags_count":3,"template":false,"template_full_name":null,"purl":"pkg:github/planio-gmbh/plaintext","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/planio-gmbh%2Fplaintext","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/planio-gmbh%2Fplaintext/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/planio-gmbh%2Fplaintext/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/planio-gmbh%2Fplaintext/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/planio-gmbh","download_url":"https://codeload.github.com/planio-gmbh/plaintext/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/planio-gmbh%2Fplaintext/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":283910127,"owners_count":26915128,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-11-11T02:00:06.610Z","response_time":65,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cv","doc","docx","extract","extraction","files","fulltext","odt","office","pdf","ppt","pptx","rtf","ruby","ruby-on-rails","xsl","xslt"],"created_at":"2024-09-24T13:42:29.416Z","updated_at":"2025-11-11T18:40:35.137Z","avatar_url":"https://github.com/planio-gmbh.png","language":"Ruby","funding_links":[],"categories":[],"sub_categories":[],"readme":"# plaintext [![Test](https://github.com/planio-gmbh/plaintext/actions/workflows/test.yml/badge.svg)](https://github.com/planio-gmbh/plaintext/actions/workflows/test.yml)\n\nThis gem wraps command line tools to extract plain text from typical files such as\n\n- PDF\n- RTF\n- MS Office\n    - Word (doc, docx)\n    - Excel (xsl, xslx)\n    - PowerPoint (ppt, pptx)\n- OpenOffice + Libre\n    - Presentation\n    - Text\n    - Spreadsheet\n- Image files (png, jpeg, tiff), such as screenshots and scanned documents, through character recognition (OCR)\n- Plaintext (txt)\n- Comma-separated values (csv)\n\n## Acknowledgements\n\nThis gem bases on work by Jens Krämer / Planio, who originally provided it as a\n[patch for Redmine](https://www.redmine.org/issues/306). Now, it is a collaborative effort of\nboth project management software providers [Planio](https://plan.io) and [OpenProject](https://openproject.org)\nas both systems tackle the identical challenge to extract plain text from attachment files.\n\n## Installation\n\nAdd this line to your application's Gemfile:\n\n```ruby\ngem 'plaintext'\n```\n\nAnd then execute:\n\n    $ bundle\n\nOr install it yourself as:\n\n    $ gem install plaintext\n\n#### Rails\n\nIn a Rails application save `plaintext.yml.example` in `config/plaintext.yml` and overwrite the settings to \nyour needs.\n\nThen load that configuration file in an initializer. Add the following lines to `config/initializers/plaintext.rb`:\n\n```ruby\npath = Rails.root.join 'config', 'plaintext.yml'\nif File.file?(path)\n  config = File.read(path)\n  Plaintext::Configuration.load(config)\nend\n````\n\n#### Plain Ruby\n\nPlease overwrite `Plaintext::Configuration.load`.\n\n### Linux\n\nOn linux the default configuration should work. However, make sure that the following packages are installed\n\n    $ apt-get install catdoc unrtf poppler-utils tesseract-ocr\n\n### Mac OS X\n\nOn Mac things are still not complete. Please help us to have the same capabilities as under Linux. Right now we cannot\nextract text from presentation and spreadsheets.\n\nPlease use homebrew to install the missing command line tools.\n\n    $ brew install unrtf poppler tesseract\n    \nThe `plaintext.yml` should look like this:\n    \n```yml\npdftotext:\n  - /usr/local/bin/pdftotext\n  - -enc\n  - UTF-8\n  - __FILE__\n  - '-'\n\nunrtf:\n  - /usr/local/bin/unrtf\n  - --text\n  - __FILE__\n\ntesseract:\n  - /usr/local/bin/tesseract\n  - __FILE__\n  - stdout\n\ncatdoc:\n  - /usr/bin/textutil\n  - -convert\n  - txt\n  - -stdout\n  - __FILE__\n```\n\n## Usage\n\n```ruby\n# `file` is of type File.\n# `content_type` is a String.\nfulltext = Plaintext::Resolver.new(file, content_type).text\n```\n\nTo limit the number of bytes returned (default is 4MB), set the\n`max_plaintext_bytes` property on the resolver instance before calling `text`.\n\n## License\n\nThe `plaintext` gem is free software; you can redistribute it and/or modify it under the terms of the GNU General \nPublic License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any \nlater version.\n\nThis program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied \nwarranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.\n\nYou should have received a copy of the GNU General Public License along with the plugin. If not, see\n[www.gnu.org/licenses](https://www.gnu.org/licenses/).\n\n## Contributing\n\nBug reports and pull requests are welcome on GitHub at https://github.com/planio-gmbh/plaintext.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fplanio-gmbh%2Fplaintext","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fplanio-gmbh%2Fplaintext","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fplanio-gmbh%2Fplaintext/lists"}