{"id":16747906,"url":"https://github.com/amake/srx-ruby","last_synced_at":"2025-04-10T13:52:08.126Z","repository":{"id":59156229,"uuid":"335336347","full_name":"amake/srx-ruby","owner":"amake","description":"An SRX segmenting engine for Ruby","archived":false,"fork":false,"pushed_at":"2025-03-30T12:20:51.000Z","size":103,"stargazers_count":4,"open_issues_count":0,"forks_count":2,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-04-09T07:18:35.379Z","etag":null,"topics":["nlp","ruby","segmentation","srx"],"latest_commit_sha":null,"homepage":"https://rubygems.org/gems/srx","language":"Ruby","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/amake.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2021-02-02T15:34:04.000Z","updated_at":"2025-03-30T12:20:54.000Z","dependencies_parsed_at":"2024-03-28T23:38:32.866Z","dependency_job_id":"8d6da138-7404-424b-9457-fb87ddfe1877","html_url":"https://github.com/amake/srx-ruby","commit_stats":{"total_commits":120,"total_committers":1,"mean_commits":120.0,"dds":0.0,"last_synced_commit":"db3b082d954fc21acf48c536b2315aae0680197b"},"previous_names":[],"tags_count":6,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amake%2Fsrx-ruby","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amake%2Fsrx-ruby/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amake%2Fsrx-ruby/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amake%2Fsrx-ruby/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/amake","download_url":"https://codeload.github.com/amake/srx-ruby/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248228647,"owners_count":21068731,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["nlp","ruby","segmentation","srx"],"created_at":"2024-10-13T02:11:14.451Z","updated_at":"2025-04-10T13:52:08.109Z","avatar_url":"https://github.com/amake.png","language":"Ruby","funding_links":[],"categories":[],"sub_categories":[],"readme":"# SRX for Ruby\n\nSRX is a specification for segmenting text, i.e. splitting text into sentences.\nMore specifically it is\n\n- An XML-based format for specifying segmentation rules, and\n- An algorithm by which the rules are applied\n\nSee the [SRX 2.0 Specification](http://www.ttt.org/oscarStandards/srx/srx20.html)\nfor full details.\n\nThis gem provides facilities for reading SRX files and an engine for performing\nsegmentation.\n\nOnly a minimal rule set is supplied by default; for actual usage you are\nencouraged to supply your own SRX rules. One such set of rules is that from\n[LanguageTool](https://languagetool.org/); this is conveniently packaged into a\ncompanion gem:\n[srx-languagetool-ruby](https://github.com/amake/srx-languagetool-ruby).\n\n## What's different about this gem?\n\nThere are lots of good segmentation gems out there such as\n\n- [pragmatic_segmenter](https://github.com/diasks2/pragmatic_segmenter)\n- [TactfulTokenizer](https://github.com/zencephalon/Tactful_Tokenizer)\n- [Punkt](https://github.com/lfcipriani/punkt-segmenter)\n\nWhat makes SRX different is:\n\n- It allows easy customization and exchange of rules via SRX files\n- It preserves whitespace surrounding break points\n- It offers advanced XML/HTML tag handling: it won't be fooled by false breaks\n  in e.g. attribute values\n\nSome other advantages that are not unique to SRX:\n\n- It is offered under a very permissive license\n- It is relatively lightweight as a dependency\n- It is fast (though this depends somewhat on the ruleset you use)\n\nSome disadvantages:\n\n- It is inherently rule-based, with all of the weaknesses that implies\n- It is not very accurate on the [Golden Rules\n  test](https://github.com/diasks2/pragmatic_segmenter#comparison-of-segmentation-tools-libraries-and-algorithms),\n  scoring 47% (English) and 48% (others) with the default rules. However you can\n  improve on that with better rules such as\n  [LanguageTool's](https://github.com/amake/srx-languagetool-ruby).\n\n## Caveats\n\nThe SRX spec calls for [ICU regular\nexpressions](https://unicode-org.github.io/icu/userguide/strings/regexp.html),\nbut this library uses standard [Ruby\nregexp](https://ruby-doc.org/core-2.7.0/Regexp.html). Please note:\n\n- Not all ICU syntax is supported\n- For supported syntax, in some cases the meaning of a regex may differ when\n  interpreted as Ruby regexp\n- The following ICU syntax is supported through translation to Ruby syntax:\n  - `\\x{hhhh}` → `\\u{hhhh}`\n  - `\\0ooo` → `\\u{hhhh}`\n\n## Installation\n\nAdd this line to your application's Gemfile:\n\n```ruby\ngem 'srx'\n```\n\nAnd then execute:\n\n    $ bundle install\n\nOr install it yourself as:\n\n    $ gem install srx\n\n## Usage\n\nUse the default rules like so. Specify the language according the `\u003cmaprules\u003e`\nof your SRX (usually two-letter [ISO 639-1\ncodes](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes)).\n\n```ruby\nrequire 'srx'\n\ndata = Srx::Data.default\nengine = Srx::Engine.new(data)\nengine.segment('Hi. How are you?', language: 'en') #=\u003e [\"Hi.\", \" How are you?\"]\n```\n\nOr bring your own rules:\n\n```ruby\ndata = Srx::Data.from_file(path: 'path/to/my/rules.srx')\nengine = Srx::Engine.new(data)\n```\n\nSpecify the format as `:xml` or `:html` to benefit from special handling of\ntags:\n\n```ruby\n# This should only be one segment, but handling as plain text incorrectly\n# produces two segments.\ninput = 'foo \u003cbar baz=\"a. b.\"\u003e bazinga'\n\nSrx::Engine.new(Data.default).segment(input, language: 'en')\n#=\u003e [\"foo \u003cbar baz=\\\"a.\", \" b.\\\"\u003e bazinga\"]\n\nSrx::Engine.new(Data.default, format: :xml).segment(input, language: 'en')\n#=\u003e [\"foo \u003cbar baz=\\\"a. b.\\\"\u003e bazinga\"]\n```\n\n## Development\n\nAfter checking out the repo, run `bin/setup` to install dependencies. Then, run\n`rake test` to run the tests. You can also run `bin/console` for an interactive\nprompt that will allow you to experiment.\n\nTo install this gem onto your local machine, run `bundle exec rake install`. To\nrelease a new version, update the version number in `version.rb`, and then run\n`bundle exec rake release`, which will create a git tag for the version, push\ngit commits and the created tag, and push the `.gem` file to\n[rubygems.org](https://rubygems.org).\n\n## Contributing\n\nBug reports and pull requests are welcome on GitHub at\nhttps://github.com/amake/srx.\n\n## License\n\nThe gem is available as open source under the terms of the [MIT\nLicense](https://opensource.org/licenses/MIT).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Famake%2Fsrx-ruby","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Famake%2Fsrx-ruby","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Famake%2Fsrx-ruby/lists"}