{"id":19639244,"url":"https://github.com/lfcipriani/punkt-segmenter","last_synced_at":"2025-04-09T23:16:02.903Z","repository":{"id":56889591,"uuid":"740516","full_name":"lfcipriani/punkt-segmenter","owner":"lfcipriani","description":"Ruby port of the NLTK Punkt sentence segmentation algorithm","archived":false,"fork":false,"pushed_at":"2018-06-10T22:23:21.000Z","size":152,"stargazers_count":92,"open_issues_count":5,"forks_count":10,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-04-09T23:15:58.790Z","etag":null,"topics":["nlp-library","nltk","punkt-segmenter","ruby","ruby-port","rubynlp","sentence-boundaries","sentence-tokenizer","tokenized-sentences"],"latest_commit_sha":null,"homepage":"","language":"Ruby","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lfcipriani.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2010-06-26T02:17:41.000Z","updated_at":"2024-02-13T06:55:17.000Z","dependencies_parsed_at":"2022-08-20T14:30:07.407Z","dependency_job_id":null,"html_url":"https://github.com/lfcipriani/punkt-segmenter","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lfcipriani%2Fpunkt-segmenter","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lfcipriani%2Fpunkt-segmenter/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lfcipriani%2Fpunkt-segmenter/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lfcipriani%2Fpunkt-segmenter/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lfcipriani","download_url":"https://codeload.github.com/lfcipriani/punkt-segmenter/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248125591,"owners_count":21051770,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["nlp-library","nltk","punkt-segmenter","ruby","ruby-port","rubynlp","sentence-boundaries","sentence-tokenizer","tokenized-sentences"],"created_at":"2024-11-11T12:45:36.372Z","updated_at":"2025-04-09T23:16:02.864Z","avatar_url":"https://github.com/lfcipriani.png","language":"Ruby","funding_links":[],"categories":["NLP Pipeline Subtasks"],"sub_categories":["Segmentation"],"readme":"# Punkt sentence tokenizer\n\nThis code is a ruby 1.9.x port of the Punkt sentence tokenizer algorithm implemented by the NLTK Project ([http://www.nltk.org/]). Punkt is a **language-independent**, unsupervised approach to **sentence boundary detection**. It is based on the assumption that a large number of ambiguities in the determination of sentence boundaries can be eliminated once abbreviations have been identiﬁed.\n\nThe full description of the algorithm is presented in the following academic paper:\n\n\u003e Kiss, Tibor and Strunk, Jan (2006): Unsupervised Multilingual Sentence Boundary Detection.  \n\u003e Computational Linguistics 32: 485-525.  \n\u003e [Download paper]\n\nHere are the credits for the original implementation:\n\n- Willy (willy@csse.unimelb.edu.au) (original Python port)\n- Steven Bird (sb@csse.unimelb.edu.au) (additions)\n- Edward Loper (edloper@gradient.cis.upenn.edu) (rewrite)\n- Joel Nothman (jnothman@student.usyd.edu.au) (almost rewrite)\n\nI simply did the ruby port and some API changes.\n\n## Install\n\n    gem install punkt-segmenter\n\nCurrently, this gem only runs on ruby 1.9.x (because of unicode_utils dependency)\n\n## How to use\n\nLet's suppose we have the following text:\n\n*\"A minute is a unit of measurement of time or of angle. The minute is a unit of time equal to 1/60th of an hour or 60 seconds by 1. In the UTC time scale, a minute occasionally has 59 or 61 seconds; see leap second. The minute is not an SI unit; however, it is accepted for use with SI units. The symbol for minute or minutes is min. The fact that an hour contains 60 minutes is probably due to influences from the Babylonians, who used a base-60 or sexagesimal counting system. Colloquially, a min. may also refer to an indefinite amount of time substantially longer than the standardized length.\"* (source: http://en.wikipedia.org/wiki/Minute)\n\nYou can separate in sentences using the Punkt::SentenceTokenizer object:\n\n    tokenizer = Punkt::SentenceTokenizer.new(text)\n    result    = tokenizer.sentences_from_text(text, :output =\u003e :sentences_text)\n\nThe result will be:\n\n    result    = [\n        [0] \"A minute is a unit of measurement of time or of angle.\",\n        [1] \"The minute is a unit of time equal to 1/60th of an hour or 60 seconds by 1.\",\n        [2] \"In the UTC time scale, a minute occasionally has 59 or 61 seconds; see leap second.\",\n        [3] \"The minute is not an SI unit; however, it is accepted for use with SI units.\",\n        [4] \"The symbol for minute or minutes is min.\",\n        [5] \"The fact that an hour contains 60 minutes is probably due to influences from the Babylonians, who used a base-60 or sexagesimal counting system.\",\n        [6] \"Colloquially, a min. may also refer to an indefinite amount of time substantially longer than the standardized length.\"\n    ]\n\nThe algorithm uses the text passed as parameter to train and tokenize in sentences. Sometimes the size of the input text is not enough to have a well trained set, which may cause some mistakes on the sentences splitting. For these cases you can train the Punkt segmenter:\n\n    trainer = Punkt::Trainer.new()\n    trainer.train(trainning_text)\n    \n    tokenizer = Punkt::SentenceTokenizer.new(trainer.parameters)\n    result    = tokenizer.sentences_from_text(text, :output =\u003e :sentences_text)\n\nIn this case, instead of passing the text to SentenceTokenizer, you pass the trainer parameters.\n\nA recommended use case for the trainning object is to train a big corpus in a specific language and then marshal the object to a file. Then you can load the already trained tokenizer from a file. You can even add more texts to the trainning set whenever you want.\n\nThe available options for *sentences_from_text* method are:\n\n- array of sentences indexes (default)\n- array of sentences string  (**:output =\u003e :sentences_text**)\n- array of sentences tokens  (**:output =\u003e :tokenized_sentences**)\t\n- realigned boundaries (**:realign_boundaries =\u003e true**): do this if you want to realign sentences that end with, for example, parenthesis, quotes, brackets, etc\n\t\nIf you have a list of tokens, you can use the *sentences_from_tokens* method, which takes only the list of tokens as parameter.\n\nCheck the unit tests for more detailed examples in English and Portuguese.\n\n----\n*This code follows the terms and conditions of Apache License v2 (http://www.apache.org/licenses/LICENSE-2.0)*\n\n*Copyright (C) Luis Cipriani*\n  \n  [http://www.nltk.org/]: http://www.nltk.org/\n  [Download paper]: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.85.5017\u0026rep=rep1\u0026type=pdf\n\n\n\n[![Bitdeli Badge](https://d2weczhvl823v0.cloudfront.net/lfcipriani/punkt-segmenter/trend.png)](https://bitdeli.com/free \"Bitdeli Badge\")\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flfcipriani%2Fpunkt-segmenter","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flfcipriani%2Fpunkt-segmenter","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flfcipriani%2Fpunkt-segmenter/lists"}