{"id":17166722,"url":"https://github.com/willf/segment","last_synced_at":"2025-04-13T15:25:58.223Z","repository":{"id":30716666,"uuid":"34272847","full_name":"willf/segment","owner":"willf","description":"A tool to segment text based on frequencies and the Viterbi algorithm \"#TheBoyWhoLived\" =\u003e ['#', 'The', 'Boy', 'Who', 'Lived']","archived":false,"fork":false,"pushed_at":"2016-04-23T04:51:08.000Z","size":2230,"stargazers_count":82,"open_issues_count":0,"forks_count":15,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-04-10T20:13:03.085Z","etag":null,"topics":["python","segment","viterbi-algorithm"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/willf.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-04-20T16:37:26.000Z","updated_at":"2024-01-04T15:59:41.000Z","dependencies_parsed_at":"2022-09-19T00:50:21.666Z","dependency_job_id":null,"html_url":"https://github.com/willf/segment","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/willf%2Fsegment","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/willf%2Fsegment/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/willf%2Fsegment/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/willf%2Fsegment/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/willf","download_url":"https://codeload.github.com/willf/segment/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248734194,"owners_count":21153166,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["python","segment","viterbi-algorithm"],"created_at":"2024-10-14T23:06:29.581Z","updated_at":"2025-04-13T15:25:58.190Z","avatar_url":"https://github.com/willf.png","language":"Python","readme":"This module segments text according word frequency using the Viterbi algorithm. Probably\ndue to Peter Norvig somehow.\n\nThree sources of frequency information is provided.\n\nOne is from the Google NGram corpus, a general web corpus.\n\nThe second is from the Rovereto Twitter N-Gram Corpus, which is better for some Twitter data.\n\nThe third is from a webcrawl dataset of anchor text provided\nby Vinay Goel of the Internet Archive.\n\n    \u003e from segment.segmenter import Analyzer\n    \u003e e = Analyzer('en')\n    \u003e e.segment(\"AbeLincoln\")\n    ['Abe', 'Lincoln']\n    \u003e e.segment(\"BieberHeartsBeliebers\")\n    ['Bi', 'e', 'ber', 'Hearts', 'Be', 'lieber', 's']\n    \u003e t = Analyzer('twitter')\n    \u003e t.segment(\"BieberHeartsBeliebers\")\n    ['Bieber', 'Hearts', 'Beliebers']\n    \u003e t = Analyzer('anchor')\n    \u003e t.segment(\"wordpress\u0026sex\")\n    ['wordpress', '\u0026', 'sex']\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwillf%2Fsegment","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fwillf%2Fsegment","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwillf%2Fsegment/lists"}