{"id":14263146,"url":"https://github.com/djc/instant-segment","last_synced_at":"2025-04-06T02:08:59.739Z","repository":{"id":43483608,"uuid":"267115252","full_name":"djc/instant-segment","owner":"djc","description":"Fast English word segmentation in Rust","archived":false,"fork":false,"pushed_at":"2025-03-17T13:34:44.000Z","size":7181,"stargazers_count":97,"open_issues_count":2,"forks_count":7,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-03-30T01:08:13.485Z","etag":null,"topics":["natural-language-processing","rust","segmentation"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/djc.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null},"funding":{"github":["djc"],"patreon":"dochtman"}},"created_at":"2020-05-26T18:02:04.000Z","updated_at":"2025-03-10T15:09:14.000Z","dependencies_parsed_at":"2024-08-04T21:37:05.082Z","dependency_job_id":"34688bfd-a7ae-40fe-87fc-20557fc0256e","html_url":"https://github.com/djc/instant-segment","commit_stats":{"total_commits":185,"total_committers":10,"mean_commits":18.5,"dds":"0.16756756756756752","last_synced_commit":"6cd525c6a6153ac3317149950be1756554c987cd"},"previous_names":["instantdomain/instant-segment","instantdomainsearch/word-segmenters","djc/instant-segment"],"tags_count":22,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/djc%2Finstant-segment","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/djc%2Finstant-segment/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/djc%2Finstant-segment/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/djc%2Finstant-segment/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/djc","download_url":"https://codeload.github.com/djc/instant-segment/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247423515,"owners_count":20936626,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["natural-language-processing","rust","segmentation"],"created_at":"2024-08-22T13:02:19.341Z","updated_at":"2025-04-06T02:08:59.713Z","avatar_url":"https://github.com/djc.png","language":"Rust","funding_links":["https://github.com/sponsors/djc","https://patreon.com/dochtman"],"categories":["Rust"],"sub_categories":[],"readme":"![Cover logo](./cover.svg)\n\n# Instant Segment: fast English word segmentation in Rust\n\n[![Documentation](https://docs.rs/instant-segment/badge.svg)](https://docs.rs/instant-segment/)\n[![Crates.io](https://img.shields.io/crates/v/instant-segment.svg)](https://crates.io/crates/instant-segment)\n[![PyPI](https://img.shields.io/pypi/v/instant-segment)](https://pypi.org/project/instant-segment/)\n[![Build status](https://github.com/instant-labs/instant-segment/workflows/CI/badge.svg)](https://github.com/instant-labs/instant-segment/actions?query=workflow%3ACI)\n[![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](LICENSE-APACHE)\n\nInstant Segment is a fast Apache-2.0 library for English word segmentation. It\nis based on the Python [wordsegment][python] project written by Grant Jenks,\nwhich is in turn based on code from Peter Norvig's chapter [Natural Language\nCorpus Data][chapter] from the book [Beautiful Data][book] (Segaran and\nHammerbacher, 2009).\n\nFor the microbenchmark included in this repository, Instant Segment is ~500x\nfaster than the Python implementation. The API was carefully constructed\nso that multiple segmentations can share the underlying state to allow parallel\nusage.\n\n## How it works\n\nInstant Segment works by segmenting a string into words by selecting the splits\nwith the highest probability given a corpus of words and their occurrences.\n\nFor instance, provided that `choose` and `spain` occur more frequently than\n`chooses` and `pain`, and that the pair `choose spain` occurs more frequently\nthan `chooses pain`, Instant Segment can help identify the domain\n`choosespain.com` as `ChooseSpain.com` which more likely matches user intent.\n\nRead about [how we built and improved][story] Instant Segment for use in production\nat [Instant Domain Search](https://instantdomainsearch.com/) to help our users\nfind relevant domains they can register.\n\n## Using the library\n\n### Python **(\u003e= 3.9)**\n\n```sh\npip install instant-segment\n```\n\n### Rust\n\n```toml\n[dependencies]\ninstant-segment = \"0.8.1\"\n```\n\n### Examples\n\nThe following examples expect `unigrams` and `bigrams` to exist. See the\nexamples ([Rust](./instant-segment/examples/contrived.rs),\n[Python](./instant-segment-py/examples/contrived.py)) to see how to construct\nthese objects.\n\n```python\nimport instant_segment\n\nsegmenter = instant_segment.Segmenter(unigrams, bigrams)\nsearch = instant_segment.Search()\nsegmenter.segment(\"instantdomainsearch\", search)\nprint([word for word in search])\n\n--\u003e ['instant', 'domain', 'search']\n```\n\n```rust\nuse instant_segment::{Search, Segmenter};\nuse std::collections::HashMap;\n\nlet segmenter = Segmenter::new(unigrams, bigrams);\nlet mut search = Search::default();\nlet words = segmenter\n    .segment(\"instantdomainsearch\", \u0026mut search)\n    .unwrap();\nprintln!(\"{:?}\", words.collect::\u003cVec\u003c\u0026str\u003e\u003e())\n\n--\u003e [\"instant\", \"domain\", \"search\"]\n```\n\nCheck out the tests for more thorough examples:\n[Rust](./instant-segment/src/test_cases.rs),\n[Python](./instant-segment-py/test/test.py)\n\n## Testing\n\nTo run the tests run the following:\n\n```\ncargo t -p instant-segment --all-features\n```\n\nYou can also test the Python bindings with:\n\n```\nmake test-python\n```\n\n[python]: https://github.com/grantjenks/python-wordsegment\n[chapter]: http://norvig.com/ngrams/\n[story]: https://instantdomains.com/engineering/instant-word-segmentation-with-rust\n[book]: http://oreilly.com/catalog/9780596157111/\n[distributed]: https://catalog.ldc.upenn.edu/LDC2006T13\n[issues]: https://github.com/instant-labs/instant-segment/issues\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdjc%2Finstant-segment","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdjc%2Finstant-segment","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdjc%2Finstant-segment/lists"}