{"id":37114525,"url":"https://github.com/sentencizer/sentencizer","last_synced_at":"2026-01-14T13:28:27.407Z","repository":{"id":222510754,"uuid":"757492418","full_name":"sentencizer/sentencizer","owner":"sentencizer","description":"A sentence splitting (sentence boundary disambiguation) library for Go. It is rule-based and works out-of-the-box.","archived":false,"fork":false,"pushed_at":"2025-08-31T14:21:07.000Z","size":1937,"stargazers_count":44,"open_issues_count":12,"forks_count":8,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-10-24T14:56:42.072Z","etag":null,"topics":["ai","golang","llm","natural-language-processing","nlp-library","rag","retrieval-augmented-generation","sentence-boundary-detection","sentence-segmentation","sentence-segmenter","sentence-splitter","sentence-splitting","sentence-tokenizer","text-splitter","text-splitting"],"latest_commit_sha":null,"homepage":"https://gosbd.pages.dev/","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sentencizer.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-02-14T15:55:42.000Z","updated_at":"2025-10-15T16:34:25.000Z","dependencies_parsed_at":"2024-02-16T03:27:31.901Z","dependency_job_id":"efe52bda-000b-46e7-a61f-218cbad16133","html_url":"https://github.com/sentencizer/sentencizer","commit_stats":null,"previous_names":["gosbd/gosbd","yohamta/gosbd","sentencizer/sentencizer"],"tags_count":9,"template":false,"template_full_name":null,"purl":"pkg:github/sentencizer/sentencizer","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sentencizer%2Fsentencizer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sentencizer%2Fsentencizer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sentencizer%2Fsentencizer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sentencizer%2Fsentencizer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sentencizer","download_url":"https://codeload.github.com/sentencizer/sentencizer/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sentencizer%2Fsentencizer/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28421195,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-14T10:47:48.104Z","status":"ssl_error","status_checked_at":"2026-01-14T10:46:19.031Z","response_time":107,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","golang","llm","natural-language-processing","nlp-library","rag","retrieval-augmented-generation","sentence-boundary-detection","sentence-segmentation","sentence-segmenter","sentence-splitter","sentence-splitting","sentence-tokenizer","text-splitter","text-splitting"],"created_at":"2026-01-14T13:28:26.675Z","updated_at":"2026-01-14T13:28:27.392Z","avatar_url":"https://github.com/sentencizer.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# **Sentencizer: Sentence Splitting (Sentence Boundary Disambiguation) Library for Go**\n\n\u003cimg align=\"right\" width=\"320\" src=\"/artifacts/sbd-gopher.png\" alt=\"sentencizer-logo\" title=\"dsbd-logo\" /\u003e\n\n[![Godoc](http://img.shields.io/badge/go-documentation-blue.svg?style=flat-square)](https://godoc.org/github.com/sentencizer/sentencizer)\n\nSentencizer is a library for segmenting text into sentences, designed to make it easier to build Retrieval Augmented Generation (RAG) systems in Go. It is inspired by [pySBD](https://github.com/nipunsadvilkar/pySBD) and [pragmatic_segmenter](https://github.com/diasks2/pragmatic_segmenter), and works out-of-the-box with a rule-based approach.\n\n## Playground\n\nTry out Sentencizer in our [online playground](https://gosbd.pages.dev).\n\n## Features\n\n- **Sentence Splitting**: Efficiently breaks down a block of text into individual sentences.\n- **Lightweight and Easy Integration**: Designed to be lightweight and easy to integrate into existing Go projects.\n- **High Accuracy**: Offers high accuracy in sentence segmentation. For more details, see [pySBD](https://github.com/nipunsadvilkar/pySBD).\n- **Fast Sentence Splitting**: Sentencizer aims to provide high-performance sentence splitting by leveraging Go's efficiency.\n- **Non-Destructive Splitting**: Segments text into sentences without altering the original content.\n- **Language-Specific Configuration**: Adaptable to handle punctuation rules specific to different languages.\n- **Text Cleaning**: Equipped with features to manage and clean noisy text, including:\n  - Handling irregular newline characters and spacing\n  - Processing Tables of Contents\n  - Recognizing and managing URLs and HTML tags\n  - Dealing with sentences that are delimited without any space\n\n_Note: Text Cleaning feature is to be implemented. Contributions are greatly welcomed._\n\n## Installation\n\nTo install sentencizer, you can use `go get`:\n\n```sh\ngo get github.com/sentencizer/sentencizer\n```\n\n## Usage\n\nHere's a basic example of how to use sentencizer:\n\n```go\npackage main\n\nimport (\n    \"fmt\"\n    \"github.com/sentencizer/sentencizer\"\n)\n\n// This example segments a text string into individual sentences.\nfunc main() {\n    segmenter := sentencizer.NewSegmenter(\"en\")\n    text := \"This is a sentence. And this is another one.\"\n    sentences := segmenter.Segment(text)\n    for _, sentence := range sentences {\n        fmt.Println(sentence)\n    }\n}\n```\n\n## Roadmap\n\n- [x] Add Online Playground.\n- [ ] Add chuking feature with overlapping option.\n- [ ] Setup Codecov for monitoring test coverage.\n- [ ] Implement text cleaner.\n- [ ] Add support for more languages.\n- [ ] Add benchmark test.\n- [ ] Setup GitHub Action for testing.\n\n## Language Support Roadmap\n\nThe following table outlines our current language support. We're actively seeking contributions to expand this list. If you're interested in contributing, consider helping us add support for a language, whether it's listed below or not. Your expertise in a language not listed here could be a valuable addition to our project.\n\n| Language  | ISO Code | Supported | Contributed By |\n| --------- | -------- | --------- | ----------- |\n| Amharic   | am       | Planned   ||\n| Arabic    | ar       | Planned   ||\n| Armenian  | hy       | Planned   ||\n| Bulgarian | bg       | Planned   ||\n| Burmese   | my       | Planned   ||\n| Chinese   | zh       | Yes       ||\n| Danish    | da       | Planned   ||\n| Deutsch   | de       | Planned   ||\n| Dutch     | nl       | Planned   ||\n| English   | en       | Yes       ||\n| French    | fr       | Planned   ||\n| Greek     | el       | Planned   ||\n| Hindi     | hi       | Planned   ||\n| Hebrew    | he       | Yes       | [@neurlang](https://github.com/neurlang) |\n| Italian   | it       | Planned   ||\n| Japanese  | ja       | Yes       ||\n| Kazakh    | kk       | Planned   ||\n| Lithuanian| lt       | Yes       | [@naglis](https://github.com/naglis) |\n| Marathi   | mr       | Planned   ||\n| Persian   | fa       | Planned   ||\n| Polish    | pl       | Planned   ||\n| Russian   | ru       | Yes       ||\n| Slovak    | sk       | Planned   ||\n| Spanish   | es       | Planned   ||\n| Urdu      | ur       | Planned   ||\n\nWe welcome contributions that help us add support for these languages. Please feel free to submit a Pull Request with your contributions.\n\n## Motivation\n\nSentence splitting is a crucial step in the preprocessing pipeline of Natural Language Processing (NLP) tasks, especially for building Retrieval Augmented Generation (RAG) systems. RAG systems rely on accurately segmented sentences to retrieve relevant information and generate coherent responses.\n\nWhile libraries like pragmatic_segmenter and pySBD are known for their high accuracy and efficiency in sentence splitting, there are no equivalent libraries available in Go. This poses a challenge for developers building RAG systems in Go, as they need to rely on external libraries or implement their own sentence splitting logic.\n\nSentencizer aims to bridge this gap by providing a reliable and efficient sentence splitting solution in Go. By offering a native Go library for sentence splitting, Sentencizer simplifies the process of building RAG systems and other NLP applications entirely within the Go ecosystem. This not only streamlines the development workflow but also enables faster execution times by leveraging Go's performance characteristics.\n\n## Acknowledgement\n\nThis library builds upon the excellent foundations laid by [pySBD](https://github.com/nipunsadvilkar/pySBD) and [pragmatic_segmenter](https://github.com/diasks2/pragmatic_segmenter).\n\n## Contributing\n\nContributions are greatly appreciated and crucial for this project! Here are a few ways you can contribute:\n\n- **Add new tests and rules**: Improve the accuracy of sentence segmentation by adding new tests and rules.\n- **Add support for a new language**: Help expand the reach of this library by adding support for new languages.\n- **Port features**: Help improve this library by porting features that are supported in pySBD and pragmatic_segmenter.\n\nPlease feel free to submit a Pull Request with your contributions.\n\n## License\n\nThis project is licensed under the MIT License.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsentencizer%2Fsentencizer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsentencizer%2Fsentencizer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsentencizer%2Fsentencizer/lists"}