{"id":13823011,"url":"https://github.com/textlint-rule/sentence-splitter","last_synced_at":"2025-04-04T07:07:56.351Z","repository":{"id":51183435,"uuid":"46096663","full_name":"textlint-rule/sentence-splitter","owner":"textlint-rule","description":"Split {Japanese, English} text into sentences.","archived":false,"fork":false,"pushed_at":"2023-11-25T05:25:21.000Z","size":373,"stargazers_count":123,"open_issues_count":7,"forks_count":18,"subscribers_count":7,"default_branch":"master","last_synced_at":"2025-03-06T11:07:13.707Z","etag":null,"topics":["english","japanese","javascript","nlp","segement","sentence"],"latest_commit_sha":null,"homepage":"https://sentence-splitter.netlify.app/","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/textlint-rule.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null},"funding":{"github":"azu"}},"created_at":"2015-11-13T03:08:25.000Z","updated_at":"2025-02-26T18:30:46.000Z","dependencies_parsed_at":"2024-01-18T04:10:11.455Z","dependency_job_id":"8b59e626-edce-4304-ab93-4471bed0e625","html_url":"https://github.com/textlint-rule/sentence-splitter","commit_stats":{"total_commits":122,"total_committers":4,"mean_commits":30.5,"dds":0.1311475409836066,"last_synced_commit":"fa8f67138d6702aba7f6e8032ac8fdd542226af1"},"previous_names":["azu/sentence-splitter"],"tags_count":39,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/textlint-rule%2Fsentence-splitter","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/textlint-rule%2Fsentence-splitter/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/textlint-rule%2Fsentence-splitter/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/textlint-rule%2Fsentence-splitter/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/textlint-rule","download_url":"https://codeload.github.com/textlint-rule/sentence-splitter/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247135144,"owners_count":20889421,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["english","japanese","javascript","nlp","segement","sentence"],"created_at":"2024-08-04T08:02:29.501Z","updated_at":"2025-04-04T07:07:56.298Z","avatar_url":"https://github.com/textlint-rule.png","language":"TypeScript","funding_links":["https://github.com/sponsors/azu"],"categories":["TypeScript"],"sub_categories":[],"readme":"# sentence-splitter\n\nSplit {Japanese, English} text into sentences.\n\n## What is sentence?\n\nThis library split next text into 3 sentences.\n\n```\nWe are talking about pens.\nHe said \"This is a pen. I like it\".\nI could relate to that statement.\n```\n\nResult is:\n\n![Sentence Image](./docs/img/sentence-result.png)\n\nYou can check actual AST in online playground.\n\n- \u003chttps://sentence-splitter.netlify.app/#We%20are%20talking%20about%20pens.%0AHe%20said%20%22This%20is%20a%20pen.%20I%20like%20it%22.%0AI%20could%20relate%20to%20that%20statement.\u003e\n\nSecond sentence includes `\"This is a pen. I like it\"`, but this library can not split it into new sentence.\nThe reason is `\"...\"` and `「...」` text is ambiguous as a sentence or a proper noun.\nAlso, HTML does not have suitable semantics for conversation.\n\n- [html - Most semantic way to markup a conversation (or interview)? - Stack Overflow](https://stackoverflow.com/questions/8798685/most-semantic-way-to-markup-a-conversation-or-interview)\n\nAs a result, The second line will be one sentence, but sentence-splitter add a `contexts` info to the sentence node.\n\n```json5\n{\n    \"type\": \"Sentence\",\n    \"children\": [\n      {\n        \"type\": \"Str\",\n        \"value\": \"He said \\\"This is a pen. I like it\\\"\"\n      },\n      ...\n    ],\n    \"contexts\": [\n        {\n            \"type\": \"PairMark\",\n            \"pairMark\": {\n                \"key\": \"double quote\",\n                \"start\": \"\\\"\",\n                \"end\": \"\\\"\"\n            },\n            \"range\": [\n                8,\n                33\n            ],\n            ...\n        ]\n    ]\n}\n```\n\n- Example: \u003chttps://sentence-splitter.netlify.app/#He%20said%20%22This%20is%20a%20pen.%20I%20like%20it%22.\u003e\n\nProbably, textlint rule should handle the `\"...\"` and `「...」` text after parsing sentences by sentence-splitter.\n\n- Issue: [Nesting Sentences Support · Issue #27 · textlint-rule/sentence-splitter](https://github.com/textlint-rule/sentence-splitter/issues/27)\n- Related PR\n  - https://github.com/textlint-ja/textlint-rule-no-doubled-joshi/pull/47\n  - https://github.com/textlint-ja/textlint-rule-no-doubled-conjunctive-particle-ga/pull/27\n  - https://github.com/textlint-ja/textlint-rule-max-ten/pull/24\n\n## Installation\n\n    npm install sentence-splitter\n\n## Usage\n\n```ts\nexport interface SeparatorParserOptions {\n    /**\n     * Recognize each characters as separator\n     * Example [\".\", \"!\", \"?\"]\n     */\n    separatorCharacters?: string[]\n}\n\nexport interface AbbrMarkerOptions {\n    language?: Language;\n}\n\nexport interface splitOptions {\n    /**\n     * Separator \u0026 AbbrMarker options\n     */\n    SeparatorParser?: SeparatorParserOptions;\n    AbbrMarker?: AbbrMarkerOptions;\n}\n\n/**\n * split `text` into Sentence nodes.\n * This function return array of Sentence nodes.\n */\nexport declare function split(text: string, options?: splitOptions): TxtParentNodeWithSentenceNode[\"children\"];\n\n/**\n * Convert Paragraph Node to Paragraph Node that includes Sentence Node.\n * Paragraph Node is defined in textlint's TxtAST.\n * See https://github.com/textlint/textlint/blob/master/docs/txtnode.md\n */\nexport declare function splitAST(paragraphNode: TxtParentNode, options?: splitOptions): TxtParentNodeWithSentenceNode;\n```\n\nSee also [TxtAST](https://github.com/textlint/textlint/blob/master/docs/txtnode.md \"TxtAST\").\n\n### Example\n\n- Online playground: \u003chttps://sentence-splitter.netlify.app/\u003e\n\n## Node\n\nThis node is based on [TxtAST](https://github.com/textlint/textlint/blob/master/docs/txtnode.md \"TxtAST\").\n\n### Node's type\n\n- `Str`: Str node has `value`. It is same as TxtAST's `Str` node.\n- `Sentence`: Sentence Node has `Str`, `WhiteSpace`, or `Punctuation` nodes as children\n- `WhiteSpace`: WhiteSpace Node has `\\n`.\n- `Punctuation`: Punctuation Node has `.`, `。`\n\nGet these `SentenceSplitterSyntax` constants value from the module:\n\n```js\nimport { SentenceSplitterSyntax } from \"sentence-splitter\";\n\nconsole.log(SentenceSplitterSyntax.Sentence);// \"Sentence\"\n```\n\n### Node's interface\n\n```ts\nexport type SentencePairMarkContext = {\n  type: \"PairMark\";\n  range: readonly [startIndex: number, endIndex: number];\n  loc: {\n    start: {\n      line: number;\n      column: number;\n    };\n    end: {\n      line: number;\n      column: number;\n    };\n  };\n};\nexport type TxtSentenceNode = Omit\u003cTxtParentNode, \"type\"\u003e \u0026 {\n    readonly type: \"Sentence\";\n    readonly contexts?: TxtPairMarkNode[];\n};\nexport type TxtWhiteSpaceNode = Omit\u003cTxtTextNode, \"type\"\u003e \u0026 {\n    readonly type: \"WhiteSpace\";\n};\nexport type TxtPunctuationNode = Omit\u003cTxtTextNode, \"type\"\u003e \u0026 {\n    readonly type: \"Punctuation\";\n};\n```\n\nFore more details, Please see [TxtAST](https://github.com/textlint/textlint/blob/master/docs/txtnode.md \"TxtAST\").\n\n### Node layout\n\nNode layout image.\n\n- Example: \u003chttps://sentence-splitter.netlify.app/#This%20is%201st%20sentence.%20This%20is%202nd%20sentence.\u003e\n\n\u003e This is 1st sentence. This is 2nd sentence.\n\n```\n\u003cSentence\u003e\n    \u003cStr /\u003e                      |This is 1st sentence|\n    \u003cPunctuation /\u003e              |.|\n\u003c/Sentence\u003e\n\u003cWhiteSpace /\u003e                   | |\n\u003cSentence\u003e\n    \u003cStr /\u003e                      |This is 2nd sentence|\n    \u003cPunctuation /\u003e              |.|\n\u003c/Sentence\u003e\n```\n\nNote: This library will not split `Str` into `Str` and `WhiteSpace`(tokenize)\nBecause, Tokenize need to implement language specific context.\n\n### For textlint rule\n\nYou can use `splitAST` for textlint rule.\n`splitAST` function can preserve original AST's position unlike `split` function.\n\n```ts\nimport { splitAST, SentenceSplitterSyntax } from \"sentence-splitter\";\n\nexport default function(context, options = {}) {\n    const { Syntax, RuleError, report, getSource } = context;\n    return {\n        [Syntax.Paragraph](node) {\n            const parsedNode = splitAST(node);\n            const sentenceNodes = parsedNode.children.filter(childNode =\u003e childNode.type === SentenceSplitterSyntax.Sentence);\n            console.log(sentenceNodes); // =\u003e Sentence nodes\n        }\n    }\n}\n```\n\nExamples\n\n- [textlint-ja/textlint-rule-max-ten: textlint rule that limit maxinum ten(、) count of sentence.](https://github.com/textlint-ja/textlint-rule-max-ten)\n\n## Reference\n\nThis library use [\"Golden Rule\" test](test/pragmatic_segmenter/test.ts) of `pragmatic_segmenter` for testing.\n\n- [diasks2/pragmatic_segmenter: Pragmatic Segmenter is a rule-based sentence boundary detection gem that works out-of-the-box across many languages.](https://github.com/diasks2/pragmatic_segmenter \"diasks2/pragmatic_segmenter: Pragmatic Segmenter is a rule-based sentence boundary detection gem that works out-of-the-box across many languages.\")\n\n## Related Libraries\n\n- [textlint-util-to-string](https://github.com/textlint/textlint-util-to-string)\n- and \u003chttps://github.com/textlint/textlint/wiki/Collection-of-textlint-rule#rule-libraries\u003e\n\n## Tests\n\nRun tests:\n\n    npm test\n\nCreate `input.json` from `_input.md`\n\n    npm run createInputJson\n\nUpdate snapshots(`output.json`):\n\n    npm run updateSnapshot\n\n### Adding snapshot testcase\n\n1. Create `test/fixtures/\u003ctest-case-name\u003e/` directory\n2. Put `test/fixtures/\u003ctest-case-name\u003e/_input.md` with testing content\n3. Run `npm run updateSnapshot`\n4. Check the `test/fixtures/\u003ctest-case-name\u003e/output.json`\n5. If it is ok, commit it\n\n## Contributing\n\n1. Fork it!\n2. Create your feature branch: `git checkout -b my-new-feature`\n3. Commit your changes: `git commit -am 'Add some feature'`\n4. Push to the branch: `git push origin my-new-feature`\n5. Submit a pull request :D\n\n## License\n\nMIT\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftextlint-rule%2Fsentence-splitter","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftextlint-rule%2Fsentence-splitter","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftextlint-rule%2Fsentence-splitter/lists"}