Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/textlint-rule/sentence-splitter
Split {Japanese, English} text into sentences.
https://github.com/textlint-rule/sentence-splitter
english japanese javascript nlp segement sentence
Last synced: 1 day ago
JSON representation
Split {Japanese, English} text into sentences.
- Host: GitHub
- URL: https://github.com/textlint-rule/sentence-splitter
- Owner: textlint-rule
- License: mit
- Created: 2015-11-13T03:08:25.000Z (over 9 years ago)
- Default Branch: master
- Last Pushed: 2023-11-25T05:25:21.000Z (about 1 year ago)
- Last Synced: 2024-05-19T11:13:26.785Z (9 months ago)
- Topics: english, japanese, javascript, nlp, segement, sentence
- Language: TypeScript
- Homepage: https://sentence-splitter.netlify.app/
- Size: 364 KB
- Stars: 106
- Watchers: 6
- Forks: 14
- Open Issues: 7
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# sentence-splitter
Split {Japanese, English} text into sentences.
## What is sentence?
This library split next text into 3 sentences.
```
We are talking about pens.
He said "This is a pen. I like it".
I could relate to that statement.
```Result is:
![Sentence Image](./docs/img/sentence-result.png)
You can check actual AST in online playground.
-
Second sentence includes `"This is a pen. I like it"`, but this library can not split it into new sentence.
The reason is `"..."` and `「...」` text is ambiguous as a sentence or a proper noun.
Also, HTML does not have suitable semantics for conversation.- [html - Most semantic way to markup a conversation (or interview)? - Stack Overflow](https://stackoverflow.com/questions/8798685/most-semantic-way-to-markup-a-conversation-or-interview)
As a result, The second line will be one sentence, but sentence-splitter add a `contexts` info to the sentence node.
```json5
{
"type": "Sentence",
"children": [
{
"type": "Str",
"value": "He said \"This is a pen. I like it\""
},
...
],
"contexts": [
{
"type": "PairMark",
"pairMark": {
"key": "double quote",
"start": "\"",
"end": "\""
},
"range": [
8,
33
],
...
]
]
}
```- Example:
Probably, textlint rule should handle the `"..."` and `「...」` text after parsing sentences by sentence-splitter.
- Issue: [Nesting Sentences Support · Issue #27 · textlint-rule/sentence-splitter](https://github.com/textlint-rule/sentence-splitter/issues/27)
- Related PR
- https://github.com/textlint-ja/textlint-rule-no-doubled-joshi/pull/47
- https://github.com/textlint-ja/textlint-rule-no-doubled-conjunctive-particle-ga/pull/27
- https://github.com/textlint-ja/textlint-rule-max-ten/pull/24## Installation
npm install sentence-splitter
## Usage
```ts
export interface SeparatorParserOptions {
/**
* Recognize each characters as separator
* Example [".", "!", "?"]
*/
separatorCharacters?: string[]
}export interface AbbrMarkerOptions {
language?: Language;
}export interface splitOptions {
/**
* Separator & AbbrMarker options
*/
SeparatorParser?: SeparatorParserOptions;
AbbrMarker?: AbbrMarkerOptions;
}/**
* split `text` into Sentence nodes.
* This function return array of Sentence nodes.
*/
export declare function split(text: string, options?: splitOptions): TxtParentNodeWithSentenceNode["children"];/**
* Convert Paragraph Node to Paragraph Node that includes Sentence Node.
* Paragraph Node is defined in textlint's TxtAST.
* See https://github.com/textlint/textlint/blob/master/docs/txtnode.md
*/
export declare function splitAST(paragraphNode: TxtParentNode, options?: splitOptions): TxtParentNodeWithSentenceNode;
```See also [TxtAST](https://github.com/textlint/textlint/blob/master/docs/txtnode.md "TxtAST").
### Example
- Online playground:
## Node
This node is based on [TxtAST](https://github.com/textlint/textlint/blob/master/docs/txtnode.md "TxtAST").
### Node's type
- `Str`: Str node has `value`. It is same as TxtAST's `Str` node.
- `Sentence`: Sentence Node has `Str`, `WhiteSpace`, or `Punctuation` nodes as children
- `WhiteSpace`: WhiteSpace Node has `\n`.
- `Punctuation`: Punctuation Node has `.`, `。`Get these `SentenceSplitterSyntax` constants value from the module:
```js
import { SentenceSplitterSyntax } from "sentence-splitter";console.log(SentenceSplitterSyntax.Sentence);// "Sentence"
```### Node's interface
```ts
export type SentencePairMarkContext = {
type: "PairMark";
range: readonly [startIndex: number, endIndex: number];
loc: {
start: {
line: number;
column: number;
};
end: {
line: number;
column: number;
};
};
};
export type TxtSentenceNode = Omit & {
readonly type: "Sentence";
readonly contexts?: TxtPairMarkNode[];
};
export type TxtWhiteSpaceNode = Omit & {
readonly type: "WhiteSpace";
};
export type TxtPunctuationNode = Omit & {
readonly type: "Punctuation";
};
```Fore more details, Please see [TxtAST](https://github.com/textlint/textlint/blob/master/docs/txtnode.md "TxtAST").
### Node layout
Node layout image.
- Example:
> This is 1st sentence. This is 2nd sentence.
```
|This is 1st sentence|
|.|| |
|This is 2nd sentence|
|.|```
Note: This library will not split `Str` into `Str` and `WhiteSpace`(tokenize)
Because, Tokenize need to implement language specific context.### For textlint rule
You can use `splitAST` for textlint rule.
`splitAST` function can preserve original AST's position unlike `split` function.```ts
import { splitAST, SentenceSplitterSyntax } from "sentence-splitter";export default function(context, options = {}) {
const { Syntax, RuleError, report, getSource } = context;
return {
[Syntax.Paragraph](node) {
const parsedNode = splitAST(node);
const sentenceNodes = parsedNode.children.filter(childNode => childNode.type === SentenceSplitterSyntax.Sentence);
console.log(sentenceNodes); // => Sentence nodes
}
}
}
```Examples
- [textlint-ja/textlint-rule-max-ten: textlint rule that limit maxinum ten(、) count of sentence.](https://github.com/textlint-ja/textlint-rule-max-ten)
## Reference
This library use ["Golden Rule" test](test/pragmatic_segmenter/test.ts) of `pragmatic_segmenter` for testing.
- [diasks2/pragmatic_segmenter: Pragmatic Segmenter is a rule-based sentence boundary detection gem that works out-of-the-box across many languages.](https://github.com/diasks2/pragmatic_segmenter "diasks2/pragmatic_segmenter: Pragmatic Segmenter is a rule-based sentence boundary detection gem that works out-of-the-box across many languages.")
## Related Libraries
- [textlint-util-to-string](https://github.com/textlint/textlint-util-to-string)
- and## Tests
Run tests:
npm test
Create `input.json` from `_input.md`
npm run createInputJson
Update snapshots(`output.json`):
npm run updateSnapshot
### Adding snapshot testcase
1. Create `test/fixtures//` directory
2. Put `test/fixtures//_input.md` with testing content
3. Run `npm run updateSnapshot`
4. Check the `test/fixtures//output.json`
5. If it is ok, commit it## Contributing
1. Fork it!
2. Create your feature branch: `git checkout -b my-new-feature`
3. Commit your changes: `git commit -am 'Add some feature'`
4. Push to the branch: `git push origin my-new-feature`
5. Submit a pull request :D## License
MIT