Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/textlint-rule/sentence-splitter

Split {Japanese, English} text into sentences.
https://github.com/textlint-rule/sentence-splitter

english japanese javascript nlp segement sentence

Last synced: 1 day ago
JSON representation

Split {Japanese, English} text into sentences.

Awesome Lists containing this project

README

        

# sentence-splitter

Split {Japanese, English} text into sentences.

## What is sentence?

This library split next text into 3 sentences.

```
We are talking about pens.
He said "This is a pen. I like it".
I could relate to that statement.
```

Result is:

![Sentence Image](./docs/img/sentence-result.png)

You can check actual AST in online playground.

-

Second sentence includes `"This is a pen. I like it"`, but this library can not split it into new sentence.
The reason is `"..."` and `「...」` text is ambiguous as a sentence or a proper noun.
Also, HTML does not have suitable semantics for conversation.

- [html - Most semantic way to markup a conversation (or interview)? - Stack Overflow](https://stackoverflow.com/questions/8798685/most-semantic-way-to-markup-a-conversation-or-interview)

As a result, The second line will be one sentence, but sentence-splitter add a `contexts` info to the sentence node.

```json5
{
"type": "Sentence",
"children": [
{
"type": "Str",
"value": "He said \"This is a pen. I like it\""
},
...
],
"contexts": [
{
"type": "PairMark",
"pairMark": {
"key": "double quote",
"start": "\"",
"end": "\""
},
"range": [
8,
33
],
...
]
]
}
```

- Example:

Probably, textlint rule should handle the `"..."` and `「...」` text after parsing sentences by sentence-splitter.

- Issue: [Nesting Sentences Support · Issue #27 · textlint-rule/sentence-splitter](https://github.com/textlint-rule/sentence-splitter/issues/27)
- Related PR
- https://github.com/textlint-ja/textlint-rule-no-doubled-joshi/pull/47
- https://github.com/textlint-ja/textlint-rule-no-doubled-conjunctive-particle-ga/pull/27
- https://github.com/textlint-ja/textlint-rule-max-ten/pull/24

## Installation

npm install sentence-splitter

## Usage

```ts
export interface SeparatorParserOptions {
/**
* Recognize each characters as separator
* Example [".", "!", "?"]
*/
separatorCharacters?: string[]
}

export interface AbbrMarkerOptions {
language?: Language;
}

export interface splitOptions {
/**
* Separator & AbbrMarker options
*/
SeparatorParser?: SeparatorParserOptions;
AbbrMarker?: AbbrMarkerOptions;
}

/**
* split `text` into Sentence nodes.
* This function return array of Sentence nodes.
*/
export declare function split(text: string, options?: splitOptions): TxtParentNodeWithSentenceNode["children"];

/**
* Convert Paragraph Node to Paragraph Node that includes Sentence Node.
* Paragraph Node is defined in textlint's TxtAST.
* See https://github.com/textlint/textlint/blob/master/docs/txtnode.md
*/
export declare function splitAST(paragraphNode: TxtParentNode, options?: splitOptions): TxtParentNodeWithSentenceNode;
```

See also [TxtAST](https://github.com/textlint/textlint/blob/master/docs/txtnode.md "TxtAST").

### Example

- Online playground:

## Node

This node is based on [TxtAST](https://github.com/textlint/textlint/blob/master/docs/txtnode.md "TxtAST").

### Node's type

- `Str`: Str node has `value`. It is same as TxtAST's `Str` node.
- `Sentence`: Sentence Node has `Str`, `WhiteSpace`, or `Punctuation` nodes as children
- `WhiteSpace`: WhiteSpace Node has `\n`.
- `Punctuation`: Punctuation Node has `.`, `。`

Get these `SentenceSplitterSyntax` constants value from the module:

```js
import { SentenceSplitterSyntax } from "sentence-splitter";

console.log(SentenceSplitterSyntax.Sentence);// "Sentence"
```

### Node's interface

```ts
export type SentencePairMarkContext = {
type: "PairMark";
range: readonly [startIndex: number, endIndex: number];
loc: {
start: {
line: number;
column: number;
};
end: {
line: number;
column: number;
};
};
};
export type TxtSentenceNode = Omit & {
readonly type: "Sentence";
readonly contexts?: TxtPairMarkNode[];
};
export type TxtWhiteSpaceNode = Omit & {
readonly type: "WhiteSpace";
};
export type TxtPunctuationNode = Omit & {
readonly type: "Punctuation";
};
```

Fore more details, Please see [TxtAST](https://github.com/textlint/textlint/blob/master/docs/txtnode.md "TxtAST").

### Node layout

Node layout image.

- Example:

> This is 1st sentence. This is 2nd sentence.

```

|This is 1st sentence|
|.|

| |

|This is 2nd sentence|
|.|

```

Note: This library will not split `Str` into `Str` and `WhiteSpace`(tokenize)
Because, Tokenize need to implement language specific context.

### For textlint rule

You can use `splitAST` for textlint rule.
`splitAST` function can preserve original AST's position unlike `split` function.

```ts
import { splitAST, SentenceSplitterSyntax } from "sentence-splitter";

export default function(context, options = {}) {
const { Syntax, RuleError, report, getSource } = context;
return {
[Syntax.Paragraph](node) {
const parsedNode = splitAST(node);
const sentenceNodes = parsedNode.children.filter(childNode => childNode.type === SentenceSplitterSyntax.Sentence);
console.log(sentenceNodes); // => Sentence nodes
}
}
}
```

Examples

- [textlint-ja/textlint-rule-max-ten: textlint rule that limit maxinum ten(、) count of sentence.](https://github.com/textlint-ja/textlint-rule-max-ten)

## Reference

This library use ["Golden Rule" test](test/pragmatic_segmenter/test.ts) of `pragmatic_segmenter` for testing.

- [diasks2/pragmatic_segmenter: Pragmatic Segmenter is a rule-based sentence boundary detection gem that works out-of-the-box across many languages.](https://github.com/diasks2/pragmatic_segmenter "diasks2/pragmatic_segmenter: Pragmatic Segmenter is a rule-based sentence boundary detection gem that works out-of-the-box across many languages.")

## Related Libraries

- [textlint-util-to-string](https://github.com/textlint/textlint-util-to-string)
- and

## Tests

Run tests:

npm test

Create `input.json` from `_input.md`

npm run createInputJson

Update snapshots(`output.json`):

npm run updateSnapshot

### Adding snapshot testcase

1. Create `test/fixtures//` directory
2. Put `test/fixtures//_input.md` with testing content
3. Run `npm run updateSnapshot`
4. Check the `test/fixtures//output.json`
5. If it is ok, commit it

## Contributing

1. Fork it!
2. Create your feature branch: `git checkout -b my-new-feature`
3. Commit your changes: `git commit -am 'Add some feature'`
4. Push to the branch: `git push origin my-new-feature`
5. Submit a pull request :D

## License

MIT