Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/textlint-rule/sentence-splitter

Split {Japanese, English} text into sentences.
https://github.com/textlint-rule/sentence-splitter

english japanese javascript nlp segement sentence

Last synced: 1 day ago
JSON representation

Split {Japanese, English} text into sentences.

Host: GitHub
URL: https://github.com/textlint-rule/sentence-splitter
Owner: textlint-rule
License: mit
Created: 2015-11-13T03:08:25.000Z (over 9 years ago)
Default Branch: master
Last Pushed: 2023-11-25T05:25:21.000Z (about 1 year ago)
Last Synced: 2024-05-19T11:13:26.785Z (9 months ago)
Topics: english, japanese, javascript, nlp, segement, sentence
Language: TypeScript
Homepage: https://sentence-splitter.netlify.app/
Size: 364 KB
Stars: 106
Watchers: 6
Forks: 14
Open Issues: 7
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # sentence-splitter

Split {Japanese, English} text into sentences.

## What is sentence?

This library split next text into 3 sentences.

```

We are talking about pens.

He said "This is a pen. I like it".

I could relate to that statement.

```

Result is:

![Sentence Image](./docs/img/sentence-result.png)

You can check actual AST in online playground.

- 

Second sentence includes `"This is a pen. I like it"`, but this library can not split it into new sentence.

The reason is `"..."` and `「...」` text is ambiguous as a sentence or a proper noun.

Also, HTML does not have suitable semantics for conversation.

- [html - Most semantic way to markup a conversation (or interview)? - Stack Overflow](https://stackoverflow.com/questions/8798685/most-semantic-way-to-markup-a-conversation-or-interview)

As a result, The second line will be one sentence, but sentence-splitter add a `contexts` info to the sentence node.

```json5

{

    "type": "Sentence",

    "children": [

      {

        "type": "Str",

        "value": "He said \"This is a pen. I like it\""

      },

      ...

    ],

    "contexts": [

        {

            "type": "PairMark",

            "pairMark": {

                "key": "double quote",

                "start": "\"",

                "end": "\""

            },

            "range": [

                8,

                33

            ],

            ...

        ]

    ]

}

```

- Example: 

Probably, textlint rule should handle the `"..."` and `「...」` text after parsing sentences by sentence-splitter.

- Issue: [Nesting Sentences Support · Issue #27 · textlint-rule/sentence-splitter](https://github.com/textlint-rule/sentence-splitter/issues/27)

- Related PR

  - https://github.com/textlint-ja/textlint-rule-no-doubled-joshi/pull/47

  - https://github.com/textlint-ja/textlint-rule-no-doubled-conjunctive-particle-ga/pull/27

  - https://github.com/textlint-ja/textlint-rule-max-ten/pull/24

## Installation

    npm install sentence-splitter

## Usage

```ts

export interface SeparatorParserOptions {

    /**

     * Recognize each characters as separator

     * Example [".", "!", "?"]

     */

    separatorCharacters?: string[]

}

export interface AbbrMarkerOptions {

    language?: Language;

}

export interface splitOptions {

    /**

     * Separator & AbbrMarker options

     */

    SeparatorParser?: SeparatorParserOptions;

    AbbrMarker?: AbbrMarkerOptions;

}

/**

 * split `text` into Sentence nodes.

 * This function return array of Sentence nodes.

 */

export declare function split(text: string, options?: splitOptions): TxtParentNodeWithSentenceNode["children"];

/**

 * Convert Paragraph Node to Paragraph Node that includes Sentence Node.

 * Paragraph Node is defined in textlint's TxtAST.

 * See https://github.com/textlint/textlint/blob/master/docs/txtnode.md

 */

export declare function splitAST(paragraphNode: TxtParentNode, options?: splitOptions): TxtParentNodeWithSentenceNode;

```

See also [TxtAST](https://github.com/textlint/textlint/blob/master/docs/txtnode.md "TxtAST").

### Example

- Online playground: 

## Node

This node is based on [TxtAST](https://github.com/textlint/textlint/blob/master/docs/txtnode.md "TxtAST").

### Node's type

- `Str`: Str node has `value`. It is same as TxtAST's `Str` node.

- `Sentence`: Sentence Node has `Str`, `WhiteSpace`, or `Punctuation` nodes as children

- `WhiteSpace`: WhiteSpace Node has `\n`.

- `Punctuation`: Punctuation Node has `.`, `。`

Get these `SentenceSplitterSyntax` constants value from the module:

```js

import { SentenceSplitterSyntax } from "sentence-splitter";

console.log(SentenceSplitterSyntax.Sentence);// "Sentence"

```

### Node's interface

```ts

export type SentencePairMarkContext = {

  type: "PairMark";

  range: readonly [startIndex: number, endIndex: number];

  loc: {

    start: {

      line: number;

      column: number;

    };

    end: {

      line: number;

      column: number;

    };

  };

};

export type TxtSentenceNode = Omit & {

    readonly type: "Sentence";

    readonly contexts?: TxtPairMarkNode[];

};

export type TxtWhiteSpaceNode = Omit & {

    readonly type: "WhiteSpace";

};

export type TxtPunctuationNode = Omit & {

    readonly type: "Punctuation";

};

```

Fore more details, Please see [TxtAST](https://github.com/textlint/textlint/blob/master/docs/txtnode.md "TxtAST").

### Node layout

Node layout image.

- Example: 

> This is 1st sentence. This is 2nd sentence.

```

                          |This is 1st sentence|

                  |.|

                   | |

                          |This is 2nd sentence|

                  |.|

```

Note: This library will not split `Str` into `Str` and `WhiteSpace`(tokenize)

Because, Tokenize need to implement language specific context.

### For textlint rule

You can use `splitAST` for textlint rule.

`splitAST` function can preserve original AST's position unlike `split` function.

```ts

import { splitAST, SentenceSplitterSyntax } from "sentence-splitter";

export default function(context, options = {}) {

    const { Syntax, RuleError, report, getSource } = context;

    return {

        [Syntax.Paragraph](node) {

            const parsedNode = splitAST(node);

            const sentenceNodes = parsedNode.children.filter(childNode => childNode.type === SentenceSplitterSyntax.Sentence);

            console.log(sentenceNodes); // => Sentence nodes

        }

    }

}

```

Examples

- [textlint-ja/textlint-rule-max-ten: textlint rule that limit maxinum ten(、) count of sentence.](https://github.com/textlint-ja/textlint-rule-max-ten)

## Reference

This library use ["Golden Rule" test](test/pragmatic_segmenter/test.ts) of `pragmatic_segmenter` for testing.

- [diasks2/pragmatic_segmenter: Pragmatic Segmenter is a rule-based sentence boundary detection gem that works out-of-the-box across many languages.](https://github.com/diasks2/pragmatic_segmenter "diasks2/pragmatic_segmenter: Pragmatic Segmenter is a rule-based sentence boundary detection gem that works out-of-the-box across many languages.")

## Related Libraries

- [textlint-util-to-string](https://github.com/textlint/textlint-util-to-string)

- and 

## Tests

Run tests:

    npm test

Create `input.json` from `_input.md`

    npm run createInputJson

Update snapshots(`output.json`):

    npm run updateSnapshot

### Adding snapshot testcase

1. Create `test/fixtures//` directory

2. Put `test/fixtures//_input.md` with testing content

3. Run `npm run updateSnapshot`

4. Check the `test/fixtures//output.json`

5. If it is ok, commit it

## Contributing

1. Fork it!

2. Create your feature branch: `git checkout -b my-new-feature`

3. Commit your changes: `git commit -am 'Add some feature'`

4. Push to the branch: `git push origin my-new-feature`

5. Submit a pull request :D

## License

MIT