https://github.com/jdkato/prose

:book: A Golang library for text processing, including tokenization, part-of-speech tagging, and named-entity extraction.
https://github.com/jdkato/prose

natural-language-processing nlp prose

Last synced: 4 months ago
JSON representation

:book: A Golang library for text processing, including tokenization, part-of-speech tagging, and named-entity extraction.

Host: GitHub
URL: https://github.com/jdkato/prose
Owner: jdkato
License: mit
Archived: true
Created: 2017-02-17T17:08:22.000Z (over 8 years ago)
Default Branch: master
Last Pushed: 2023-05-02T05:39:17.000Z (about 2 years ago)
Last Synced: 2024-10-23T17:17:02.903Z (7 months ago)
Topics: natural-language-processing, nlp, prose
Language: Go
Homepage:
Size: 26.8 MB
Stars: 3,056
Watchers: 56
Forks: 164
Open Issues: 21
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome-go - prose - Library for text processing that supports tokenization, part-of-speech tagging, named-entity extraction, and more. English only. (Natural Language Processing / Tokenizers)
zero-alloc-awesome-go - prose - Library for text processing that supports tokenization, part-of-speech tagging, named-entity extraction, and more. English only. (Natural Language Processing / Tokenizers)
my-awesome - jdkato/prose - language-processing,nlp,prose pushed_at:2023-05 star:3.1k fork:0.2k :book: A Golang library for text processing, including tokenization, part-of-speech tagging, and named-entity extraction. (Go)
go-awesome - prose - Natural language processing library (Open source library / Word Processing)
awesome-go - prose - Library for text processing that supports tokenization, part-of-speech tagging, named-entity extraction, and more. English only. Stars:`3.1K`. (Natural Language Processing / Tokenizers)
awesome-go - prose - A Golang library for text processing, including tokenization, part-of-speech tagging, and named-entity extraction. - ★ 1811 (Natural Language Processing)
awesome-go-extra - ARCHIVED - of-speech tagging, and named-entity extraction.|2943|147|20|2017-02-17T17:08:22Z|2022-05-17T11:03:05Z| (Bot Building / Tokenizers)
awesome-go-zh - prose

README

        # prose [![Build Status](https://travis-ci.org/jdkato/prose.svg?branch=master)](https://travis-ci.org/jdkato/prose) [![GoDoc](https://godoc.org/github.com/golang/gddo?status.svg)](https://pkg.go.dev/github.com/jdkato/prose/[email protected]?tab=doc) [![Coverage Status](https://coveralls.io/repos/github/jdkato/prose/badge.svg?branch=master)](https://coveralls.io/github/jdkato/prose?branch=master) [![Go Report Card](https://goreportcard.com/badge/github.com/jdkato/prose)](https://goreportcard.com/report/github.com/jdkato/prose) [![codebeat badge](https://codebeat.co/badges/a867ec38-c025-4f65-85f9-89a9188cc458)](https://codebeat.co/projects/github-com-jdkato-prose-master) [![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)](https://github.com/avelino/awesome-go#natural-language-processing)

`prose` is a natural language processing library (English only, at the moment) in *pure Go*. It supports tokenization, segmentation, part-of-speech tagging, and named-entity extraction.

You can find a more detailed summary on the library's performance here: [Introducing `prose` v2.0.0: Bringing NLP *to Go*](https://medium.com/@errata.ai/introducing-prose-v2-0-0-bringing-nlp-to-go-a1f0c121e4a5).

## Installation

```console

$ go get github.com/jdkato/prose/v2

```

## Usage

### Contents

* [Overview](#overview)

* [Tokenizing](#tokenizing)

* [Segmenting](#segmenting)

* [Tagging](#tagging)

* [NER](#ner)

### Overview

```go

package main

import (

    "fmt"

    "log"

    "github.com/jdkato/prose/v2"

)

func main() {

    // Create a new document with the default configuration:

    doc, err := prose.NewDocument("Go is an open-source programming language created at Google.")

    if err != nil {

        log.Fatal(err)

    }

    // Iterate over the doc's tokens:

    for _, tok := range doc.Tokens() {

        fmt.Println(tok.Text, tok.Tag, tok.Label)

        // Go NNP B-GPE

        // is VBZ O

        // an DT O

        // ...

    }

    // Iterate over the doc's named-entities:

    for _, ent := range doc.Entities() {

        fmt.Println(ent.Text, ent.Label)

        // Go GPE

        // Google GPE

    }

    // Iterate over the doc's sentences:

    for _, sent := range doc.Sentences() {

        fmt.Println(sent.Text)

        // Go is an open-source programming language created at Google.

    }

}

```

The document-creation process adheres to the following sequence of steps:

```text

tokenization -> POS tagging -> NE extraction

            \

             segmentation

```

Each step may be disabled (assuming later steps aren't required) by passing the appropriate [*functional option*](https://dave.cheney.net/2014/10/17/functional-options-for-friendly-apis). To disable named-entity extraction, for example, you'd do the following:

```go

doc, err := prose.NewDocument(

        "Go is an open-source programming language created at Google.",

        prose.WithExtraction(false))

```

### Tokenizing

`prose` includes a tokenizer capable of processing modern text, including the non-word character spans shown below.

| Type            | Example                           |

|-----------------|-----------------------------------|

| Email addresses | `[email protected]`            |

| Hashtags        | `#trending`                       |

| Mentions        | `@jdkato`                         |

| URLs            | `https://github.com/jdkato/prose` |

| Emoticons       | `:-)`, `>:(`, `o_0`, etc.         |

```go

package main

import (

    "fmt"

    "log"

    "github.com/jdkato/prose/v2"

)

func main() {

    // Create a new document with the default configuration:

    doc, err := prose.NewDocument("@jdkato, go to http://example.com thanks :).")

    if err != nil {

        log.Fatal(err)

    }

    // Iterate over the doc's tokens:

    for _, tok := range doc.Tokens() {

        fmt.Println(tok.Text, tok.Tag)

        // @jdkato NN

        // , ,

        // go VB

        // to TO

        // http://example.com NN

        // thanks NNS

        // :) SYM

        // . .

    }

}

```

### Segmenting

`prose` includes one of the most accurate sentence segmenters available, according to the [Golden Rules](https://github.com/diasks2/pragmatic_segmenter#comparison-of-segmentation-tools-libraries-and-algorithms) created by the developers of the `pragmatic_segmenter`.

| Name                | Language | License   | GRS (English)  | GRS (Other) | Speed†   |

|---------------------|----------|-----------|----------------|-------------|----------|

| Pragmatic Segmenter | Ruby     | MIT       | 98.08% (51/52) | 100.00%     | 3.84 s   |

| prose               | Go       | MIT       | 75.00% (39/52) | N/A         | 0.96 s   |

| TactfulTokenizer    | Ruby     | GNU GPLv3 | 65.38% (34/52) | 48.57%      | 46.32 s  |

| OpenNLP             | Java     | APLv2     | 59.62% (31/52) | 45.71%      | 1.27 s   |

| Standford CoreNLP   | Java     | GNU GPLv3 | 59.62% (31/52) | 31.43%      | 0.92 s   |

| Splitta             | Python   | APLv2     | 55.77% (29/52) | 37.14%      | N/A      |

| Punkt               | Python   | APLv2     | 46.15% (24/52) | 48.57%      | 1.79 s   |

| SRX English         | Ruby     | GNU GPLv3 | 30.77% (16/52) | 28.57%      | 6.19 s   |

| Scapel              | Ruby     | GNU GPLv3 | 28.85% (15/52) | 20.00%      | 0.13 s   |

> † The original tests were performed using a *MacBook Pro 3.7 GHz Quad-Core Intel Xeon E5 running 10.9.5*, while `prose` was timed using a *MacBook Pro 2.9 GHz Intel Core i7 running 10.13.3*.

```go

package main

import (

    "fmt"

    "strings"

    "github.com/jdkato/prose/v2"

)

func main() {

    // Create a new document with the default configuration:

    doc, _ := prose.NewDocument(strings.Join([]string{

        "I can see Mt. Fuji from here.",

        "St. Michael's Church is on 5th st. near the light."}, " "))

    // Iterate over the doc's sentences:

    sents := doc.Sentences()

    fmt.Println(len(sents)) // 2

    for _, sent := range sents {

        fmt.Println(sent.Text)

        // I can see Mt. Fuji from here.

        // St. Michael's Church is on 5th st. near the light.

    }

}

```

### Tagging

`prose` includes a tagger based on Textblob's ["fast and accurate" POS tagger](https://github.com/sloria/textblob-aptagger). Below is a comparison of its performance against [NLTK](http://www.nltk.org/)'s implementation of the same tagger on the Treebank corpus:

| Library | Accuracy | 5-Run Average (sec) |

|:--------|---------:|--------------------:|

| NLTK    |    0.893 |               7.224 |

| `prose` |    0.961 |               2.538 |

(See [`scripts/test_model.py`](https://github.com/jdkato/aptag/blob/master/scripts/test_model.py) for more information.)

The full list of supported POS tags is given below.

| TAG        | DESCRIPTION                               |

|------------|-------------------------------------------|

| `(`        | left round bracket                        |

| `)`        | right round bracket                       |

| `,`        | comma                                     |

| `:`        | colon                                     |

| `.`        | period                                    |

| `''`       | closing quotation mark                    |

| ``` `` ``` | opening quotation mark                    |

| `#`        | number sign                               |

| `$`        | currency                                  |

| `CC`       | conjunction, coordinating                 |

| `CD`       | cardinal number                           |

| `DT`       | determiner                                |

| `EX`       | existential there                         |

| `FW`       | foreign word                              |

| `IN`       | conjunction, subordinating or preposition |

| `JJ`       | adjective                                 |

| `JJR`      | adjective, comparative                    |

| `JJS`      | adjective, superlative                    |

| `LS`       | list item marker                          |

| `MD`       | verb, modal auxiliary                     |

| `NN`       | noun, singular or mass                    |

| `NNP`      | noun, proper singular                     |

| `NNPS`     | noun, proper plural                       |

| `NNS`      | noun, plural                              |

| `PDT`      | predeterminer                             |

| `POS`      | possessive ending                         |

| `PRP`      | pronoun, personal                         |

| `PRP$`     | pronoun, possessive                       |

| `RB`       | adverb                                    |

| `RBR`      | adverb, comparative                       |

| `RBS`      | adverb, superlative                       |

| `RP`       | adverb, particle                          |

| `SYM`      | symbol                                    |

| `TO`       | infinitival to                            |

| `UH`       | interjection                              |

| `VB`       | verb, base form                           |

| `VBD`      | verb, past tense                          |

| `VBG`      | verb, gerund or present participle        |

| `VBN`      | verb, past participle                     |

| `VBP`      | verb, non-3rd person singular present     |

| `VBZ`      | verb, 3rd person singular present         |

| `WDT`      | wh-determiner                             |

| `WP`       | wh-pronoun, personal                      |

| `WP$`      | wh-pronoun, possessive                    |

| `WRB`      | wh-adverb                                 |

### NER

`prose` v2.0.0 includes a much improved version of v1.0.0's chunk package, which can identify people (`PERSON`) and geographical/political Entities (`GPE`) by default.

```go

package main

import (

    "github.com/jdkato/prose/v2"

)

func main() {

    doc, _ := prose.NewDocument("Lebron James plays basketball in Los Angeles.")

    for _, ent := range doc.Entities() {

        fmt.Println(ent.Text, ent.Label)

        // Lebron James PERSON

        // Los Angeles GPE

    }

}

```

However, in an attempt to make this feature more useful, we've made it straightforward to train your own models for specific use cases. See [Prodigy + `prose`: Radically efficient machine teaching *in Go*](https://medium.com/@errata.ai/prodigy-prose-radically-efficient-machine-teaching-in-go-93389bf2d772) for a tutorial.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/jdkato/prose

Awesome Lists containing this project

README