Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/jdkato/prose

:book: A Golang library for text processing, including tokenization, part-of-speech tagging, and named-entity extraction.
https://github.com/jdkato/prose

natural-language-processing nlp prose

Last synced: 2 months ago
JSON representation

:book: A Golang library for text processing, including tokenization, part-of-speech tagging, and named-entity extraction.

Awesome Lists containing this project

README

        

# prose [![Build Status](https://travis-ci.org/jdkato/prose.svg?branch=master)](https://travis-ci.org/jdkato/prose) [![GoDoc](https://godoc.org/github.com/golang/gddo?status.svg)](https://pkg.go.dev/github.com/jdkato/prose/[email protected]?tab=doc) [![Coverage Status](https://coveralls.io/repos/github/jdkato/prose/badge.svg?branch=master)](https://coveralls.io/github/jdkato/prose?branch=master) [![Go Report Card](https://goreportcard.com/badge/github.com/jdkato/prose)](https://goreportcard.com/report/github.com/jdkato/prose) [![codebeat badge](https://codebeat.co/badges/a867ec38-c025-4f65-85f9-89a9188cc458)](https://codebeat.co/projects/github-com-jdkato-prose-master) [![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)](https://github.com/avelino/awesome-go#natural-language-processing)

`prose` is a natural language processing library (English only, at the moment) in *pure Go*. It supports tokenization, segmentation, part-of-speech tagging, and named-entity extraction.

You can find a more detailed summary on the library's performance here: [Introducing `prose` v2.0.0: Bringing NLP *to Go*](https://medium.com/@errata.ai/introducing-prose-v2-0-0-bringing-nlp-to-go-a1f0c121e4a5).

## Installation

```console
$ go get github.com/jdkato/prose/v2
```

## Usage

### Contents

* [Overview](#overview)
* [Tokenizing](#tokenizing)
* [Segmenting](#segmenting)
* [Tagging](#tagging)
* [NER](#ner)

### Overview

```go
package main

import (
"fmt"
"log"

"github.com/jdkato/prose/v2"
)

func main() {
// Create a new document with the default configuration:
doc, err := prose.NewDocument("Go is an open-source programming language created at Google.")
if err != nil {
log.Fatal(err)
}

// Iterate over the doc's tokens:
for _, tok := range doc.Tokens() {
fmt.Println(tok.Text, tok.Tag, tok.Label)
// Go NNP B-GPE
// is VBZ O
// an DT O
// ...
}

// Iterate over the doc's named-entities:
for _, ent := range doc.Entities() {
fmt.Println(ent.Text, ent.Label)
// Go GPE
// Google GPE
}

// Iterate over the doc's sentences:
for _, sent := range doc.Sentences() {
fmt.Println(sent.Text)
// Go is an open-source programming language created at Google.
}
}
```

The document-creation process adheres to the following sequence of steps:

```text
tokenization -> POS tagging -> NE extraction
\
segmentation
```

Each step may be disabled (assuming later steps aren't required) by passing the appropriate [*functional option*](https://dave.cheney.net/2014/10/17/functional-options-for-friendly-apis). To disable named-entity extraction, for example, you'd do the following:

```go
doc, err := prose.NewDocument(
"Go is an open-source programming language created at Google.",
prose.WithExtraction(false))
```

### Tokenizing

`prose` includes a tokenizer capable of processing modern text, including the non-word character spans shown below.

| Type | Example |
|-----------------|-----------------------------------|
| Email addresses | `[email protected]` |
| Hashtags | `#trending` |
| Mentions | `@jdkato` |
| URLs | `https://github.com/jdkato/prose` |
| Emoticons | `:-)`, `>:(`, `o_0`, etc. |

```go
package main

import (
"fmt"
"log"

"github.com/jdkato/prose/v2"
)

func main() {
// Create a new document with the default configuration:
doc, err := prose.NewDocument("@jdkato, go to http://example.com thanks :).")
if err != nil {
log.Fatal(err)
}

// Iterate over the doc's tokens:
for _, tok := range doc.Tokens() {
fmt.Println(tok.Text, tok.Tag)
// @jdkato NN
// , ,
// go VB
// to TO
// http://example.com NN
// thanks NNS
// :) SYM
// . .
}
}
```

### Segmenting

`prose` includes one of the most accurate sentence segmenters available, according to the [Golden Rules](https://github.com/diasks2/pragmatic_segmenter#comparison-of-segmentation-tools-libraries-and-algorithms) created by the developers of the `pragmatic_segmenter`.

| Name | Language | License | GRS (English) | GRS (Other) | Speed† |
|---------------------|----------|-----------|----------------|-------------|----------|
| Pragmatic Segmenter | Ruby | MIT | 98.08% (51/52) | 100.00% | 3.84 s |
| prose | Go | MIT | 75.00% (39/52) | N/A | 0.96 s |
| TactfulTokenizer | Ruby | GNU GPLv3 | 65.38% (34/52) | 48.57% | 46.32 s |
| OpenNLP | Java | APLv2 | 59.62% (31/52) | 45.71% | 1.27 s |
| Standford CoreNLP | Java | GNU GPLv3 | 59.62% (31/52) | 31.43% | 0.92 s |
| Splitta | Python | APLv2 | 55.77% (29/52) | 37.14% | N/A |
| Punkt | Python | APLv2 | 46.15% (24/52) | 48.57% | 1.79 s |
| SRX English | Ruby | GNU GPLv3 | 30.77% (16/52) | 28.57% | 6.19 s |
| Scapel | Ruby | GNU GPLv3 | 28.85% (15/52) | 20.00% | 0.13 s |

> † The original tests were performed using a *MacBook Pro 3.7 GHz Quad-Core Intel Xeon E5 running 10.9.5*, while `prose` was timed using a *MacBook Pro 2.9 GHz Intel Core i7 running 10.13.3*.

```go
package main

import (
"fmt"
"strings"

"github.com/jdkato/prose/v2"
)

func main() {
// Create a new document with the default configuration:
doc, _ := prose.NewDocument(strings.Join([]string{
"I can see Mt. Fuji from here.",
"St. Michael's Church is on 5th st. near the light."}, " "))

// Iterate over the doc's sentences:
sents := doc.Sentences()
fmt.Println(len(sents)) // 2
for _, sent := range sents {
fmt.Println(sent.Text)
// I can see Mt. Fuji from here.
// St. Michael's Church is on 5th st. near the light.
}
}
```

### Tagging

`prose` includes a tagger based on Textblob's ["fast and accurate" POS tagger](https://github.com/sloria/textblob-aptagger). Below is a comparison of its performance against [NLTK](http://www.nltk.org/)'s implementation of the same tagger on the Treebank corpus:

| Library | Accuracy | 5-Run Average (sec) |
|:--------|---------:|--------------------:|
| NLTK | 0.893 | 7.224 |
| `prose` | 0.961 | 2.538 |

(See [`scripts/test_model.py`](https://github.com/jdkato/aptag/blob/master/scripts/test_model.py) for more information.)

The full list of supported POS tags is given below.

| TAG | DESCRIPTION |
|------------|-------------------------------------------|
| `(` | left round bracket |
| `)` | right round bracket |
| `,` | comma |
| `:` | colon |
| `.` | period |
| `''` | closing quotation mark |
| ``` `` ``` | opening quotation mark |
| `#` | number sign |
| `$` | currency |
| `CC` | conjunction, coordinating |
| `CD` | cardinal number |
| `DT` | determiner |
| `EX` | existential there |
| `FW` | foreign word |
| `IN` | conjunction, subordinating or preposition |
| `JJ` | adjective |
| `JJR` | adjective, comparative |
| `JJS` | adjective, superlative |
| `LS` | list item marker |
| `MD` | verb, modal auxiliary |
| `NN` | noun, singular or mass |
| `NNP` | noun, proper singular |
| `NNPS` | noun, proper plural |
| `NNS` | noun, plural |
| `PDT` | predeterminer |
| `POS` | possessive ending |
| `PRP` | pronoun, personal |
| `PRP$` | pronoun, possessive |
| `RB` | adverb |
| `RBR` | adverb, comparative |
| `RBS` | adverb, superlative |
| `RP` | adverb, particle |
| `SYM` | symbol |
| `TO` | infinitival to |
| `UH` | interjection |
| `VB` | verb, base form |
| `VBD` | verb, past tense |
| `VBG` | verb, gerund or present participle |
| `VBN` | verb, past participle |
| `VBP` | verb, non-3rd person singular present |
| `VBZ` | verb, 3rd person singular present |
| `WDT` | wh-determiner |
| `WP` | wh-pronoun, personal |
| `WP$` | wh-pronoun, possessive |
| `WRB` | wh-adverb |

### NER

`prose` v2.0.0 includes a much improved version of v1.0.0's chunk package, which can identify people (`PERSON`) and geographical/political Entities (`GPE`) by default.

```go
package main

import (
"github.com/jdkato/prose/v2"
)

func main() {
doc, _ := prose.NewDocument("Lebron James plays basketball in Los Angeles.")
for _, ent := range doc.Entities() {
fmt.Println(ent.Text, ent.Label)
// Lebron James PERSON
// Los Angeles GPE
}
}
```

However, in an attempt to make this feature more useful, we've made it straightforward to train your own models for specific use cases. See [Prodigy + `prose`: Radically efficient machine teaching *in Go*](https://medium.com/@errata.ai/prodigy-prose-radically-efficient-machine-teaching-in-go-93389bf2d772) for a tutorial.