{"id":13413560,"url":"https://github.com/jdkato/prose","last_synced_at":"2025-10-05T18:31:17.413Z","repository":{"id":17599221,"uuid":"82319669","full_name":"jdkato/prose","owner":"jdkato","description":":book: A Golang library for text processing, including tokenization, part-of-speech tagging, and named-entity extraction.","archived":true,"fork":false,"pushed_at":"2023-05-02T05:39:17.000Z","size":28087,"stargazers_count":3056,"open_issues_count":21,"forks_count":164,"subscribers_count":56,"default_branch":"master","last_synced_at":"2024-10-23T17:17:02.903Z","etag":null,"topics":["natural-language-processing","nlp","prose"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jdkato.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-02-17T17:08:22.000Z","updated_at":"2024-10-20T11:42:56.000Z","dependencies_parsed_at":"2024-06-18T11:18:00.163Z","dependency_job_id":null,"html_url":"https://github.com/jdkato/prose","commit_stats":{"total_commits":271,"total_committers":14,"mean_commits":"19.357142857142858","dds":"0.11439114391143912","last_synced_commit":"a376476c262708a9c6dcf3807cba0cd0f0aba2ff"},"previous_names":["jdkato/aptag"],"tags_count":6,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jdkato%2Fprose","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jdkato%2Fprose/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jdkato%2Fprose/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jdkato%2Fprose/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jdkato","download_url":"https://codeload.github.com/jdkato/prose/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":235432148,"owners_count":18989466,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["natural-language-processing","nlp","prose"],"created_at":"2024-07-30T20:01:43.191Z","updated_at":"2025-10-05T18:31:09.234Z","avatar_url":"https://github.com/jdkato.png","language":"Go","readme":"# prose [![Build Status](https://travis-ci.org/jdkato/prose.svg?branch=master)](https://travis-ci.org/jdkato/prose) [![GoDoc](https://godoc.org/github.com/golang/gddo?status.svg)](https://pkg.go.dev/github.com/jdkato/prose/v2@v2.0.0?tab=doc) [![Coverage Status](https://coveralls.io/repos/github/jdkato/prose/badge.svg?branch=master)](https://coveralls.io/github/jdkato/prose?branch=master) [![Go Report Card](https://goreportcard.com/badge/github.com/jdkato/prose)](https://goreportcard.com/report/github.com/jdkato/prose) [![codebeat badge](https://codebeat.co/badges/a867ec38-c025-4f65-85f9-89a9188cc458)](https://codebeat.co/projects/github-com-jdkato-prose-master) [![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)](https://github.com/avelino/awesome-go#natural-language-processing)\n\n`prose` is a natural language processing library (English only, at the moment) in *pure Go*. It supports tokenization, segmentation, part-of-speech tagging, and named-entity extraction.\n\nYou can find a more detailed summary on the library's performance here: [Introducing `prose` v2.0.0: Bringing NLP *to Go*](https://medium.com/@errata.ai/introducing-prose-v2-0-0-bringing-nlp-to-go-a1f0c121e4a5).\n\n## Installation\n\n```console\n$ go get github.com/jdkato/prose/v2\n```\n\n## Usage\n\n### Contents\n\n* [Overview](#overview)\n* [Tokenizing](#tokenizing)\n* [Segmenting](#segmenting)\n* [Tagging](#tagging)\n* [NER](#ner)\n\n### Overview\n\n\n```go\npackage main\n\nimport (\n    \"fmt\"\n    \"log\"\n\n    \"github.com/jdkato/prose/v2\"\n)\n\nfunc main() {\n    // Create a new document with the default configuration:\n    doc, err := prose.NewDocument(\"Go is an open-source programming language created at Google.\")\n    if err != nil {\n        log.Fatal(err)\n    }\n\n    // Iterate over the doc's tokens:\n    for _, tok := range doc.Tokens() {\n        fmt.Println(tok.Text, tok.Tag, tok.Label)\n        // Go NNP B-GPE\n        // is VBZ O\n        // an DT O\n        // ...\n    }\n\n    // Iterate over the doc's named-entities:\n    for _, ent := range doc.Entities() {\n        fmt.Println(ent.Text, ent.Label)\n        // Go GPE\n        // Google GPE\n    }\n\n    // Iterate over the doc's sentences:\n    for _, sent := range doc.Sentences() {\n        fmt.Println(sent.Text)\n        // Go is an open-source programming language created at Google.\n    }\n}\n```\n\nThe document-creation process adheres to the following sequence of steps:\n\n```text\ntokenization -\u003e POS tagging -\u003e NE extraction\n            \\\n             segmentation\n```\n\nEach step may be disabled (assuming later steps aren't required) by passing the appropriate [*functional option*](https://dave.cheney.net/2014/10/17/functional-options-for-friendly-apis). To disable named-entity extraction, for example, you'd do the following:\n\n```go\ndoc, err := prose.NewDocument(\n        \"Go is an open-source programming language created at Google.\",\n        prose.WithExtraction(false))\n```\n\n### Tokenizing\n\n`prose` includes a tokenizer capable of processing modern text, including the non-word character spans shown below.\n\n| Type            | Example                           |\n|-----------------|-----------------------------------|\n| Email addresses | `Jane.Doe@example.com`            |\n| Hashtags        | `#trending`                       |\n| Mentions        | `@jdkato`                         |\n| URLs            | `https://github.com/jdkato/prose` |\n| Emoticons       | `:-)`, `\u003e:(`, `o_0`, etc.         |\n\n\n```go\npackage main\n\nimport (\n    \"fmt\"\n    \"log\"\n\n    \"github.com/jdkato/prose/v2\"\n)\n\nfunc main() {\n    // Create a new document with the default configuration:\n    doc, err := prose.NewDocument(\"@jdkato, go to http://example.com thanks :).\")\n    if err != nil {\n        log.Fatal(err)\n    }\n\n    // Iterate over the doc's tokens:\n    for _, tok := range doc.Tokens() {\n        fmt.Println(tok.Text, tok.Tag)\n        // @jdkato NN\n        // , ,\n        // go VB\n        // to TO\n        // http://example.com NN\n        // thanks NNS\n        // :) SYM\n        // . .\n    }\n}\n```\n\n### Segmenting\n\n`prose` includes one of the most accurate sentence segmenters available, according to the [Golden Rules](https://github.com/diasks2/pragmatic_segmenter#comparison-of-segmentation-tools-libraries-and-algorithms) created by the developers of the `pragmatic_segmenter`.\n\n| Name                | Language | License   | GRS (English)  | GRS (Other) | Speed†   |\n|---------------------|----------|-----------|----------------|-------------|----------|\n| Pragmatic Segmenter | Ruby     | MIT       | 98.08% (51/52) | 100.00%     | 3.84 s   |\n| prose               | Go       | MIT       | 75.00% (39/52) | N/A         | 0.96 s   |\n| TactfulTokenizer    | Ruby     | GNU GPLv3 | 65.38% (34/52) | 48.57%      | 46.32 s  |\n| OpenNLP             | Java     | APLv2     | 59.62% (31/52) | 45.71%      | 1.27 s   |\n| Standford CoreNLP   | Java     | GNU GPLv3 | 59.62% (31/52) | 31.43%      | 0.92 s   |\n| Splitta             | Python   | APLv2     | 55.77% (29/52) | 37.14%      | N/A      |\n| Punkt               | Python   | APLv2     | 46.15% (24/52) | 48.57%      | 1.79 s   |\n| SRX English         | Ruby     | GNU GPLv3 | 30.77% (16/52) | 28.57%      | 6.19 s   |\n| Scapel              | Ruby     | GNU GPLv3 | 28.85% (15/52) | 20.00%      | 0.13 s   |\n\n\u003e † The original tests were performed using a *MacBook Pro 3.7 GHz Quad-Core Intel Xeon E5 running 10.9.5*, while `prose` was timed using a *MacBook Pro 2.9 GHz Intel Core i7 running 10.13.3*.\n\n```go\npackage main\n\nimport (\n    \"fmt\"\n    \"strings\"\n\n    \"github.com/jdkato/prose/v2\"\n)\n\nfunc main() {\n    // Create a new document with the default configuration:\n    doc, _ := prose.NewDocument(strings.Join([]string{\n        \"I can see Mt. Fuji from here.\",\n        \"St. Michael's Church is on 5th st. near the light.\"}, \" \"))\n\n    // Iterate over the doc's sentences:\n    sents := doc.Sentences()\n    fmt.Println(len(sents)) // 2\n    for _, sent := range sents {\n        fmt.Println(sent.Text)\n        // I can see Mt. Fuji from here.\n        // St. Michael's Church is on 5th st. near the light.\n    }\n}\n```\n\n### Tagging\n\n`prose` includes a tagger based on Textblob's [\"fast and accurate\" POS tagger](https://github.com/sloria/textblob-aptagger). Below is a comparison of its performance against [NLTK](http://www.nltk.org/)'s implementation of the same tagger on the Treebank corpus:\n\n| Library | Accuracy | 5-Run Average (sec) |\n|:--------|---------:|--------------------:|\n| NLTK    |    0.893 |               7.224 |\n| `prose` |    0.961 |               2.538 |\n\n(See [`scripts/test_model.py`](https://github.com/jdkato/aptag/blob/master/scripts/test_model.py) for more information.)\n\nThe full list of supported POS tags is given below.\n\n| TAG        | DESCRIPTION                               |\n|------------|-------------------------------------------|\n| `(`        | left round bracket                        |\n| `)`        | right round bracket                       |\n| `,`        | comma                                     |\n| `:`        | colon                                     |\n| `.`        | period                                    |\n| `''`       | closing quotation mark                    |\n| ``` `` ``` | opening quotation mark                    |\n| `#`        | number sign                               |\n| `$`        | currency                                  |\n| `CC`       | conjunction, coordinating                 |\n| `CD`       | cardinal number                           |\n| `DT`       | determiner                                |\n| `EX`       | existential there                         |\n| `FW`       | foreign word                              |\n| `IN`       | conjunction, subordinating or preposition |\n| `JJ`       | adjective                                 |\n| `JJR`      | adjective, comparative                    |\n| `JJS`      | adjective, superlative                    |\n| `LS`       | list item marker                          |\n| `MD`       | verb, modal auxiliary                     |\n| `NN`       | noun, singular or mass                    |\n| `NNP`      | noun, proper singular                     |\n| `NNPS`     | noun, proper plural                       |\n| `NNS`      | noun, plural                              |\n| `PDT`      | predeterminer                             |\n| `POS`      | possessive ending                         |\n| `PRP`      | pronoun, personal                         |\n| `PRP$`     | pronoun, possessive                       |\n| `RB`       | adverb                                    |\n| `RBR`      | adverb, comparative                       |\n| `RBS`      | adverb, superlative                       |\n| `RP`       | adverb, particle                          |\n| `SYM`      | symbol                                    |\n| `TO`       | infinitival to                            |\n| `UH`       | interjection                              |\n| `VB`       | verb, base form                           |\n| `VBD`      | verb, past tense                          |\n| `VBG`      | verb, gerund or present participle        |\n| `VBN`      | verb, past participle                     |\n| `VBP`      | verb, non-3rd person singular present     |\n| `VBZ`      | verb, 3rd person singular present         |\n| `WDT`      | wh-determiner                             |\n| `WP`       | wh-pronoun, personal                      |\n| `WP$`      | wh-pronoun, possessive                    |\n| `WRB`      | wh-adverb                                 |\n\n### NER\n\n`prose` v2.0.0 includes a much improved version of v1.0.0's chunk package, which can identify people (`PERSON`) and geographical/political Entities (`GPE`) by default.\n\n```go\npackage main\n\nimport (\n    \"github.com/jdkato/prose/v2\"\n)\n\nfunc main() {\n    doc, _ := prose.NewDocument(\"Lebron James plays basketball in Los Angeles.\")\n    for _, ent := range doc.Entities() {\n        fmt.Println(ent.Text, ent.Label)\n        // Lebron James PERSON\n        // Los Angeles GPE\n    }\n}\n```\n\nHowever, in an attempt to make this feature more useful, we've made it straightforward to train your own models for specific use cases. See [Prodigy + `prose`: Radically efficient machine teaching *in Go*](https://medium.com/@errata.ai/prodigy-prose-radically-efficient-machine-teaching-in-go-93389bf2d772) for a tutorial.\n","funding_links":[],"categories":["Go","Misc","开源类库","Natural Language Processing","Open source library","自然语言处理","Bot Building","自然語言處理","\u003cspan id=\"自然语言处理-natural-language-processing\"\u003e自然语言处理 Natural Language Processing\u003c/span\u003e","Relational Databases"],"sub_categories":["文本处理","Strings","Word Processing","暂未分类","Tokenizers","高級控制台界面","Advanced Console UIs","\u003cspan id=\"高级控制台用户界面-advanced-console-uis\"\u003e高级控制台用户界面 Advanced Console UIs\u003c/span\u003e","Uncategorized","分词器","交流","暂未分类这些库被放在这里是因为其他类别似乎都不适合。","高级控制台界面"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjdkato%2Fprose","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjdkato%2Fprose","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjdkato%2Fprose/lists"}