https://github.com/turbopape/postagga

A Library to parse natural language in pure Clojure and ClojureScript
https://github.com/turbopape/postagga
bots clojure clojurescript natural-language-processing parser pos-tagger viterbi-algorithm
Last synced: 8 days ago
JSON representation
A Library to parse natural language in pure Clojure and ClojureScript
Host: GitHub
URL: https://github.com/turbopape/postagga
Owner: turbopape
License: mit
Created: 2017-03-10T11:19:05.000Z (about 8 years ago)
Default Branch: master
Last Pushed: 2020-12-15T05:51:02.000Z (over 4 years ago)
Last Synced: 2024-11-12T11:16:45.907Z (5 months ago)
Topics: bots, clojure, clojurescript, natural-language-processing, parser, pos-tagger, viterbi-algorithm
Language: Clojure
Homepage:
Size: 17.7 MB
Stars: 159
Watchers: 18
Forks: 16
Open Issues: 12
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project

awesome-nlp - postagga - 用於解析 Clojure 和 ClojureScript 中的自然語言的函式庫。 (函式庫 / 書籍)
README

        # postagga

[![Backers on Open Collective](https://opencollective.com/postagga/backers/badge.svg)](#backers) [![Sponsors on Open Collective](https://opencollective.com/postagga/sponsors/badge.svg)](#sponsors) [![License MIT](https://img.shields.io/badge/License-MIT-blue.svg)](http://opensource.org/licenses/MIT)

[![Gratipay](https://img.shields.io/gratipay/turbopape.svg)](https://gratipay.com/turbopape/)

[![Clojars Project](https://img.shields.io/clojars/v/postagga.svg)](https://clojars.org/postagga)



 

> "But if thought corrupts language, language can also corrupt thought."

- George Orwell, 1984

**postagga** is a suite of tools to assist you in generating 

efficient and self-contained natural language processors. You can use **postagga** 

to process annotated text samples into full fledged parsers capable of

understanding "*free speech*" input as structured data. Ah - and

you'll be able to do this easily. You're welcome.

# Getting postagga

You can refer **postagga** as a lib in your Clojure project. Grab it

from clojars - in your dependencies in **project.clj**, just add:

[![Clojars Project](https://img.shields.io/clojars/v/postagga.svg)](https://clojars.org/postagga)

You can also clone the project and walk around the source and models:

```ssh

git clone https://github.com/turbopape/postagga.git

```

The models are included under the [models folder](https://github.com/turbopape/postagga/blob/master/models). 

In JVM Clojure, provided you have cloned the repository:

```clojure

;; ...

 (def fr-model (load-edn "models/fr_tb_v_model.edn")) ;; for French for instance

;; ... 

```

We also shipped two light models as vars defined in namespaces: one for

French and one for English. As for JavaScript, the artifacts size are

a concern. You can use these models by requiring the two namespaces:

```clojure

  (ns your-cool.bot

   (: require [postagga.en-fn-v-model :refer [en-model]] ;; for English

              [postagga.fr-tb-v-model :refer [fr-model]])) ;; for French 

   ;; ...

   

```

These namespaces make it easy for you to ship parsers for ClojureScript.

You can see an example on how to work with this model, all while

making sure your code is compatible across Clojure AND ClojureScript

(thanks to readers' conditional) in the [Test File](https://github.com/turbopape/postagga/blob/master/test/postagga/core_test.cljc).

# How does it work?

To do its magic, **postagga** extracts the *phrase structure* of your input and tries to find how does this structure compare to its many semantic rules and if it finds a match, where in this structure shall it extract meaningful information.

Let's study a simple example. Look at the next sentence:

> "Rafik loves apples"

That is our "Natural language input."

First step in understanding this sentence is to extract some structure from it so it is easier to interpret. One common way to do this is extracting its grammatical phrase structure, which is close enough to what "function" words are actually meant to provide:

> Noun Verb Noun

That was the phrase structure analysis, or as we call it POS (Part Of Speech) Tagging. These "Tags" qualify parts of the sentence, as the name implies, and will be used as a hi-fidelity mechanism to write rules for parsers of such phrases.

**postagga** has tools that enable you to train POS Taggers for any language you want, without relying on external libs. Actually, it does not care about the meaning of the tags at all. However, you should be consistent and clear enough when annotating your input data samples with tags. On one hand, your parser will be more reliable. On the other hand you'll do yourself a great favor maintaining your parser.

Now comes the parser part. Actually, **postagga** offers a parser that needs semantic **rules** to be able to map a particular phrase structure into data. In our example, we know that the first **Noun** depicts a subject carrying out some action. This action is represented by the **Verb** following it. Finally, the **Noun** coming after the **Verb** will undergo this action.

**postagga** parsers lets you express such rules so it can extract the data for you. You literally tell it to take the first **Noun**, call it **Subject**, take the verb, label it **action**, take the last **Noun**, call it the **Object**, finally packaging it all into the following data structure:

```clojure

{:Subject "Rafik" :Action "Loves" :Object "Apples"}

```

Naturally, **postagga** can handle much more complex sentences!

**postagga** parsers are eventually compiled into self-contained packages, with no single third party dependency. From there it can easily run on servers (Clojure version) and on the browser (ClojureScript). Now your bots can really get what you're trying to tell them!

# The postagga workflow

## Training a POS Tagger

First of all, you need to train a POS Tagger that can qualify parts of

your natural text. **postagga** relies on Hidden Markov Models,

computed with

the

[Viterbi  Algorithm](https://en.wikipedia.org/wiki/Viterbi_algorithm). This

algorithm makes use of a set of matrices, like what states (the POS Tags)

we have, how likely we transition from one POS tag to another,

etc...

All of these constitute a **model**. These are computed out of what we

call an **annotated text corpus**. The **postagga.trainer** namespace is used create models

out of such annotated text corpus.

To train a model, make sure you have an annotated corpus like so:

```clojure

[ ;; A vector of sentences like this one:

[["-" "PONCT"] ["guerre" "NC"] ["d'" "P"] ["indochine" "NPP"]] [["-" "PONCT"] ["colloque" "NC"] ["sur" "P"] ["les" "DET"] ["fraudes" "NC"]] [["-" "PONCT"] ["dernier" "ADJ"] ["résumé" "NC"] [":" "PONCT"] ["l'" "DET"] ["\"" "PONCT"] ["affaire" "NC"] ["des" "P+D"] ["piastres" "NC"] ["\"" "PONCT"]] [["catégories" "NC"] [":" "PONCT"] ["guerre" "NC"] ["d'" "P"] ["indochine" "NPP"] ["." "PONCT"]] [["indochine" "NPP"] ["française" "ADJ"] ["." "PONCT"]] [["quatrième" "ADJ"] ["république" "NC"] ["." "PONCT"]

;; etc...

]

```

Say you have this corpus - that is: a vector of annotated sentences

in a var unsurprisingly named **corpus**. To train a **model**, just issue:

```clojure

(require '[postagga.trainer :refer [train]]

(def model (train corpus)) ;;<- Beware, these can be large vars so avoid realizing all of them like printing in your REPL!!!

```

We processed one annotated corpus for English:

- [postagga-fn-en.edn](https://github.com/turbopape/postagga/blob/master/models/postagga-fn-en.edn)

  Generated from

  the

  [Framenet Project](https://framenet.icsi.berkeley.edu/fndrupal/)

We also processed two annotated corpora for French:

- [postagga-sequoia-fr.edn](https://github.com/turbopape/postagga/blob/master/models/postagga-sequoia-fr.edn)

    Generated from

    the

    [Sequoia Corpus from INRIA](https://www.rocq.inria.fr/alpage-wiki/tiki-index.php?page=CorpusSequoia).

    

- [postagga-tb-fr.edn](https://github.com/turbopape/postagga/blob/master/models/fr_tb_v_model.edn)

    Generated from

    the

    [Free French tree Bank](https://github.com/nicolashernandez/free-french-treebank).

    

We exposed two of these models as Clojure namespaces so you can embed

them without using the **resource** functionality - as it is specific

to Clojure(JVM). We chose the two lightest ones to limit the possibility

cause network issues:

- [French Model as a namespace: postagga.fr_tb_model](https://github.com/turbopape/postagga/blob/master/src/postagga/fr_tb_model.cljc)

- [English Model as a namespace: postagga.en_fn_model](https://github.com/turbopape/postagga/blob/master/src/postagga/en_fn_model.cljc)

The suite of tools used to process these two corpora are in

the [corpuscule project](https://github.com/turbopape/corpuscule). 

**Please refer to the licensing of these corpora to see what

extent you can use work derived from them.**

We then trained a  model out of the above English corpus:

- [en_fn_v_model.edn](https://github.com/turbopape/postagga/blob/master/models/en_fn_v_model.edn)

... and two models out of these two French corpora:

- [fr_sequoia_pos_v_model.edn](https://github.com/turbopape/postagga/blob/master/models/fr_sequoia_pos_v_model.edn)

- [fr_tb_v_model.edn](https://github.com/turbopape/postagga/blob/master/models/fr_tb_v_model.edn)

      

    

Now you can use that **model** to assign POS tags to speech:

(**Note:** sentences must be fed in the form of a vector of all small-case

tokens)

```clojure

(require '[postagga.tagger :refer [viterbi]])

(viterbi model ["je" "suis" "heureux"])

;;=> ["CLS" "V" "ADJ"]

```

### Patching Viterbi's output

When the tagger encounters a word it doesn't know about- that is, was

not in the corpus used to generate the viterbi models - it arbitrarily

assigns it a tag - more or less randomly picked by the algorithm. To 

somehow enhance the detection, it is possible to *patch* the output,

that is, look it up in a dictionary of terms of a known type and force the tags

accordingly. For instance, if you have a dictionary for

proper nouns in a given language, you can patch your HMM generated

POS-tags by forcing every word happening to be an entry in this

dictionary to have the "NPP" tag.

We provide two dictionaries for proper nouns:

- [fr_tr_names.cljc](https://github.com/turbopape/postagga/blob/master/src/postagga/fr_tr_names.cljc) for French,

- [en_tr_names.cljc](https://github.com/turbopape/postagga/blob/master/src/postagga/en_tr_names.cljc) for English. 

You can see how you can integrate patching in the parsing phase

hereafter.

Technically, dictionaries are [tries](https://en.wikipedia.org/wiki/Trie) to speed up lookup for multiple

entries. But this may evolve during time and should be considered as

mere details implementation.

### Meaning of tags

A reference to the meaning of tags is provided:

- For [English](https://github.com/turbopape/postagga/blob/master/models/en_penn_tb_tags.md)

## Using the tagger to parse free speech

Now that you have your tagger trained, you can use a parser to drill the

information from your sentences. For our last example say you want

**postagga** to understand how you currently feel, or how you look. It can be done by detecting

the first token as being a Subject - **CLS**, doing a Verb - **V** and

then having an Adjective - **ADJ**. We want to detect who is having what

adjective in our sentence.

For this, we'll use the **postagga.parser** namespace.

First of all, require the namespace:

```clojure

(require '[postagga.parser :refer [parse-tags-rules]])

```

Then, you'll need to specify rules for the parser. We want to grab the

word tagged as **CLS** and the word tagged as **ADJ** as our

information. Here's what the parser rules look like:

```clojure

(def sample-rules [{;;Rule TB French "je suis heureux."

                    :id :sample-rule-tb-french

                    :optional-steps []

                    :rule [

                           :qui       ;;<----- A atep

                           #{:get-value #{"CLS"}} ;;<----- A state in the parse machine

                                           ;;i.e, a set of possible sets of POS TAGS                           

                           :!OR!

                           

                           :product

                           #{#{"DET"}}

                           #{:get-value #{"NC"}} ;;<--- an alternate possible

                                                 ;; state at this step

                                                 

                           

                           :mood              ;;<--- Another step

                           #{#{"V"}}

                           #{:get-value #{"ADJ"}}]}]

```

This deserves some explanation before we carry on with our example.

The parser is basically a state machine. It goes through **steps** *([:qui, :mood])*, with each step encompassing multiple

**states** *([#{#{"V}} ...])*. A **state** basically refers to words; it is matched with tag sets

(a word can relate to multiple tags, if your preferred tagger wants to!). 

Different tag sets can be assigned to a **state**. For instance, to say that in some **state** we require either a *Noun("NPP")* or a *Verb("V")*, you might put:

```clojure

;...

#{#{"V"} #{"NPP"}}

;...

```

Putting the keyword **:get-value** in a **state** tells the parser to grab the word having

led to this state, put in the yielded parse map, and finally assign it to a key representing

the **step** in which that state was in. Confusing, isn't it? :confused:

You'll get it with an example.

Let's say that somewhere we have:

```clojure

[:qui ; <-- A step

;;...

   {:get-value #{"CLS"}} ;;<-- A state with :get-value under the :qui step

;;...

]

```

The value of the word that yielded the tag **CLS** (which is **je** in our example) will be reflected on the

output map as an entry in some vector, associated with the related step, which is **qui**:

```clojure

{:qui ["je"]}

```

This is what the **postagga** parser is all about: you tell it where to extract information and how you want it structured for upstream processing.

If we had multiple states with **:get-value** flag on, we'll find multiple words in the corresponding entry in the output. This is why the **step** key is referring a vector of words in the output map.

It is also possible to say that a state can be encountered repeatedly

using the **:multi** keyword. If you say in certain state:

```clojure

:some-step

;...

#{:get-value :multi #{"ADJ"}

;...

```

and if you feed **postagga** the following tokenized sentence:

```clojure

["il" "parait" "beau" "grand" "heureux"]

```

you'll find in the parse map:

```clojure

{:some-step ["beau" "grand" "heureux"]}

```

The **:optional-steps** stanza tells the parser not to raise an error if a step

belonging to this vector is not present.

At any step you can specify multiple alternatives for capturing

different sets of information. In the above example, you can say that

the first step in your sentence might talk about a person captured

through the *CLS* attribute or a product captured by specified a

*DET* than a *NC*. You specify such alternatives via the :!OR! keyword.

You'll also need to tell the parser how to break down a line of text

into a vector of words. We call this a **tokenizer**. Waiting to

develop a full fledged couple with language-specific rules, we can

just start by a naive one that splits strings using space characters:

```clojure

;; Hey, this one works only on Clojure (JVM) version !!

(def sample-tokenizer-fn #(clojure.string/split % #"\s"))

```

Back to our sample. With **sample-rules** holding a set of rules as defined above,

you can parse your sentence like so:

```clojure

(def parse-result (parse-tags-rules 

                   sample-tokenizer-fn      ;; The tokenizer function.

                   (partial viterbi model)  ;; The tagger function - curried with a model

                   sample-rules             ;; The parser rules.

                   "je suis heureux"))      ;; The sentence to parse. 

```

and you'd have a detailed result like so:

```clojure

{:errors nil ;;<- The error if any

 :result {:rule :sample-rule-tb-french ;; <- Which rule was detected 

          :data {:qui ["je"],          ;; <- The data structure drilled

                                       ;;    down from the input.

                 :mood ["heureux"]}}}

```

The errors will be reported as a collection mapping each rule to what

step and state the parser failed. This can be quite large, so be

careful not to spit the contents of the result directly into your REPL - 

you can test on the **:errors** being _nil_ and work with the

**:data** value:

```clojure

;; Do something with

(:data parse-result)

 

```

To integrate patching, as discussed above to the parsing, you can

proceed as follows:

```clojure

(def patch-fr-tagger-w-name ;;<- a function that wraps viterbi into a

                            ;; "patched" version

  #(patch-w-entity  0.9 % en-names-trie

                    ;; Takes a sentence, computes tags with viterbi

                    ;; and afterwards, looks if the words are close

                    ;; enough to entries in the French names

                    ;; dictionary, in which case it will force them to

                    ;; have "NC" tag 

                    (viterbi fr-model %)

                    "NC"))

;;=> #'postagga.core-test/patch-fr-tagger-w-name                    

(-> (parse-tags-rules sample-tokenizer-fn 

                      patch-fr-tagger-w-name 

                      sample-rules "nicolas est heureux")

              (get-in [:result :data]))

;;=> {:qui["nicolas"] :mood ["heureux"]}              

```

# Complete list of features

You can see some of this workflow (other than the training) in the

[Tests](https://github.com/turbopape/postagga/blob/master/test/postagga/core_test.cljc).

Please refer to the [Changelog](https://github.com/turbopape/postagga/blob/master/CHANGELOG.md) to see included features per version.

# TODO and contributing

**postagga** can make great use of great contributors like you! I'll

track the enhancements, bugs, features, etc., in the [project issues](https://github.com/turbopape/postagga/issues)

tab and please feel free to send your PRs!

# Code Of Conduct

Please note that this project is released with a [Contributor Code of Conduct](./CODE_OF_CONDUCT.md). By

participating in this project, you agree to abide by its terms.

## Contributors

This project exists thanks to all the people who contribute. [[Contribute]](CONTRIBUTING.md).



## Backers

Thank you to all our backers! 🙏 [[Become a backer](https://opencollective.com/postagga#backer)]



## Sponsors

Support this project by becoming a sponsor. Your logo will show up here with a link to your website. [[Become a sponsor](https://opencollective.com/postagga#sponsor)]





















# License and Credits

Copyright (c) 2017 [Rafik Naccache](mailto:[email protected]).

Happily brought to you by [fekr](http://www.fekr.tech).

The Logo is created by my talented friend the great [Chakib Daoud](https://www.facebook.com/3amettaher/?fref=ts)

Distributed under the terms of the [MIT License]("http://opensource.org/licenses/MIT).
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/turbopape/postagga

Awesome Lists containing this project

README