https://github.com/chainsawriot/compromiser

⚖️ English Natural Language Processing With Compromise
https://github.com/chainsawriot/compromiser

Last synced: 3 months ago
JSON representation

⚖️ English Natural Language Processing With Compromise

Host: GitHub
URL: https://github.com/chainsawriot/compromiser
Owner: chainsawriot
License: gpl-3.0
Created: 2021-11-26T15:45:22.000Z (over 3 years ago)
Default Branch: master
Last Pushed: 2021-11-27T12:02:34.000Z (over 3 years ago)
Last Synced: 2025-02-24T09:39:51.460Z (4 months ago)
Language: R
Homepage:
Size: 105 KB
Stars: 1
Watchers: 3
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.Rmd
- License: LICENSE.md

Awesome Lists containing this project

jimsghstars - chainsawriot/compromiser - ⚖️ English Natural Language Processing With Compromise (R)

README

        ---

output: github_document

---

```{r, include = FALSE}

knitr::opts_chunk$set(

  collapse = TRUE,

  comment = "#>",

  fig.path = "man/figures/README-",

  out.width = "100%"

)

devtools::load_all()

```

# compromiser

The goal of compromiser is to make the elegant [compromise](https://compromise.cool/) javascript library available to R. So, what is `compromise`? Self-described as "modest natural language processing", `compromise` is an extremely lightweight, easy-to-use, rule-based NLP library for English. As of writing, the entire library is 200kB (smaller than a typical animated gif file). The library is optimized for speed and being "accurate enough".

Similar to the original `compromise`, compromiser is optimized for speed and ease of use. You may know [spacyr](https://github.com/quanteda/spacyr). The current package is modeled after spacyr, but with speed, out-of-the-box experience and "enough" accuracy. Similar to spacyr, all output formats are [tif](https://github.com/ropensci/tif)-compatible.

Caveats:

* At the moment `compromiser` does POS tagging only

* English only

* It won't keep punctuations

* It is only accurate enough for most applications. Don't expect state-of-the-art performance.

## Installation

``` r

devtools::install_github("chainsawriot/compromiser")

```

## Example

Unlike spacyr, udpipe, or openNLP, no post-installation setup is required for compromiser. You can directly go POS parsing a small corpus. 

```r

library(compromiser)

```

```{r}

textdata <- c("The dog has been selectively bred over millennia for various

 behaviors, sensory capabilities, and physical attributes. Dog breeds vary

 widely in shape, size, and color. They perform many roles for humans, such

 as hunting, herding, pulling loads, protection, assisting police and the

 military, companionship, therapy, and aiding disabled people. This influence

 on human society has given them the sobriquet of \"man's best friend.\"",

 "A methane gas explosion and fire in a Siberian coal mine left more than

 50 miners and rescuers dead. Another 239 people were rescued.")

x <- tag(textdata)

x

```

It gets the job done. Some tricky words are tagged incorrectly, though. (e.g. breeds, mine)

```{r}

as.data.frame(x)

```

# Integration with quanteda

```r

library(quanteda)

```

```{r, echo = FALSE, message = FALSE}

library(quanteda)

```

Conversion to Quanteda's tokens.

```{r}

x_toks <- as.tokens(x)

x_toks

```

Suppose you are only interested in nouns and adjectives.

```{r}

tokens_select(x_toks, pattern = c("*/N*", "*/JJ"))

```

# Integration with tidytext

I don't care. But probably the `as.data.frame` method is sufficient.

# A more serious example

`data_corpus_inaugural` is a corpus of all US inaugural speeches.

```{r}

system.time(inaug <- tag(data_corpus_inaugural))

inaug

```

It took < 15 seconds to do POS tagging of 138054 tokens. With that and using quanteda, we can study what are the most frequent nouns.

```{r}

inaug %>% as.tokens %>% tokens_select(pattern = c("*/NN*")) %>% dfm %>% topfeatures(n = 30)

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/chainsawriot/compromiser

Awesome Lists containing this project

README