Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/chainsawriot/compromiser
⚖️ English Natural Language Processing With Compromise
https://github.com/chainsawriot/compromiser
Last synced: 27 days ago
JSON representation
⚖️ English Natural Language Processing With Compromise
- Host: GitHub
- URL: https://github.com/chainsawriot/compromiser
- Owner: chainsawriot
- License: gpl-3.0
- Created: 2021-11-26T15:45:22.000Z (almost 3 years ago)
- Default Branch: master
- Last Pushed: 2021-11-27T12:02:34.000Z (almost 3 years ago)
- Last Synced: 2024-08-13T07:12:56.629Z (3 months ago)
- Language: R
- Homepage:
- Size: 105 KB
- Stars: 1
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.Rmd
- License: LICENSE.md
Awesome Lists containing this project
- jimsghstars - chainsawriot/compromiser - ⚖️ English Natural Language Processing With Compromise (R)
README
---
output: github_document
---```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
devtools::load_all()
```# compromiser
The goal of compromiser is to make the elegant [compromise](https://compromise.cool/) javascript library available to R. So, what is `compromise`? Self-described as "modest natural language processing", `compromise` is an extremely lightweight, easy-to-use, rule-based NLP library for English. As of writing, the entire library is 200kB (smaller than a typical animated gif file). The library is optimized for speed and being "accurate enough".
Similar to the original `compromise`, compromiser is optimized for speed and ease of use. You may know [spacyr](https://github.com/quanteda/spacyr). The current package is modeled after spacyr, but with speed, out-of-the-box experience and "enough" accuracy. Similar to spacyr, all output formats are [tif](https://github.com/ropensci/tif)-compatible.
Caveats:
* At the moment `compromiser` does POS tagging only
* English only
* It won't keep punctuations
* It is only accurate enough for most applications. Don't expect state-of-the-art performance.## Installation
``` r
devtools::install_github("chainsawriot/compromiser")
```## Example
Unlike spacyr, udpipe, or openNLP, no post-installation setup is required for compromiser. You can directly go POS parsing a small corpus.
```r
library(compromiser)
``````{r}
textdata <- c("The dog has been selectively bred over millennia for various
behaviors, sensory capabilities, and physical attributes. Dog breeds vary
widely in shape, size, and color. They perform many roles for humans, such
as hunting, herding, pulling loads, protection, assisting police and the
military, companionship, therapy, and aiding disabled people. This influence
on human society has given them the sobriquet of \"man's best friend.\"",
"A methane gas explosion and fire in a Siberian coal mine left more than
50 miners and rescuers dead. Another 239 people were rescued.")x <- tag(textdata)
x
```It gets the job done. Some tricky words are tagged incorrectly, though. (e.g. breeds, mine)
```{r}
as.data.frame(x)
```# Integration with quanteda
```r
library(quanteda)
``````{r, echo = FALSE, message = FALSE}
library(quanteda)
```Conversion to Quanteda's tokens.
```{r}
x_toks <- as.tokens(x)
x_toks
```Suppose you are only interested in nouns and adjectives.
```{r}
tokens_select(x_toks, pattern = c("*/N*", "*/JJ"))
```# Integration with tidytext
I don't care. But probably the `as.data.frame` method is sufficient.
# A more serious example
`data_corpus_inaugural` is a corpus of all US inaugural speeches.
```{r}
system.time(inaug <- tag(data_corpus_inaugural))
inaug
```It took < 15 seconds to do POS tagging of 138054 tokens. With that and using quanteda, we can study what are the most frequent nouns.
```{r}
inaug %>% as.tokens %>% tokens_select(pattern = c("*/NN*")) %>% dfm %>% topfeatures(n = 30)
```