Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/wrathematics/meanr
A sentiment analysis package for R.
https://github.com/wrathematics/meanr
Last synced: 20 days ago
JSON representation
A sentiment analysis package for R.
- Host: GitHub
- URL: https://github.com/wrathematics/meanr
- Owner: wrathematics
- License: other
- Created: 2016-12-01T20:47:13.000Z (almost 8 years ago)
- Default Branch: master
- Last Pushed: 2023-12-10T18:49:04.000Z (11 months ago)
- Last Synced: 2024-08-10T10:38:07.580Z (3 months ago)
- Language: C
- Size: 513 KB
- Stars: 22
- Watchers: 2
- Forks: 5
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- Changelog: ChangeLog
- License: LICENSE
Awesome Lists containing this project
- jimsghstars - wrathematics/meanr - A sentiment analysis package for R. (C)
README
# meanr
* **Version:** 0.1-5
* **URL**: https://github.com/wrathematics/meanr
* **License:** [BSD 2-Clause](https://opensource.org/license/bsd-2-clause/)
* **Author:** Drew Schmidt**meanr** is an R package performing sentiment analysis. Its main method, `score()`, computes sentiment as a simple sum of the counts of positive (+1) and negative (-1) sentiment words in a piece of text. More sophisticated techniques are available to R, for example in the **qdap** package's `polarity()` function. This package uses [the Hu and Liu sentiment dictionary](https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html), same as everybody else.
**meanr** is significantly faster than everything else I tried (which was actually the motivation for its creation), but I don't claim to have tried everything. I believe the package is quite fast. However, the method is merely a dictionary lookup, so it ignores word context like in more sophisticated methods. On the other hand, the more sophisticated tools are very slow. If you have a large volume of text, I believe there is value in getting a "first glance" at the data, and **meanr** allows you to do this very quickly.
## Installation
The stable version is available on CRAN:
```r
install.packages("meanr")
```The development version is maintained on GitHub:
```r
remotes::install_github("wrathematics/meanr")
```## Example Usage
I have a dataset that, for legal reasons, I can not describe, much less provide. You can think of it like a collection of tweets (they are not tweets). But take my word for it that it's real, English language text. The data is in the form of a vector of strings, which we'll call `x`.
```r
x = readRDS("x.rds")length(x)
## [1] 655760sum(nchar(x))
## [1] 162663972library(meanr)
system.time(s <- score(x))
## user system elapsed
## 1.072 0.000 0.285head(s)
## positive negative score wc
## 1 2 0 2 32
## 2 5 0 5 29
## 3 4 2 2 67
## 4 12 3 9 203
## 5 8 2 6 101
## 6 4 3 1 99
```## How It Works
The `score()` function receives a vector of strings, and operates on each one as follows:
1. The maximum string length is found, and a buffer of that size is allocated.
2. The string is copied to the buffer.
3. All punctuation is removed. All characters are converted to lowercase.
4. Score sentiment:
- Tokenize words as collections of chars separated by a space.
- Check if the word is positive; if not, check if it is negative; if not, then it's assumed to be neutral. Each check is a lookup up in one of two tables of Hu and Liu's dictionaries.
- If the word is in the table, get its value from the hash table (positive words have value 1, negative words -1) and update the various counts. Otherwise, the word is "neutral" (score of 0).This is all done in four passes of each string; each pass corresponds to each of the enumerated items above. The hash tables uses perfect hash functions generated by gperf.