Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/tonyelhabr/tetext

Personal R package for text analysis
https://github.com/tonyelhabr/tetext

Last synced: 3 months ago
JSON representation

Personal R package for text analysis

Awesome Lists containing this project

README

        

---
output: github_document
---

```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
message = FALSE,
warning = FALSE,
comment = "#>",
# cache.path = "man/README",
fig.path = "man/README/README-"
)
```

```{r eval = FALSE, echo = FALSE}
viz_void <- ggplot2::ggplot() + ggplot2::theme_void()

dir_logo <- file.path("man", "figures")
if(!exists(dir_logo)) {
dir.create(dir_logo, recursive = TRUE)
}
filepath_logo <- file.path("man", "figures", paste0("logo.png"))
hexSticker::sticker(
subplot = viz_void,
package = "tetext",
filename = file.path("man", "figures", paste0("logo.png")),
p_y = 1.0,
p_color = "black",
# p_family = "sans",
p_size = 40,
h_size = 1.5,
h_color = "black",
h_fill = "yellow"
)
logo <- magick::image_read(filepath_logo)
magick::image_write(magick::image_scale(logo, "120"), path = filepath_logo)
```

# teproj

## Introduction

This package contains functions that I use for quick and "tidy" text analysis.

### Installation

`devtools::install_github("tonyelhabr/tetext")`.

## Notes

### Inspiration

This package is heavily influenced by the blogs of
[David Robinson](http://varianceexplained.org/posts/) and
[Julia Silge](juliasilge.com/blog/), as well as their co-authored book
[_Text Mining with R_](https://www.tidytextmining.com/).
Most of the functions in this package implement snippets of code that they have
shared.

### Main Functions

~~The following is a list of all functions in the package.~~

```{r echo = FALSE, eval = FALSE}
# library("tetext")
# ls("package:tetext")
```
The following is a list is a list of the __main__ functions in the package.
(Not shown: the SE aliases and various helper functions.)

```{r echo = FALSE}
library("tetext")
ls_all <- ls("package:tetext")
ls_at <- grep("_at$", ls_all, value = TRUE)
ls_main <- grep("^compute_|^visualize_|^trim_|^tidify_", ls_all, value = TRUE)
ls_show <- setdiff(ls_main, ls_at)
print(sort(c(ls_show)))
```

```{r include = FALSE}
# sprintf("Code coverage: %f", covr::package_coverage())
```

Here are some short descriptions of the functions, grouped generally
by usage. Functions are listed in order of
recommended use in a script (and in the order in which I wrote them).

+ __time:__ `visualize_time()`, `visualize_time_facet()`, `visualize_time_hh()`:
Visualize data over time.
+ __tidify:__ `tidify_to_unigrams()`, `tidify_to_bigrams()`:
Tokenize data to tidy format with unigrams or bigrams.
+ __cnts:__ `visualize_cnts()`, `visualize_cnts_facet()`,
`visualize_cnts_wordcloud()`, `visualize_cnts_wordcloud_facet()`: Visualize counts of n-grams.
+ __freqs:__ `compute_freqs()`, `compute_freqs_facet()`,
`visualize_bigrams_freqs_facet()`, `compute_freqs_facet_by2()`,
`visualize_freqs_facet_by2()`:
Compute and visualize frequencies of n-grams.
+ __corrs:__ `compute_corrs()`, `visualize_corrs_network()`:
Compute and visualize pairwise correlations (of bigrams).
+ __tfidf:__ `compute_tfidf()`, `visualize_tfidf()`:
Compute and visualize change in n-gram usage across documents.
+ __change:__ `compute_change()`, `visualize_change()`:
Compute and visualize change in n-gram usage across documents.
+ __sents:__ `compute_sent_summary()`, `compute_sent_summary_facet()`
Compute sentiment scores for n-grams.
+ __xy:__ `compute_freqs_facet_by2()`, `visualize_freqs_facet_by2()`,
`compute_logratios_facet_by2()`, `visualize_logratios_facet_by2()`,
`compute_sentratios_facet_by2()`, `visualize_sentratios_facet_by2()`,
Compute metrics across facets `facet` entities. Uses a handful of internal
function that are not intended to be called directly (although they can be).
These internal functions include the following:
`create_xy_grid()`, `filter_xy_grid()`, `preprocess_xy_data()`, `postprocess_xy_data()`,
`wrapper_func()`, `add_dummy_cols()`.
Also, there are more specific internal functions, such as:
`compute_freqs_facet_wide()`, `compute_logratios_facet_wide()`, `compute_sentratios_facet_wide()`

### Function Idioms

All major functions in this function have non-standard evaluation (NSE) and
standard evaluation (SE) versions
To distinguish the NSE and SE functions,
the `dplyr` convention of suffixing standard evaluation (SE)
functions with `_at` is used. (These functions expect characters instead of
"bare" string to indicate column names.

Main functions used to return data.frames mostly begin with the verb `compute`.
Visualization functions that call these functions internally begin with `visualize`.
There are various `default_*()` functions (e.g. `default_theme()`, `default_labs()`,
`default_facet()`) that are used to specify default
arguments in the `visualize_*()` functions--these arguments end with the noun `_base`
(e.g. `theme_base`, `labs_base`, `facet_base`, etc.). Accompanying these `_base` arguments
are similarly-named `_params` arguments (e.g. `theme_params`, `labs_params`, `facet_params`, etc.).
The intended use of this framework is that the `_base` arguments, which default to some `default_*()`
function, are not to be altered; instead, this is the purpose of the `_params` arguments.
This design is used as a means of providing "good" defaults while also providing
a suitable means of customization.

To explain this format more clearly, consider an example of proper usage of these arguments.
In the `visualize_cnts()` function, there is a `theme_base` argument set equal
to `default_theme()` by default--this function argument defines
a "baseline" theme (very similar to `ggplot2::theme_minimal()`) that should not
be directly changed. To make modifications to the `default_theme()`, the accompanying
`theme_params` argument should be set equal to a names list corresponding
to the appropriate `ggplot2::theme()` parameters that are to be customized.
For example, one might set `theme_params = list(legend.title = ggplot2::element_text("Legend Title"))`.
(The `default_theme()` function sets `ggplot::theme(legend.title = ggplot2::element_blank()`.)

(Notably, the only `_base`/`_params`/`default_*()` combination that does not truly follow this design pattern
is that for `scale_manual`. There is a `default_scale_manual()` function for
a `scale_manual_params` argument in the visualization function, but no corresponding
`_base` argument.)

### Functions to add?

+ ~~__model:__~~ `model_lda()`, `visualize_lda_betas()`, `visualize_lda_gammas()`
+ ~~__poisson:__~~
`compute_sentdiff_poisson()`, `prepare_sents_diffs_poisson()`, `visualize_sents_diffs_poission()`

## Examples

Check out
[this blog post analyzing the text in _R Weekly posts_](https://tonyelhabr.rbind.io/posts/tidy-text-analysis-rweekly/)
to see usage of an earlier version of this package:

(_Warning:_ At the time, only SE functions had been implemented. Also, since then, some
function names and arguments may have been changed.)

Additionally, see
[this blog post analyzing personal Google search history](https://tonyelhabr.rbind.io/posts/tidy-text-analysis-google-search-history/)
to see code that are similar
to the code used to implement this package's functions:

Finally, the tests files can provide worthwhile examples of function usage

```{r echo = FALSE}
# library("tetext")
```