Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/trinker/termco
Regular Expression Counts of Terms and Substrings
https://github.com/trinker/termco
classification-model frequent-terms r regex regex-model regular-expression terms-count
Last synced: 11 days ago
JSON representation
Regular Expression Counts of Terms and Substrings
- Host: GitHub
- URL: https://github.com/trinker/termco
- Owner: trinker
- License: other
- Created: 2015-07-26T03:17:09.000Z (over 9 years ago)
- Default Branch: master
- Last Pushed: 2022-01-05T19:13:15.000Z (almost 3 years ago)
- Last Synced: 2023-03-24T05:41:15.708Z (over 1 year ago)
- Topics: classification-model, frequent-terms, r, regex, regex-model, regular-expression, terms-count
- Language: R
- Size: 6.07 MB
- Stars: 24
- Watchers: 7
- Forks: 7
- Open Issues: 15
-
Metadata Files:
- Readme: README.Rmd
- License: LICENSE
Awesome Lists containing this project
README
---
title: "termco"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output:
md_document:
toc: true
---```{r, echo=FALSE, message=FALSE, warning=FALSE}
desc <- suppressWarnings(readLines("DESCRIPTION"))
regex <- "(^Version:\\s+)(\\d+\\.\\d+\\.\\d+)"
loc <- grep(regex, desc)
ver <- gsub(regex, "\\2", desc[loc])
verbadge <- sprintf('', ver, ver)
verbage <- ''
library(dplyr);library(termco);library(knitr)
``````{r, echo=FALSE, message=FALSE, warning=FALSE}
",sep="")
knit_hooks$set(htmlcap = function(before, options, envir) {
if(!before) {
paste('
}
})
knitr::opts_knit$set(self.contained = TRUE, cache = FALSE)
knitr::opts_chunk$set(fig.path = "tools/figure/")
```[![Project Status: Active - The project has reached a stable, usable state and is being actively developed.](http://www.repostatus.org/badges/0.1.0/active.svg)](http://www.repostatus.org/#active)
[![Build Status](https://travis-ci.org/trinker/termco.svg?branch=master)](https://travis-ci.org/trinker/termco)
[![Coverage Status](https://coveralls.io/repos/trinker/termco/badge.svg?branch=master)](https://coveralls.io/r/trinker/termco?branch=master)
[![DOI](https://zenodo.org/badge/5398/trinker/termco.svg)](https://zenodo.org/badge/latestdoi/5398/trinker/termco)![](tools/termco_logo/r_termco.png)
**termco** is a suite of functions used to count and find terms and substrings in strings. The tools can be used to build an expert rules, regular expression based text classification model. The package wraps the [**data.table**](https://cran.r-project.org/package=data.table) and [**stringi**](https://cran.r-project.org/package=stringi) packages to create fast data frame counts of regular expression terms and substrings.
# Functions
The main function of **termco** is `term_count`. It is used to extract regex term counts by grouping variable(s) as well as to generate classification models.
Most of the functions *count*, *search*, *plot* terms, and *covert* between output types, while a few remaining functions are used to train, test and interpret *model*s. Additionally, the `probe_` family of function generate lists of function calls or plots for given search terms. The table below describes the functions, category of use, and their description:
| Function | Use Category | Description |
|------------------------------|--------------|---------------------------------------------------|
| `term_count` | count | Count regex term occurrence; modeling |
| `token_count` | count | Count fixed token occurrence; modeling |
| `frequent_terms`/`all_words` | count | Frequent terms |
| `important_terms` | count | Important terms |
| `hierarchical_coverage_term` | count | Unique coverage of a text vector by terms |
| `hierarchical_coverage_regex`| count | Unique coverage of a text vector by regex |
| `frequent_ngrams` | count | Weighted frequent ngram (2 & 3) collocations |
| `word_count` | count | Count words |
| `term_before`/`term_after` | count | Frequency of words before/after a regex term |
| `term_first` | count | Frequency of words at the begining of strings |
| `colo` | search | Regex output to find term collocations |
| `search_term` | search | Search for regex terms |
| `match_word` | search | Extract words from a text matching a regular expression |
| `search_term_collocations` | search | Wrapper for `search_term` + `frequent_terms` |
| `classification_project` | modeling | Make a classification modeling project template |
| `classification_template` | modeling | Make a classification analysis script template |
| `as_dtm`/`as_tdm` | modeling | Coerce `term_count` object into `tm::DocumentTermMatrix`/`tm::TermDocumentMatrix` |
| `split_data` | modeling | Split data into `train` & `test` sets |
| `evaluate` | modeling | Check accuracy of model against human coder |
| `classify` | modeling | Assign n tags to text from a model |
| `get_text` | modeling | Get the original text for model tags |
| `coverage` | modeling | Coverage for `term_count` or `search_term` object |
| `uncovered`/`get_uncovered` | modeling | Get the uncovered text from a model |
| `mutate_counts` | modeling | Apply normalizing function to term count columns |
| `select_counts` | modeling | Select columns without stripping count classes |
| `tag_co_occurrence` | modeling | Explore co-occurrence of tags from a model |
| `validate_model`/`assign_validation_task` | modeling | Human validation of a `term_count` model |
| `read_term_list` | read/write | Read a term list from an external file |
| `write_term_list` | read/write | Write a term list to an external file |
| `term_list_template` | read/write | Write a term list template to an external file |
| `as_count` | convert | Strip pretty printing from `term_count` object |
| `as_terms` | convert | Convert a count matrix to list of term vectors |
| `as_term_list` | convert | Convert a vector of terms into a named term list |
| `weight` | convert | Weight a `term_count` object proportion/percent |
| `plot_ca` | plot | Plot `term_count` object as 3-D correspondence analysis map |
| `plot_counts` | plot | Horizontal bar plot of group counts |
| `plot_freq` | plot | Vertical bar plot of frequencies of counts |
| `plot_cum_percent` | plot | Plot `frequent_terms` object as cumulative percent |
| `probe_list` | probe | Generate list of `search_term` function calls |
| `probe_colo_list` | probe | Generate list of `search_term_collocations` function calls |
| `probe_colo_plot_list` | probe | Generate list of `search_term_collocationss` + `plot` function calls |
| `probe_colo_plot` | probe | Plot `probe_colo_plot_list` directly |# Installation
To download the development version of **termco**:
Download the [zip ball](https://github.com/trinker/termco/zipball/master) or [tar ball](https://github.com/trinker/termco/tarball/master), decompress and run `R CMD INSTALL` on it, or use the **pacman** package to install the development version:
```r
if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh(
"trinker/gofastr",
"trinker/termco"
)
```# Contact
You are welcome to:
* submit suggestions and bug-reports at:
* send a pull request on:
* compose a friendly e-mail to:# Examples
The following examples demonstrate some of the functionality of **termco**.
## Load the Tools/Data
```{r, message=FALSE, warning=FALSE}
if (!require("pacman")) install.packages("pacman")
pacman::p_load(dplyr, ggplot2, termco)data(presidential_debates_2012)
```## Build Counts Dataframe
```{r}
discoure_markers <- list(
response_cries = c("\\boh", "\\bah", "aha", "ouch", "yuk"),
back_channels = c("uh[- ]huh", "uhuh", "yeah"),
summons = "hey",
justification = "because"
)counts <- presidential_debates_2012 %>%
with(term_count(dialogue, grouping.var = list(person, time), discoure_markers))counts
```## Printing
```{r}
print(counts, pretty = FALSE)
print(counts, zero.replace = "_")
```## Plotting
```{r}
plot(counts)
plot(counts, labels=TRUE)
plot_ca(counts, FALSE)
```## Ngram Collocations
**termco** wraps the [**quanteda**](https://github.com/kbenoit/quanteda) package to examine important ngram collocations. **quanteda**'s `collocation` function provides measures of: `"lambda"`, `"z"`, and `"frequency"` to examine the strength of relationship between ngrams. **termco** adds stopword removal, min/max character filtering, and stemming to **quanteda**'s `collocation` as well as a generic `plot` method.
```{r}
x <- presidential_debates_2012[["dialogue"]]frequent_ngrams(x)
frequent_ngrams(x, gram.length = 3)
frequent_ngrams(x, order.by = "lambda")
```### Collocation Plotting
```{r, fig.width=10}
plot(frequent_ngrams(x))
plot(frequent_ngrams(x), drop.redundant.yaxis.text = FALSE)
plot(frequent_ngrams(x, gram.length = 3))
plot(frequent_ngrams(x, order.by = "lambda"))
```## Converting to Document Term Matrix
Regular expression counts can be useful features in machine learning models. The **tm** package's `DocumentTermMatrix` is a popular data structure for machine learning in **R**. The `as_dtm` and `as_tdm` functions are useful for coercing the count `data.table` structure of a `term_count` object into a `DocumentTermMatrix`/`TermDocumentMatrix`. The result can be combined with token/word only `DocumentTermMatrix` structures using `cbind` & `rbind`.
```{r}
as_dtm(markers)
cosine_distance <- function (x, ...) {
x <- t(slam::as.simple_triplet_matrix(x))
stats::as.dist(1 - slam::crossprod_simple_triplet_matrix(x)/(sqrt(slam::col_sums(x^2) %*%
t(slam::col_sums(x^2)))))
}mod <- hclust(cosine_distance(as_dtm(markers)))
plot(mod)
rect.hclust(mod, k = 5, border = "red")(clusters <- cutree(mod, 5))
```# Building an Expert Rules, Regex Classifier Model
Machine learning models of classification are great when you have known tags to train with because the model scales. Qualitative, expert based human coding is terrific for when you have no tagged data. However, when you have a larger, untagged data set the machine learning approaches have no outcome to learn from and the data is too large to classify by hand. One solution is to use a expert rules, regular expression approach that is somewhere between machine learning and hand coding. This is one solution for tagging larger, untagged data sets. Additionally, when each text element contains larger chunks of text, [unsupervised clustering type algorithms](https://github.com/trinker/clustext) such as k-means, non-negative matrix factorization, hierarchical clustering, or [topic modeling](https://github.com/trinker/topicmodels_learning) may be of use for creating clusters that could be interpreted and treated as categories.
This example section highlights the types of function combinations and order for a typical expert rules classification. This task typically involves the combined use of available literature, close examinations of term usage within text, and researcher experience. Building a classifier model requires the researcher to build a list of regular expressions that map to a category or tag. Below I outline minimal work flow for classification.
Note that the user may want to begin with a classification model template that contains subdirectories and files for a classification project. The `classification_project` generates this template with a pre-populated *'classification.R'* script that can guide the user through the modeling process. The directory tree looks like the following:
```
template
|
| .Rproj
|
+---models
| categories.R
|
+---data
+---output
+---plots
+---reports
\---scripts
01_data_cleaning.R
02_classification.R
```## Load the Tools/Data
```{r, message=FALSE}
if (!require("pacman")) install.packages("pacman")
pacman::p_load(dplyr, ggplot2, termco)data(presidential_debates_2012)
```## Splitting Data
Many classification techniques require the data to be split into a training and test set to allow the researcher to observe how a model will perform on a new data set. This also prevents over-fitting the data. The `split_data` function allows easy splitting of `data.frame` or `vector` data by integer or proportion. The function returns a named list of the data set into a `train` and `test` set. The printed view is a truncated version of the returned list with `|...` indicating there are additional observations.
```{r}
set.seed(111)
(pres_deb_split <- split_data(presidential_debates_2012, .75))
```The training set can be accessed via `pres_deb_split$train`; likewise, the test set can be accessed by way of `pres_deb_split$test`.
Here I show splitting by integer.
```{r, eval=FALSE}
split_data(presidential_debates_2012, 100)
``````{r, echo=FALSE}
set.seed(112)
(pres_deb_split <- split_data(presidential_debates_2012, 100))
```I could have trained on the training set and tested on the testing set in the following examples around modeling but have chosen not to for simplicity.
## Understanding Term Use
In order to build the named list of regular expressions that map to a category/tag the researcher must understand the terms (particularly information salient terms) in context. The understanding of term use helps the researcher to begin to build a mental model of the topics being used in a fashion similar to qualitative coding techniques. Broad categories will begin to coalesce as word use is elucidated. It forms the initial names of the "named list of regular expressions". Of course building the regular expressions in the regex model building step will allow the researcher to see new ways in which terms are used as well as new important terms. This in turn will reshape, remove, and add names to the "named list of regular expressions". This recursive process is captured in the model below.
### View Most Used Words
A common task in building a model is to understand the most frequent words while excluding less information rich function words. The `frequnt_terms` function produces an ordered data frame of counts. The researcher can exclude stop words and limit the terms to contain n characters between set thresholds. The output is ordered by most to least frequent n terms but can be rearranged alphabetically.
```{r}
presidential_debates_2012 %>%
with(frequent_terms(dialogue))
presidential_debates_2012 %>%
with(frequent_terms(dialogue, 40)) %>%
plot()
```A cumulative percent can give a different view of the term usage. The `plot_cum_percent` function converts a `frequent_terms` output into a cumulative percent plot. Additionally, `frequent_ngrams` + `plot` can give insight into the frequently occurring ngrams.
```{r, fig.width=8, fig.height=6}
presidential_debates_2012 %>%
with(frequent_terms(dialogue, 40)) %>%
plot_cum_percent()
```It may also be helpful to view the unique contribution of terms on the coverage excluding all elements from the match vector that were previously matched by another term. The `hierarchical_coverage_term` and accompanying `plot` method allows for hierarchical exploration of the unique coverage of terms.
```{r, fig.width=8, fig.height=6}
terms <- presidential_debates_2012 %>%
with(frequent_terms(dialogue, 30)) %>%
`[[`("term")presidential_debates_2012 %>%
with(hierarchical_coverage_term(dialogue, terms))presidential_debates_2012 %>%
with(hierarchical_coverage_term(dialogue, terms)) %>%
plot(use.terms = TRUE)
```### View Most Used Words in Context
Much of the exploration of terms in context in effort to build the named list of regular expressions that map to a category/tag involves recursive views of frequent terms in context. The `probe` family of functions can generate lists of function calls (and copy them to the clipboard for easy transfer) allowing the user to circulate through term lists generated from other **termco** tools such as `frequent_terms`. This is meant to standardize and speed up the process.
The first `probe_` tool makes a list of function calls for `search_term` using a term list. Here I show just 10 terms from `frequent_terms`. This can be pasted into a script and then run line by line to explore the frequent terms in context.
```{r}
presidential_debates_2012 %>%
with(frequent_terms(dialogue, 10)) %>%
select(term) %>%
unlist() %>%
probe_list("presidential_debates_2012$dialogue")
```The next `probe_` function generates a list of `search_term_collocations` function calls (`search_term_collocations` wraps `search_term` with `frequent_terms` and eliminates the search term from the output). This allows the user to systematically explore the words that frequently collocate with the original terms.
```{r}
presidential_debates_2012 %>%
with(frequent_terms(dialogue, 5)) %>%
select(term) %>%
unlist() %>%
probe_colo_list("presidential_debates_2012$dialogue")
```As `search_term_collocations` has a `plot` method the user may wish to generate function calls similar to `probe_colo_list` but wrapped with `plot` for a visual exploration of the data. The `probe_colo_plot_list` makes a list of such function calls, whereas the `probe_colo_plot` plots the output directly to a single external .pdf file.
```{r}
presidential_debates_2012 %>%
with(frequent_terms(dialogue, 5)) %>%
select(term) %>%
unlist() %>%
probe_colo_plot_list("presidential_debates_2012$dialogue")
```The plots can be generated externally with the `probe_colo_plot` function which makes multi-page .pdf of frequent terms bar plots; one plot for each term.
```{r, eval=FALSE}
presidential_debates_2012 %>%
with(frequent_terms(dialogue, 5)) %>%
select(term) %>%
unlist() %>%
probe_colo_plot("presidential_debates_2012$dialogue")
```### View Important Words
It may also be useful to view top [min-max](http://stats.stackexchange.com/a/70807/7482) scaled tf-idf weighted terms to allow the more information rich terms to bubble to the top. The `important_terms` function allows the user to do exactly this. The function works similar to `term_count` but with an information weight.
```{r}
presidential_debates_2012 %>%
with(important_terms(dialogue, 10))
```## Building the Model
To build a model the researcher created a named list of regular expressions that map to a category/tag. This is fed to the `term_count` function. `term_count` allows for aggregation by grouping variables but for building the model we usually want to get observation level counts. Set `grouping.var = TRUE` to generate an `id` column of 1 through number of observation which gives the researcher the observation level counts.
```{r}
discoure_markers <- list(
response_cries = c("\\boh", "\\bah", "aha", "ouch", "yuk"),
back_channels = c("uh[- ]huh", "uhuh", "yeah"),
summons = "hey",
justification = "because"
)model <- presidential_debates_2012 %>%
with(term_count(dialogue, grouping.var = TRUE, discoure_markers))model
```## Testing the Model
In building a classifier the researcher is typically concerned with coverage, discrimination, and accuracy. The first two are easier to obtain while accuracy is not possible to compute without a comparison sample of expertly tagged data.
We want our model to be assigning tags to as many of the text elements as possible. The `coverage` function can provide an understanding of what percent of the data is tagged. Our model has relatively low coverage, indicating the regular expression model needs to be improved.
```{r}
model %>%
coverage()
```Understanding how well our model discriminates is important as well. We want the model to cover as close to 100% of the data as possible, but likely want fewer tags assigned to each element. If the model is tagging many tags to each element it is not able to discriminate well. The `as_terms` + `plot_freq` function provides a visual representation of the model's ability to discriminate. The output is a bar plot showing the distribution of the number of tags at the element level. The goal is to have a larger density at $1$ tag. Note that the plot also gives a view of coverage, as the zero bar shows the frequency of elements that could not be tagged. Our model has a larger distribution of $1$ tag compared to the $> 1$ tag distributions, though the coverage is very poor. As the number of tags increases the ability of the model to discriminate typically lessens. There is often a trade off between model coverage and discrimination.
```{r}
model %>%
as_terms() %>%
plot_freq(size=3) + xlab("Number of Tags")
```We may also want to see the distribution of the tags as well. The combination of `as_terms` + `plot_counts` gives the distribution of the tags. In our model the majority of tags are applied to the **summons** category.
```{r}
model %>%
as_terms() %>%
plot_counts() + xlab("Tags")
```## Improving the Model
### Improving Coverage
The model does not have very good coverage. To improve this the researcher will want to look at the data with no coverage to try to build additional regular expressions and categories. This requires understanding language, noticing additional features of the data with no coverage that may map to categories, and building regular expressions to model these features. This section will outline some of the tools that can be used to detect features and build regular expressions to model these language features.
We first want to view the untagged data. The `uncovered` function provides a logical vector that can be used to extract the text with no tags.
```{r}
untagged <- get_uncovered(model)head(untagged)
```The `frequent_terms` function can be used again to understand common features of the untagged data.
```{r}
untagged %>%
frequent_terms()
```We may see a common term such as the word *right* and want to see what other terms collocate with it. Using a regular expression that searches for multiple terms can improve a model's accuracy and ability to discriminate. Using `search_term` in combination with `frequent_terms` can be a powerful way to see which words tend to collocate. Here I pass a regex for *right* (`\\bright`) to `search_term`. This pulls up the text that contains this term. I then use `frequent_terms` to see what words frequently occur with the word *right*. We notice the word *people* tends to occur with *right*.
```{r}
untagged %>%
search_term("\\bright") %>%
frequent_terms(10, stopwords = "right")
```The `search_term_collocations` function provides a convenient wrapper for `search_term` + `frequent_terms` which also removes the search term from the output.
```{r}
untagged %>%
search_term_collocations("\\bright", n=10)
```This is an exploratory act. Finding the right combination of features that occur together requires lots of recursive noticing, trialling, testing, reading, interpreting, and deciding. After we noticed that the terms *people* and *course* appear with the term *right* above we will want to see these text elements. We can use a grouped-or expression with `colo` to build a regular expression that will search for any text elements that contain these two terms anywhere. `colo` is more powerful than initially shown here; I demonstrate further functionality below. Here is the regex produced.
```{r}
colo("\\bright", "(people|course)")
```This is extremely powerful when used inside of `search_term` as the text containing this regular expression will be returned along with the coverage proportion on the uncovered data.
```{r}
search_term(untagged, colo("\\bright", "(people|course)"))
```We notice right away that the phrase *right course* appears often. We can create a search with just this expression.
***Note*** *that the decision to include a regular expression in the model is up to the researcher. We must guard against over-fitting the model, making it not transferable to new, similar contexts.*
```{r}
search_term(untagged, "right course")
```Based on the `frequent_terms` output above, the word *jobs* also seems important. Again, we use the `search_term` + `frequent_terms` combo to extract words collocating with *jobs*.
```{r}
search_term_collocations(untagged, "jobs", n=15)
```As stated above, `colo` is a powerful search tool as it can take multiple regular expressions as well as allowing for multiple negations (i.e., find x but not if y). To include multiple negations use a grouped-or regex as shown below.
```{r}
## Where do `jobs` and `create` collocate?
search_term(untagged, colo("jobs", "create"))
## Where do `jobs`, `create`, and the word `not` collocate?
search_term(untagged, colo("jobs", "create", "(not|'nt)"))
## Where do `jobs` and`create` collocate without a `not` word?
search_term(untagged, colo("jobs", "create", not = "(not|'nt)"))
## Where do `jobs`, `romney`, and `create` collocate?
search_term(untagged, colo("jobs", "create", "romney"))
```Here is one more example with `colo` for the words *jobs* and *overseas*. The user may want to quickly test and then transfer the regex created by `colo` to the regular expression list. By setting `options(termco.copy2clip = TRUE)` the user globally sets `colo` to use the **clipr** package to copy the regex to the clipboard for better work flow.
```{r}
search_term(untagged, colo("jobs", "overseas"))
```The researcher uses an iterative process to continue to build the regular expression list. The `term_count` function builds the matrix of counts to further test the model. The use of (a) `coverage`, (b) `as_terms` + `plot_counts`, and (c) `as_terms` + `freq_counts` will allow for continued testing of model functioning.
### Improving Discrimination
It is often desirable to improve discrimination. While the bar plot highlighting the distribution of the number of tags is useful, it only indicates if there is a problem, not where the problem lies. The `tag_co_occurrence` function produces a list of `data.frame` and `matrices` that aide in understanding how to improve discrimination. This list is useful, but the `plot` method provides an improved visual view of the co-occurrences of tags.
The network plot on the left shows the strength of relationships between tags, while the plot on the right shows the average number of other tags that co-occur with each regex tag. In this particular case the plot combo is not complex because of the limited number of regex tags. Note that the edge strength is relative to all other edges. The strength has to be considered in the context of the average number of other tags that co-occur with each regex tag bar/dot plot on the right. As the number of tags increases the plot increases in complexity. The unconnected nodes and shorter bars represent the tags that provide the best discriminatory power, whereas the other tags have the potential to be redundant.
```{r, impr_disc, fig.width=8}
tag_co_occurrence(model) %>%
plot(min.edge.cutoff = .01)
```Another way to view the overlapping complexity and relationships between tags is to use an [Upset plot](http://caleydo.org/tools/upset/). The `plot_upset` function wraps `UpSetR::upset` and is made to handle `term_count` objects directly. Upset plots are complex and require study of the method in order to interpret the results (http://caleydo.org/tools/upset). The time invested in learning this plot type can be very fruitful in utilizing a technique that scales to the types of data sets that **termco** outputs. This tool can be useful in order to understand overlap and thus improve discrimination.
```{r, impr_disc2, fig.width=8}
plot_upset(model)
```## Categorizing/Tagging
The `classify` function enables the researcher to apply $n$ tags to each text element. Depending on the text and the regular expression list's ability, multiple tags may be applied to a text. The `n` argument allows the maximum number of tags to be set though the function does not guarantee this many (or any) tags will be assigned.
Here I show the `head` of the returned vector (if `n` > 1 a `list` may be returned) as well as a `table` and plot of the counts. Use `n = Inf` to return all tags.
```{r}
classify(model) %>%
head()
classify(model) %>%
unlist() %>%
table()
classify(model) %>%
unlist() %>%
plot_counts() + xlab("Tags")
```## Evaluation: Accuracy
### Pre Coded Data
The `evaluate` function is a more formal method of evaluation than `validate_model`. The `evaluate` function yields a test a model's accuracy, precision, and recall using macro and micro averages of the confusion matrices for each tag as outlined by [Dan Jurafsky & Chris Manning](https://www.youtube.com/watch?v=OwwdYHWRB5E&index=31&list=PL6397E4B26D00A269). The function requires a known, human coded sample. In the example below I randomly generate "known human coded tagged" vector. Obviously, this is for demonstration purposes. The model outputs a pretty printing of a list. Note that if a larger, known tagging set of data is available the user may want to strongly consider machine learning models (see: [**RTextTools**](https://cran.r-project.org/package=RTextTools)).
This minimal example will provide insight into the way the evaluate scores behave:
```{r}
known <- list(1:3, 3, NA, 4:5, 2:4, 5, integer(0))
tagged <- list(1:3, 3, 4, 5:4, c(2, 4:3), 5, integer(0))
evaluate(tagged, known)
```Below we create fake "known" tags to test `evaluate` with real data (though the comparison is fabricated).
```{r}
mod1 <- presidential_debates_2012 %>%
with(term_count(dialogue, TRUE, discoure_markers)) %>%
classify()fake_known <- mod1
set.seed(1)
fake_known[sample(1:length(fake_known), 300)] <- "random noise"evaluate(mod1, fake_known)
```### Post Coding Data
It is often useful to less formally, validate a model via human evaluation; checking that text is being tagged as expected. This approach is more formative and less rigorous than `evaluate`, intended to be used to assess model functioning in order to improve it. The `validate_model` provides an interactive interface for a single evaluator to sample n tags and corresponding texts and assess the accuracy of the tag to the text. The `assign_validation_task` generates an external file(s) for n coders for redundancy of code assessments. This may be of use in [Mechanical Turk](https://www.mturk.com/mturk/welcome) type applications. The example below demonstrates `validate_model`'s `print`/`summary` and `plot` outputs.
```{r, eval = FALSE}
validated <- model %>%
validate_model()
``````{r, results='hide', echo=FALSE}
data(validated)
```After `validate_model` has been run the `print`/`summary` and `plot` provides an accuracy of each tag and a confidence level (note that the confidence band is highly affected by the number of samples per tag).
```{r}
validated
plot(validated)
```These examples give guidance on how to use the tools in the **termco** package to build an expert rules, regular expression text classification model.