https://github.com/trinker/textclean

Tools for cleaning and normalizing text data
https://github.com/trinker/textclean
data-munging emoticons r regex text-analysis text-cleaning
Last synced: 3 months ago
JSON representation
Tools for cleaning and normalizing text data
Host: GitHub
URL: https://github.com/trinker/textclean
Owner: trinker
Created: 2016-01-07T03:56:49.000Z (over 9 years ago)
Default Branch: master
Last Pushed: 2021-11-01T17:04:05.000Z (over 3 years ago)
Last Synced: 2025-03-29T00:05:21.418Z (3 months ago)
Topics: data-munging, emoticons, r, regex, text-analysis, text-cleaning
Language: R
Homepage:
Size: 23.8 MB
Stars: 248
Watchers: 10
Forks: 26
Open Issues: 8
Metadata Files:
- Readme: README.Rmd
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project

jimsghstars - trinker/textclean - Tools for cleaning and normalizing text data (R)
README

        ---

title: "textclean"

date: "`r format(Sys.time(), '%d %B, %Y')`"

output:

  md_document:

    toc: true      

---

```{r, echo=FALSE, warning=FALSE, error=FALSE}

desc <- suppressWarnings(readLines("DESCRIPTION"))

regex <- "(^Version:\\s+)(\\d+\\.\\d+\\.\\d+)"

loc <- grep(regex, desc)

ver <- gsub(regex, "\\2", desc[loc])

pacman::p_load_current_gh('trinker/numform')

verbadge <- sprintf('', ver, ver)

verbadge <- ""

```

[![Project Status: Active - The project has reached a stable, usable

state and is being actively

developed.](https://www.repostatus.org/badges/0.1.0/active.svg)](http://www.repostatus.org/#active)

[![Build Status](https://travis-ci.org/trinker/textclean.svg?branch=master)](https://travis-ci.org/trinker/textclean)

[![Coverage Status](https://coveralls.io/repos/trinker/textclean/badge.svg?branch=master)](https://coveralls.io/github/trinker/textclean)

[![](https://cranlogs.r-pkg.org/badges/textclean)](https://cran.r-project.org/package=textclean)

`r verbadge`

![](tools/textclean_logo/r_textclean.png)

**textclean** is a collection of tools to clean and normalize text.  Many of these tools have been taken from the **qdap** package and revamped to be more intuitive, better named, and faster.  Tools are geared at checking for substrings that are not optimal for analysis and replacing or removing them (normalizing) with more analysis friendly substrings (see Sproat, Black, Chen, Kumar, Ostendorf, & Richards, 2001, [doi:10.1006/csla.2001.0169](https://www.sciencedirect.com/science/article/pii/S088523080190169X)) or extracting them into new variables. For example, emoticons are often used in text but not always easily handled by analysis algorithms.  The `replace_emoticon()` function replaces emoticons with word equivalents. 

Other R packages provide some of the same functionality (e.g., **english**, **gsubfn**, **mgsub**, **stringi**, **stringr**, **qdapRegex**).  **textclean** differs from these packages in that it is designed to handle all of the common cleaning and normalization tasks with a single, consistent, pre-configured toolset (note that **textclean** uses many of these terrific packages as a backend).  This means that the researcher spends less time on munging, leading to quicker analysis.  This package is meant to be used jointly with the [**textshape**](https://github.com/trinker/textshape) package, which provides text extraction and reshaping functionality.  **textclean** works well with the [**qdapRegex**](https://github.com/trinker/qdapRegex) package which provides tooling for substring substitution and extraction of pre-canned regular expressions.  In addition, the functions of **textclean** are designed to work within the piping of the tidyverse framework by consistently using the first argument of functions as the data source. The **textclean** subbing and replacement tools are particularly effective within a `dplyr::mutate` statement. 

# Functions

The main functions, task category, & descriptions are summarized in the table below:

| Function 
|----------------- 
| `mgsub` 
| `fgsub` 
| `sub_holder` 
| `swap` 
| `strip` 
| `drop_empty_row` 
| `drop_row`/`keep_row` 
| `drop_NA` 
| `drop_element`/`keep_element` 
| `match_tokens` 
| `replace_contractions` 
| `replace_date` 
| `replace_email` 
| `replace_emoji` 
| `replace_emoticon` 
| `replace_grade` 
| `replace_hash` 
| `replace_html` 
| `replace_incomplete` 
| `replace_internet_slang` 
| `replace_kern` 
| `replace_misspelling` 
| `replace_money` 
| `replace_names` 
| `replace_non_ascii` 
| `replace_number` 
| `replace_ordinal` 
| `replace_rating` 
| `replace_symbol` 
| `replace_tag` 
| `replace_time` 
| `replace_to`/`replace_from` 
| `replace_tokens` 
| `replace_url` 
| `replace_white` 
| `replace_word_elongation` 
| `add_comma_space` 
| `add_missing_endmark` 
| `make_plural` 
| `check_text` 
| `has_endmark`

| Task        | Description                           | ----------|-------------|---------------------------------------| | subbing     | Multiple `gsub`                       | | subbing     | Functional matching replacement `gsub`  | | subbing     | Hold a value prior to a `strip`       | | subbing     | Simultaneously swap patterns 1 & 2    | | deletion    | Remove all non word characters        | | filter rows | Remove empty rows                     | | filter rows | Filter rows matching a regex          | | filter rows | Remove `NA` text rows                 | | filter elements | Filter matching elements from a vector   | | filter elements | Filter out tokens from strings that match a regex criteria | | replacement | Replace contractions with both words  | | replacement | Replace dates                         | | replacement | Replace emails                        | | replacement | Replace emojis with word equivalent or unique identifier | | replacement | Replace emoticons with word equivalent               | | replacement | Replace grades (e.g., "A+") with word equivalent     | | replacement | Replace Twitter style hash tags (e.g., #rstats) | | replacement | Replace HTML tags and symbols  | | replacement | Replace incomplete sentence end-marks  | | replacement  | Replace Internet slang with word equivalents | | replacement | Replace spaces for >2 letter, all cap, words containing spaces in between letters  | | replacement | Replace misspelled words with their most likely replacement | | replacement | Replace money  in the form of $\\d+.?\\d{0,2}  | | replacement | Replace common first/last names     | | replacement | Replace non-ASCII with equivalent or remove   | | replacement | Replace common numbers                | | replacement | Replace common ordinal number form    | | replacement | Replace ratings (e.g., "10 out of 10", "3 stars") with word equivalent | | replacement | Replace common symbols                | | replacement | Replace Twitter style handle tag (e.g., @trinker) | | replacement | Replace time stamps                   | | replacement | Remove from/to begin/end of string to/from a character(s) | | replacement | Remove or replace a vector of tokens with a single value | | replacement | Replace URLs                          | | replacement | Replace regex white space characters  | | replacement | Replace word elongations with shortened form  | | replacement | Replace non-space after comma         | | replacement | Replace missing endmarks with desired symbol | | replacement | Add plural endings to singular noun forms | | check       | Text report of potential issues       | | check       | Check if an element has an end-mark   |

# Installation

To download the development version of **textclean**:

Download the [zip ball](https://github.com/trinker/textclean/zipball/master) or [tar ball](https://github.com/trinker/textclean/tarball/master), decompress and run `R CMD INSTALL` on it, or use the **pacman** package to install the development version:

```r

if (!require("pacman")) install.packages("pacman")

pacman::p_load_gh(

    "trinker/lexicon",    

    "trinker/textclean"

)

```

# Contact

You are welcome to:    

- submit suggestions and bug-reports at:     

# Contributing

Contributions are welcome from anyone subject to the following rules:

- Abide by the [code of conduct](CONDUCT.md).

- Follow the style conventions of the package (indentation, function & argument naming, commenting, etc.)

- All contributions must be consistent with the package license (GPL-2)

- Submit contributions as a pull request. Clearly state what the changes are and try to keep the number of changes per pull request as low as possible.

- If you make big changes, add your name to the 'Author' field.

# Demonstration

## Load the Packages/Data

```{r, message=FALSE}

library(dplyr)

library(textshape)

library(lexicon)

library(textclean)

```

## Check Text

One of the most useful tools in **textclean** is `check_text` which scans text variables and reports potential problems.  Not all potential problems are definite problems for analysis but the report provides an overview of what may need further preparation.  The report also provides suggested functions for the reported problems.  The report provides information on the following:

```{r, results = 'asis', comment = NA, echo = FALSE}

cat(sapply(textclean:::.checks, function(x){

    frame <- ifelse(x$is_meta, "1. **%s** - Text variable that %s", "1. **%s** - Text elements that %s")

    sprintf(frame, x$fun, x$problem)

}), sep = '\n')

```

Note that `check_text` is running multiple checks and may be slower on larger texts.  The user may provide a sample of text for review or use the `checks` argument to specify the exact checks to conduct and thus limit the compute time.

Here is an example:

```{r}

x <- c("i like", "
i want. . thet them ther .", "I am ! that|", "", NA, 

    ""they" they,were there", ".", "   ", "?", "3;", "I like goud eggs!", 

    "bi\xdfchen Z\xfcrcher", "i 4like...", "\\tgreat",  "She said \"yes\"")

Encoding(x) <- "latin1"

x <- as.factor(x)

check_text(x)

```

And if all is well the user should be greeted by a cow:

```{r}

y <- c("A valid sentence.", "yet another!")

check_text(y)

```

## Row Filtering

It is useful to drop/remove empty rows or unwanted rows (for example the researcher dialogue from a transcript).  The `drop_empty_row` & `drop_row` do empty row do just this.  First I'll demo the removal of empty rows.

```{r}

## create a data set wit empty rows

(dat <- rbind.data.frame(DATA[, c(1, 4)], matrix(rep(" ", 4), 

    ncol =2, dimnames=list(12:13, colnames(DATA)[c(1, 4)]))))

drop_empty_row(dat)

```

Next we drop out rows.  The `drop_row` function takes a data set, a column (named or numeric position) and regex terms to search for.  The `terms` argument takes regex(es) allowing for partial matching.  `terms` is case sensitive  but can be changed via the `ignore.case` argument.

```{r}

drop_row(dataframe = DATA, column = "person", terms = c("sam", "greg"))

drop_row(DATA, 1, c("sam", "greg"))

keep_row(DATA, 1, c("sam", "greg"))

drop_row(DATA, "state", c("Comp"))

drop_row(DATA, "state", c("I "))

drop_row(DATA, "state", c("you"), ignore.case = TRUE)

```

## Stripping

Often it is useful to remove all non relevant symbols and case from a text (letters, spaces, and apostrophes are retained).  The `strip` function accomplishes this.  The `char.keep` argument allows the user to retain characters.

```{r}

strip(DATA$state)

strip(DATA$state, apostrophe.remove = TRUE)

strip(DATA$state, char.keep = c("?", "."))

```

## Subbing

### Multiple Subs

`gsub` is a great tool but often the user wants to replace a vector of elements with another vector.  `mgsub` allows for a vector of patterns and replacements.  Note that the first argument of `mgsub` is the data, not the `pattern` as is standard with base R's `gsub`.  This allows `mgsub` to be used in a **magrittr** pipeline more easily.  Also note that by default `fixed = TRUE`.  This means the search `pattern` is not a regex per-se.  This makes the replacement much faster when a regex search is not needed.  `mgsub` also reorders the patterns to ensure patterns contained within patterns don't over write the longer pattern.  For example if the pattern `c('i', 'it')` is given the longer `'it'` is replaced first (though `order.pattern = FALSE` can be used to negate this feature).

```{r}

mgsub(DATA$state, c("it's", "I'm"), c("<>", "<>"))

mgsub(DATA$state, "[[:punct:]]", "<>", fixed = FALSE)

mgsub(DATA$state, c("i", "it"), c("<>", "[[IT]]"))

mgsub(DATA$state, c("i", "it"), c("<>", "[[IT]]"), order.pattern = FALSE)

```


#### Safe Substitutions

The default behavior of `mgsub` is optimized for speed.  This means that it is very fast at multiple substitutions and in most cases works efficiently.  However, it is not what Mark Ewing describes as "safe" substitution.  In his vignette for the [**mgsub**](https://github.com/bmewing/mgsub) package, Mark defines "safe" as:

> 1. Longer matches are preferred over shorter matches for substitution first

> 2. No placeholders are used so accidental string collisions don't occur

Because safety is sometimes required, `textclean::mgsub` provides a `safe` argument using the **mgsub** package as the backend.   In addition to the `safe` argument the `mgsub_regex_safe` function is available to make the usage more explicit.  The safe mode comes at the cost of speed.

```{r}

x <- "Dopazamine is a fake chemical"

pattern <- c("dopazamin", "do.*ne")

replacement <- c("freakout", "metazamine")

```

```{r}

## Unsafe

mgsub(x, pattern, replacement, ignore.case=TRUE, fixed = FALSE)

## Safe

mgsub(x, pattern, replacement, ignore.case=TRUE, fixed = FALSE, safe = TRUE)

## Or alternatively

mgsub_regex_safe(x, pattern, replacement, ignore.case=TRUE)

```

```{r}

x <- "hey, how are you?"

pattern <- c("hey", "how", "are", "you")

replacement <- c("how", "are", "you", "hey")

```

```{r}

## Unsafe

mgsub(x, pattern,replacement)

## Safe

mgsub_regex_safe(x, pattern,replacement)

```

### Match, Extract, Operate, Replacement Subs

Again, `gsub` is a great tool but sometimes the user wants to match a pattern, extract that pattern, operate a function over that pattern, and then replace the original match.  The `fgsub` function allows the user to perform this operation.  It is a stripped down version of `gsubfn` from the **gsubfn** package.  For more versatile needs please see the **gsubfn** package.

In this example the regex looks for words that contain a lower case letter followed by the same letter at least 2 more times.  It then extracts these words, splits them appart into letters, reverses the string, pastes them back together, wraps them with double angle braces, and then puts them back at the original locations.

```{r}

fgsub(

    x = c(NA, 'df dft sdf', 'sd fdggg sd dfhhh d', 'ddd'),

    pattern = "\\b\\w*([a-z])(\\1{2,})\\w*\\b",

    fun = function(x) {paste0('<<', paste(rev(strsplit(x, '')[[1]]), collapse =''), '>>')}

)

```

In this example we extract numbers, strip out non-digits, coerce them to numeric, cut them in half, round up to the closest integer, add the commas back, and replace back into the original locations.

```{r}

fgsub(

    x = c(NA, 'I want 32 grapes', 'he wants 4 ice creams', 'they want 1,234,567 dollars'),

    pattern = "[\\d,]+",

    fun = function(x) {prettyNum(ceiling(as.numeric(gsub('[^0-9]', '', x))/2), big.mark = ',')}

)

```

### Stashing Character Pre-Sub

There are times the user may want to stash a set of characters before subbing out and then return the stashed characters.  An example of this is when a researcher wants to remove punctuation but not emoticons.  The `subholder` function provides tooling to stash the emoticons, allow a punctuation stripping, and then return the emoticons.  First I'll create some fake text data with emoticons, then stash the emoticons (using a unique text key to hold their place), then I'll strip out the punctuation, and last put the stashed emoticons back.

```{r}

(fake_dat <- paste(hash_emoticons[1:11, 1, with=FALSE][[1]], DATA$state))

(m <- sub_holder(fake_dat, hash_emoticons[[1]]))

(m_stripped <-strip(m$output))

m$unhold(m_stripped)

```

Of course with clever regexes you can achieve the same thing:

```{r}

ord_emos <- hash_emoticons[[1]][order(nchar(hash_emoticons[[1]]))]

## This step ensures that longer strings are matched first but can 

## fail in cases that use quantifiers.  These can appear short but in

## reality can match long strings and would be ordered last in the 

## replacement, meaning that the shorter regex took precedent.

emos <- paste(

    gsub('([().\\|[{}^$*+?])', '\\\\\\1', ord_emos),

    collapse = '|'

)

gsub(

    sprintf('(%s)(*SKIP)(*FAIL)|[^\'[:^punct:]]', emos), 

    '', 

    fake_dat, 

    perl = TRUE

)

```

The pure regex approach can be a bit trickier (less safe) and more difficult to reason about.  It also relies on the less general `(*SKIP)(*FAIL)` backtracking control verbs that are only implemented in a few applications like Perl & PCRE.  Still, it's nice to see an alternative regex approach for comparison.

## Replacement

**textclean** contains tools to replace substrings within text with other substrings that may be easier to analyze.  This section outlines the uses of these tools.  

### Contractions

Some analysis techniques require contractions to be replaced with their multi-word forms (e.g., "I'll" -> "I will").  `replace_contrction` provides this functionality.

```{r}

x <- c("Mr. Jones isn't going.",  

    "Check it out what's going on.",

    "He's here but didn't go.",

    "the robot at t.s. wasn't nice", 

    "he'd like it if i'd go away")

replace_contraction(x)

```

### Dates

```{r}

x <- c(NA, '11-16-1980 and 11/16/1980', "and 2017-02-08 but then there's 2/8/2017 too")

replace_date(x)

replace_date(x, replacement = '<>')

```

### Emojis

Similar to emoticons, emoji tokens may be ignored if they are not in a computer readable form.  `replace_emoji` replaces emojis with their word forms equivalents.

```{r}

x <- read.delim(system.file("docs/r_tweets.txt", package = "textclean"), 

    stringsAsFactors = FALSE)[[3]][1:3]

x

```

```{r}

replace_emoji(x)

```

### Emoticons

Some analysis techniques examine words, meaning emoticons may be ignored.  `replace_emoticon` replaces emoticons with their word forms equivalents.

```{r}

x <- c(

    "text from: http://www.webopedia.com/quick_ref/textmessageabbreviations_02.asp",

    "... understanding what different characters used in smiley faces mean:",

    "The close bracket represents a sideways smile  )",

    "Add in the colon and you have sideways eyes   :",

    "Put them together to make a smiley face  :)",

    "Use the dash -  to add a nose   :-)",

    "Change the colon to a semi-colon ; and you have a winking face ;)  with a nose  ;-)",

    "Put a zero 0 (halo) on top and now you have a winking, smiling angel 0;) with a nose 0;-)",

    "Use the letter 8 in place of the colon for sunglasses 8-)",

    "Use the open bracket ( to turn the smile into a frown  :-("

)

replace_emoticon(x)

```

### Grades

In analysis where grades may be discussed it may be useful to convert the letter forms into word meanings.  The `replace_grade` can be used for this task.

```{r}

text <- c(

    "I give an A+",

    "He deserves an F",

    "It's C+ work",

    "A poor example deserves a C!"

)

replace_grade(text)

```

### HTML

Sometimes HTML tags and symbols stick around like pesky gnats.  The `replace_html` function makes light work of them.

```{r}

x <- c(

    "Random text with symbols:   < > & " '",

    "
More text ¢ £ ¥ € © ®"

)

replace_html(x)

```

### Incomplete Sentences

Sometimes an incomplete sentence is denoted with multiple end marks or no punctuation at all.  `replace_incomplete` standardizes these sentences with a pipe (`|`) endmark (or one of the user's choice).

```{r}

x <- c("the...",  "I.?", "you.", "threw..", "we?")

replace_incomplete(x)

replace_incomplete(x, '...')

```

### Internet Slang

Often in informal written and spoken communication (e.g., Twitter, texting, Facebook, etc.) people use Internet slang, shorter abbreviations and acronyms, to replace longer word sequences.  These replacements may obfuscate the meaning when the machine attempts to analyze the text.  The `replace_internet_slang` function replaces the slang with longer word equivalents that are more easily analyzed by machines.

```{r}

x <- c(

    "TGIF and a big w00t!  The weekend is GR8!",

    "NP it was my pleasure: EOM",

    'w8...this n00b needs me to say LMGTFY...lol.',

    NA

)

replace_internet_slang(x)

```

### Kerning

In typography kerning is the adjustment of spacing.  Often, in informal writing, adding manual spaces (a form of kerning) coupled with all capital letters is  used for emphasis (e.g., `"She's the B O M B!"`).  These word forms would look like noise in most analysis and would likely be removed as a stopword when in fact they likely carry a great deal of meaning.  The `replace_kern` function looks for 3 or more consecutive capital letters with spaces in between and removes the spaces.  

```{r}

x <- c(

    "Welcome to A I: the best W O R L D!",

    "Hi I R is the B O M B for sure: we A G R E E indeed.",

    "A sort C A T indeed!",

    NA

)

replace_kern(x)

```

### Money

There are times one may want to replace money mentions with text or normalized versions.  The `replace_money` function is designed to complete this task.

```{r}

x <- c(NA, '$3.16 into "three dollars, sixteen cents"', "-$20,333.18 too", 'fff')

 

replace_money(x)

```

```{r}

replace_money(x, replacement = '<>')

```

### Names

Often one will want to standardize text by removing first and last names.  The `replace_names` function quickly removes/replaces common first and last names.  This can be made more targeted by feeding a vector of names extracted via a named entity extractor.

```{r}

x <- c(

    "Mary Smith is not here",

     "Karen is not a nice person",

     "Will will do it",

    NA

)

 

replace_names(x)

replace_names(x, replacement = '<>')

```

### Non-ASCII Characters

R can choke on non-ASCII characters.  They can be re-encoded but the new encoding may lack interpretability (e.g., ¢ may be converted to `\xA2` which is not easily understood or likely to be matched in a hash look up).  `replace_non_ascii` attempts to replace common non-ASCII characters with a text representation (e.g., ¢ becomes "cent")  Non recognized non-ASCII characters are simply removed (unless `remove.nonconverted = FALSE`).

```{r}

x <- c(

    "Hello World", "6 Ekstr\xf8m", "J\xf6reskog", "bi\xdfchen Z\xfcrcher",

    'This is a \xA9 but not a \xAE', '6 \xF7 2 = 3', 'fractions \xBC, \xBD, \xBE',

    'cows go \xB5', '30\xA2'

)

Encoding(x) <- "latin1"

x

replace_non_ascii(x)

replace_non_ascii(x, remove.nonconverted = FALSE)

```

### Numbers

Some analysis requires numbers to be converted to text form.  `replace_number` attempts to perform this task.  `replace_number` handles comma separated numbers as well.

```{r}

x <- c("I like 346,457 ice cream cones.", "They are 99 percent good")

y <- c("I like 346457 ice cream cones.", "They are 99 percent good")

replace_number(x)

replace_number(y)

replace_number(x, num.paste = TRUE)

replace_number(x, remove=TRUE)

```

### Ratings

Some texts use ratings to convey satisfaction with a particular object.  The `replace_rating` function replaces the more abstract rating with word equivalents.

```{r}

x <- c("This place receives 5 stars for their APPETIZERS!!!",

     "Four stars for the food & the guy in the blue shirt for his great vibe!",

     "10 out of 10 for both the movie and trilogy.",

     "* Both the Hot & Sour & the Egg Flower Soups were absolutely 5 Stars!",

     "For service, I give them no stars.", "This place deserves no stars.",

     "10 out of 10 stars.",

     "My rating: just 3 out of 10.",

     "If there were zero stars I would give it zero stars.",

     "Rating: 1 out of 10.",

     "I gave it 5 stars because of the sound quality.",

     "If it were possible to give them 0/10, they'd have it."

)

replace_rating(x)

```

### Ordinal Numbers

Again, some analysis requires numbers, including ordinal numbers, to be converted to text form.  `replace_ordinal` attempts to perform this task for ordinal number 1-100 (i.e., 1st - 100th).  

```{r}

x <- c(

    "I like the 1st one not the 22nd one.", 

    "For the 100th time stop those 3 things!",

    "I like the 3rd 1 not the 12th 1."

)

replace_ordinal(x)

replace_ordinal(x, TRUE)

replace_ordinal(x, remove = TRUE)

replace_number(replace_ordinal(x))

```

### Symbols

Text often contains short-hand representations of words/phrases.  These symbols may contain analyzable information but in the symbolic form they cannot be parsed.  The `replace_symbol` function attempts to replace the symbols `c("$", "%", "#", "@", "& "w/")` with their word equivalents.

```{r}

x <- c("I am @ Jon's & Jim's w/ Marry", 

    "I owe $41 for food", 

    "two is 10% of a #"

)

replace_symbol(x)

```

### Time Stamps

Often times the researcher will want to replace times with a text or normalized version.  The `replace_time` function works well for this task.  Notice that replacement takes a function that can operate on the extracted pattern.

```{r}

x <- c(

    NA, '12:47 to "twelve forty-seven" and also 8:35:02',

    'what about 14:24.5', 'And then 99:99:99?'

)

## Textual: Word version

replace_time(x)

## Normalization: <>

replace_time(x, replacement = '<>')

## Normalization: hh:mm:ss or hh:mm

replace_time(x, replacement = function(y){

        z <- unlist(strsplit(y, '[:.]'))

        z[1] <- 'hh'

        z[2] <- 'mm'

        if(!is.na(z[3])) z[3] <- 'ss'

        textclean::glue_collapse(z, ':')

    }

)

## Textual: Word version (forced seconds)

replace_time(x, replacement = function(y){

        z <- replace_number(unlist(strsplit(y, '[:.]')))

        z[3] <- paste0('and ', ifelse(is.na(z[3]), '0', z[3]), ' seconds')

        paste(z, collapse = ' ')

    }

)

```

### Tokens

Often an analysis requires converting tokens of a certain type into a common form or removing them entirely.  The `mgsub` function can do this task, however it is regex based and time consuming when the number of tokens to replace is large.  For example, one may want to replace all proper nouns that are first names with the word name.  The `replace_token` provides a fast way to replace a group of tokens with a single replacement.

This example shows a use case for `replace_token`:

```{r}

## Set Up the Tokens to Replace

nms <- gsub("(^.)(.*)", "\\U\\1\\L\\2", lexicon::common_names, perl = TRUE)

head(nms)

## Set Up the Data

x <- textshape::split_portion(sample(c(sample(lexicon::grady_augmented, 20000), 

    sample(nms, 10000, TRUE))), n.words = 12)

x$text.var <- paste0(x$text.var, sample(c('.', '!', '?'), length(x$text.var), TRUE))

head(x$text.var)

head(replace_tokens(x$text.var, nms, 'NAME'))

```

This demonstration shows how fast token replacement can be with `replace_token`:

```{r}

## mgsub

tic <- Sys.time()

head(mgsub(x$text.var, nms, "NAME"))

(toc <- Sys.time() - tic)

## replace_tokens

tic <- Sys.time()

head(replace_tokens(x$text.var, nms, "NAME"))

(toc <- Sys.time() - tic)

```

```{r, echo = FALSE}

tic <- Sys.time()

out <- replace_tokens(rep(x$text.var, 20), nms, "NAME")

toc <- Sys.time() - tic

```

Now let's amp it up with 20x more text data.  That's `r f_comma(length(x$text.var) * 20)` rows of text (`r f_comma(sum(stringi::stri_count_words(x$text.var))*20)` words) and `r f_comma(length(nms))` replacement tokens in `r round(toc, 1)` seconds.

```

tic <- Sys.time()

out <- replace_tokens(rep(x$text.var, 20), nms, "NAME")

(toc <- Sys.time() - tic)

```

```{r, echo=FALSE}

toc

```

### White Space

Regex white space characters (e.g., `\n`, `\t`, `\r`) matched by `\s` may impede analysis.  These can be replaced with a single space `" "` via the `replace_white` function.

```{r}

x <- "I go \r

    to   the \tnext line"

x

cat(x)

replace_white(x)

```

### Word Elongation

In informal writing people may use a form of text embellishment to emphasize or alter word meanings called elongation (a.k.a. "word lengthening").  For example, the use of "Whyyyyy" conveys frustration.  Other times the usage may be to be more sexy (e.g., "Heyyyy there").  Other times it may be used for emphasis (e.g., "This is so gooood").

The `replace_word_elongation` function replaces these un-normalized forms with the most likely normalized form.  The `impart.meaning` argument can replace a short list of known elongations with semantic replacements.

```{r}

x <- c('look', 'noooooo!', 'real coooool!', "it's sooo goooood", 'fsdfds',

    'fdddf', 'as', "aaaahahahahaha", "aabbccxccbbaa", 'I said heyyy!',

    "I'm liiiike whyyyyy me?", "Wwwhhatttt!", NA)

replace_word_elongation(x)

replace_word_elongation(x, impart.meaning = TRUE)

```
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/trinker/textclean

Awesome Lists containing this project

README