Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/mkearney/textfeatures
👷♂️ A simple package for extracting useful features from character objects 👷♀️
https://github.com/mkearney/textfeatures
feature-extraction machine-learning mkearney-r-package neural-network neural-networks r rstats text-mining word2vec
Last synced: 3 days ago
JSON representation
👷♂️ A simple package for extracting useful features from character objects 👷♀️
- Host: GitHub
- URL: https://github.com/mkearney/textfeatures
- Owner: mkearney
- License: other
- Created: 2018-02-27T00:00:10.000Z (almost 7 years ago)
- Default Branch: master
- Last Pushed: 2020-10-12T20:07:14.000Z (over 4 years ago)
- Last Synced: 2025-01-15T11:45:11.450Z (10 days ago)
- Topics: feature-extraction, machine-learning, mkearney-r-package, neural-network, neural-networks, r, rstats, text-mining, word2vec
- Language: R
- Homepage: https://textfeatures.mikewk.com
- Size: 7.64 MB
- Stars: 168
- Watchers: 9
- Forks: 17
- Open Issues: 10
-
Metadata Files:
- Readme: README.Rmd
- License: LICENSE
Awesome Lists containing this project
- jimsghstars - mkearney/textfeatures - 👷♂️ A simple package for extracting useful features from character objects 👷♀️ (R)
README
---
output: github_document
---```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, collapse = TRUE, comment = "#>", cache = TRUE)
library(textfeatures)
library(magrittr)
options(width = 100)
skimrskim <- function(x) {
skimr::skim(x[-1]) %>%
dplyr::filter(stat %in% c("p0", "p25", "p50", "p75", "p100", "hist", "n")) %>%
dplyr::select(-value, -level, -type) %>%
tidyr::spread(stat, formatted) %>%
dplyr::select(variable, `min` = p0, `25%` = p25, `mid` = p50, `75%` = p75, `max` = p100, hist) %>%
knitr::kable()
}
```# 👷 textfeatures 👷
[![Build status](https://travis-ci.org/mkearney/textfeatures.svg?branch=master)](https://travis-ci.org/mkearney/textfeatures)
[![AppVeyor build status](https://ci.appveyor.com/api/projects/status/github/mkearney/textfeatures?branch=master&svg=true)](https://ci.appveyor.com/project/mkearney/textfeatures)
[![CRAN status](https://www.r-pkg.org/badges/version/textfeatures)](https://cran.r-project.org/package=textfeatures)
[![Coverage Status](https://codecov.io/gh/mkearney/textfeatures/branch/master/graph/badge.svg)](https://codecov.io/gh/mkearney/textfeatures?branch=master)
[![DOI](https://zenodo.org/badge/123046986.svg)](https://zenodo.org/badge/latestdoi/123046986)![Downloads](https://cranlogs.r-pkg.org/badges/textfeatures)
![Downloads](https://cranlogs.r-pkg.org/badges/grand-total/textfeatures)
[![lifecycle](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://www.tidyverse.org/lifecycle/#experimental)> Easily extract useful features from character objects.
## Install
Install from CRAN.
```{r cran, eval=FALSE}
## download from CRAN
install.packages("textfeatures")
```Or install the development version from Github.
```{r github, eval=FALSE}
## install from github
devtools::install_github("mkearney/textfeatures")
```## Usage
### `textfeatures()`
Input a character vector.
```{r chr}
## vector of some text
x <- c(
"this is A!\t sEntence https://github.com about #rstats @github",
"and another sentence here", "THe following list:\n- one\n- two\n- three\nOkay!?!"
)## get text features
textfeatures(x, verbose = FALSE)
```Or input a data frame with a column named `text`.
```{r df}
## data frame with rstats tweets
rt <- rtweet::search_tweets("rstats", n = 2000, verbose = FALSE)## get text features
tf <- textfeatures(rt, verbose = FALSE)## preview data
tf
```Compare across multiple authors.
```{r news, eval = FALSE}
## data frame tweets from multiple news media accounts
news <- rtweet::get_timelines(
c("cnn", "nytimes", "foxnews", "latimes", "washingtonpost"),
n = 2000)## get text features (including ests for 20 word dims) for all observations
news_features <- textfeatures(news, word_dims = 20, verbose = FALSE)
``````{r news_features, echo = FALSE, eval = FALSE}
## override id with screen names
news_features$user_id <- news$screen_name## load the tidyverse
suppressPackageStartupMessages(library(tidyverse))## convert to long (tidy) form and plot
p <- news_features %>%
scale_count() %>%
scale_standard() %>%
group_by(user_id) %>%
summarise_if(is.numeric, mean) %>%
gather(var, val, -user_id) %>%
arrange(-val) %>%
mutate(var = factor(var, levels = unique(var)),
user_id = paste0("@", user_id)) %>%
ggplot(aes(x = var, y = val, fill = user_id)) +
geom_col(width = .15, fill = "#000000bb") +
geom_point(size = 2.5, shape = 21) +
tfse::theme_mwk(light = "#ffffff") +
facet_wrap(~ user_id, nrow = 1) +
coord_flip() +
theme(legend.position = "none",
axis.text = element_text(colour = "black"),
axis.text.x = element_text(size = rel(.7)),
plot.title = element_text(face = "bold", size = rel(1.6)),
panel.grid.major = element_line(colour = "#333333", size = rel(.05)),
panel.grid.minor = element_line(colour = "#333333", size = rel(.025))) +
labs(y = NULL, x = NULL,
title = "{textfeatures}: Extract Features from Text",
subtitle = "Features extracted from text of the most recent 2,000 tweets posted by each news media account")## save plot
ggsave("tools/readme/readme.png", p, width = 9, height = 6, units = "in")
```## Fast version
If you're looking for something faster try setting `sentiment = FALSE` and `word2vec = 0`.
```{r fast}
## get non-substantive text features
textfeatures(rt, sentiment = FALSE, word_dims = 0, verbose = FALSE)
```## Example: NASA meta data
Extract text features from NASA meta data:
```{r, include=FALSE}
if (!file.exists(".nasa.rds")) {
## read NASA meta data
nasa <- jsonlite::fromJSON("https://data.nasa.gov/data.json")
## identify non-public or restricted data sets
nonpub <- grepl("Not publicly available|must register",
nasa$data$rights, ignore.case = TRUE) |
nasa$dataset$accessLevel %in% c("restricted public", "non-public")
## create data frame with ID, description (name it "text"), and nonpub
nd <- data.frame(text = nasa$dataset$description, nonpub = nonpub,
stringsAsFactors = FALSE)
## drop duplicates (truncate text to ensure more distinct obs)
nd <- nd[!duplicated(tolower(substr(nd$text, 1, 100))), ]
## filter via sampling to create equal number of pub/nonpub
nd <- nd[c(sample(which(!nd$nonpub), sum(nd$nonpub)), which(nd$nonpub)), ]
saveRDS(nd, ".nasa.rds")
} else {
nd <- readRDS(".nasa.rds")
}
``````{r nd, eval=FALSE}
## read NASA meta data
nasa <- jsonlite::fromJSON("https://data.nasa.gov/data.json")## identify non-public or restricted data sets
nonpub <- grepl("Not publicly available|must register",
nasa$data$rights, ignore.case = TRUE) |
nasa$dataset$accessLevel %in% c("restricted public", "non-public")## create data frame with ID, description (name it "text"), and nonpub
nd <- data.frame(text = nasa$dataset$description, nonpub = nonpub,
stringsAsFactors = FALSE)## drop duplicates (truncate text to ensure more distinct obs)
nd <- nd[!duplicated(tolower(substr(nd$text, 1, 100))), ]## filter via sampling to create equal number of pub/nonpub
nd <- nd[c(sample(which(!nd$nonpub), sum(nd$nonpub)), which(nd$nonpub)), ]
``````{r nasafinal}
## get text features
nasa_tf <- textfeatures(nd, word_dims = 20, normalize = FALSE, verbose = FALSE)## drop columns with little to no variance
min_var <- function(x, min = 1) {
is_num <- vapply(x, is.numeric, logical(1))
non_num <- names(x)[!is_num]
yminvar <- names(x[is_num])[vapply(x[is_num], function(.x) stats::var(.x,
na.rm = TRUE) >= min, logical(1))]
x[c(non_num, yminvar)]
}
nasa_tf <- min_var(nasa_tf)## view summary
skimrskim(nasa_tf)## add nonpub variable
nasa_tf$nonpub <- nd$nonpub## run model predicting whether data is restricted
m1 <- glm(nonpub ~ ., data = nasa_tf[-1], family = binomial)## view model summary
summary(m1)## how accurate was the model?
table(predict(m1, type = "response") > .5, nasa_tf$nonpub)
```