{"id":13857986,"url":"https://github.com/mkearney/textfeatures","last_synced_at":"2025-04-09T14:09:22.788Z","repository":{"id":56936800,"uuid":"123046986","full_name":"mkearney/textfeatures","owner":"mkearney","description":"👷‍♂️ A simple package for extracting useful features from character objects 👷‍♀️","archived":false,"fork":false,"pushed_at":"2020-10-12T20:07:14.000Z","size":8008,"stargazers_count":167,"open_issues_count":10,"forks_count":17,"subscribers_count":8,"default_branch":"master","last_synced_at":"2025-03-31T15:25:45.631Z","etag":null,"topics":["feature-extraction","machine-learning","mkearney-r-package","neural-network","neural-networks","r","rstats","text-mining","word2vec"],"latest_commit_sha":null,"homepage":"https://textfeatures.mikewk.com","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mkearney.png","metadata":{"files":{"readme":"README.Rmd","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-02-27T00:00:10.000Z","updated_at":"2025-03-22T11:01:43.000Z","dependencies_parsed_at":"2022-08-21T07:20:33.400Z","dependency_job_id":null,"html_url":"https://github.com/mkearney/textfeatures","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mkearney%2Ftextfeatures","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mkearney%2Ftextfeatures/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mkearney%2Ftextfeatures/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mkearney%2Ftextfeatures/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mkearney","download_url":"https://codeload.github.com/mkearney/textfeatures/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248054195,"owners_count":21039952,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["feature-extraction","machine-learning","mkearney-r-package","neural-network","neural-networks","r","rstats","text-mining","word2vec"],"created_at":"2024-08-05T03:01:53.239Z","updated_at":"2025-04-09T14:09:22.765Z","avatar_url":"https://github.com/mkearney.png","language":"R","readme":"---\noutput: github_document\n---\n\n```{r setup, include=FALSE}\nknitr::opts_chunk$set(echo = TRUE, collapse = TRUE, comment = \"#\u003e\", cache = TRUE)\nlibrary(textfeatures)\nlibrary(magrittr)\noptions(width = 100)\nskimrskim \u003c- function(x) {\n  skimr::skim(x[-1]) %\u003e% \n    dplyr::filter(stat %in% c(\"p0\", \"p25\", \"p50\", \"p75\", \"p100\", \"hist\", \"n\")) %\u003e%\n    dplyr::select(-value, -level, -type) %\u003e%\n    tidyr::spread(stat, formatted) %\u003e%\n    dplyr::select(variable, `min` = p0, `25%` = p25, `mid` = p50, `75%` = p75, `max` = p100, hist) %\u003e% \n    knitr::kable()\n}\n```\n\n# 👷 textfeatures 👷\u003cimg src=\"man/figures/logo.png\" width=\"160px\" align=\"right\" /\u003e \n\n[![Build status](https://travis-ci.org/mkearney/textfeatures.svg?branch=master)](https://travis-ci.org/mkearney/textfeatures)\n[![AppVeyor build status](https://ci.appveyor.com/api/projects/status/github/mkearney/textfeatures?branch=master\u0026svg=true)](https://ci.appveyor.com/project/mkearney/textfeatures)\n[![CRAN status](https://www.r-pkg.org/badges/version/textfeatures)](https://cran.r-project.org/package=textfeatures)\n[![Coverage Status](https://codecov.io/gh/mkearney/textfeatures/branch/master/graph/badge.svg)](https://codecov.io/gh/mkearney/textfeatures?branch=master)\n[![DOI](https://zenodo.org/badge/123046986.svg)](https://zenodo.org/badge/latestdoi/123046986)\n\n![Downloads](https://cranlogs.r-pkg.org/badges/textfeatures)\n![Downloads](https://cranlogs.r-pkg.org/badges/grand-total/textfeatures)\n[![lifecycle](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://www.tidyverse.org/lifecycle/#experimental)\n\n\u003e Easily extract useful features from character objects.\n\n## Install\n\nInstall from CRAN.\n\n```{r cran, eval=FALSE}\n## download from CRAN\ninstall.packages(\"textfeatures\")\n```\n\nOr install the development version from Github.\n\n```{r github, eval=FALSE}\n## install from github\ndevtools::install_github(\"mkearney/textfeatures\")\n```\n\n## Usage\n\n### `textfeatures()`\n\nInput a character vector.\n\n```{r chr}\n## vector of some text\nx \u003c- c(\n  \"this is A!\\t sEntence https://github.com about #rstats @github\",\n  \"and another sentence here\", \"THe following list:\\n- one\\n- two\\n- three\\nOkay!?!\"\n)\n\n## get text features\ntextfeatures(x, verbose = FALSE)\n```\n\nOr input a data frame with a column named `text`.\n\n```{r df}\n## data frame with rstats tweets\nrt \u003c- rtweet::search_tweets(\"rstats\", n = 2000, verbose = FALSE)\n\n## get text features\ntf \u003c- textfeatures(rt, verbose = FALSE)\n\n## preview data\ntf\n```\n\nCompare across multiple authors.\n\n```{r news, eval = FALSE}\n## data frame tweets from multiple news media accounts\nnews \u003c- rtweet::get_timelines(\n  c(\"cnn\", \"nytimes\", \"foxnews\", \"latimes\", \"washingtonpost\"), \n  n = 2000)\n\n## get text features (including ests for 20 word dims) for all observations\nnews_features \u003c- textfeatures(news, word_dims = 20, verbose = FALSE)\n```\n\n\n```{r news_features, echo = FALSE, eval = FALSE}\n## override id with screen names\nnews_features$user_id \u003c- news$screen_name\n\n## load the tidyverse\nsuppressPackageStartupMessages(library(tidyverse))\n\n## convert to long (tidy) form and plot\np \u003c- news_features %\u003e%\n  scale_count() %\u003e%\n  scale_standard() %\u003e%\n  group_by(user_id) %\u003e%\n  summarise_if(is.numeric, mean) %\u003e%\n  gather(var, val, -user_id) %\u003e%\n  arrange(-val) %\u003e%\n  mutate(var = factor(var, levels = unique(var)), \n    user_id = paste0(\"@\", user_id)) %\u003e%\n  ggplot(aes(x = var, y = val, fill = user_id)) + \n  geom_col(width = .15, fill = \"#000000bb\") +\n  geom_point(size = 2.5, shape = 21) + \n  tfse::theme_mwk(light = \"#ffffff\") + \n  facet_wrap(~ user_id, nrow = 1) + \n  coord_flip() + \n  theme(legend.position = \"none\",\n    axis.text = element_text(colour = \"black\"),\n    axis.text.x = element_text(size = rel(.7)),\n    plot.title = element_text(face = \"bold\", size = rel(1.6)),\n    panel.grid.major = element_line(colour = \"#333333\", size = rel(.05)),\n    panel.grid.minor = element_line(colour = \"#333333\", size = rel(.025))) + \n  labs(y = NULL, x = NULL,\n    title = \"{textfeatures}: Extract Features from Text\",\n    subtitle = \"Features extracted from text of the most recent 2,000 tweets posted by each news media account\")\n\n## save plot\nggsave(\"tools/readme/readme.png\", p, width = 9, height = 6, units = \"in\")\n```\n\n\u003cp style='align:center'\u003e\u003cimg src='tools/readme/readme.png' max-width=\"600px\" /\u003e\u003c/p\u003e\n\n\n\n\n## Fast version\n\nIf you're looking for something faster try setting `sentiment = FALSE` and `word2vec = 0`.\n\n```{r fast}\n## get non-substantive text features\ntextfeatures(rt, sentiment = FALSE, word_dims = 0, verbose = FALSE)\n```\n\n\n## Example: NASA meta data\n\nExtract text features from NASA meta data:\n\n\n```{r, include=FALSE}\nif (!file.exists(\".nasa.rds\")) {\n  ## read NASA meta data\n  nasa \u003c- jsonlite::fromJSON(\"https://data.nasa.gov/data.json\")\n  \n  ## identify non-public or restricted data sets\n  nonpub \u003c- grepl(\"Not publicly available|must register\", \n    nasa$data$rights, ignore.case = TRUE) | \n    nasa$dataset$accessLevel %in% c(\"restricted public\", \"non-public\")\n  \n  ## create data frame with ID, description (name it \"text\"), and nonpub\n  nd \u003c- data.frame(text = nasa$dataset$description, nonpub = nonpub, \n    stringsAsFactors = FALSE)\n  \n  ## drop duplicates (truncate text to ensure more distinct obs)\n  nd \u003c- nd[!duplicated(tolower(substr(nd$text, 1, 100))), ]\n  \n  ## filter via sampling to create equal number of pub/nonpub\n  nd \u003c- nd[c(sample(which(!nd$nonpub), sum(nd$nonpub)), which(nd$nonpub)), ]\n  saveRDS(nd, \".nasa.rds\")\n} else {\n  nd \u003c- readRDS(\".nasa.rds\")\n}\n```\n\n\n```{r nd, eval=FALSE}\n## read NASA meta data\nnasa \u003c- jsonlite::fromJSON(\"https://data.nasa.gov/data.json\")\n\n## identify non-public or restricted data sets\nnonpub \u003c- grepl(\"Not publicly available|must register\", \n  nasa$data$rights, ignore.case = TRUE) | \n  nasa$dataset$accessLevel %in% c(\"restricted public\", \"non-public\")\n\n## create data frame with ID, description (name it \"text\"), and nonpub\nnd \u003c- data.frame(text = nasa$dataset$description, nonpub = nonpub, \n  stringsAsFactors = FALSE)\n\n## drop duplicates (truncate text to ensure more distinct obs)\nnd \u003c- nd[!duplicated(tolower(substr(nd$text, 1, 100))), ]\n\n## filter via sampling to create equal number of pub/nonpub\nnd \u003c- nd[c(sample(which(!nd$nonpub), sum(nd$nonpub)), which(nd$nonpub)), ]\n```\n\n\n```{r nasafinal}\n## get text features\nnasa_tf \u003c- textfeatures(nd, word_dims = 20, normalize = FALSE, verbose = FALSE)\n\n## drop columns with little to no variance\nmin_var \u003c- function(x, min = 1) {\n  is_num \u003c- vapply(x, is.numeric, logical(1))\n  non_num \u003c- names(x)[!is_num]\n  yminvar \u003c- names(x[is_num])[vapply(x[is_num], function(.x) stats::var(.x, \n      na.rm = TRUE) \u003e= min, logical(1))]\n  x[c(non_num, yminvar)]\n}\nnasa_tf \u003c- min_var(nasa_tf)\n\n## view summary\nskimrskim(nasa_tf)\n\n## add nonpub variable\nnasa_tf$nonpub \u003c- nd$nonpub\n\n## run model predicting whether data is restricted\nm1 \u003c- glm(nonpub ~ ., data = nasa_tf[-1], family = binomial)\n\n## view model summary\nsummary(m1)\n\n## how accurate was the model?\ntable(predict(m1, type = \"response\") \u003e .5, nasa_tf$nonpub)\n```\n\n\n\n","funding_links":[],"categories":["R"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmkearney%2Ftextfeatures","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmkearney%2Ftextfeatures","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmkearney%2Ftextfeatures/lists"}