{"id":13857762,"url":"https://github.com/mkearney/wactor","last_synced_at":"2025-10-18T07:13:02.428Z","repository":{"id":56936608,"uuid":"196457380","full_name":"mkearney/wactor","owner":"mkearney","description":"Word Factor Vectors","archived":false,"fork":false,"pushed_at":"2019-12-13T05:39:39.000Z","size":387,"stargazers_count":32,"open_issues_count":0,"forks_count":2,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-03-26T05:41:50.619Z","etag":null,"topics":["r","r-package","rstats","text","text-classification","text-processing","text-vectorization","word-embeddings","word-vectors","word2vec"],"latest_commit_sha":null,"homepage":"https://github.com/mkearney/wactor","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mkearney.png","metadata":{"files":{"readme":"README.Rmd","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-07-11T19:58:57.000Z","updated_at":"2025-03-22T10:55:07.000Z","dependencies_parsed_at":"2022-08-21T01:10:13.480Z","dependency_job_id":null,"html_url":"https://github.com/mkearney/wactor","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mkearney%2Fwactor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mkearney%2Fwactor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mkearney%2Fwactor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mkearney%2Fwactor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mkearney","download_url":"https://codeload.github.com/mkearney/wactor/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248557844,"owners_count":21124165,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["r","r-package","rstats","text","text-classification","text-processing","text-vectorization","word-embeddings","word-vectors","word2vec"],"created_at":"2024-08-05T03:01:46.203Z","updated_at":"2025-10-18T07:12:57.379Z","avatar_url":"https://github.com/mkearney.png","language":"R","readme":"---\noutput: github_document\n---\n\n\u003c!-- README.md is generated from README.Rmd. Please edit that file --\u003e\n\n```{r, include = FALSE}\nknitr::opts_chunk$set(\n  collapse = TRUE,\n  comment = \"#\u003e\",\n  fig.path = \"man/figures/README-\",\n  out.width = \"100%\"\n)\noptions(width = 104)\n```\n# wactor \u003cimg src='man/figures/logo.png' align=\"right\" height=\"200\" /\u003e\n\n\u003c!-- badges: start --\u003e\n[![Travis build status](https://travis-ci.org/mkearney/wactor.svg?branch=master)](https://travis-ci.org/mkearney/wactor)\n[![AppVeyor build status](https://ci.appveyor.com/api/projects/status/github/mkearney/wactor?branch=master\u0026svg=true)](https://ci.appveyor.com/project/mkearney/wactor)\n[![CRAN status](https://www.r-pkg.org/badges/version/wactor)](https://CRAN.R-project.org/package=wactor)\n[![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://www.tidyverse.org/lifecycle/#experimental)\n[![Codecov test coverage](https://codecov.io/gh/mkearney/wactor/branch/master/graph/badge.svg)](https://codecov.io/gh/mkearney/wactor?branch=master)\n\u003c!-- badges: end --\u003e\n\nA user-friendly factor-like interface for converting strings of text into numeric vectors and rectangular data structures.\n\n## Installation\n\n\u003c!-- You can install the released version of wactor from [CRAN](https://CRAN.R-project.org) with:\n\n``` r\ninstall.packages(\"wactor\")\n```\n--\u003e\n\nYou can install the development version from [GitHub](https://github.com/mkearney/wactor) with:\n\n``` r\n## install {remotes} if not already\nif (!requireNamespace(\"remotes\")) {\n  install.packages(\"remotes\")\n}\n## install wactor for github\nremotes::install_github(\"mkearney/wactor\")\n```\n## Example\n\nHere's some basic text (e.g., natural language) data:\n\n```{r example}\n## load wactor package\nlibrary(wactor)\n\n## text data (sentences)\nx \u003c- c(\n  \"This test is a test\",\n  \"This one will be a test\",\n  \"This this was a test\",\n  \"And this is the fourth test\",\n  \"Fifth: the test!\",\n  \"And the sixth test\",\n  \"This is the seventh test\",\n  \"This test is going to be a test\",\n  \"This one will have been a test\",\n  \"This this has been a test\"\n)\n\n## for demonstration purposes, store as a data frame as well\ndata \u003c- tibble::tibble(\n  text = x,\n  value = rnorm(length(x)),\n  z = c(rep(TRUE, 7), rep(FALSE, 3))\n)\n```\n\n\n### `split_test_train()`\n\nA convenience function for splitting an input object into test and train data \nframes. This is often useful for splitting a single data frame:\n\n```{r}\n## split into test/train data sets\nsplit_test_train(data)\n```\n\nBy default, `split_test_train()` returns 80% of the input data in the `train`\ndata set and 20% of the input data in the `test` data set. This proportion of\ndata used in the returned training data can be adjusted via the `.p` argument:\n\n```{r}\n## split into test/train data sets–with 70% of data in training set\nsplit_test_train(data, .p = 0.70)\n```\n\nWhen predicting categorical variables, it's often desirable to ensure the \ntraining data set has an even number of observations for each level in the \nresponse variable. This can be achieved by indicating the [column] name of the \ncategorical response variable using tidy evaluation. This will prioritize evenly\nbalanced observations over the specified proportion in training data:\n\n```{r}\n## ensure evenly balanced groups in `train` data set\nsplit_test_train(data, .p = 0.70, z)\n```\n\nThe `split_test_train()` doesn't only work on data frames. It's also possible to \nsplit atomic vectors (i.e., character, numeric, logical):\n\n```{r}\n## OR split character vector into test/train data sets\n(d \u003c- split_test_train(x))\n```\n\n\n### `wactor()`\n\nUse `wactor()` to convert a character vector into a `wactor` object. The code\nbelow uses the previously split [into test/train] text data `d` described above.\n\n```{r}\n## create wactor\nw \u003c- wactor(d$train$x)\n```\n\n### `dtm()`\n\nGet the document term frequency matrix\n\n```{r}\n## term frequency–inverse document frequency\ndtm(w)\n\n## same thing as dtm\npredict(w)\n```\n\n### `tfidf()`\n\nor term frequency–inverse document frequency matrix\n\n```{r}\n## create tf-idf matrix\ntfidf(w)\n```\n\nOr apply the wactor on **new data**\n\n```{r}\n## document term frequecy of new data\ndtm(w, d$test$x)\n\n## same thing as dtm\npredict(w, d$test$x)\n\n## term frequency–inverse document frequency of new data\ntfidf(w, d$test$x)\n```\n\n\n### `xgb_mat`\n\nThe wactor package also makes it easy to work with the \n[{xgboost}](https://github.com/dmlc/xgboost) package:\n\n```{r}\n## convert tfidf matrix into xgb.DMatrix\nxgb_mat(tfidf(w, d$test$x))\n```\n\nThe `xgb_mat()` function also allows users to specify a response/label/outcome\nvector, e.g.:\n\n```{r}\n## include a response variable\nxgb_mat(tfidf(w, d$train$x), y = c(rep(0, 4), rep(1, 4)))\n```\n\nTo return split (into test and train) data, specify a value between 0-1 to set\nthe proportion of observations that should appear in the training data set:\n\n```{r}\n## split into test/train\nxgb_data \u003c- xgb_mat(tfidf(w, d$train$x), y = c(rep(0, 4), rep(1, 4)), split = 0.8)\n```\n\nThe object returned by `xgb_mat()` can then easily be passed to {xgboost}\nfunctions for powerful and fast machine learning!\n\n```{r}\n## specify hyper params\nparams \u003c- list(\n  max_depth = 2,\n  eta = 0.25,\n  objective = \"binary:logistic\"\n)\n\n## init training\nxgboost::xgb.train(\n  params,\n  xgb_data$train,\n  nrounds = 4,\n  watchlist = xgb_data)\n```\n\n","funding_links":[],"categories":["R"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmkearney%2Fwactor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmkearney%2Fwactor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmkearney%2Fwactor/lists"}