https://github.com/mkearney/wactor

Word Factor Vectors
https://github.com/mkearney/wactor

r r-package rstats text text-classification text-processing text-vectorization word-embeddings word-vectors word2vec

Last synced: 6 months ago
JSON representation

Word Factor Vectors

Host: GitHub
URL: https://github.com/mkearney/wactor
Owner: mkearney
License: other
Created: 2019-07-11T19:58:57.000Z (about 6 years ago)
Default Branch: master
Last Pushed: 2019-12-13T05:39:39.000Z (almost 6 years ago)
Last Synced: 2025-03-26T05:41:50.619Z (7 months ago)
Topics: r, r-package, rstats, text, text-classification, text-processing, text-vectorization, word-embeddings, word-vectors, word2vec
Language: R
Homepage: https://github.com/mkearney/wactor
Size: 378 KB
Stars: 32
Watchers: 2
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.Rmd
- License: LICENSE

Awesome Lists containing this project

jimsghstars - mkearney/wactor - Word Factor Vectors (R)

README

          ---

output: github_document

---

```{r, include = FALSE}

knitr::opts_chunk$set(

  collapse = TRUE,

  comment = "#>",

  fig.path = "man/figures/README-",

  out.width = "100%"

)

options(width = 104)

```

# wactor 

[![Travis build status](https://travis-ci.org/mkearney/wactor.svg?branch=master)](https://travis-ci.org/mkearney/wactor)

[![AppVeyor build status](https://ci.appveyor.com/api/projects/status/github/mkearney/wactor?branch=master&svg=true)](https://ci.appveyor.com/project/mkearney/wactor)

[![CRAN status](https://www.r-pkg.org/badges/version/wactor)](https://CRAN.R-project.org/package=wactor)

[![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://www.tidyverse.org/lifecycle/#experimental)

[![Codecov test coverage](https://codecov.io/gh/mkearney/wactor/branch/master/graph/badge.svg)](https://codecov.io/gh/mkearney/wactor?branch=master)

A user-friendly factor-like interface for converting strings of text into numeric vectors and rectangular data structures.

## Installation

You can install the development version from [GitHub](https://github.com/mkearney/wactor) with:

``` r

## install {remotes} if not already

if (!requireNamespace("remotes")) {

  install.packages("remotes")

}

## install wactor for github

remotes::install_github("mkearney/wactor")

```

## Example

Here's some basic text (e.g., natural language) data:

```{r example}

## load wactor package

library(wactor)

## text data (sentences)

x <- c(

  "This test is a test",

  "This one will be a test",

  "This this was a test",

  "And this is the fourth test",

  "Fifth: the test!",

  "And the sixth test",

  "This is the seventh test",

  "This test is going to be a test",

  "This one will have been a test",

  "This this has been a test"

)

## for demonstration purposes, store as a data frame as well

data <- tibble::tibble(

  text = x,

  value = rnorm(length(x)),

  z = c(rep(TRUE, 7), rep(FALSE, 3))

)

```

### `split_test_train()`

A convenience function for splitting an input object into test and train data 

frames. This is often useful for splitting a single data frame:

```{r}

## split into test/train data sets

split_test_train(data)

```

By default, `split_test_train()` returns 80% of the input data in the `train`

data set and 20% of the input data in the `test` data set. This proportion of

data used in the returned training data can be adjusted via the `.p` argument:

```{r}

## split into test/train data sets–with 70% of data in training set

split_test_train(data, .p = 0.70)

```

When predicting categorical variables, it's often desirable to ensure the 

training data set has an even number of observations for each level in the 

response variable. This can be achieved by indicating the [column] name of the 

categorical response variable using tidy evaluation. This will prioritize evenly

balanced observations over the specified proportion in training data:

```{r}

## ensure evenly balanced groups in `train` data set

split_test_train(data, .p = 0.70, z)

```

The `split_test_train()` doesn't only work on data frames. It's also possible to 

split atomic vectors (i.e., character, numeric, logical):

```{r}

## OR split character vector into test/train data sets

(d <- split_test_train(x))

```

### `wactor()`

Use `wactor()` to convert a character vector into a `wactor` object. The code

below uses the previously split [into test/train] text data `d` described above.

```{r}

## create wactor

w <- wactor(d$train$x)

```

### `dtm()`

Get the document term frequency matrix

```{r}

## term frequency–inverse document frequency

dtm(w)

## same thing as dtm

predict(w)

```

### `tfidf()`

or term frequency–inverse document frequency matrix

```{r}

## create tf-idf matrix

tfidf(w)

```

Or apply the wactor on **new data**

```{r}

## document term frequecy of new data

dtm(w, d$test$x)

## same thing as dtm

predict(w, d$test$x)

## term frequency–inverse document frequency of new data

tfidf(w, d$test$x)

```

### `xgb_mat`

The wactor package also makes it easy to work with the 

[{xgboost}](https://github.com/dmlc/xgboost) package:

```{r}

## convert tfidf matrix into xgb.DMatrix

xgb_mat(tfidf(w, d$test$x))

```

The `xgb_mat()` function also allows users to specify a response/label/outcome

vector, e.g.:

```{r}

## include a response variable

xgb_mat(tfidf(w, d$train$x), y = c(rep(0, 4), rep(1, 4)))

```

To return split (into test and train) data, specify a value between 0-1 to set

the proportion of observations that should appear in the training data set:

```{r}

## split into test/train

xgb_data <- xgb_mat(tfidf(w, d$train$x), y = c(rep(0, 4), rep(1, 4)), split = 0.8)

```

The object returned by `xgb_mat()` can then easily be passed to {xgboost}

functions for powerful and fast machine learning!

```{r}

## specify hyper params

params <- list(

  max_depth = 2,

  eta = 0.25,

  objective = "binary:logistic"

)

## init training

xgboost::xgb.train(

  params,

  xgb_data$train,

  nrounds = 4,

  watchlist = xgb_data)

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/mkearney/wactor

Awesome Lists containing this project

README