Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/teramonagi/scdv

SCDV (Sparse Composite Document Vectors) implementation in R
https://github.com/teramonagi/scdv

r

Last synced: 3 months ago
JSON representation

SCDV (Sparse Composite Document Vectors) implementation in R

Awesome Lists containing this project

README

        

---
output: github_document
---

```{r setting, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
warning = FALSE,
message = FALSE,
comment = "#>",
fig.path = "man/figures/README-"
)
```

# scdv
An R package for SCDV (Sparse Composite Document Vectors) algorithm

[![Travis-CI Build
Status](https://api.travis-ci.com/teramonagi/scdv.svg?branch=master)](https://travis-ci.com/teramonagi/scdv)

## Installation

```r
# Wait for a while...
# install.packages("scdv")

# The development version from GitHub:
# install.packages("devtools")
devtools::install_github("teramonagi/scdv")
```

## Example

### Get (sample) data and do pre-processing
```{r data, cache=TRUE}
library(scdv)
# Get example document from Project Gutenberg (http://www.gutenberg.org/wiki/Main_Page)
urls <- c(
"http://www.gutenberg.org/files/98/98-0.txt",
"http://www.gutenberg.org/files/1342/1342-0.txt"
)
x <- purrr::map(urls, ~ httr::content(httr::GET(.x)))
# pre-processing for each document
doc <- purrr::map(x, ~ tokenizers::tokenize_words(.x, stopwords = stopwords::stopwords("en"))[[1]])
doc[[1]][1:10]
```

### Calculate SCDV(Sparse Composite Document Vector)
```{r scdv, cache=TRUE}
# Set the number of cluster (k), and the word2vec dimension (dimension)
k <- 10
dimension <- 30
# Calculate Sparse Composite Document Vector
dv <- scdv::scdv(doc, k, dimension, word2vec_args = list(show_by=25))
```

### Calculate embedding expression by word2vec and visualize these
```{r w2v, cache=FALSE}
# Calculate embedding expression by word2vec
wv <- scdv::word2vec(doc, dimension, args = list(show_by=25))
```

```{r w2v_visualize}
# Sample row and visualize
scdv::visualize(wv[sample(nrow(wv), size = 10), ])
# You can also visualize document vecotr like
#scdv::visualize(dv)
```