Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/koheiw/wordmap
https://github.com/koheiw/wordmap
Last synced: 2 days ago
JSON representation
- Host: GitHub
- URL: https://github.com/koheiw/wordmap
- Owner: koheiw
- License: other
- Created: 2024-06-11T05:06:22.000Z (5 months ago)
- Default Branch: master
- Last Pushed: 2024-10-25T00:09:30.000Z (14 days ago)
- Last Synced: 2024-10-26T12:49:35.242Z (12 days ago)
- Language: R
- Size: 3.36 MB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.Rmd
- Changelog: NEWS.md
- License: LICENSE
Awesome Lists containing this project
README
---
output: github_document
editor_options:
chunk_output_type: console
---```{r, echo=FALSE, message=FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "##",
fig.path = "man/images/"
)
```# Wordmap: Semi-supervised Multinomial Document Classifier
**wordmap** is a semi-supervised algorithm for multinomial document classification originally created for [newsmap](https://github.com/koheiw/newsmap). **wordmap** is separated from **newsmap** to expand the score of its application beyond geographical classification of news.
The algorithm is also useful in extracting features associated with document meta-data (industry group, patent class etc.) from vary larger corpora. The list of features could be used to create a lexicon to perform dictionary analysis.
## How to install
**wordmap** is available on CRAN since the v0.8.0 You can install the package using the R command.
```{r, eval=FALSE}
install.packages("wordmap")
```If you want to the latest version, please install by running this command in R. You need to have **devtools** installed beforehand.
```{r, eval=FALSE}
install.packages("devtools")
devtools::install_github("koheiw/wordmap")
```## Example
In this example, we identify topics of sentences from using a seed topic dictionary adopted from [Watanabe & Zhou (2020)](https://journals.sagepub.com/doi/full/10.1177/0894439320907027).
`data_corpus_ungd2017` contains transcripts of speeches delivered at the United Nations General Assembly in 2017.```{r}
require(quanteda)
require(wordmap)dict <- data_dictionary_topic
print(dict)corp <- data_corpus_ungd2017 %>%
corpus_reshape()toks <- tokens(corp, remove_url = TRUE, remove_numbers = TRUE) %>%
tokens_remove(stopwords("en"), min_nchar = 2, padding = TRUE) #%>%
#tokens_remove("^[A-Z]", valuetype = "regex", case_insensitive = FALSE, padding = TRUE)
dfmt_feat <- dfm(toks, remove_padding = TRUE) %>%
dfm_trim(min_termfreq = 5)
dfmt_label <- tokens_lookup(toks, dict) %>%
dfm()map <- textmodel_wordmap(dfmt_feat, dfmt_label)
coef(map)
```### Predict topics of sentences
```{r}
dat <- data.frame(text = corp, topic = predict(map))
``````{r echo=FALSE}
knitr::kable(head(dat, 10))
```### Create a topic dictionary
Create a **quanteda** dictionary object from the extracted features. The dictionary could be use to perform analysis of other corpora.
```{r}
as.dictionary(map, n = 100)
```