Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/koheiw/newsmap

Semi-supervised algorithm for geographical document classification
https://github.com/koheiw/newsmap

machine-learning news-stories quanteda text-analysis

Last synced: 2 days ago
JSON representation

Semi-supervised algorithm for geographical document classification

Awesome Lists containing this project

README

        

---
output: github_document
---

```{r, echo=FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "##",
fig.path = "man/images/"
)
```

# Newsmap: geographical document classifier

[![CRAN
Version](https://www.r-pkg.org/badges/version/newsmap)](https://CRAN.R-project.org/package=newsmap)
[![Downloads](https://cranlogs.r-pkg.org/badges/newsmap)](https://CRAN.R-project.org/package=newsmap)
[![Total
Downloads](https://cranlogs.r-pkg.org/badges/grand-total/newsmap?color=orange)](https://CRAN.R-project.org/package=newsmap)
[![R build
status](https://github.com/koheiw/newsmap/workflows/R-CMD-check/badge.svg)](https://github.com/koheiw/newsmap/actions)
[![codecov](https://codecov.io/gh/koheiw/newsmap/branch/master/graph/badge.svg)](https://codecov.io/gh/koheiw/newsmap)

Semi-supervised Bayesian model for geographical document classification. Newsmap automatically constructs a large geographical dictionary from a corpus to accurate classify documents. Currently, the **newsmap** package contains seed dictionaries in multiple languages that include *English*, *German*, *French*, *Spanish*, *Portuguese*, *Russian*, *Italian*, *Arabic*, *Turkish*, *Hebrew*, *Japanese*, *Chinese*.

The detail of the algorithm is explained in [Newsmap: semi-supervised approach to geographical news classification](https://www.tandfonline.com/eprint/dDeyUTBrhxBSSkHPn5uB/full). **newsmap** has also been used in scientific research in various fields ([Google Scholar](https://scholar.google.com/scholar?oi=bibs&hl=en&cites=3438152153062747083)).

## How to install

**newsmap** is available on CRAN since the version 0.6. You can install the package using R Studio GUI or the command.

```{r, eval=FALSE}
install.packages("newsmap")
```

If you want to the latest version, please install by running this command in R. You need to have **devtools** installed beforehand.

```{r, eval=FALSE}
install.packages("devtools")
devtools::install_github("koheiw/newsmap")
```

## Example

In this example, using a text analysis package [**quanteda**](https://quanteda.io) for preprocessing of textual data, we train a geographical classification model on a [corpus of news summaries collected from Yahoo News](https://www.dropbox.com/s/e19kslwhuu9yc2z/yahoo-news.RDS?dl=1) via RSS in 2014.

### Download example data

```{r, eval=FALSE}
download.file('https://www.dropbox.com/s/e19kslwhuu9yc2z/yahoo-news.RDS?dl=1',
'~/yahoo-news.RDS', mode = "wb")
```

### Train Newsmap classifier

```{r}
require(newsmap)
require(quanteda)

# Load data
dat <- readRDS('~/yahoo-news.RDS')
dat$text <- paste0(dat$head, ". ", dat$body)
dat$body <- NULL
corp <- corpus(dat, text_field = 'text')

# Custom stopwords
month <- c('January', 'February', 'March', 'April', 'May', 'June',
'July', 'August', 'September', 'October', 'November', 'December')
day <- c('Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday')
agency <- c('AP', 'AFP', 'Reuters')

# Select training period
sub_corp <- corpus_subset(corp, '2014-01-01' <= date & date <= '2014-12-31')

# Tokenize
toks <- tokens(sub_corp)
toks <- tokens_remove(toks, stopwords('english'), valuetype = 'fixed', padding = TRUE)
toks <- tokens_remove(toks, c(month, day, agency), valuetype = 'fixed', padding = TRUE)

# quanteda v1.5 introduced 'nested_scope' to reduce ambiguity in dictionary lookup
toks_label <- tokens_lookup(toks, data_dictionary_newsmap_en,
levels = 3, nested_scope = "dictionary")
dfmt_label <- dfm(toks_label)

dfmt_feat <- dfm(toks, tolower = FALSE)
dfmt_feat <- dfm_select(dfmt_feat, selection = "keep", '^[A-Z][A-Za-z1-2]+',
valuetype = 'regex', case_insensitive = FALSE) # include only proper nouns to model
dfmt_feat <- dfm_trim(dfmt_feat, min_termfreq = 10)

model <- textmodel_newsmap(dfmt_feat, dfmt_label)

# Features with largest weights
coef(model, n = 7)[c("us", "gb", "fr", "br", "jp")]
```

### Predict geographical focus of texts

```{r}
pred_data <- data.frame(text = as.character(sub_corp), country = predict(model))
```

```{r echo=FALSE}
knitr::kable(head(pred_data))
```