Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/koheiw/newsmap

Semi-supervised algorithm for geographical document classification
https://github.com/koheiw/newsmap

machine-learning news-stories quanteda text-analysis

Last synced: about 2 months ago
JSON representation

Semi-supervised algorithm for geographical document classification

Host: GitHub
URL: https://github.com/koheiw/newsmap
Owner: koheiw
License: other
Created: 2016-05-13T10:54:00.000Z (almost 9 years ago)
Default Branch: master
Last Pushed: 2024-05-23T01:42:22.000Z (9 months ago)
Last Synced: 2024-05-30T03:09:46.879Z (9 months ago)
Topics: machine-learning, news-stories, quanteda, text-analysis
Language: R
Homepage:
Size: 1.83 MB
Stars: 56
Watchers: 6
Forks: 21
Open Issues: 8
Metadata Files:
- Readme: README.Rmd
- License: LICENSE

Awesome Lists containing this project

README

        ---

output: github_document

---

```{r, echo=FALSE}

knitr::opts_chunk$set(

  collapse = TRUE,

  comment = "##",

  fig.path = "man/images/"

)

```

# Newsmap: geographical document classifier

[![CRAN

Version](https://www.r-pkg.org/badges/version/newsmap)](https://CRAN.R-project.org/package=newsmap)

[![Downloads](https://cranlogs.r-pkg.org/badges/newsmap)](https://CRAN.R-project.org/package=newsmap)

[![Total

Downloads](https://cranlogs.r-pkg.org/badges/grand-total/newsmap?color=orange)](https://CRAN.R-project.org/package=newsmap)

[![R build

status](https://github.com/koheiw/newsmap/workflows/R-CMD-check/badge.svg)](https://github.com/koheiw/newsmap/actions)

[![codecov](https://codecov.io/gh/koheiw/newsmap/branch/master/graph/badge.svg)](https://codecov.io/gh/koheiw/newsmap)

Semi-supervised Bayesian model for geographical document classification. Newsmap automatically constructs a large geographical dictionary from a corpus to accurate classify documents. Currently, the **newsmap** package contains seed dictionaries in multiple languages that include *English*, *German*, *French*, *Spanish*, *Portuguese*, *Russian*, *Italian*, *Arabic*, *Turkish*, *Hebrew*, *Japanese*, *Chinese*.

The detail of the algorithm is explained in [Newsmap: semi-supervised approach to geographical news classification](https://www.tandfonline.com/eprint/dDeyUTBrhxBSSkHPn5uB/full). **newsmap** has also been used in scientific research in various fields ([Google Scholar](https://scholar.google.com/scholar?oi=bibs&hl=en&cites=3438152153062747083)).

## How to install

**newsmap** is available on CRAN since the version 0.6. You can install the package using R Studio GUI or the command.

```{r, eval=FALSE}

install.packages("newsmap")

```

If you want to the latest version, please install by running this command in R. You need to have **devtools** installed beforehand.

```{r, eval=FALSE}

install.packages("devtools")

devtools::install_github("koheiw/newsmap")

```

## Example

In this example, using a text analysis package [**quanteda**](https://quanteda.io) for preprocessing of textual data, we train a geographical classification model on a [corpus of news summaries collected from Yahoo News](https://www.dropbox.com/s/e19kslwhuu9yc2z/yahoo-news.RDS?dl=1) via RSS in 2014.

### Download example data

```{r, eval=FALSE}

download.file('https://www.dropbox.com/s/e19kslwhuu9yc2z/yahoo-news.RDS?dl=1', 

              '~/yahoo-news.RDS', mode = "wb")

```

### Train Newsmap classifier

```{r}

require(newsmap)

require(quanteda)

# Load data

dat <- readRDS('~/yahoo-news.RDS')

dat$text <- paste0(dat$head, ". ", dat$body)

dat$body <- NULL

corp <- corpus(dat, text_field = 'text')

# Custom stopwords

month <- c('January', 'February', 'March', 'April', 'May', 'June',

           'July', 'August', 'September', 'October', 'November', 'December')

day <- c('Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday')

agency <- c('AP', 'AFP', 'Reuters')

# Select training period

sub_corp <- corpus_subset(corp, '2014-01-01' <= date & date <= '2014-12-31')

# Tokenize

toks <- tokens(sub_corp)

toks <- tokens_remove(toks, stopwords('english'), valuetype = 'fixed', padding = TRUE)

toks <- tokens_remove(toks, c(month, day, agency), valuetype = 'fixed', padding = TRUE)

# quanteda v1.5 introduced 'nested_scope' to reduce ambiguity in dictionary lookup

toks_label <- tokens_lookup(toks, data_dictionary_newsmap_en, 

                            levels = 3, nested_scope = "dictionary")

dfmt_label <- dfm(toks_label)

dfmt_feat <- dfm(toks, tolower = FALSE)

dfmt_feat <- dfm_select(dfmt_feat, selection = "keep", '^[A-Z][A-Za-z1-2]+', 

                        valuetype = 'regex', case_insensitive = FALSE) # include only proper nouns to model

dfmt_feat <- dfm_trim(dfmt_feat, min_termfreq = 10)

model <- textmodel_newsmap(dfmt_feat, dfmt_label)

# Features with largest weights

coef(model, n = 7)[c("us", "gb", "fr", "br", "jp")]

```

### Predict geographical focus of texts 

```{r}

pred_data <- data.frame(text = as.character(sub_corp), country = predict(model))

```

```{r echo=FALSE}

knitr::kable(head(pred_data))

```