{"id":13857939,"url":"https://github.com/koheiw/newsmap","last_synced_at":"2025-09-08T14:31:50.283Z","repository":{"id":13086367,"uuid":"58731130","full_name":"koheiw/newsmap","owner":"koheiw","description":"Semi-supervised algorithm for geographical document classification","archived":false,"fork":false,"pushed_at":"2024-05-23T01:42:22.000Z","size":1924,"stargazers_count":56,"open_issues_count":8,"forks_count":21,"subscribers_count":6,"default_branch":"master","last_synced_at":"2024-05-30T03:09:46.879Z","etag":null,"topics":["machine-learning","news-stories","quanteda","text-analysis"],"latest_commit_sha":null,"homepage":"","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/koheiw.png","metadata":{"files":{"readme":"README.Rmd","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2016-05-13T10:54:00.000Z","updated_at":"2024-06-11T05:51:16.157Z","dependencies_parsed_at":"2024-04-10T22:43:33.821Z","dependency_job_id":"07b592e1-80ec-41a9-a4ca-36758cc62b17","html_url":"https://github.com/koheiw/newsmap","commit_stats":{"total_commits":349,"total_committers":14,"mean_commits":"24.928571428571427","dds":0.09742120343839544,"last_synced_commit":"07d7c303ec621257e0616dbcca1a73d2e3fe0080"},"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/koheiw%2Fnewsmap","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/koheiw%2Fnewsmap/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/koheiw%2Fnewsmap/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/koheiw%2Fnewsmap/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/koheiw","download_url":"https://codeload.github.com/koheiw/newsmap/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":232314020,"owners_count":18504026,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["machine-learning","news-stories","quanteda","text-analysis"],"created_at":"2024-08-05T03:01:51.295Z","updated_at":"2025-01-03T09:12:30.038Z","avatar_url":"https://github.com/koheiw.png","language":"R","funding_links":[],"categories":["R"],"sub_categories":[],"readme":"---\noutput: github_document\n---\n\n```{r, echo=FALSE}\nknitr::opts_chunk$set(\n  collapse = TRUE,\n  comment = \"##\",\n  fig.path = \"man/images/\"\n)\n```\n\n# Newsmap: geographical document classifier\n\n\u003c!-- badges: start --\u003e\n\n[![CRAN\nVersion](https://www.r-pkg.org/badges/version/newsmap)](https://CRAN.R-project.org/package=newsmap)\n[![Downloads](https://cranlogs.r-pkg.org/badges/newsmap)](https://CRAN.R-project.org/package=newsmap)\n[![Total\nDownloads](https://cranlogs.r-pkg.org/badges/grand-total/newsmap?color=orange)](https://CRAN.R-project.org/package=newsmap)\n[![R build\nstatus](https://github.com/koheiw/newsmap/workflows/R-CMD-check/badge.svg)](https://github.com/koheiw/newsmap/actions)\n[![codecov](https://codecov.io/gh/koheiw/newsmap/branch/master/graph/badge.svg)](https://codecov.io/gh/koheiw/newsmap)\n\u003c!-- badges: end --\u003e\n\nSemi-supervised Bayesian model for geographical document classification. Newsmap automatically constructs a large geographical dictionary from a corpus to accurate classify documents. Currently, the **newsmap** package contains seed dictionaries in multiple languages that include *English*, *German*, *French*, *Spanish*, *Portuguese*, *Russian*, *Italian*, *Arabic*, *Turkish*, *Hebrew*, *Japanese*, *Chinese*.\n\nThe detail of the algorithm is explained in [Newsmap: semi-supervised approach to geographical news classification](https://www.tandfonline.com/eprint/dDeyUTBrhxBSSkHPn5uB/full). **newsmap** has also been used in scientific research in various fields ([Google Scholar](https://scholar.google.com/scholar?oi=bibs\u0026hl=en\u0026cites=3438152153062747083)).\n\n## How to install\n\n**newsmap** is available on CRAN since the version 0.6. You can install the package using R Studio GUI or the command.\n\n```{r, eval=FALSE}\ninstall.packages(\"newsmap\")\n```\n\nIf you want to the latest version, please install by running this command in R. You need to have **devtools** installed beforehand.\n\n```{r, eval=FALSE}\ninstall.packages(\"devtools\")\ndevtools::install_github(\"koheiw/newsmap\")\n```\n\n## Example\n\nIn this example, using a text analysis package [**quanteda**](https://quanteda.io) for preprocessing of textual data, we train a geographical classification model on a [corpus of news summaries collected from Yahoo News](https://www.dropbox.com/s/e19kslwhuu9yc2z/yahoo-news.RDS?dl=1) via RSS in 2014.\n\n### Download example data\n\n```{r, eval=FALSE}\ndownload.file('https://www.dropbox.com/s/e19kslwhuu9yc2z/yahoo-news.RDS?dl=1', \n              '~/yahoo-news.RDS', mode = \"wb\")\n```\n\n### Train Newsmap classifier\n\n```{r}\nrequire(newsmap)\nrequire(quanteda)\n\n# Load data\ndat \u003c- readRDS('~/yahoo-news.RDS')\ndat$text \u003c- paste0(dat$head, \". \", dat$body)\ndat$body \u003c- NULL\ncorp \u003c- corpus(dat, text_field = 'text')\n\n# Custom stopwords\nmonth \u003c- c('January', 'February', 'March', 'April', 'May', 'June',\n           'July', 'August', 'September', 'October', 'November', 'December')\nday \u003c- c('Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday')\nagency \u003c- c('AP', 'AFP', 'Reuters')\n\n# Select training period\nsub_corp \u003c- corpus_subset(corp, '2014-01-01' \u003c= date \u0026 date \u003c= '2014-12-31')\n\n# Tokenize\ntoks \u003c- tokens(sub_corp)\ntoks \u003c- tokens_remove(toks, stopwords('english'), valuetype = 'fixed', padding = TRUE)\ntoks \u003c- tokens_remove(toks, c(month, day, agency), valuetype = 'fixed', padding = TRUE)\n\n# quanteda v1.5 introduced 'nested_scope' to reduce ambiguity in dictionary lookup\ntoks_label \u003c- tokens_lookup(toks, data_dictionary_newsmap_en, \n                            levels = 3, nested_scope = \"dictionary\")\ndfmt_label \u003c- dfm(toks_label)\n\ndfmt_feat \u003c- dfm(toks, tolower = FALSE)\ndfmt_feat \u003c- dfm_select(dfmt_feat, selection = \"keep\", '^[A-Z][A-Za-z1-2]+', \n                        valuetype = 'regex', case_insensitive = FALSE) # include only proper nouns to model\ndfmt_feat \u003c- dfm_trim(dfmt_feat, min_termfreq = 10)\n\nmodel \u003c- textmodel_newsmap(dfmt_feat, dfmt_label)\n\n# Features with largest weights\ncoef(model, n = 7)[c(\"us\", \"gb\", \"fr\", \"br\", \"jp\")]\n```\n\n### Predict geographical focus of texts \n\n```{r}\npred_data \u003c- data.frame(text = as.character(sub_corp), country = predict(model))\n```\n\n```{r echo=FALSE}\nknitr::kable(head(pred_data))\n```\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkoheiw%2Fnewsmap","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkoheiw%2Fnewsmap","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkoheiw%2Fnewsmap/lists"}