Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
Awesome Lists | Featured Topics | Projects
https://github.com/dime-worldbank/ulex

Unique Location Extractor
https://github.com/dime-worldbank/ulex
Last synced: about 2 months ago
JSON representation
Unique Location Extractor
Host: GitHub
URL: https://github.com/dime-worldbank/ulex
Owner: dime-worldbank
License: other
Created: 2024-06-06T10:36:43.000Z (7 months ago)
Default Branch: main
Last Pushed: 2024-07-04T02:37:40.000Z (6 months ago)
Last Synced: 2024-10-25T07:29:22.033Z (2 months ago)
Language: R
Size: 5.46 MB
Stars: 1
Watchers: 3
Forks: 1
Open Issues: 1
Metadata Files:
- Readme: README.Rmd
- License: LICENSE
Awesome Lists containing this project

README

        ---

output: github_document

---

```{r, include = FALSE}

knitr::opts_chunk$set(

  collapse = TRUE,

  comment = "#>",

  fig.path = "man/figures/README-",

  out.width = "100%"

)

```

# Unique Location Extractor (ULEx) 

[![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/ulex)](https://cran.r-project.org/package=ulex)

[![activity](https://img.shields.io/github/commit-activity/m/dime-worldbank/ulex)](https://github.com/dime-worldbank/ulex/graphs/commit-activity)

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/license/mit)

[![R-CMD-check](https://github.com/dime-worldbank/ulex/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/dime-worldbank/ulex/actions/workflows/R-CMD-check.yaml)

* [Overview](#overview)

* [Installation](#installation)

* [Functions](#main-functions)

* [Quick start](#quick-start)

* [Additional information on functions](#addn-info)

## Overview 

Text often contains references to the locations of events where we want to extract the location of the event. For example, consider this example tweet that reports a road traffic crash in Nairobi, Kenya, where we are interested in determining the location of the crash:

> crash occurred near garden city on thika road on your way towards roysambu.

The tweet contains three location references: (1) garden city, (2) Thika road and (3) roysambu, where 'garden city' is the name of multiple locations. Here, we are interested in extracting the location of the garden city location on Thika road that represents the crash site.

__The Unique Location Extractor (ULEx) geoparses text to extract the unique location of events.__ The algorithm first determines which location references refer to the event of interest and which location references should be ignored. The algorithm them determines the location of the event by checking text against dictionaries of landmarks, roads, and areas (such as neighborhoods). Moreover, the algorithm accounts for differences in spelling between how a user writes a location and how the location is captured in a dictionary of locations; users may use short, informal names while a location dictionary may contain formal names. For example, a user may write _"crash near mathare center"_, while a landmark dictionary contains _"mathare social justice centre"_.

This package was originally developed to extract locations of road traffic crashes from reports of crashes via Twitter, specifically in the context of Nairobi, Kenya using the Twitter feed [@Ma3Route](https://twitter.com/Ma3Route?ref_src=twsrc%5Egoogle%7Ctwcamp%5Eserp%7Ctwgr%5Eauthor). For more information, see our article here:

> [Milusheva S, Marty R, Bedoya G, Williams S, Resor E, Legovini A (2021) Applying machine learning and geolocation techniques to social media data (Twitter) to develop a resource for urban planning. PLoS ONE 16(2): e0244317. https://doi.org/10.1371/journal.pone.0244317](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0244317)

## Installation 

The package can be installed via CRAN.

``` r

install.packages("ulex")

```

You can install the development version of `ulex` from GitHub with:

``` r

# install.packages("devtools")

devtools::install_github("dime-worldbank/ulex")

```

## Functions 

The package contains two functions:

* __augment_gazetteer:__ The backbone of locating events is looking up location references in a gazetteer, or geographic dictionary. The `augment_gazetteer` facilitates cleaning a gazetteer that may have been constructed from sources such as [OpenStreetMaps](https://cran.r-project.org/web/packages/osmdata/vignettes/osmdata.html), [Geonames](https://github.com/ropensci/geonames) or [Google Maps](https://www.rdocumentation.org/packages/googleway/versions/2.7.1/topics/google_places). For more information on the function, see [here](#addn-aug).

* __locate_event:__ Takes text as input and returns the location of the relevant event. Key inputs include the text to geoparse, a gazetteer of landmarks, spatial files of roads and areas (e.g., neighborhoods) and a list of event words. For more information on the function, see [here](#addn-loc).

## Quick Start 

* [Setup](#setup)

* [Create location datasets](#create-loc-data)

  - [Dataset of Wards](#create-areas)

  - [Dataset of roads](#create-roads)

  - [Dataset of landmarks (landmark gazetteer)](#create-landmarks)

* [Augment gazetteer](#aug-gazettee)

* [Location events](#loc-events)

### Setup 

```{r, message = F, warning=F}

# Load ULEx

library(ulex)

## Load other packages, such as those for creating location dictionaries

library(dplyr)

library(geodata)

library(osmdata)

library(basemaps)

library(sf)

library(ggplot2)

library(stringr)

```

### Create location datasets 

#### Dataset of Wards 

We create a dataset of Wards in Nairobi from [GADM](https://gadm.org/data.html).

```{r, message = F, warning=F}

ken_sf <- gadm(country = "KEN", level = 3, path = tempdir()) %>% st_as_sf()

nbo_sf <- ken_sf %>%

  filter(NAME_1 %in% "Nairobi") %>%

  rename(name = NAME_3) %>%

  dplyr::select(name)

head(nbo_sf)

```

#### Dataset of roads 

We create a dataset of roads from [OpenStreetMaps](https://www.openstreetmap.org/).

```{r, message = F, warning=F}

roads_sf <- opq(st_bbox(nbo_sf), timeout = 999) %>%

  add_osm_feature(key = "highway", value = c("motorway",

                                             "trunk",

                                             "primary",

                                             "secondary",

                                             "tertiary",

                                             "unclassified")) %>%

  osmdata_sf()

roads_sf <- roads_sf$osm_lines

roads_sf <- roads_sf %>%

  filter(!is.na(name)) %>%

  dplyr::select(name) %>%

  mutate(name = name %>% tolower())

head(roads_sf)

```

#### Dataset of landmarks (landmark gazetteer) 

We create a gazetteer of landmarks from [OpenStreetMaps](https://www.openstreetmap.org/). From OpenStreetMaps, we use all amenities and bus stops.

```{r, message = F, warning=F}

# Amenities --------------------------------------------------------------------

amenities_sf <- opq(st_bbox(nbo_sf), timeout = 999) %>%

  add_osm_feature(key = "amenity") %>%

  osmdata_sf()

amenities_pnt_sf <- amenities_sf$osm_points

amenities_ply_sf <- amenities_sf$osm_polygons %>%

  st_centroid()

amenities_sf <- bind_rows(amenities_pnt_sf,

                          amenities_ply_sf) %>%

  dplyr::mutate(type = amenity)

# Bus Stops --------------------------------------------------------------------

busstops_sf <- opq(st_bbox(nbo_sf), timeout = 999) %>%

  add_osm_feature(key = "highway",

                  value = "bus_stop") %>%

  osmdata_sf()

busstops_sf <- busstops_sf$osm_points

busstops_sf <- busstops_sf %>%

  mutate(type = "bus_stop")

# Append -----------------------------------------------------------------------

landmarks_sf <- bind_rows(amenities_sf,

                          busstops_sf) %>%

  filter(!is.na(name)) %>%

  dplyr::select(name, type) %>%

  mutate(name = name %>% tolower())

head(landmarks_sf)

```

#### Map landmark, road, and area dictionaries

The below map shows the locations in the landmark, roads, and area dictionaries.

```{r, message = F, warning=F}

ggplot() +

  geom_sf(data = roads_sf,

          aes(color = "Roads"),

          linewidth = 0.6) +

  geom_sf(data = landmarks_sf,

          aes(color = "Landmarks"),

          size = 0.1,

          alpha = 0.5) +

  geom_sf(data = nbo_sf,

          fill = "gray",

          aes(color = "Wards"),

          linewidth = 0.5,

          alpha = 0.2) +

  labs(color = NULL,

       title = "Landmarks, Roads, and Wards") +

  scale_color_manual(values = c("blue", "chartreuse3", "black")) +

  theme_void() +

  theme(plot.title = element_text(face = "bold"))

```

### Augment Gazetteer 

Here, we augment the landmark gazetteer---which increases the number of entries from about 11,000 to 50,000.

```{r, message = F, warning=F}

landmarks_aug_sf <- augment_gazetteer(landmarks_sf)

print(nrow(landmarks_sf))

print(nrow(landmarks_aug_sf))

head(landmarks_aug_sf)

```

### Locate Events 

We geolocate the location of crashes contained in five texts.

```{r, message = F, warning=F}

texts <- c("crash at garden city",

            "crash occurred near garden city on thika road towards roysambu",

            "crash at intersection of juja road and outer ring rd",

            "crash occured near roysambu on thika rd",

            "crash near mathare centre along juja road")

crashes_sf <- locate_event(text = texts,

                           landmark_gazetteer = landmarks_aug_sf,

                           areas = nbo_sf,

                           roads = roads_sf,

                           event_words = c("accident", "crash", "collision", 

                                           "wreck", "overturn"))

```

```{r, message = F, warning=F}

ext <- crashes_sf %>%

  st_buffer(dist = 500) %>%

  st_bbox()

ggplot() +

  geom_sf() +

  basemap_gglayer(ext) +

  geom_sf(data = crashes_sf %>%

            st_transform(3857),

          pch = 21,

          color = "black",

          fill = "red") +

  scale_fill_identity() + 

  theme_void()

```

The output of `locate_event()` has the following variables:

* __text:__ Original text to geocode.

* __matched_words_correct_spelling:__ Names of locations used to geocode the event, as names appear in landmark, roads, and area datasets    

* __matched_words_text_spelling:__ Names of locations used to geocode event, as names appear in text.      

* __dist_closest_event_word:__ Distance of landmark to event word (ie, number of words between event word and location word).           

* __type:__ Type of location (e.g., landmark, intersection).                            

* __how_determined_location:__ Information on how location was determined.           

* __dist_mentioned_road_m:__ Distance (meters) of event location to mentioned road.             

* __lon_all:__ All landmark locations found in text (longitude).                           

* __lat_all:__ All landmark locations found in text (latitude).                           

* __landmarks_all_text_spelling:__ Names of all landmarks found, as names appear in text.      

* __landmarks_all_correct_spelling:__ Names of all landmarks found, as names appear in landmark gazetteer.   

* __landmarks_all_location:__ Names of landmarks and locations (name,latitude,longitude).            

* __roads_all_text_spelling:__ Names of roads in text, as names appear in text.          

* __roads_all_correct_spelling:__ Name of roads in text, as names appear in road dataset.        

* __intersection_all_text_spelling:__ Name of intersection (e.g., pairs of roads that make intersection), as names appear in text.   

* __intersection_all_correct_spelling:__ Name of intersection (e.g., pairs of roads that make intersection), as names appear road dataset.

* __intersection_all_location:__ Name and locations of intersections (name,latitude,longitude).         

* __geometry:__ Geometry of event location.

```{r, message = F, warning=F}

head(crashes_sf)

```

## Additional information on functions 

### `augment_gazetteer()` 

The `augment_gazetteer` function adds additional landmarks to account for different ways of saying the same landmark name. For example, raw gazetteers may contain long, formal names, where shorter versions of the name are more often used. In addition, the function facilitates removing landmarks names that are spurious or may confuse the algorithm; these include landmark names that are common words that may be used in different contexts, or frequent and generic landmarks such as `hotel`. Key components of the function include:

1. Adding additional landmarks based off of n-grams and skip-grams of landmark names. For example, from the original landmark `garden city mall`, the following landmarks will be added: `garden city`, `city mall`, and  `garden mall`.

2. Adding landmarks according to a set of rules: for example, if a landmark starts or ends with a certain word, an alternative version of the landmark is added that removes that word. Here, words along categories of landmarks are removed, where a user may not reference the category; for example, a user will more likely say `McDonalds` than `McDonalds restaurant.`

3. Removes landmarks that refer to large geographic areas (e.g., roads). Roads and areas are dealt with separately; this function focuses on cleaning a gazetteer of specific points/landmarks.

__Pages S4 to S6 in the supplementary information file [here](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0244317#sec005) provides further details on the augment gazetteer algorithm__

### `locate_event()` 

The `locate_event` function extracts landmarks from text and determines the unique location of events from the text. The algorithm works in two steps: (1) finding locations in text and (2) determining a unique location.

_Finding location references in text_

To extract location references from text, the function implements the following steps to extract location references from text.

1. Determines whether any text matches names in the gazetteer. Both exact and 'fuzzy' matches (allowing a certain Levenstein distance) are used.

2. Rely on words after prepositions to find locations. The algorithm starts with a word after a preposition and extracts all landmarks that contain that word. Then, the algorithm takes the next word in the text and further subsets the landmarks. This process is repeated until adding a word removes all landmarks. If a road or area (eg, neighborhood) is found in the previous step, only landmarks near that road or neighborhood are considered. Landmarks with the shortest number of words are kept (i.e., if this process finds 5 landmarks with 2 words and 7 landmarks with 3 words, only the 5 landmarks with 2 words are kept).

3. If a road or area is mentioned and a landmark is not near that road or landmark, longer versions of the landmark that are near the road or area are searched for. For example, if a user says `crash near garden on thika road`, the algorithm may extract multiple landmarks with the name `garden`, none of which are near Thika road. It will then search for all landmarks that contain `garden` in them (e.g., `garden city mall`) that are near Thika road.

4. If two roads are mentioned, the algorithm extracts the intersection of the roads.

_Determine unique location_

After extracting landmarks, the algorithm seeks to identify a single location using a series of steps. These steps consider a defined list of event words (eg, for road traffic crashes, these could include 'crash', 'accident', 'overturn', etc), whether the user mentions a junction word (e.g., 'junction' or 'intersection') and a list of prepositions. Certain prepositions are given precedent over others to distinguish between locations indicating the location of an event versus locations further away that provide additional context; for example, `at` takes higher precedence that `towards`. The following main series of steps are used in the following order:

1. Locations that follow the pattern [event word] [preposition] [location] are extracted.

2. Locations that follow the pattern [preposition] [location] are extracted. If multiple occurrences, the location near the higher order preposition is used. If a tie, the location closest to the event word is used.

3. If a junction word is used, two roads are mentioned, and the two roads intersect once, the intersection point is used.

4. The location closest to the event word within the text is used.

5. If the location name has multiple locations, we (1) restrict to locations near any mentioned road or area, (2) check for a dominant cluster of locations and (3) prioritize certain landmark types over others (e.g., a user is more likely to reference a large, well known location type like a stadium).

6. If a landmark is not found, but a road or area are found, the road or area are returned. If a road and area are mentioned, the intersection of the road and area is returned.

__Pages S15 to S19 in the supplementary information file [here](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0244317#sec005) provides further details on the locate event algorithm__