{"id":24318222,"url":"https://github.com/rdmpage/glasgow-geoparser","last_synced_at":"2026-04-21T01:32:28.947Z","repository":{"id":142280172,"uuid":"409977596","full_name":"rdmpage/glasgow-geoparser","owner":"rdmpage","description":"Simple geoparsing using a gazetteer based on Wikidata and FlashText search","archived":false,"fork":false,"pushed_at":"2025-04-08T08:30:47.000Z","size":4061,"stargazers_count":1,"open_issues_count":2,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-12-25T13:00:50.142Z","etag":null,"topics":["flashtext","gazetteer","geoparser","geotagging","wikidata"],"latest_commit_sha":null,"homepage":"","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rdmpage.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2021-09-24T13:40:57.000Z","updated_at":"2025-04-08T08:30:51.000Z","dependencies_parsed_at":"2024-02-29T17:33:23.976Z","dependency_job_id":"7c0ed723-e7f9-4254-a6f5-87567952e5c0","html_url":"https://github.com/rdmpage/glasgow-geoparser","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/rdmpage/glasgow-geoparser","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rdmpage%2Fglasgow-geoparser","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rdmpage%2Fglasgow-geoparser/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rdmpage%2Fglasgow-geoparser/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rdmpage%2Fglasgow-geoparser/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rdmpage","download_url":"https://codeload.github.com/rdmpage/glasgow-geoparser/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rdmpage%2Fglasgow-geoparser/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32072953,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-20T21:26:33.338Z","status":"ssl_error","status_checked_at":"2026-04-20T21:26:22.081Z","response_time":94,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["flashtext","gazetteer","geoparser","geotagging","wikidata"],"created_at":"2025-01-17T14:37:36.751Z","updated_at":"2026-04-21T01:32:28.913Z","avatar_url":"https://github.com/rdmpage.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Glasgow Geoparser\n\nA simple tool to [geoparse](https://en.wikipedia.org/wiki/Toponym_resolution) using a gazetteer derived from Wikidata, combined with FlashText search. \n\nTypically geoparsing involves taking a body of text and undertaking the following two steps:\n- using [Named-entity recognition (NER)](https://en.wikipedia.org/wiki/Named-entity_recognition) to identity named entities in the text (e.g., place names, people names, etc.)\n- using a gazetteer of geographic names (e.g., [GeoNames](http://www.geonames.org) try and match the place names found by NER.\n\nAn example of such a parser is the [Edinburgh Geoparser](https://www.ltg.ed.ac.uk/software/geoparser/) (which the name “Glasgow Parser” is a play on). Typically geoparsing software can be large and tricky to install, especially if you are looking to make your installation publicly accessible. Geoparsing services seem to have a short half-life (e.g., [Geoparser.io](https://geoparser.io), perhaps because they are so useful they quickly get swamped by users.\n\nBearing this in mind, the approach I’ve taken here is to create a very simple geoparser that is focussed on fairly large areas, especially those relevant to biodiversity, and is aimed at geoparsing text such as abstracts of scientific papers. It is not intended to be used to parse locality data for specimens, for example. For that level of granularity I think GBIF is probably the best gazetteer we have (see https://lyrical-money.glitch.me ).\n\nTo create the Glasgow Geoparser I fetch localities from Wikidata and create a CSV file with basic data including names and geographic coordinates, along the lines of [wikidata-gazetteer\nPublic](https://github.com/Wikidata-Gazetteer/wikidata-gazetteer) but with fewer terms, and with a much smaller subset of Wikidata. Once this data is assembled I parse it and create a  trie. This makes it easy to quickly test for any string whether it occurs in the dataset. I use the FlashText algorithm to parse a block of text and extract all string that match names in the dataset. \n\n\n## Wikidata\n\nI use a series of SPARQL queries to generate CSV files of localities that I consider to be most likely to appear in the abstracts of articles relevant to biodiversity (e.g., taxonomic papers). The data fields are:\n\nkey | value\n-- | --\nwikidata_id | Wikidata id (QID)\nname | English name\nenwiki_title | title of English Wikipedia article\nalternate_names | names in other languages, synonyms\ncountry_code | ISO country code (e.g., FR)\nlatitude | latitude in decimal degrees\nlongitude | longitude in decimal degrees\ngeonames_id | id in GeoNames\nosm_id | id in OpenStreetMap\n\nFor some queries Wikidata times out, so an alternative is to use two queries. The first finds a set of ids that match a query (e.g., islands \u003e 20,000 km\u003csup\u003e2\u003c/sup\u003e in area) then run those ids through `idquery.php` which chunks the set of ids into smaller pieces and runs a query for that set of ids. In other words, we are simply asking for properties of a known id. This approach also enable us to have a list of localities that might not match a simple query (e.g., countries) but which is of interest.\n\n\n## FlashText\n\nThe algorithm for locating geographic names in text uses a [trie](https://en.wikipedia.org/wiki/Trie) and is described in a paper by [Vikash Singh](https://github.com/vi3k6i5).\n\n\u003e Singh, V. (2017). Replace or Retrieve Keywords In Documents at Scale. CoRR, abs/1711.00046. http://arxiv.org/abs/1711.00046\n\nTo implement this algorithm I used [Trie tree (prefix tree) detailed-PHP code implementation](https://www.programmerall.com/article/4530755185/) as a starting point, then modified it following the Python code given in the article by Vikash Singh. Typically in a trie there is a “stop” character to indicate the end of a word. In this case we also have a pointer to a data object, namely the data from Wikidata. Hence if we find a term in the trie we can instantly retrieve the corresponding data.\n\n## Output\n\nThe output of the algorithm is a list of all the strings in the text that match geographic names in the gazetteer. This can be easily converted into GeoJSON for display.\n\n\n## Build database\n\nDatasets are created in `Wikidata` folder. Run `wikidata-to-trie.php` in root folder to parse csv files into trie structure.\n\n\n## Queries\n\n\n### Countries\n\n### ADM1 areas\n\n### Large islands\n\n\n\nSELECT * WHERE {\n  ?item wdt:P31 wd:Q23442.\n  \n  ?item wdt:P2046 ?area .\n  #FILTER(?area \u003e 100000) .\n  \n  \n}\nLIMIT 10\n\n\n\n\n## Examples to work through\n\n\nhttps://www.biodiversitylibrary.org/part/270384\n\nhttps://www.biodiversitylibrary.org/page/56235889\n\nNote that we need to remove line breaks, and also we get hits in Africa and New Guinea(!)\n\n\u003e Richards, S.J., Oliver, P., Brown, R.M.: A new scansorial species of Platymantis (Anura: Ceratobatrachidae)...  (plate 1)  Guinea, New Britain Island, East New Britain  Province, Vouvou Camp: SAMA R64805.  Platymantis caesiops Kraus, Allison, 2009  Material examined: 2 specimens, Papua New  Guinea, New Britain Island, East New Britain  Province, Vouvou Camp: SAMA R10730, 10732.  Platymantis cheesmanae Parker, 1940  Material examined: 3 specimens, Indonesia,  Cyclops Mountains, Wambena Camp: SJR 6212,  6201, 6204.  Platymantis citrinospilus Brown, Richards,  Broadhead, 2013  Material examined: 4 specimens, Papua New  Guinea, New Britain Island, East New Britain  Province, Nakanai Mountains, Tompoi Camp, 1700  m above sea level: SAMA R64758 (holotype), SAMA  R64756, R64757, PNGNM 24042 (paratypes).  Platymantis desticans Brown, Richards, 2008  Material examined: 4 specimens, Solomon  Islands, Isabel Province, Barora Faa Island, (off  the western tip of Isabel Island): SAMA R56849  (holotype), and SAMA R56850-52 (paratypes).  Platymantis gillardi Zweifel, 1960  Material examined: 17 specimens, Papua New  Guinea, Bismarck Archipelago, New Britain Island,  West New Britain Province, S coast, ca 7 mi NW  Pomugu, Kandrian: CAS-SU 22877-78; Papua  New Guinea, West New Britain Province, northern  Nakanai Mountains, ridge between the Ivule and  Sigole rivers on the northern edge of the Nakanai  Plateau: UWZM 23787-96, 23799-800; East New  Britain Province, Vouvou Camp: SAMA R64801-02.  Platymantis guppyi (Boulenger, 1884)  Material examined: 59 specimens, Papua New  Guinea, Bougainville Island, Bougainville Province, Camp Torokina: USNM 120852-53; Kunua: MCZ-A 38628, 38632-33, 38635, 38638-39, 38664-  666, 38668, 38674, KU 93736-40, 98159-65,  98468; Melilup: MCZ-A 38629, 38659-60, 38667,  38669-72, 59498-501; Mutahi: CAS 106553-  106565; Solomon Islands, Barora Faa Island (near  Isabel Island): SAMA R56839, 56840; Guadalcanal Island, Tadai District, Mt. Austen, Barana Village:  KU 307359, 307375-76, 307381, 307384-86.  Platymantis latro Richards, Mack, Austin, 2007  Material examined: 18 specimens, Papua New  Guinea, Admiralty Islands, Manus Province, Manus Island: KU 93750-54; Chachuau Camp near Tulu  1 Village: SAMA R62819 (holotype), UPNG 10051,  SAMA R62820; Natnewai Camp: SAMA R62826;  Lorengau: UPNG 10052-54, SAMA R62821-23;  Rambutyo Island, Penchal Village: SAMA R62827;  Los Negros Island, Salami Village: SAMA R62828-  29 (paratypes).  Platymantis macrops (Brown, 1965)  Material examined: 4 specimens, Solomon  Islands, North Solomons, Bougainville Island,  Bougainville Province, Kunua: MCZ-A 38195-96  (paratypes); Aresi, S. of Kunua: MCZ-A 41864  (holotype); Matsiogu: MCZ-A 78820.  Platymantis macrosceles Zweifel, 1975  Material examined: 4 specimens, Papua New  Guinea, West New Britain Province, Ti, Nakanai  Mountains (central New Britain): BPBM 1005  (holotype); Nakanai Mountains, ridge between  the Ivule and Sigole Rivers: UWZM 23721, UPNG  10007; Papua New Guinea, East New Britain  Province, Vouvou Camp: SAMA R64815.  Platymantis magnus Brown, Menzies, 1979  Material examined: 4 specimens, Papua New  Guinea, New Ireland Island, New Ireland Province,  W. Coast, approx. 88 km S Kavieng (“Madina  High School area”): CAS 143640, (holotype); CAS  143639 (paratype); Utu, 1 km S, 5 km E Kavieng:  MCZ-A 92671-72 (paratypes).  Platymantis mamusiorum Foufopoulos, Brown,  2004  Material examined: 2 specimens, Papua New  Guinea, West New Britain Province, northern  Nakanai Mountains, ridge between the Ivule and  Sigole rivers on the northern edge of the Nakanai  Plateau (05°33.112’S, 151°04.269’E): UWZM  23720 (holotype), UWZM 23719, 23722, UPNG  9992 (Paratypes); Papua New Guinea, East New  Britain Province, Vouvou Camp: SAMA R64713-14.  Platymantis man us Kraus, Allison, 2009  Material examined: 2 specimens, Papua New  Guinea, Admiralty Islands, Manus Province, Manus Island, lorengau, MCZ-A 87512 (holotype), 87513  (paratopotype)  Platymantis mimicus Brown, Tyler, 1968  Material examined: 6 specimens, Papua New  Guinea, Bismarck Archipelago, New Britain Island,  West New Britain Province, ca 18 mi S of Talasea,  Numundo Plantation on Willaumez Peninsula: CAS-\n\n### Broke original trie code\n\n\"Cocirculation of Rio Negro Virus (RNV) and Pixuna Virus (PIXV) in Tucumán province, Argentina\" https://pubmed.ncbi.nlm.nih.gov/20497404/ https://doi.org/10.1111/j.1365-3156.2010.02541.x\n\n\u003e Venezuelan equine encephalitis complex includes viruses considered emerging pathogens for humans and animals in the Americas. Two members of this complex have been detected previously in Argentina: Rio Negro Virus (RNV), detected in mosquitoes from Chaco province and rodents from Formosa province, and Pixuna Virus (PIXV), detected in mosquitoes from Chaco province. To carry out surveillance studies in other parts of the country, detection of a 195-bp fragment of alphaviruses by RT-nested PCR was performed in mosquito samples from San Miguel de Tucumán city. Four pools resulted positive and three were sequenced. Two amplicons grouped with RNV and one with PIXV. This is the first report of viral activity of members of the Venezuelan equine encephalitis complex in north-eastern Argentina.\n\nFor one extracted point we had no data, which broke the GeoJSON export. Fixed https://github.com/rdmpage/glasgow-geoparser/commit/3327f6393a68b0822b2be3f5e0cb197ab823e3a6\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frdmpage%2Fglasgow-geoparser","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frdmpage%2Fglasgow-geoparser","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frdmpage%2Fglasgow-geoparser/lists"}