{"id":20915273,"url":"https://github.com/benjamindpb/wikidata-preprocessing","last_synced_at":"2025-07-15T22:20:59.382Z","repository":{"id":45429527,"uuid":"495136248","full_name":"benjamindpb/wikidata-preprocessing","owner":"benjamindpb","description":"Wikidata dump preprocessing \u0026 analysis of georreferencial entities","archived":false,"fork":false,"pushed_at":"2023-02-05T07:27:27.000Z","size":36,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-12T23:13:52.939Z","etag":null,"topics":["data-analysis","preprocessing","wikidata","wikidata-dump"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/benjamindpb.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-05-22T17:57:15.000Z","updated_at":"2022-12-16T23:56:50.000Z","dependencies_parsed_at":"2024-11-19T05:01:47.322Z","dependency_job_id":null,"html_url":"https://github.com/benjamindpb/wikidata-preprocessing","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/benjamindpb/wikidata-preprocessing","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/benjamindpb%2Fwikidata-preprocessing","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/benjamindpb%2Fwikidata-preprocessing/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/benjamindpb%2Fwikidata-preprocessing/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/benjamindpb%2Fwikidata-preprocessing/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/benjamindpb","download_url":"https://codeload.github.com/benjamindpb/wikidata-preprocessing/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/benjamindpb%2Fwikidata-preprocessing/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":265464610,"owners_count":23770325,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-analysis","preprocessing","wikidata","wikidata-dump"],"created_at":"2024-11-18T16:13:52.601Z","updated_at":"2025-07-15T22:20:59.334Z","avatar_url":"https://github.com/benjamindpb.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Wikidata Preprocessing\n\n## Objective\nThe main objective of this work is to carry out a preprocessing and subsequent analysis of the Wikidata database to see the feasibility of its use as a potential data source for carrying out an entity geolocation project. In addition, once the data source is obtained, the aim is to analyze the performance of generating a world map with georeferenced instances.\n\n## Preprocessing\nTo carry out the preprocessing of the Wikidata database, the [truthy dump](https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format#Truthy_statements) is first downloaded and read to obtain all the triples that contain the property [P625](https://www.wikidata.org/wiki/Property:P625) (*coordinate location*) as a predicate. In this way, all the Wikidata entities that are potentially georeferenceable on a world map are obtained. These are saved in a tsv file for later analysis.\n\n## Data analysis\nOnce the georeferenced entities were obtained, an analysis of the **types** to which these entities correspond was carried out. For this, the same previous procedure of reading the dump was repeated and for each triple that had the property [P31](https://www.wikidata.org/wiki/Property:P31) (*instance of*) as a predicate, the type was saved in a dictionary as *key* and the count as *value*. It should be noted that an entity may have more than one P31 property, for example the P31 properties of the University of Chile are that it is a public university, open access publisher and research institute.\n\n## Results\nTo perform the visualizations, the [D3.js](https://d3js.org/) and [Folium](https://python-visualization.github.io/folium/) tools were used. \n\nThe following images show the geolocation of 500 thousand Wikidata entities using d3.js and Folium respectively.\n\n![d3_500k](https://user-images.githubusercontent.com/48598318/184800667-672df18e-0a3c-408e-94d6-a01e567189a2.png)\n![folium_500k](https://user-images.githubusercontent.com/48598318/184800687-d937c4b2-5903-4f9d-980c-b69259d709c0.png)\n\nAs explained above, from the analysis of the data it was possible to obtain the distribution of the types of georeferenced entities.\nThe following table shows the 25 types of entities that are most repeated in Wikidata:\n\n| Entity URL                              |   label     |   count |\n|:----------------------------------------|-------------|--------:|\n| https://www.wikidata.org/wiki/Q8502     |mountain|  519904 |\n| https://www.wikidata.org/wiki/Q486972   |human settlement|  418608 |\n| https://www.wikidata.org/wiki/Q79007    |street|  406014 |\n| https://www.wikidata.org/wiki/Q4022     |river|  366991 |\n| https://www.wikidata.org/wiki/Q54050    |hill|  321257 |\n| https://www.wikidata.org/wiki/Q41176    |building|  259995 |\n| https://www.wikidata.org/wiki/Q23397    |lake|  257618 |\n| https://www.wikidata.org/wiki/Q3947     |house|  193489 |\n| https://www.wikidata.org/wiki/Q16970    |church building|  191711 |\n| https://www.wikidata.org/wiki/Q532      |village|  176979 |\n| https://www.wikidata.org/wiki/Q355304   |watercourse|  173187 |\n| https://www.wikidata.org/wiki/Q23442    |island|  148484 |\n| https://www.wikidata.org/wiki/Q27686    |hotel|  121843 |\n| https://www.wikidata.org/wiki/Q47521    |stream|  121753 |\n| https://www.wikidata.org/wiki/Q9842     |primary school|  107988 |\n| https://www.wikidata.org/wiki/Q811979   |architectural structure|  101900 |\n| https://www.wikidata.org/wiki/Q55488    |railway station|   98867 |\n| https://www.wikidata.org/wiki/Q39816    |valley|   95799 |\n| https://www.wikidata.org/wiki/Q22698    |park|   81944 |\n| https://www.wikidata.org/wiki/Q39614    |cemetery|   81427 |\n| https://www.wikidata.org/wiki/Q12323    |dam|   73837 |\n| https://www.wikidata.org/wiki/Q67383935 |co-educational school|   73757 |\n| https://www.wikidata.org/wiki/Q124714   |spring|   69248 |\n| https://www.wikidata.org/wiki/Q19855165 |rural school|   68024 |\n| https://www.wikidata.org/wiki/Q55659167 |natural watercourse|   66348 |\n\n---\n[![Alt text](https://upload.wikimedia.org/wikipedia/commons/thumb/a/ae/Wikidata_Stamp_Rec_Dark.svg/200px-Wikidata_Stamp_Rec_Dark.svg.png \"Powered by Wikidata\")](https://www.wikidata.org/wiki/Wikidata:Main_Page)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbenjamindpb%2Fwikidata-preprocessing","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbenjamindpb%2Fwikidata-preprocessing","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbenjamindpb%2Fwikidata-preprocessing/lists"}