{"id":14959397,"url":"https://github.com/msorkhpar/wiki-entity-summarization-preprocessor","last_synced_at":"2026-01-31T02:31:41.299Z","repository":{"id":243824894,"uuid":"776663236","full_name":"msorkhpar/wiki-entity-summarization-preprocessor","owner":"msorkhpar","description":"Convert Wikidata and Wikipedia raw files to filterable formats with a focus of marking Wikidata  as summaries based on their Wikipedia abstracts.","archived":false,"fork":false,"pushed_at":"2024-08-20T12:31:46.000Z","size":1151,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-09-05T12:21:23.230Z","etag":null,"topics":["distilbert","java","neo4j","networkx","postgresql","python","transformers","wikes","wiki-entity-summarization","wikies"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"cc-by-4.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/msorkhpar.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-03-24T05:50:43.000Z","updated_at":"2025-01-16T02:12:12.000Z","dependencies_parsed_at":"2024-09-22T09:02:10.343Z","dependency_job_id":"4923d810-00c4-4e9c-b2b1-a61c3207bb1f","html_url":"https://github.com/msorkhpar/wiki-entity-summarization-preprocessor","commit_stats":{"total_commits":13,"total_committers":2,"mean_commits":6.5,"dds":"0.23076923076923073","last_synced_commit":"26cc60c8f79d6ed00dbc305f479ad537effc0219"},"previous_names":["msorkhpar/wiki-entity-summarization-preprocessor"],"tags_count":3,"template":false,"template_full_name":null,"purl":"pkg:github/msorkhpar/wiki-entity-summarization-preprocessor","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/msorkhpar%2Fwiki-entity-summarization-preprocessor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/msorkhpar%2Fwiki-entity-summarization-preprocessor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/msorkhpar%2Fwiki-entity-summarization-preprocessor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/msorkhpar%2Fwiki-entity-summarization-preprocessor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/msorkhpar","download_url":"https://codeload.github.com/msorkhpar/wiki-entity-summarization-preprocessor/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/msorkhpar%2Fwiki-entity-summarization-preprocessor/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28927159,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-30T22:32:35.345Z","status":"online","status_checked_at":"2026-01-31T02:00:09.179Z","response_time":128,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["distilbert","java","neo4j","networkx","postgresql","python","transformers","wikes","wiki-entity-summarization","wikies"],"created_at":"2024-09-24T13:19:36.795Z","updated_at":"2026-01-31T02:31:41.283Z","avatar_url":"https://github.com/msorkhpar.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![arXiv](https://img.shields.io/badge/arXiv-2406.08435-B31B1B.svg)](https://doi.org/10.48550/arXiv.2406.08435)![GitHub License](https://img.shields.io/github/license/msorkhpar/wiki-entity-summarization-preprocessor)\n\n# Wiki Entity Summarization Pre-processing\n\n## Overview\n\nThis project focuses on the pre-processing steps required for the Wiki Entity Summarization (Wiki ES) project. It\ninvolves building the necessary databases and loading data from various sources to prepare for the entity summarization\ntasks.\n\n### Server Specifications\n\nFor the pre-processing steps, we used an r5a.4xlarge instance on AWS with the following specifications:\n\n- vCpu: 16 (AMD EPYC 7571, 16 MiB cache, 2.5 GHz)\n- Memory: 128 GB (DDR4, 2667 MT/s)\n- Storage: 500 GB (EBS, 2880 Max Bandwidth)\n\n### Getting Started\n\nTo get started with the pre-processing, follow these steps:\n\n1. Build the [wikimapper](https://github.com/jcklie/wikimapper) database:\n\n```shell\npip install wikimapper\n````\n\nIf you would like to download the latest version, run the following:\n\n```shell\nEN_WIKI_REDIRECT_AND_PAGES_PATH={your_files_path}\nwikimapper download enwiki-latest --dir $EN_WIKI_REDIRECT_AND_PAGES_PATH\n```\n\nAfter having `enwiki-{VERSION}-page.sql.gz`, `enwiki-{VERSION}-redirect.sql.gz`,\nand `enwiki-{VERSION}-page_props.sql.gz` loaded under your data directory, run the following commands:\n\n```shell\nVERSION={VERSION}\nEN_WIKI_REDIRECT_AND_PAGES_PATH={your_files_path}\nINDEX_DB_PATH=\"`pwd`/data/index_enwiki-$VERSION.db\"\nwikimapper create enwiki-$VERSION --dumpdir $EN_WIKI_REDIRECT_AND_PAGES_PATH --target $INDEX_DB_PATH\n```\n\n2. Load the created database into the Postgres database:\n   read [pgloader's document](https://pgloader.readthedocs.io/en/latest/install.html) for the installation\n\n```shell\n./config-files-generator.sh\nsource .env\ncat \u003c\u003cEOT \u003e sqlite-to-page-migration.load\nload database\n    from $INDEX_DB_PATH\n    into postgresql://$DB_USER:$DB_PASS@$DB_HOST:$DB_PORT/$DB_NAME\nwith include drop, create tables, create indexes, reset sequences\n;\nEOT\n\npgloader ./sqlite-to-page-migration.load\n```\n\n3. Correct missing data: After running the experiments, some issues were encountered with the wikimapper library.\n   To correct the missing data, run the following script:\n\n```shell \npython3 missing_data_correction.py\n```\n\n## Data Sources\n\nThe pre-processing steps involve loading data from the following sources:\n\n- **Wikidata**, [wikidatawiki latest version](https://dumps.wikimedia.org/wikidatawiki/latest/):\n  First, download the latest version of the Wikidata dump. With the dump, you can run the following command to load the\n  metadata of the Wikidata dataset into the Postgres database, and the relationships between the entities into the Neo4j\n  database. This module is called `Wikidata Graph Builder (wdgp)`.\n  ```shell\n  docker-compose up wdgp\n  ```\n- **Wikipedia**, [enwiki lastest version](https://dumps.wikimedia.org/enwiki/latest/):\n  The Wikipedia pages are used to extract the abstract and infobox of the corresponding Wikidata entity. The abstract\n  and infobox are then used to annotate the summary in Wikidata. To provide such information, you need to load the\n  latest version of the Wikipedia dump into the Postgres database. This module is called `Wikipedia Page Extractor (\n  wppe)`.\n    ```shell\n    docker-compose up wppe\n    ```\n\n## Summary Annotation\n\nWhen both datasets are loaded into the databases, we start processing all the available pages in the Wikipedia dataset\nto extract the abstract and infobox of the corresponding Wikidata entity. Later, these pages are marked from the\nextracted data, and the edges containing the marked pages are marked as candidates. Since Wikidata is a heterogeneous\ngraph with multiple types of edges, we need to pick the most relevant edge as a summary between two entities for the\nsummarization task. This module is called `Wiki Summary Annotator (wsa)`, and we\nuse [DistilBERT](https://arxiv.org/abs/1910.01108)  to filter the most relevant edge.\n\n```shell\ndocker-compose up wsa\n```\n\n## Conclusion\n\nBy running the above commands, you will have the necessary databases and data loaded to start the Wiki Entity\nSummarization project. The next steps involve providing a set of seed nodes based on your preference along with other\nconfiguration parameters to get a fully customized Entity Summarization Dataset.\n\n## Citation\n\nIf you use this project in your research, please cite the following paper:\n\n```bibtex\n@misc{javadi2024wiki,\n    title = {Wiki Entity Summarization Benchmark},\n    author = {Saeedeh Javadi and Atefeh Moradan and Mohammad Sorkhpar and Klim Zaporojets and Davide Mottin and Ira Assent},\n    year = {2024},\n    eprint = {2406.08435},\n    archivePrefix = {arXiv},\n    primaryClass = {cs.IR}\n}\n```\n\n## License\n\nThis project is licensed under the CC BY 4.0 License. See the [LICENSE](LICENSE) file for details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmsorkhpar%2Fwiki-entity-summarization-preprocessor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmsorkhpar%2Fwiki-entity-summarization-preprocessor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmsorkhpar%2Fwiki-entity-summarization-preprocessor/lists"}