{"id":30344470,"url":"https://github.com/7mza/wikigrapher-generator","last_synced_at":"2026-05-05T04:07:00.610Z","repository":{"id":309320965,"uuid":"1034266297","full_name":"7mza/wikigrapher-generator","owner":"7mza","description":"explore wikipedia as a graph","archived":false,"fork":false,"pushed_at":"2025-08-11T07:00:12.000Z","size":262,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2025-08-11T09:17:34.708Z","etag":null,"topics":["bash","data-processing","graphs","neo4j","python","shell"],"latest_commit_sha":null,"homepage":"https://wikigrapher.com","language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"agpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/7mza.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-08-08T05:50:20.000Z","updated_at":"2025-08-11T07:00:16.000Z","dependencies_parsed_at":"2025-08-11T09:17:46.412Z","dependency_job_id":"b45e3130-b8d4-4530-a766-5f15d0d1d289","html_url":"https://github.com/7mza/wikigrapher-generator","commit_stats":null,"previous_names":["7mza/wikigrapher-generator"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/7mza/wikigrapher-generator","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/7mza%2Fwikigrapher-generator","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/7mza%2Fwikigrapher-generator/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/7mza%2Fwikigrapher-generator/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/7mza%2Fwikigrapher-generator/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/7mza","download_url":"https://codeload.github.com/7mza/wikigrapher-generator/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/7mza%2Fwikigrapher-generator/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":270996198,"owners_count":24681933,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-18T02:00:08.743Z","response_time":89,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bash","data-processing","graphs","neo4j","python","shell"],"created_at":"2025-08-18T12:42:21.502Z","updated_at":"2026-05-05T04:07:00.593Z","avatar_url":"https://github.com/7mza.png","language":"Shell","funding_links":[],"categories":[],"sub_categories":[],"readme":"# \u003cimg src=\"./misc/wikigrapher.png\" alt=\"drawing\" width=\"50\"/\u003e Wikigrapher-Generator\n\nTransform Wikipedia into a knowledge graph [https://wikigrapher.com](https://wikigrapher.com)\n\n**TLDR: Wikipedia SQL dumps -\u003e Wikigrapher-Generator -\u003e Wikipedia Neo4j graph**\n\nExplore how Wikipedia pages are connected beneath the surface\n\n---\n\n**Standalone web app for this project is available at [7mza/wikigrapher-slim](https://github.com/7mza/wikigrapher-slim)**\n\n## Overview\n\nBuilt by transforming Wikipedia SQL dumps (pages, links, redirects, templates, categories) from [relational model](https://www.mediawiki.org/wiki/manual:database_layout)\n\n![sql_db](./misc/db.svg)\n\ninto a navigable graph\n\n![graph_db](./misc/graph.jpg)\n\nTechnically a set of bash scripts to download and clean dumps + python scripts to handle dictionary/set operations and serialize in-memory objects (RAM offloading + snapshotting of processing steps)\n\n---\n\nThis project is loosely based on [jwngr/sdow](https://github.com/jwngr/sdow)\n\nIt's heavily modified to rely entirely on graph model and [neo4j/apoc](https://github.com/neo4j/apoc) instead of rewriting graph algorithms + introducing support for more Wikipedia node types (redirects, categories, templates ...)\n\n## Local generation\n\n[python \u003c= 3.14](https://github.com/pyenv/pyenv)\n\nbash\n\n```shell\n#apt install wget aria2 pigz\n\nchmod +x ./*.sh\n\n(venv)\n\npip3 install --upgrade pip -r requirements.txt\n\n./clean.sh \u0026\u0026 ./generate_tsv.sh\n\n# or\n\n./clean.sh \u0026\u0026 ./generate_tsv.sh --date YYYYMMDD --lang XX\n```\n\n**Dumps are released each 01 \u0026 20 of the month, 404/checksum error means dump in progress, wait for a few days or pick a previous date**\n\n**--date YYYYMMDD** represents desired date of [dump](https://dumps.wikimedia.org/enwiki)\n\n- If not provided, will default to latest dump available\n- **--date 11111111 will generate an EN dummy dump based on [example.sql](./misc/example.sql) for testing purposes**\n\n**--lang XX** represents desired language of dump\n\n- **EN/AR/FR are tested**\n- If not provided, will default to EN\n\nTo test another language, enable it in line `en | ar | fr)` in [generate_tsv.sh](./generate_tsv.sh)\n\n**Dump download depends on Wikimedia servers rate limit and graph generation for Wikipedia EN takes around 2h on a 6c/32g/nvme**\n\n## Docker generation\n\n\u003cspan style=\"color:red\"\u003eLimit generator service RAM and CPU in [compose.yml](./compose.yml)\u003c/span\u003e\n\n```shell\ndocker compose run --remove-orphans --build generator\n\n# or\n\nDUMP_DATE=YYYYMMDD DUMP_LANG=XX docker compose run --remove-orphans --build generator\n\n# using run instead of up for tqdm and aria2 progress indicators\n```\n\n```shell\n# linux only: change ownership of generated files to current user/group\n# not needed for win/mac\nsudo chown -R \"$(id -u):$(id -g)\" ./dump/ ./output/\n```\n\n## Neo4j setup\n\nClean previous neo4j volume when processing a newer dump\n\n```shell\ndocker volume rm wikigrapher_neo4j_data\n```\n\nAfter successful generation of graph TSVs by previous step\n\n`[INFO] graph generated successfully: Sun Aug 01 08:00:00 2025`\n\n(exit 0 + check [output folder](./output/))\n\nUncomment neo4j service command line in [compose.yml](./compose.yml) to prevent default db from starting immediately after neo4j server starts\n\nCommunity version only allows 1 db and prevents importing on a running one\n\nThen\n\n```shell\ndocker compose --profile neo4j up --build --remove-orphans\n```\n\nAfter container starts and return \"not starting database automatically\", leave it running, and in a separate terminal (project dir)\n\n```shell\ndocker compose exec neo4j bash -c \"\ncd /import \u0026\u0026 \\\nneo4j-admin database import full neo4j \\\n--overwrite-destination --delimiter='\\t' --array-delimiter=';' \\\n--nodes=./pages.header.tsv.gz,./pages.final.tsv.gz \\\n--nodes=./categories.header.tsv.gz,./categories.final.tsv.gz \\\n--nodes=./meta.header.tsv.gz,./meta.final.tsv.gz \\\n--relationships=./redirect_to.header.tsv.gz,./redirect_to.final.tsv.gz \\\n--relationships=./link_to.header.tsv.gz,./link_to.final.tsv.gz \\\n--relationships=./belong_to.header.tsv.gz,./belong_to.final.tsv.gz \\\n--relationships=./contains.header.tsv.gz,./contains.final.tsv.gz \\\n--verbose\"\n```\n\nAfter importing is finished, revert changes of [compose.yml](./compose.yml), stop the previously running neo4j container then `docker compose --profile neo4j up --build --remove-orphans` again, you should be able to connect to neo4j ui at http://localhost:7474/ (login/pwd in [.env](./.env))\n\n### Neo4j text lookup indexes :\n\nNodes indexes\n\n```sql\nCREATE TEXT INDEX index_page_title IF NOT EXISTS FOR (n:page) on (n.title);\nCREATE TEXT INDEX index_page_id IF NOT EXISTS FOR (n:page) on (n.pageId);\n\nCREATE TEXT INDEX index_redirect_title IF NOT EXISTS FOR (n:redirect) on (n.title);\nCREATE TEXT INDEX index_redirect_id IF NOT EXISTS FOR (n:redirect) on (n.pageId);\n\nCREATE TEXT INDEX index_category_title IF NOT EXISTS FOR (n:category) on (n.title);\nCREATE TEXT INDEX index_category_id IF NOT EXISTS FOR (n:category) on (n.categoryId);\n\nCREATE TEXT INDEX index_meta_property IF NOT EXISTS FOR (n:meta) on (n.property);\nCREATE TEXT INDEX index_meta_value IF NOT EXISTS FOR (n:meta) on (n.value);\nCREATE TEXT INDEX index_meta_id IF NOT EXISTS FOR (n:meta) on (n.metaId);\n```\n\n```sql\nSHOW INDEXES;\n// wait for 100% populationPercent\n```\n\nIf you want to flag orphans (pages with no incoming or outgoing links)\n\n```sql\nCALL apoc.periodic.iterate(\n  \"MATCH (node:page) RETURN node\",\n  \"WITH node\n  WHERE NOT EXISTS ((node)-[:link_to]-\u003e())\n  AND NOT EXISTS ((node)\u003c-[:link_to|redirect_to]-())\n  AND NOT EXISTS ((node)\u003c-[:contains]-(:category {title: 'Redirects_to_Wiktionary'}))\n  CREATE (orphan:orphan {\n    id: node.pageId,\n    title: node.title,\n    type: labels(node)[0],\n    createdAt: timestamp()\n  })\n  RETURN orphan\",\n  {batchSize: 100000, parallel: true}\n)\nYIELD batches, total\nRETURN batches, total\n\n// wait for procedure to finish\n```\n\nthen orphan indexes\n\n```sql\nCREATE TEXT INDEX index_orphan_title IF NOT EXISTS FOR (n:orphan) on (n.title);\nCREATE TEXT INDEX index_orphan_type IF NOT EXISTS FOR (n:orphan) on (n.type);\nCREATE TEXT INDEX index_orphan_id IF NOT EXISTS FOR (n:orphan) on (n.id);\nCREATE TEXT INDEX index_orphan_created IF NOT EXISTS FOR (n:orphan) on (n.createdAt);\n```\n\n```sql\nSHOW INDEXES;\n// wait for 100% populationPercent\n```\n\n## Some Neo4j queries\n\n\u003cspan style=\"color:red\"\u003eneo4j auto escapes special chars before saving\u003c/span\u003e\n\n```\nL'Avare_(film)            savedAs     L\\\\'Avare_(film)\n\nPowelliphanta_\"Matiri\"    savedAs     Powelliphanta_\\\\\\\"Matiri\\\\\\\"\n```\n\n\n### shortest path between two nodes\n\n```sql\nMATCH (source:page|redirect {title: \"Albus_Dumbledore\"})\nMATCH (target:page|redirect {title: \"Ubuntu\"})\nMATCH path = SHORTESTPATH((source)-[:link_to|redirect_to*1..50]-\u003e(target))\nRETURN path\n```\n\n### all shortest paths between two nodes\n\n```sql\nMATCH (source:page|redirect {title: \"L\\\\'Avare_(film)\"})\nMATCH (target:page|redirect {title: \"Ubuntu\"})\nMATCH paths = ALLSHORTESTPATHS((source)-[:link_to|redirect_to*1..50]-\u003e(target))\nWITH paths, [node IN nodes(paths) | node.title] AS titles\nORDER BY titles\n// SKIP 0 LIMIT 10\nreturn paths\n```\n\n### all shortest paths between two nodes + consider redirects as target\n\n```sql\nMATCH (source:page|redirect {title: \"L\\\\'Avare_(film)\"})\nMATCH (target:page|redirect {title: \"Powelliphanta_\\\\\\\"Matiri\\\\\\\"\"})\nOPTIONAL MATCH (redirects:redirect)-[:redirect_to]-\u003e(target)\nMATCH paths = ALLSHORTESTPATHS((source)-[:link_to|redirect_to*1..50]-\u003e(target))\nWITH paths AS tmp, length(paths) AS len, source, redirects\nCALL\n  apoc.cypher.run(\n    \"CALL (source, len, redirects, tmp) {\n        OPTIONAL MATCH paths = ALLSHORTESTPATHS(\n          (source)-[:link_to|redirect_to*1..\" + len + \"]-\u003e(redirects)\n        )\n        RETURN paths\n        UNION\n        RETURN tmp as paths }\n        WITH paths, [node IN nodes(paths) | node.title] AS titles\n        ORDER BY titles\n        RETURN paths\",\n    {source: source, redirects: redirects, len: len, tmp:tmp}\n  )\nYIELD value\nWITH DISTINCT value.paths AS paths\n// SKIP 0 LIMIT 10\nRETURN paths\n```\n\n### find orphan nodes\n\n```sql\nMATCH (orphan:orphan {type: \"page\"}) // or \"redirect\"\nRETURN orphan\nORDER BY orphan.title\n// SKIP 0 LIMIT 10\n```\n\n### all nodes belonging to a category\n\n```sql\nMATCH (target:category {title: \"The_Lord_of_the_Rings_characters\"})\nMATCH (node)-[:belong_to]-\u003e(target)\nRETURN node\nORDER BY node.title\n// SKIP 0 LIMIT 10 // carefull, will hang your host\n```\n\n### top/bottom N categories\n\n```sql\nMATCH (category:category)\u003c-[:belong_to]-()\nWITH category, count(*) AS categoryCount\nRETURN category.title AS categoryTitle, categoryCount\nORDER BY categoryCount DESC // or ASC for bottom\nSKIP 0 LIMIT 3 // carefull, will hang your host\n```\n\n## Collaboration \u0026 scope\n\nThere’s more structured Wikipedia data to be added (revisions, revision authors, ...etc)\n\nOther tools like spark are better for large-scale processing, but the goal here is simplicity:\nruns on a personal machine, easy to understand and easy to extend\n\nIf you have ideas or want to contribute, feel free to open an issue or PR\n\n## Todo\n\n- Wikipedia templates\n- Split sh files\n- Unit tests\n- Lower RAM needs by moving from dill/pickle to a better way (mmap, hdf5 ...)\n- ~~Pgzip not working on py \u003e= 3.12 (dumps are gz and neo4j-admin can only read gz/zip)~~\n\n## Misc\n\nAll links DB are changing according to [https://phabricator.wikimedia.org/T300222](https://phabricator.wikimedia.org/T300222)\n\nFormat/lint:\n\n```shell\n#apt install shfmt shellcheck\n\n(venv)\n\npip3 install --upgrade pip -r requirements_dev.txt\n\nisort ./scripts/*.py \u0026\u0026 black ./scripts/*.py \u0026\u0026 shfmt -l -w ./*.sh \u0026\u0026 shellcheck ./*.sh\n\npylint ./scripts/*.py\n```\n\n## License\n\nThis project is licensed under the [GNU Affero General Public License v3.0](./LICENSE.txt)\n\nWikipedia® is a registered trademark of the Wikimedia foundation\n\nThis project is independently developed and not affiliated with or endorsed by the Wikimedia foundation\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2F7mza%2Fwikigrapher-generator","html_url":"https://awesome.ecosyste.ms/projects/github.com%2F7mza%2Fwikigrapher-generator","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2F7mza%2Fwikigrapher-generator/lists"}