{"id":17645743,"url":"https://github.com/kleinhenz/wiki-network-extractor","last_synced_at":"2025-05-07T05:12:18.823Z","repository":{"id":191735948,"uuid":"165935351","full_name":"kleinhenz/wiki-network-extractor","owner":"kleinhenz","description":"python module for extracting link networks from wikimedia xml dumps","archived":false,"fork":false,"pushed_at":"2020-04-17T03:27:53.000Z","size":9,"stargazers_count":7,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-05-07T05:12:12.037Z","etag":null,"topics":["data-science","network-graph","python"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kleinhenz.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2019-01-15T22:46:55.000Z","updated_at":"2025-04-29T23:23:57.000Z","dependencies_parsed_at":"2023-08-31T12:23:07.751Z","dependency_job_id":null,"html_url":"https://github.com/kleinhenz/wiki-network-extractor","commit_stats":null,"previous_names":["kleinhenz/wiki-network-extractor"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kleinhenz%2Fwiki-network-extractor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kleinhenz%2Fwiki-network-extractor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kleinhenz%2Fwiki-network-extractor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kleinhenz%2Fwiki-network-extractor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kleinhenz","download_url":"https://codeload.github.com/kleinhenz/wiki-network-extractor/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252817653,"owners_count":21808707,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-science","network-graph","python"],"created_at":"2024-10-23T10:59:19.233Z","updated_at":"2025-05-07T05:12:18.806Z","avatar_url":"https://github.com/kleinhenz.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# wiki-network-extractor\nwiki-network-extractor (`wikinet`) is a python module for extracting link networks from [wikimedia xml dumps](https://dumps.wikimedia.org).\n\n## Installation\n`pip install git+https://github.com/kleinhenz/wiki-network-extractor.git`\n\n## Usage\nThe following commands download and parse the latest xml dump of [simple english wikipedia](https://simple.wikipedia.org/wiki/Main_Page).\nThis takes ~300MB of disk space and ~1 minute to parse.\n```\ncurl -L \"https://dumps.wikimedia.org/simplewiki/latest/simplewiki-latest-pages-articles.xml.bz2\" \u003e simplewiki.xml.bz2\npython -m wikinet xml2json simplewiki.xml.bz2 simplewiki.json\npython -m wikinet json2hdf simplewiki.json simplewiki.h5\n```\nThis creates a hdf5 archive (`simplewiki.h5`) containing page titles, lengths (in characters) and the adjacency matrix of the link network in [CSR](https://en.wikipedia.org/wiki/Sparse_matrix#Compressed_sparse_row_(CSR,_CRS_or_Yale_format)) format.\nThe structure of the hdf5 archive is shown below (obtained from `h5dump -n 1 simplewiki.h5`).\n```\nHDF5 \"simplewiki.h5\" {\nFILE_CONTENTS {\n group      /\n group      /graph\n group      /graph/adjacency\n group      /graph/adjacency/csr\n dataset    /graph/adjacency/csr/IA\n dataset    /graph/adjacency/csr/JA\n group      /graph/vertices\n dataset    /graph/vertices/lengths\n dataset    /graph/vertices/titles\n }\n}\n```\n\n## Inspiration\n* [six degrees of wikipedia](http://mu.netsoc.ie/wiki/)\n* [sdow](https://github.com/jwngr/sdow)\n\n## Implementation Notes\n\n### Two Stage Parsing\nNetwork extraction is done in two stages to optimize speed, memory, and disk space requirements.\nIn the first, and slowest stage `xml2json` incrementally reads a (bzip2 compressed) xml dump, extracts all links from the text of each page using regex and produces a [newline delimited json](http://ndjson.org/) file where each line is a json object containing the title, length and links for a single page/redirect.\nIn the second stage `json2hdf` reads the json file, applies filters, resolves all links and saves the results in a hdf5 archive.\n\nThis two stage approach is used because in order to resolve links a complete list of all page titles and redirects must be available.\nTherefore, either all links must be held in memory as text until the entire wiki has been read or else the wiki must be read twice (once to collect all titles and redirects and once to resolve links).\nFor large wikis, e.g. [english wikipedia](https://en.wikipedia.org/wiki/Main_Page), holding all links as text requires a prohibitive amount of memory for typical machines so it is necessary to use the second approach and take two passes.\nHowever, reading the xml dump twice would be slow because of its size and compression (and it is desirable keep it compressed in order to save disk space).\nThe intermediate ndjson representation solves these problems since it can be written on the fly requiring only one pass through the xml dump, and can be read quickly making the two pass approach feasible in the second stage.\n\n### Link Extraction\n[Wikitext links](https://en.wikipedia.org/wiki/Help:Link) have the general form `[[Page name#Section name|displayed text]]`.\nThe target for each link is extracted in `xml2json` using the following python regex: `re.compile(\"(?:\\\\[\\\\[)(.+?)(?:[\\\\]|#])\")`.\n\n### Link Resolution\nResolving links is a mostly straightforward process except for two details.\nFirst links can point to redirects that must be followed.\nSecond the first letter of a link is case insensitive unless the link is only a single letter.\n`wikinet` takes care to handle both these details correctly by keeping track of all redirects and normalizing links.\n\n### Page Filtering\nThe xml dumps contain many pages that are not normal articles such as files and help pages.\nThese are filtered in `json2hdf` by checking page titles against the following python regex:\n```\nre.compile(\"(?:Wikipedia:|:?File:|Media:|:?Image:|:?Template:|Draft:|Portal:|Module:|TimedText:|MediaWiki:|Help:)\")\n```\n\n### Storage Format\nThe output is stored as a hdf5 archive rather than as GraphML or some other text based format because the data can be large enough to make these formats inconvenient in terms of both disk space and parsing time.\nFor example the link network of [simple english wikipedia](https://simple.wikipedia.org/wiki/Main_Page) takes up ~125MB as a GraphML file but only ~16MB as a hdf5 archive.\nThese savings become more important for larger wikis such as [english wikipedia](https://en.wikipedia.org/wiki/Main_Page) which takes up about 1GB as a hdf5 archive.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkleinhenz%2Fwiki-network-extractor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkleinhenz%2Fwiki-network-extractor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkleinhenz%2Fwiki-network-extractor/lists"}