{"id":18060613,"url":"https://github.com/lschmid83/wikipedia-extractor","last_synced_at":"2025-08-19T04:32:43.463Z","repository":{"id":193835834,"uuid":"683403184","full_name":"lschmid83/Wikipedia-Extractor","owner":"lschmid83","description":"Wikipedia Extractor is a lightweight C# library which can be used to extract XML page data from a Wikipedia data dump.","archived":false,"fork":false,"pushed_at":"2024-12-01T11:55:05.000Z","size":33,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2024-12-01T12:37:43.213Z","etag":null,"topics":["data-dump","integration-testing","page-title","regex","search-algorithm","search-index","unit-testing","wikipedia","xml"],"latest_commit_sha":null,"homepage":"https://www.nuget.org/packages/WikipediaExtractor","language":"C#","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lschmid83.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-08-26T13:12:48.000Z","updated_at":"2024-12-01T11:55:09.000Z","dependencies_parsed_at":"2024-12-01T12:42:29.753Z","dependency_job_id":null,"html_url":"https://github.com/lschmid83/Wikipedia-Extractor","commit_stats":null,"previous_names":["lschmid83/wikipedia-extractor"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lschmid83%2FWikipedia-Extractor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lschmid83%2FWikipedia-Extractor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lschmid83%2FWikipedia-Extractor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lschmid83%2FWikipedia-Extractor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lschmid83","download_url":"https://codeload.github.com/lschmid83/Wikipedia-Extractor/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":230320811,"owners_count":18208251,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-dump","integration-testing","page-title","regex","search-algorithm","search-index","unit-testing","wikipedia","xml"],"created_at":"2024-10-31T04:09:55.827Z","updated_at":"2024-12-18T18:29:40.820Z","avatar_url":"https://github.com/lschmid83.png","language":"C#","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Wikipedia Extractor\n\nWikipedia Extractor is a lightweight C# library which can be used to extract XML page data from a Wikipedia data dump. It makes use of the index file included with the compressed data dump to find the position of the page and quickly retrieve it from the archive. It was developed using Visual Studio 2022.\n\nThe current URL for the data dumps is https://dumps.wikimedia.org/enwiki/ you will need to download both files and extract the index but not the dump and enter the correct paths for the library to find the files. \n\nThe test project can be run without using the data dump as all of the index and page contents are created in memory.\n\nThis library does not parse the XML page elements instead it just returns an object containing the XML. There are other projects on GitHub for parsing the XML.\n\nHere are some screenshots of the library running:\n\n\u003cimg align='left' src='https://drive.google.com/uc?id=1d5y_9GKCelsbyn61Ui7oHYZYQhCB1MKG' width='240'\u003e\n\u003cimg src='https://drive.google.com/uc?id=1IQeyd8hGIURlNH6VW9GjyjnShMoV9GYF' width='240'\u003e\n\n# Example\n\n```cs\nvar pageTitles = new List\u003cstring\u003e\n{\n\t\"Software development\",\n\t\"Microsoft Visual Studio\",\n\t\"JavaScript\"\n};\n\nusing (var indexSearcher = new PageIndexSearcher(@\"c:\\enwiki-20190701-pages-articles-multistream-index.txt\"))\n{\n\tvar pageIndexItems = indexSearcher.Search(pageTitles);\n\tforeach (PageIndexItem pii in pageIndexItems)\n\t{\n\t\tConsole.WriteLine(pii.PageId + \": \" + pii.PageTitle);\n\t}\n\n\tusing (var dataDumpReader = new DataDumpReader(@\"c:\\enwiki-20190701-pages-articles-multistream.xml.bz2\"))\n\t{\n\t\tvar results = dataDumpReader.Search(pageIndexItems);\n\t\tforeach (var result in results) \n\t\t{ \n\t\t\tConsole.WriteLine(result.Name + \": \" + result.Value);               \n\t\t}\n\t}\n}\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flschmid83%2Fwikipedia-extractor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flschmid83%2Fwikipedia-extractor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flschmid83%2Fwikipedia-extractor/lists"}