{"id":24955116,"url":"https://github.com/jrie/wicked","last_synced_at":"2025-08-24T13:46:03.982Z","repository":{"id":83346265,"uuid":"64584795","full_name":"jrie/wicked","owner":"jrie","description":"A xml/wikipedia dump parser and processor written in C","archived":false,"fork":false,"pushed_at":"2020-10-18T09:37:16.000Z","size":827,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-02-03T05:28:33.195Z","etag":null,"topics":["parsing","parsing-engine","wikipedia","wikipedia-parser","xml","xml-parser","xml-parsing"],"latest_commit_sha":null,"homepage":"","language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jrie.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2016-07-31T08:38:47.000Z","updated_at":"2022-12-26T00:27:56.000Z","dependencies_parsed_at":"2023-03-12T18:00:03.847Z","dependency_job_id":null,"html_url":"https://github.com/jrie/wicked","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jrie%2Fwicked","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jrie%2Fwicked/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jrie%2Fwicked/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jrie%2Fwicked/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jrie","download_url":"https://codeload.github.com/jrie/wicked/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246093184,"owners_count":20722402,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["parsing","parsing-engine","wikipedia","wikipedia-parser","xml","xml-parser","xml-parsing"],"created_at":"2025-02-03T05:24:11.678Z","updated_at":"2025-03-28T20:14:21.490Z","avatar_url":"https://github.com/jrie.png","language":"C","funding_links":[],"categories":[],"sub_categories":[],"readme":"# wicked\nA wikipedia dump parser written in pure C.\n\n![screenshot of wicked](screenshot.png)\n\n![screenshot of wicked in action](screenshot2.png)\n\n## In short\n*wicked* parses (wikipedia) xml dumps. And should function as a foundation for an compressor, reading out data, attributes, key and values of XML and Wikipedia tag data into RAM, but also saving out parsing results to files and delivering debug output to aid checking for errors or providing a file count/byte detection statistic.\n\n## Origin of \"enwik8_small\"\n\n\"enwik8_small\" is a cut out of the wikipedia dump \"enwik8\" file of the **Hutter Prize** @ http://prize.hutter1.net/ . Since *enwik8* is at 100 MB in size, the file is not directly here on github, so for testing the \"enwik8_small\" has to be sufficient. But I would recommend using \"enwik8\" or a current wikipedia (meta) dump or similar for testing *wicked*. But any other wikipedia dump from https://dumps.wikimedia.org/ should be fine to use too.\n\n## Status and further information\n**wicked is work in progress.**\n\nXML nodes are handled and read out structured into memory, including keys and values, as well as the data contained in the XML nodes which can consist of words, wikitags and entities.\n\nBy current defaults, wicked creates a lot of debug output which shows an outline of what data has been added, wikitag information styling as well as link targets, anchors, images. Multi-line tables are not supported yet, but this might be a feature to come in a further release.\n\nWords data is written out to **words.txt** - wikitag link targets to **wikitags.txt**. Wikitags become further processed so that included words are handled as well as styling tags. Entities are written to **entities.txt**, xml data is spilled out to **xmltags.txt** and **xmldata.txt**.\n\nEach of this elements contains background information about pre and postspacing, styling information, position in the row by index, if its a format start or end and other details inside *wicked*. I would recommend checking out the data *struct word*, *struct wikitag* and *struct entity* as well as the others.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjrie%2Fwicked","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjrie%2Fwicked","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjrie%2Fwicked/lists"}