{"id":13676069,"url":"https://github.com/mosuka/wikipedia-jsonl","last_synced_at":"2026-05-17T04:33:05.617Z","repository":{"id":40423024,"uuid":"442082667","full_name":"mosuka/wikipedia-jsonl","owner":"mosuka","description":"wikipedia-jsonl is a CLI that converts Wikipedia dump XML to JSON Lines format.","archived":false,"fork":false,"pushed_at":"2023-03-28T16:00:55.000Z","size":26445,"stargazers_count":3,"open_issues_count":2,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-02-10T15:13:48.737Z","etag":null,"topics":["cli","go","golang","jsonl","mediawiki","ndjson","wikipedia","xml"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mosuka.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null},"funding":{"github":"mosuka","patreon":null,"open_collective":null,"ko_fi":null,"tidelift":null,"community_bridge":null,"liberapay":null,"issuehunt":null,"otechie":null,"custom":null}},"created_at":"2021-12-27T07:20:16.000Z","updated_at":"2023-03-30T13:39:19.000Z","dependencies_parsed_at":"2024-01-14T14:31:14.164Z","dependency_job_id":"bc3a8ff6-5f4f-4e33-b982-e723d1c6b0bb","html_url":"https://github.com/mosuka/wikipedia-jsonl","commit_stats":{"total_commits":31,"total_committers":2,"mean_commits":15.5,"dds":"0.12903225806451613","last_synced_commit":"25b8ca4d81aa4c7f9c6fe048ce33cdcff63d8116"},"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mosuka%2Fwikipedia-jsonl","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mosuka%2Fwikipedia-jsonl/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mosuka%2Fwikipedia-jsonl/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mosuka%2Fwikipedia-jsonl/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mosuka","download_url":"https://codeload.github.com/mosuka/wikipedia-jsonl/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247301266,"owners_count":20916477,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cli","go","golang","jsonl","mediawiki","ndjson","wikipedia","xml"],"created_at":"2024-08-02T13:00:17.671Z","updated_at":"2026-05-17T04:33:00.569Z","avatar_url":"https://github.com/mosuka.png","language":"Go","funding_links":["https://github.com/sponsors/mosuka"],"categories":["Go"],"sub_categories":[],"readme":"# wikipedia-jsonl\n\nwikipedia-jsonl is a CLI that converts Wikipedia dump XML to JSON Lines format.\n\n\n## Requirement\n\nThis command uses [SQLite](https://sqlite.org). Make sure to install SQLite for your platform in advance.\n\n\n## Download Wikipedia dumps\n\nDownload Wikipedia dumps from [Wikimedia Downloads](https://dumps.wikimedia.org/backup-index.html).\n\n- enwiki-YYYYMMDD-pages-articles-multistream.xml.bz2\n- enwiki-YYYYMMDD-categorylinks.sql.gz\n\n\n## Import dumps\n\nCheckout [mysql2sqlite](https://github.com/dumblob/mysql2sqlite)\n\n```\n% git clone git@github.com:dumblob/mysql2sqlite.git\n```\n\nConvert the Dump file to Sqlite SQL and import it into Sqlite.\n\n```\n% gunzip -c enwiki-20211201-categorylinks.sql.gz | ./mysql2sqlite/mysql2sqlite - | sqlite3 enwiki-20211201.db\n```\n\n## Convert Wikipedia XML to JSONL\n\nRun the following command to convert the XML to JSONL and output it to stdout.\n\n```\n% bzcat enwiki-20211201-pages-articles-multistream.xml.bz2 | ./bin/wikipedia-jsonl -a -c -d enwiki-20211201.db -e -m -l -r\n```\n\nExecuting the above command will output the results as shown below.\n\n```\n{\"categories\":[\"Redirects_from_moves\",\"Redirects_with_old_history\",\"Unprintworthy_redirects\"],\"external_links\":[],\"id\":10,\"links\":[{\"Namespace\":\"\",\"PageName\":\"Computer accessibility\",\"Anchor\":\"\"}],\"media\":[],\"redirect\":\"Computer accessibility\",\"text\":\" Computer accessibility\",\"timestamp\":\"2021-01-23T15:15:01Z\",\"title\":\"AccessibleComputing\"}\n{\"categories\":[\"Redirects_with_old_history\",\"Unprintworthy_redirects\"],\"external_links\":[],\"id\":14,\"links\":[{\"Namespace\":\"\",\"PageName\":\"Geography of Afghanistan\",\"Anchor\":\"\"}],\"media\":[],\"redirect\":\"Geography of Afghanistan\",\"text\":\" Geography of Afghanistan\",\"timestamp\":\"2017-06-05T04:18:23Z\",\"title\":\"AfghanistanGeography\"}\n{\"categories\":[\"Redirects_with_old_history\",\"Unprintworthy_redirects\"],\"external_links\":[],\"id\":15,\"links\":[{\"Namespace\":\"\",\"PageName\":\"Demographics of Afghanistan\",\"Anchor\":\"\"}],\"media\":[],\"redirect\":\"Demographics of Afghanistan\",\"text\":\" Demographics of Afghanistan\",\"timestamp\":\"2017-06-05T04:19:42Z\",\"title\":\"AfghanistanPeople\"}\n{\"categories\":[\"Redirects_with_old_history\",\"Unprintworthy_redirects\"],\"external_links\":[],\"id\":18,\"links\":[{\"Namespace\":\"\",\"PageName\":\"Communications in Afghanistan\",\"Anchor\":\"\"}],\"media\":[],\"redirect\":\"Communications in Afghanistan\",\"text\":\" Communications in Afghanistan\",\"timestamp\":\"2017-06-05T04:19:45Z\",\"title\":\"AfghanistanCommunications\"}\n{\"categories\":[\"Redirects_with_old_history\",\"Unprintworthy_redirects\"],\"external_links\":[],\"id\":19,\"links\":[{\"Namespace\":\"\",\"PageName\":\"Transport in Afghanistan\",\"Anchor\":\"\"}],\"media\":[],\"redirect\":\"Transport in Afghanistan\",\"text\":\" Transport in Afghanistan\",\"timestamp\":\"2017-06-04T21:42:11Z\",\"title\":\"AfghanistanTransportations\"}\n{\"categories\":[\"Redirects_with_old_history\",\"Unprintworthy_redirects\"],\"external_links\":[],\"id\":20,\"links\":[{\"Namespace\":\"\",\"PageName\":\"Afghan Armed Forces\",\"Anchor\":\"\"}],\"media\":[],\"redirect\":\"Afghan Armed Forces\",\"text\":\" Afghan Armed Forces\",\"timestamp\":\"2017-06-04T21:43:11Z\",\"title\":\"AfghanistanMilitary\"}\n{\"categories\":[\"Redirects_with_old_history\",\"Unprintworthy_redirects\"],\"external_links\":[],\"id\":21,\"links\":[{\"Namespace\":\"\",\"PageName\":\"Foreign relations of Afghanistan\",\"Anchor\":\"\"}],\"media\":[],\"redirect\":\"Foreign relations of Afghanistan\",\"text\":\" Foreign relations of Afghanistan\",\"timestamp\":\"2017-06-04T21:43:14Z\",\"title\":\"AfghanistanTransnationalIssues\"}\n{\"categories\":[\"Redirects_with_old_history\",\"Unprintworthy_redirects\"],\"external_links\":[],\"id\":23,\"links\":[{\"Namespace\":\"\",\"PageName\":\"Assistive technology\",\"Anchor\":\"\"}],\"media\":[],\"redirect\":\"Assistive technology\",\"text\":\" Assistive_technology\",\"timestamp\":\"2017-06-05T04:19:50Z\",\"title\":\"AssistiveTechnology\"}\n\n...\n```\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmosuka%2Fwikipedia-jsonl","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmosuka%2Fwikipedia-jsonl","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmosuka%2Fwikipedia-jsonl/lists"}