Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/mosuka/wikipedia-jsonl
wikipedia-jsonl is a CLI that converts Wikipedia dump XML to JSON Lines format.
https://github.com/mosuka/wikipedia-jsonl
cli go golang jsonl mediawiki ndjson wikipedia xml
Last synced: about 1 month ago
JSON representation
wikipedia-jsonl is a CLI that converts Wikipedia dump XML to JSON Lines format.
- Host: GitHub
- URL: https://github.com/mosuka/wikipedia-jsonl
- Owner: mosuka
- License: mit
- Created: 2021-12-27T07:20:16.000Z (about 3 years ago)
- Default Branch: main
- Last Pushed: 2023-03-28T16:00:55.000Z (almost 2 years ago)
- Last Synced: 2024-11-16T17:42:14.584Z (2 months ago)
- Topics: cli, go, golang, jsonl, mediawiki, ndjson, wikipedia, xml
- Language: Go
- Homepage:
- Size: 25.2 MB
- Stars: 3
- Watchers: 2
- Forks: 0
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
- License: LICENSE
Awesome Lists containing this project
README
# wikipedia-jsonl
wikipedia-jsonl is a CLI that converts Wikipedia dump XML to JSON Lines format.
## Requirement
This command uses [SQLite](https://sqlite.org). Make sure to install SQLite for your platform in advance.
## Download Wikipedia dumps
Download Wikipedia dumps from [Wikimedia Downloads](https://dumps.wikimedia.org/backup-index.html).
- enwiki-YYYYMMDD-pages-articles-multistream.xml.bz2
- enwiki-YYYYMMDD-categorylinks.sql.gz## Import dumps
Checkout [mysql2sqlite](https://github.com/dumblob/mysql2sqlite)
```
% git clone [email protected]:dumblob/mysql2sqlite.git
```Convert the Dump file to Sqlite SQL and import it into Sqlite.
```
% gunzip -c enwiki-20211201-categorylinks.sql.gz | ./mysql2sqlite/mysql2sqlite - | sqlite3 enwiki-20211201.db
```## Convert Wikipedia XML to JSONL
Run the following command to convert the XML to JSONL and output it to stdout.
```
% bzcat enwiki-20211201-pages-articles-multistream.xml.bz2 | ./bin/wikipedia-jsonl -a -c -d enwiki-20211201.db -e -m -l -r
```Executing the above command will output the results as shown below.
```
{"categories":["Redirects_from_moves","Redirects_with_old_history","Unprintworthy_redirects"],"external_links":[],"id":10,"links":[{"Namespace":"","PageName":"Computer accessibility","Anchor":""}],"media":[],"redirect":"Computer accessibility","text":" Computer accessibility","timestamp":"2021-01-23T15:15:01Z","title":"AccessibleComputing"}
{"categories":["Redirects_with_old_history","Unprintworthy_redirects"],"external_links":[],"id":14,"links":[{"Namespace":"","PageName":"Geography of Afghanistan","Anchor":""}],"media":[],"redirect":"Geography of Afghanistan","text":" Geography of Afghanistan","timestamp":"2017-06-05T04:18:23Z","title":"AfghanistanGeography"}
{"categories":["Redirects_with_old_history","Unprintworthy_redirects"],"external_links":[],"id":15,"links":[{"Namespace":"","PageName":"Demographics of Afghanistan","Anchor":""}],"media":[],"redirect":"Demographics of Afghanistan","text":" Demographics of Afghanistan","timestamp":"2017-06-05T04:19:42Z","title":"AfghanistanPeople"}
{"categories":["Redirects_with_old_history","Unprintworthy_redirects"],"external_links":[],"id":18,"links":[{"Namespace":"","PageName":"Communications in Afghanistan","Anchor":""}],"media":[],"redirect":"Communications in Afghanistan","text":" Communications in Afghanistan","timestamp":"2017-06-05T04:19:45Z","title":"AfghanistanCommunications"}
{"categories":["Redirects_with_old_history","Unprintworthy_redirects"],"external_links":[],"id":19,"links":[{"Namespace":"","PageName":"Transport in Afghanistan","Anchor":""}],"media":[],"redirect":"Transport in Afghanistan","text":" Transport in Afghanistan","timestamp":"2017-06-04T21:42:11Z","title":"AfghanistanTransportations"}
{"categories":["Redirects_with_old_history","Unprintworthy_redirects"],"external_links":[],"id":20,"links":[{"Namespace":"","PageName":"Afghan Armed Forces","Anchor":""}],"media":[],"redirect":"Afghan Armed Forces","text":" Afghan Armed Forces","timestamp":"2017-06-04T21:43:11Z","title":"AfghanistanMilitary"}
{"categories":["Redirects_with_old_history","Unprintworthy_redirects"],"external_links":[],"id":21,"links":[{"Namespace":"","PageName":"Foreign relations of Afghanistan","Anchor":""}],"media":[],"redirect":"Foreign relations of Afghanistan","text":" Foreign relations of Afghanistan","timestamp":"2017-06-04T21:43:14Z","title":"AfghanistanTransnationalIssues"}
{"categories":["Redirects_with_old_history","Unprintworthy_redirects"],"external_links":[],"id":23,"links":[{"Namespace":"","PageName":"Assistive technology","Anchor":""}],"media":[],"redirect":"Assistive technology","text":" Assistive_technology","timestamp":"2017-06-05T04:19:50Z","title":"AssistiveTechnology"}...
```