Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/derhuerst/build-wikipedia-feed
Build a hyperdb of Wikipedia articles.
https://github.com/derhuerst/build-wikipedia-feed
database hyperdb p2p wikipedia
Last synced: 4 days ago
JSON representation
Build a hyperdb of Wikipedia articles.
- Host: GitHub
- URL: https://github.com/derhuerst/build-wikipedia-feed
- Owner: derhuerst
- License: isc
- Created: 2017-07-18T23:33:12.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2022-04-07T21:49:10.000Z (almost 3 years ago)
- Last Synced: 2024-12-25T00:30:12.423Z (17 days ago)
- Topics: database, hyperdb, p2p, wikipedia
- Language: JavaScript
- Homepage: https://github.com/derhuerst/build-wikipedia-feed
- Size: 60.5 KB
- Stars: 24
- Watchers: 2
- Forks: 4
- Open Issues: 3
-
Metadata Files:
- Readme: readme.md
- License: license.md
Awesome Lists containing this project
README
# build-wikipedia-feed
**Build a [hyperdrive](https://npmjs.com/package/hyperdrive) feed of Wikipedia articles**, including historical revisions.
[![npm version](https://img.shields.io/npm/v/build-wikipedia-feed.svg)](https://www.npmjs.com/package/build-wikipedia-feed)
[![build status](https://img.shields.io/travis/derhuerst/build-wikipedia-feed.svg)](https://travis-ci.org/derhuerst/build-wikipedia-feed)
![ISC-licensed](https://img.shields.io/github/license/derhuerst/build-wikipedia-feed.svg)
[![chat on gitter](https://badges.gitter.im/derhuerst.svg)](https://gitter.im/derhuerst)
[![support me on Patreon](https://img.shields.io/badge/support%20me-on%20patreon-fa7664.svg)](https://patreon.com/derhuerst)## Rationale
### Problem
[Wikipedia](https://en.wikipedia.org/wiki/Wikipedia) is an incredibly important collection of knowledge on the internet. It is free to read and edit for everyone. Though, it is important to know that **it is stored on only a handful of servers in a handful of countries controlled by a single organisation**. This mainly causes two problems:
- Currently, it is **too easy to censor** the Wikipedia. We need a system that supports redundancy without any additional effort.
- It **does not work offline**. Making an offline copy is complicated. Also, you usually have to download all articles for a language.### Solution
Let's **store Wikipedia's content in a [peer-to-peer (P2P) system](https://en.wikipedia.org/wiki/Peer-to-peer)**. By leveraging software from [the *Dat* project](https://docs.datproject.org), we won't have to reinvent the wheel. [The *Dat* protocol](https://github.com/datproject/docs/blob/master/papers/dat-paper.pdf) efficiently only syncs changes between to versions of data, allows for sparse & live replication and is completely [distributed](https://en.wikipedia.org/wiki/Peer-to-peer#Unstructured_networks).
This tool can extract articles from a [Wikipedia dump](https://dumps.wikimedia.org/enwiki) or download it directly, and store it in a *Dat* archive. See below for more details.
## Installing
```shell
npm install -g build-wikipedia-feed
```## Usage
This module exposes several command line building blocks.
### read *all* revisions of every article
Pipe [a `stub-meta-history*` XML file](https://dumps.wikimedia.org/enwiki/) into `wiki-revisions-list`. You will get an [ndjson](http://ndjson.org) list of page revisions.
```shell
curl -s 'https://dumps.wikimedia.org/enwiki/20181001/enwiki-20181001-stub-meta-history1.xml.gz' | gunzip | wiki-revisions-list >revisions.ndjson
```### read the *most recent* revision of every article
Pipe [a `stub-meta-current*` XML file](https://dumps.wikimedia.org/enwiki/) file into `wiki-revisions-list`. You will get an [ndjson](http://ndjson.org) list of page revisions.
```shell
curl -s 'https://dumps.wikimedia.org/enwiki/20181001/enwiki-20181001-stub-meta-current1.xml.gz' | gunzip | wiki-revisions-list >revisions.ndjson
```### read articles being *edited right now*
Use `wiki-live-revisions`. You will get an [ndjson](http://ndjson.org) list of page revisions.
```shell
wiki-live-revisions >revisions.ndjson
```### fetch & store revisions in a [hyperdrive](https://npmjs.com/package/hyperdrive)
Use `wiki-store-revisions` to write the HTML content of all revisions in `revisions.ndjson` into a [hyperdrive](https://npmjs.com/package/hyperdrive). The archive will be created under `p2p-wiki` in [your system's data directory](https://github.com/sindresorhus/env-paths#usage).
```shell
cat revisions.ndjson | wiki-store-revisions
```## Related
- [distributed-wikipedia-mirror](https://github.com/ipfs/distributed-wikipedia-mirror) – Putting wikipedia on IPFS
- [`fetch-wikipedia-page-revision`](https://github.com/derhuerst/fetch-wikipedia-page-revision#fetch-wikipedia-page-revision) – Fetch a revision of a Wikipedia page as mobile HTML.
- [`wikipedia-edits-stream`](https://github.com/derhuerst/wikipedia-edits-stream#wikipedia-edits-stream) – A live stream of page edits on Wikipedia.
- [`commons-photo-url`](https://github.com/derhuerst/commons-photo-url#commons-photo-url) – Download Wikimedia Commons photos.
- [`wiki-article-name-encoding`](https://github.com/derhuerst/wiki-article-name-encoding#wiki-article-name-encoding) – Encode & decode Wiki(pedia) article names/slugs.## Contributing
If you have a question or have difficulties using `build-wikipedia-feed`, please double-check your code and setup first. If you think you have found a bug or want to propose a feature, refer to [the issues page](https://github.com/derhuerst/build-wikipedia-feed/issues).