Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
Awesome Lists | Featured Topics | Projects
https://github.com/spencermountain/dumpster-dive

roll a wikipedia dump into mongo
https://github.com/spencermountain/dumpster-dive
Last synced: about 1 month ago
JSON representation
roll a wikipedia dump into mongo
Host: GitHub
URL: https://github.com/spencermountain/dumpster-dive
Owner: spencermountain
License: other
Created: 2014-11-18T16:45:18.000Z (almost 10 years ago)
Default Branch: master
Last Pushed: 2024-07-01T20:10:28.000Z (3 months ago)
Last Synced: 2024-07-09T15:39:26.918Z (2 months ago)
Language: JavaScript
Homepage:
Size: 3.24 MB
Stars: 240
Watchers: 12
Forks: 46
Open Issues: 8
Metadata Files:
- Readme: README.md
- Changelog: changelog.md
- Contributing: contributing.md
- License: license.md
Awesome Lists containing this project

README

        


	dumpster-dive

	

		

	

  

    

  

	wikipedia dump parser

  _{by

    Spencer Kelly, Devrim Yasar,

		 and

    

      others}







  gets a wikipedia xml dump into mongo,

  so you can mess-around.


  
💂 Yup 💂

  ^{do it on your laptop.}

	



`dumpster-dive` is a **node** script that puts a **highly-queryable** wikipedia on your computer in a nice afternoon.



It uses [worker-nodes](https://github.com/allegro/node-worker-nodes) to process pages in parallel, and [wtf_wikipedia](https://github.com/spencermountain/wtf_wikipedia) to turn **_wikiscript_** into any json.





 -- en-wikipedia takes about 5-hours, end-to-end --



![dumpster](https://user-images.githubusercontent.com/399657/40262198-a268b95a-5ad3-11e8-86ef-29c2347eec81.gif)





this library writes to a database,


if you'd like to simply write files to the filesystem, use **[dumpster-dip](https://github.com/spencermountain/dumpster-dip)** instead.





```bash

npm install -g dumpster-dive

```



### 😎 API

```js

var dumpster = require('dumpster-dive');

dumpster({ file: './enwiki-latest-pages-articles.xml', db: 'enwiki' }, callback);

```



### Command-Line:

```bash

dumpster /path/to/my-wikipedia-article-dump.xml --citations=false --images=false

```

_then check em out in mongo:_

```bash

$ mongo        #enter the mongo shell

use enwiki     #grab the database

db.pages.count()

# 4,926,056...

db.pages.find({title:"Toronto"})[0].categories

#[ "Former colonial capitals in Canada",

#  "Populated places established in 1793" ...]

```



# Steps:

### 1️⃣ you can do this.

you can do this.

just a few Gb. you can do this.

### 2️⃣ get ready

Install [nodejs](https://nodejs.org/en/) (at least `v6`), [mongodb](https://docs.mongodb.com/manual/installation/) (at least `v3`)

```bash

# install this script

npm install -g dumpster-dive # (that gives you the global command `dumpster`)

# start mongo up

mongod --config /mypath/to/mongod.conf

```

### 3️⃣ download a wikipedia

The Afrikaans wikipedia (around 93,000 articles) only takes a few minutes to download, and 5 mins to load into mongo on a macbook:

```bash

# download an xml dump (38mb, couple minutes)

wget https://dumps.wikimedia.org/afwiki/latest/afwiki-latest-pages-articles.xml.bz2

```

the english dump is 16Gb. The [download page](https://dumps.wikimedia.org/enwiki/latest/) is confusing, but you'll want this file:

wget https://dumps.wikimedia.org/${LANG}wiki/latest/${LANG}wiki-latest-pages-articles.xml.bz2


for example, the English version is:

```bash

wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

```

### 4️⃣ unzip it

i know, this sucks. but it makes the parser so much faster.

```

bzip2 -d ./afwiki-latest-pages-articles.xml.bz2

```

On a macbook, unzipping en-wikipedia takes an hour or so. This is the most-boring part. Eat some lunch.

Or use a multithread bzip2 implementation like `lbzip2`, which takes around 4 minutes on a M1 Pro Mac with 13 threads:

```bash

brew install lbzip2

lbzip2 -d ./afwiki-latest-pages-articles.xml.bz2

```

The english wikipedia is around 100Gb.

### 5️⃣ OK, start it off

```bash

#load it into mongo (10-15 minutes)

dumpster ./afwiki-latest-pages-articles.xml

```

### 6️⃣ take a bath

just put some [epsom salts](https://www.youtube.com/watch?v=QSlIHCu2Smw) in there, it feels great.

The en-wiki dump should take a few hours. Maybe 8. Maybe 4. Have a snack prepared.

The console will update you every couple seconds to let you know where it's at.

### 7️⃣ done!

![image](https://user-images.githubusercontent.com/399657/40262181-7c1f17bc-5ad3-11e8-95ab-55f324022d43.png)



hey, go check-out your data - hit-up the mongo console:

```js

$ mongo

use afwiki //your db name

//show a random page

db.pages.find().skip(200).limit(2)

//find a specific page

db.pages.findOne({title:"Toronto"}).categories

//find the last page

db.pages.find().sort({$natural:-1}).limit(1)

// all the governors of Kentucky

db.pages.count({ categories : { $eq : "Governors of Kentucky" }}

//pages without images

db.pages.count({ images: {$size: 0} })

```

alternatively, you can run `dumpster-report afwiki` to see a quick spot-check of the records it has created across the database.



### Same for the English wikipedia:

the english wikipedia will work under the same process, but

the download will take an afternoon, and the loading/parsing a couple hours. The en wikipedia dump is a 13 GB (for [enwiki-20170901-pages-articles.xml.bz2](https://dumps.wikimedia.org/enwiki/20170901/enwiki-20170901-pages-articles.xml.bz2)), and becomes a pretty legit mongo collection uncompressed. It's something like 51GB, but mongo can do it 💪.



## Options:

dumpster follows all the conventions of [wtf_wikipedia](https://github.com/spencermountain/wtf_wikipedia), and you can pass-in any fields for it to include in it's json.

- **human-readable plaintext** **_--plaintext_**

```js

dumpster({ file: './myfile.xml.bz2', db: 'enwiki', plaintext: true, categories: false });

/*

[{

  _id:'Toronto',

  title:'Toronto',

  plaintext:'Toronto is the most populous city in Canada and the provincial capital...'

}]

*/

```

- **disambiguation pages / redirects** **_--skip_disambig_**, **_--skip_redirects_**

  by default, dumpster skips entries in the dump that aren't full-on articles, you can

```js

let obj = {

  file: './path/enwiki-latest-pages-articles.xml.bz2',

  db: 'enwiki',

  skip_redirects: false,

  skip_disambig: false

};

dumpster(obj, () => console.log('done!'));

```

- **reducing file-size:**

  you can tell wtf_wikipedia what you want it to parse, and which data you don't need:

```bash

dumpster ./my-wiki-dump.xml --infoboxes=false --citations=false --categories=false --links=false

```

- **custom json formatting**

  you can grab whatever data you want, by passing-in a `custom` function. It takes a [wtf_wikipedia](https://github.com/spencermountain/wtf_wikipedia) `Doc` object, and you can return your cool data:

```js

let obj = {

  file: path,

  db: dbName,

  custom: function (doc) {

    return {

      _id: doc.title(), //for duplicate-detection

      title: doc.title(), //for the logger..

      sections: doc.sections().map((i) => i.json({ encode: true })),

      categories: doc.categories() //whatever you want!

    };

  }

};

dumpster(obj, () => console.log('custom wikipedia!'));

```

if you're using any `.json()` methods, pass a `{encode:true}` in to avoid mongo complaints about key-names.

- **non-main namespaces:**

  do you want to parse all the navboxes? change `namespace` in ./config.js to [another number](https://en.wikipedia.org/wiki/Wikipedia:Namespace)

- **remote db:**

  if your databse is non-local, or requires authentication, set it like this:

```js

dumpster({ db_url: 'mongodb://username:password@localhost:27017/' }, () => console.log('done!'));

```



## how it works:

this library uses:

- [sunday-driver](https://github.com/spencermountain/sunday-driver) to stream the gnarly xml file

- [wtf_wikipedia](https://github.com/spencermountain/wtf_wikipedia) to brute-parse the article wikiscript contents into JSON.

## Addendum:

### \_ids

since wikimedia makes all pages have globally unique titles, we also use them for the mongo `_id` fields.

The benefit is that if it crashes half-way through, or if you want to run it again, running this script repeatedly will not multiply your data. We do a 'upsert' on the record.

### encoding special characters

mongo has some opinions on special-characters in some of its data. It is weird, but we're using this [standard(ish)](https://stackoverflow.com/a/30254815/168877) form of encoding them:

```

\  -->  \\

$  -->  \u0024

.  -->  \u002e

```

### Non-wikipedias

This library should also work on other wikis with standard xml dumps from [MediaWiki](https://www.mediawiki.org/wiki/MediaWiki) (except wikidata!). I haven't tested them, but the wtf_wikipedia supports all sorts of non-standard wiktionary/wikivoyage templates, and if you can get a bz-compressed xml dump from your wiki, this should work fine. Open an issue if you find something weird.

### did it break?

if the script trips at a specific spot, it's helpful to know the article it breaks on, by setting `verbose:true`:

```js

dumpster({

  file: '/path/to/file.xml',

  verbose: true

});

```

this prints out every page's title while processing it..

### 16mb limit?

To go faster, this library writes a ton of articles at a time (default 800). Mongo has a **16mb** limit on writes, so if you're adding a bunch of data, like `latex`, or `html`, it may make sense to turn this down.

```

dumpster --batch_size=100

```

that should do the trick.

### PRs welcome!

This is an important project, come [help us out](./contributing.md).