https://github.com/spencermountain/dumpster-dip
parse a wikipedia dump into tiny files
https://github.com/spencermountain/dumpster-dip
Last synced: 11 days ago
JSON representation
parse a wikipedia dump into tiny files
- Host: GitHub
- URL: https://github.com/spencermountain/dumpster-dip
- Owner: spencermountain
- License: mit
- Created: 2022-04-16T14:40:27.000Z (about 3 years ago)
- Default Branch: main
- Last Pushed: 2023-12-28T20:43:21.000Z (over 1 year ago)
- Last Synced: 2025-03-25T15:14:07.960Z (28 days ago)
- Language: JavaScript
- Homepage:
- Size: 571 KB
- Stars: 7
- Watchers: 2
- Forks: 9
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- Changelog: changelog.md
- License: LICENSE
Awesome Lists containing this project
README
The data exports from wikimedia, arguably the world's most-important datasets, exist as huge xml files, in a notorious markup format.
dumpster-dip can flip this dataset into individual json or text files.
Sister-project dumpster-dive puts this data into mongodb, instead
use whatever you prefer!
Both projects use wtf_wikipedia as a parser.
## Command-Line
the easiest way to get started is to simply run:
```bash
npx dumpster-dip
```which is a wild, no-install, no-dependency way to get going.
Follow the prompts, and this will **download**, **unzip**, and **parse** any-language wikipedia, into a selected format.
The optional params are:
```bash
--lang fr # do the french wikipedia
--output encyclopedia # add all 'E' pages to ./E/
--text # return plaintext instead of json
```
![]()
![]()
## JS API
Also available to be used as a powerful javascript library:
```bash
npm install dumpster-dip
``````js
import dumpster from 'dumpster-dip' // or require('dumpster-dip')await dumpster({ file: './enwiki-latest-pages-articles.xml' }) // 😅
```This will require you to download and unzip a dump yourself. Instructions below.
Depending on the language, it may take a couple hours.### Instructions
1. Download a dump
cruise the wikipedia dump page and look for `${LANG}wiki-latest-pages-articles.xml.bz2`
2. Unzip the dump
`bzip2 -d ./enwiki-latest-pages-articles.xml.bz2`
3. Start the javascript
```js
import dip from 'dumpster-dip'const opts = {
input: './enwiki-latest-pages-articles.xml',
parse: function (doc) {
return doc.sentences()[0].text() // return the first sentence of each page
}
}dip(opts).then(() => {
console.log('done!')
})
```en-wikipedia takes about 4hrs on a macbook. See expected article counts [here](https://meta.wikimedia.org/wiki/List_of_Wikipedias)
### Options
```js
{
file: './enwiki-latest-pages-articles.xml', // path to unzipped dump file relative to cwd
outputDir: './dip', // directory for all our new file(s)
outputMode: 'nested', // how we should write the results// define how many concurrent workers to run
workers: cpuCount, // default is cpu count
//interval to log status
heartbeat: 5000, //every 5 seconds// which wikipedia namespaces to handle (null will do all)
namespace: 0, //(default article namespace)
// parse redirects, too
redirects: false,
// parse disambiguation pages, too
disambiguation: true,// allow a custom wtf_wikipedia parsing library
libPath: 'wtf_wikipedia',// should we skip this page or return something?
doPage: function (doc) {
return true
},// what do return, for every page
//- avoid using an arrow-function
parse: function (doc) {
return doc.json()
}
}
```
### Output formats:
dumpster-dip comes with 4 output formats:
- **'flat'** - all files in 1 directory
- **'encyclopedia'** - all `'E..'` pages in `./e`
- **'encyclopedia-two'** - all `'Ed..'` pages in `./ed`
- **'hash'** (default) - 2 evenly-distributed directories
- **'ndjson'** - all data in one fileSometimes operating systems don't like having ~6m files in one folder - so these options allow different nesting structures:
##### Encyclopedia
to put files in folders indexed by their first letter, do:
```js
let opts = {
outputDir: './results',
outputMode: 'encyclopedia'
}
```Remember, some directories become way larger than others. Also remember that titles are UTF-8.
For two-letter folders, use `outputMode: 'encyclopedia-two'`
#### Hash (default)
This format nests each file 2-deep, using the first 4 characters of the filename's hash:
```
/BE
/EF
/Dennis_Rodman.txt
/Hilary_Clinton.txt
```Although these directory names are meaningless, the advantage of this format is that files will be distributed evenly, instead of piling-up in the 'E' directory.
This is the same scheme that wikipedia does internally.
as a helper, this library exposes a function for navigating this directory scheme:
```js
import getPath from 'dumpster-dip/nested-path'
let file = getPath('Dennis Rodman')
// ./BE/EF/Dennis_Rodman.txt
```##### Flat:
if you want all files in one flat directory, you can cross your fingers and do:
```js
let opts = {
outputDir: './results',
outputMode: 'flat'
}
```##### Ndjson
You may want all results in one newline-delimited file.
Using this format, you can produce TSV or CSV files:```js
let opts = {
outputDir: './results',
outputMode: 'ndjson',
parse: function (doc) {
return [doc.title(), doc.text().length].join('\t')
}
}
```
![]()
![]()
### Examples:
Wikipedia is often a complicated place. Getting specific data may require some investigation, and experimentation:
_See runnable examples in [./examples](https://github.com/spencermountain/dumpster-dip/tree/main/src)_
#### Birthdays of basketball players
Process only the 13,000 pages with the category [American men's basketball players](https://en.m.wikipedia.org/wiki/Category:American_men%27s_basketball_players)
```js
await dip({
input: `./enwiki-latest-pages-articles.xml`,
doPage: function (doc) {
return doc.categories().find((cat) => cat === `American men's basketball players`)
},
parse: function (doc) {
return doc.infobox().get('birth_date')
}
})
```### Film Budgets
Look for pages with the [Film infobox](https://en.m.wikipedia.org/wiki/Template:Infobox_film) and grab some properties:
```js
await dip({
input: `./enwiki-latest-pages-articles.xml`,
outputMode: 'encyclopedia',
doPage: function (doc) {
// look for anything with a 'Film' 'infobox
return doc.infobox() && doc.infobox().type() === 'film'
},
parse: function (doc) {
let inf = doc.infobox()
// pluck some values from its infobox
return {
title: doc.title(),
budget: inf.get('budget'),
gross: inf.get('gross')
}
}
})
```#### Talk Pages
Talk pages are not found in the normal 'latest-pages-articles.xml' dump. Instead, you must download the larger 'latest-pages-meta-current.xml' dump.
To process only Talk pages, set 'namespace' to 1.```js
const opts = {
input: `./enwiki-latest-pages-meta-current.xml`,
namespace: 1, // do talk pages only
parse: function (doc) {
return doc.text() //return their text
}
}
```
![]()
---
### Customization
Given the `parse` callback, you're free to return anything you'd like.
One of the charms of [wtf_wikipedia](https://github.com/spencermountain/wtf_wikipedia) is its [plugin system](https://observablehq.com/@spencermountain/wtf-wikipedia-plugins?collection=@spencermountain/wtf_wikipedia), which allows users to add any new features.
Here we apply a [custom plugin](https://observablehq.com/@spencermountain/wtf-wikipedia-plugins) to our wtf lib, and pass it in to be available each worker:
in `./myLib.js`
```js
import wtf from 'wtf_wikipedia'// add custom analysis as a plugin
wtf.plugin((models, templates) => {
// add a new method
models.Doc.prototype.firstSentence = function () {
return this.sentences()[0].text()
}
// support a missing plugin
templates.pingponggame = function (tmpl, list) {
let arr = tmpl.split('|')
return arr[1] + ' to ' + arr[2]
}
})
export default wtf
```then we can pass this version into dumpster-dip:
```js
import dip from 'dumpster-dip'dip({
input: '/path/to/dump.xml',
libPath: './myLib.js', // our version (relative to cwd)
parse: function (doc) {
return doc.firstSentence() // use custom method
}
})
```See the [plugins available](https://github.com/spencermountain/wtf_wikipedia/tree/master/plugins), such as the [NHL season parser](https://github.com/spencermountain/wtf_wikipedia/tree/master/plugins/sports), the [nsfw tagger](https://github.com/spencermountain/wtf-plugin-nsfw), or a parser for [disambiguation pages](https://github.com/spencermountain/wtf_wikipedia/tree/master/plugins/disambig).
---
#### 👋
We are commited to making this library into a great tool for parsing mediawiki projects.
**[Prs](https://github.com/spencermountain/compromise/wiki/Contributing) welcomed and respected.**
MIT