https://github.com/audiodude/wiki-wc-scripts

Scripts used to support the https://github.com/audiodude/wiki-wc project
https://github.com/audiodude/wiki-wc-scripts

Last synced: 4 months ago
JSON representation

Scripts used to support the https://github.com/audiodude/wiki-wc project

Host: GitHub
URL: https://github.com/audiodude/wiki-wc-scripts
Owner: audiodude
Created: 2015-02-25T19:08:47.000Z (over 10 years ago)
Default Branch: master
Last Pushed: 2015-02-27T00:26:34.000Z (over 10 years ago)
Last Synced: 2025-01-17T23:30:28.606Z (6 months ago)
Language: Python
Homepage:
Size: 137 KB
Stars: 0
Watchers: 3
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Scripts for counting the number of words on wikipedia

## You will need

* An [XML dump](http://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_Wikipedia) of the articles of English wikipedia. You want pages-articles.xml.bz2.
* The scripts in this project
* A bunch of disk space
* An Amazon AWS account

## Steps

0. Download your wikipedia XML article dump and extract it.
0. Run the following command to create a stripped version of the dump with just the article text and no punctuation or markup: `echo enwiki-20141106-pages-articles.xml | ./grab_articles.py | ./process_wiki.pl > enwiki_words.txt`
0. Upload `mapper.py`, `reducer.py` and the `enwiki_words.txt` to an AWS Elastic Map Reduce.
0. Run the map reduce

## Output file
The output file will contain entries like "apple\t79077", where "\t" is a tab
character. This means that the token "apple" appears 79,077 times in the
English wikipedia dump you processed.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/audiodude/wiki-wc-scripts

Awesome Lists containing this project

README