Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/audiodude/wiki-wc-scripts
Scripts used to support the https://github.com/audiodude/wiki-wc project
https://github.com/audiodude/wiki-wc-scripts
Last synced: about 2 months ago
JSON representation
Scripts used to support the https://github.com/audiodude/wiki-wc project
- Host: GitHub
- URL: https://github.com/audiodude/wiki-wc-scripts
- Owner: audiodude
- Created: 2015-02-25T19:08:47.000Z (almost 10 years ago)
- Default Branch: master
- Last Pushed: 2015-02-27T00:26:34.000Z (almost 10 years ago)
- Last Synced: 2024-10-15T00:47:56.928Z (3 months ago)
- Language: Python
- Homepage:
- Size: 137 KB
- Stars: 0
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Scripts for counting the number of words on wikipedia
## You will need
* An [XML dump](http://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_Wikipedia) of the articles of English wikipedia. You want pages-articles.xml.bz2.
* The scripts in this project
* A bunch of disk space
* An Amazon AWS account## Steps
0. Download your wikipedia XML article dump and extract it.
0. Run the following command to create a stripped version of the dump with just the article text and no punctuation or markup: `echo enwiki-20141106-pages-articles.xml | ./grab_articles.py | ./process_wiki.pl > enwiki_words.txt`
0. Upload `mapper.py`, `reducer.py` and the `enwiki_words.txt` to an AWS Elastic Map Reduce.
0. Run the map reduce## Output file
The output file will contain entries like "apple\t79077", where "\t" is a tab
character. This means that the token "apple" appears 79,077 times in the
English wikipedia dump you processed.