Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/physikerwelt/wikifilter
Simple script to filter wikidumps for wiki-tags
https://github.com/physikerwelt/wikifilter
dump filter filter-wikidumps wiki-tags wikipedia
Last synced: 5 days ago
JSON representation
Simple script to filter wikidumps for wiki-tags
- Host: GitHub
- URL: https://github.com/physikerwelt/wikifilter
- Owner: physikerwelt
- Created: 2014-02-20T10:44:28.000Z (over 10 years ago)
- Default Branch: master
- Last Pushed: 2019-04-18T09:38:50.000Z (over 5 years ago)
- Last Synced: 2024-10-10T18:51:56.710Z (26 days ago)
- Topics: dump, filter, filter-wikidumps, wiki-tags, wikipedia
- Language: Python
- Size: 11.7 KB
- Stars: 1
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
wikiFilter
==========Simple script to filter wikidumps for wiki-tags
To filter all pages that contain math from enwiki you can do the following
```
git clone https://github.com/physikerwelt/wikiFilter
cd wikiFilter
mkdir wout
wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
./wikiFilter.py
```
All options of wikiFilter can be seen via
```
./wikiFilter.py --help
usage: wikiFilter.py [-h] [-f FILE] [-s SIZE] [-d DIR] [-t TAG] [-v] [-T]extract wikipages that contain the math tag
optional arguments:
-h, --help show this help message and exit
-f FILE, --filename FILE
the bz2-file to be split and filtered (default:
enwiki-latest-pages-articles.xml.bz2)
-s SIZE, --splitsize SIZE
the number of pages contained in each split (default:
1000000)
-d DIR, --outputdir DIR
the directory name where the files go (default: wout)
-t TAG, --tagname TAG
the tag to search for (default: math)
-v, --verbosity
-T, --template include all templates (default: False)
```