https://github.com/davideuler/offline-wikipedia
Automatically exported from code.google.com/p/offline-wikipedia
https://github.com/davideuler/offline-wikipedia
Last synced: 3 months ago
JSON representation
Automatically exported from code.google.com/p/offline-wikipedia
- Host: GitHub
- URL: https://github.com/davideuler/offline-wikipedia
- Owner: davideuler
- Created: 2015-04-11T09:56:59.000Z (almost 11 years ago)
- Default Branch: master
- Last Pushed: 2015-04-11T09:59:32.000Z (almost 11 years ago)
- Last Synced: 2025-06-27T15:06:56.694Z (7 months ago)
- Language: JavaScript
- Size: 1.1 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README
Awesome Lists containing this project
README
Offline-wikipedia is django application using which you can have all of English wikipedia(in text format!) on your system.
U need to have all the fragmented dump files in data/xml_blocks/ folder, created enwiki_db tables in corresponding database.
Remember to replace parameters in page/class_con.py and views.py for database connection and other things to make it workable.
Views.py uses class_con.py to convert wiki text to html, urls.py handles css files and generate wikipedia-kinda page.
This configuration does not include Pictures and other media parts, it is text only displayed in format which is similar to en-wikipedia.
In case you just want to test the converter for wiki-html only, not going through pain of downloading xml file, and configuring database, you can straight away use parser.py in main folder, remember to provide the link to wiki-file and output will be generated as result.html, even do crosscheck the location of en_template file and change it accordingly.
In case you want to contribute and have anything to say on this attempt feel free to contact me on shantanu[at]sarai[dot]net.
index_file.py creates archive.csv file with all names of articles and there boundaries in csv format.
mod file is sorted form of archive.csv.
segragate.py takes this mod file and creates seprate csv file for each alphabet to make search faster.
Will change things to:
* Use mwlib for parsing and creating html from xml files.
* Use binary format to store the hash values of articles and there boundaries rather then csv files.
* Check latest dumps.
I am also available as baali (IRC nick) on freenode.net.