https://github.com/i80and/quelt
A simple and fast offline Wikipedia reader
https://github.com/i80and/quelt
Last synced: 8 months ago
JSON representation
A simple and fast offline Wikipedia reader
- Host: GitHub
- URL: https://github.com/i80and/quelt
- Owner: i80and
- License: mit
- Created: 2011-12-23T15:36:55.000Z (over 14 years ago)
- Default Branch: master
- Last Pushed: 2012-12-01T15:38:22.000Z (over 13 years ago)
- Last Synced: 2025-02-02T23:56:43.153Z (over 1 year ago)
- Language: C
- Homepage:
- Size: 164 KB
- Stars: 1
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
Quelt
=====
A lightweight offline Wikipedia reader.
Compilation Requirements
------------
* C99 compiler
* Unix environment. Some win32 shims exist, but they are untested.
* Expat (only for quelt-split)
* Zlib
Building
--------
$ make
Usage
-----
$ ./quelt-split [path to XML dump] [-v]
$ ./quelt [part of title] --search [--plain]
$ ./quelt [exact title] [--plain]
File format
-----------
The initial plan was for Quelt to use a separate file for every article, using
the path for the article name. A cute idea, but reality set in fairly quickly:
* Filename restrictions, especially on NTFS
* Standard filesystem tools are not built to handle 3 million+ files easily
So instead, a custom binary format is used with two files: `quelt.db`, and
`quelt.index`.
`quelt.index`:
| n_articles: Int32
| segment_length: Int32
| article 0 title: Byte[255]
| article 0 offset: Int64
| article 1 title: Byte[255]
| article 1 offset: Int64
| article n title: Byte[255]
| article n offset: Int64
`quelt.db` is a concatenated sequence of zlib streams, where the start of each
article is given by the article offsets in `quelt.index`.
The index is broken up into segments, all of which (except the last) are of
length `segment_length` and sorted independently. This gives an efficient
average search time of `O((n_segments/2) * log(segment_length))` comparisons
via a series of binary searches, while still allowing quelt and quelt-split to
run on memory-constrained machines. Note that this could be used as the first
step to a real external merge sort.