{"id":15692248,"url":"https://github.com/passcod/wikt","last_synced_at":"2025-05-08T02:24:50.529Z","repository":{"id":139959904,"uuid":"384165388","full_name":"passcod/wikt","owner":"passcod","description":"Experimental playground for wiktionary data","archived":false,"fork":false,"pushed_at":"2021-07-09T07:39:19.000Z","size":25,"stargazers_count":7,"open_issues_count":0,"forks_count":1,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-03-31T16:32:41.330Z","etag":null,"topics":["experiment","full-text","playground","wiktionary"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/passcod.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-07-08T15:13:23.000Z","updated_at":"2023-09-08T18:24:08.000Z","dependencies_parsed_at":null,"dependency_job_id":"f62bdbae-3465-45c9-bda5-98a867932676","html_url":"https://github.com/passcod/wikt","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/passcod%2Fwikt","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/passcod%2Fwikt/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/passcod%2Fwikt/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/passcod%2Fwikt/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/passcod","download_url":"https://codeload.github.com/passcod/wikt/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252985344,"owners_count":21835958,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["experiment","full-text","playground","wiktionary"],"created_at":"2024-10-03T18:30:10.632Z","updated_at":"2025-05-08T02:24:50.499Z","avatar_url":"https://github.com/passcod.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# wikt\n\nExperimental playground for wiktionary data.\n\n**This document might not update as often as the code does.**\n\n## Set up\n\nYou'll want a minimum of 10 GB free space, a decent internet connection to download the dump (it's\nabout 1GB), an SSD (it's very dependent on I/O), a multi-core processor (it's very parallel). It\nwill work on a single core, just multiply every duration below by 8–10×.\n\n### Install\n\nAs this is an experiment/playground it's probably best to clone this repo locally.\n\nThen call the tool with `cargo run --release -- OPTIONS`. For brevity below I say `wikt OPTIONS` but\nI'm actually using the cargo invocation.\n\nBuilding without `--release` is somewhat faster but 40–100× slower to process data so really not\nworth it.\n\nA useful global option is `-V`, which takes a log level. Set to `debug` for a little more logging,\nset to `trace` for a lot more logs including dumps of intermediate data, set to `warn` or `error` to\nomit the default (`info`) logging.\n\n### To get a dump:\n\n1. https://dumps.wikimedia.org/enwiktionary/\n2. Select the penultimate dated folder. Not the last one, which might be incomplete, the one before\n    last. Or the last, if you're sure it's complete.\n3. You want the `pages-articles-multistream-xml` file. Not the index file, we don't use that.\n4. Download it and unpack it.\n\nIt might be helpful to keep the packed version around so you can just delete the unpacked file once\ndone with it, to save space and time. Alternatively, you can repack it with zstd, it will be faster\nto decompress should you need to and take up about the same space. If you're working on btrfs you\ncan probably use transparent zstd compression for the same effect without having to unpack again.\n\nThe tooling is focussed on the english wiktionary, may work on other languages, may not.\n\n### Do once: extract dump into the store\n\nThe dump is a single massive XML file. That's pretty much impossible to query or do anything with\nin any kind of efficient.\n\nSo, the first step with `wikt` is to extract just the useful information into a custom binary format\nwhich I designed to be trivial to parse but massively parallelisable, as well as having some random\naccess capability. That's stored in (by default) a `store` folder, and actually works out to a bit\nsmaller than the dump itself. At writing, my store folder contains 1472 ZStandard compressed files\nof this custom format.\n\nYou generate the store with:\n\n```\nwikt store make path/to/dump.xml\n```\n\nThis will take hours.\n\nEach file in the store is called a \"block\", each block contains up to 10k \"entries\", which contain\nthe raw title and body of a wiktionary page. Blocks have a short header with the amount of entries\nwithin and an array of byte offsets into the subsequent data section where each entry starts. Blocks\nare zstd compressed by wikt, with a dictionary trained on the first block. Entries have a header\nwith two byte lengths, one each for the title and body data.\n\nSo you can read an entry given the name of the block and the number of the entry within that block.\nThat's expressed as a \"ref\" or \"refid\" which is two u32s separated by a slash in the human/textual\nform, or by a u64 containing the concatenation of the two u32s in machine form.\n\nAnd you can read all entries by iterating (in parallel) the entire `store` folder, and then opening\neach block, decompressing it, and after parsing the block header, parsing every entry in parallel.\n\nIt could be made faster by decompressing only the block header, and then seeking to the required\nposition (for random access) or chunking along byte offsets and parsing each chunk from the zstd\nstream (for sequential access). Also there might be facilities in the zstd format itself for that\npattern of use that we don't take advantage of currently.\n\n### Query the store\n\nYou can search the store for substrings in the text of entries, or for negative matches. This is\nfairly slow for specific queries because it literally iterates the entire store and runs the matcher\non every entry, but on my machine a single substring search takes ~30 seconds to run through it all,\nso it's not that terrible.\n\nYou can query the store while it's extracting the dump.\n\n```\nwikt store query word \"phrase with spaces\" ~negative\n```\n\nEach entry returned is just the title prefixed by the refid in `[`brackets`]`, you can use that\nto get the full text of the entry:\n\n```\nwikt store get 10000/1234\n```\n\nYou can use the `--count` flag to instead return the amount of entries it matched, this is faster\nsimply by virtue of not having to write to output for every entry.\n\n### Build the index\n\nOnce you've gotten a full store, you can build the index:\n\n```\nwikt index make\n```\n\nThis will take 5-15 minutes.\n\nThis is one of the places where you can play around, by changing how the index is build. To make it\neasier to see changes in effect, there's a `--limited N` option. Set `N` to e.g. 10, that will stop\nafter reading 10 blocks into the index.\n\nEach entry is read _at least once_ into the index. A \"document\" is an indexed entry or subentry.\nAs of writing, the full index is ~7.3 million entries and indexes out to ~20 million documents.\n\nThe index has the title of each entry stored, so it can return titles fast, and the refid, so the\ntext of an entry can be fetched from the store. The full text of the entry is _not_ stored, which\nsaves considerable space. Still, as of writing the index was several gigabytes large.\n\n### Query the index\n\nYou pass a Tantivy full text query, and it returns the top scored results.\n\n```\nwikt index query 'star system'\n```\n\nYou can query phrases with `\"phrase\"`, and make a term requirement stronger with a `+` prefix, or\nexclude a term with a `-` prefix. See Tantivy for more.\n\nBy default it searches in the body, and you can query specific fields with `field:expression`. For\nexample to get results of english nouns:\n\n```\nwikt index query '+lang:english +gram:noun'\n```\n\nObviously the fields depend on how you built your index.\n\nYou can't query an index that was created with different fields than how you're querying it. So if\nyou make changes to the schema you'll need to rebuild the index before querying. Contrary to the\nstore, you can't query the index until changes are committed, and the `index make` process only\ncommits once at the end.\n\nThe query output contains a bit of metadata:\n\n```\nscore=24.352657 [2440000/439] (english/?) double star system\n        ===Noun=== {{en-noun|head=[[double]] [[star system]]}}  # {{lb|en|star}} a [[bi…\n```\n\nThat's:\n- the index search score\n- the refid\n- the lang/gram indicator (here english language, unset grammatical category)\n- the title of the entry (in the actual output it's in bold)\n- an excerpt (80 chars) of the entry\n\nBy default it fetches an excerpt of the text for display. You can have it show the entire entry with\n`--full`. Or you can skip fetching the text, which will be faster, with `--titles`.\n\nUse `-n` to change the number of results returned (default 20).","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpasscod%2Fwikt","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpasscod%2Fwikt","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpasscod%2Fwikt/lists"}