{"id":16042736,"url":"https://github.com/tatuylonen/wiktextract","last_synced_at":"2025-05-14T10:08:23.199Z","repository":{"id":40625872,"uuid":"155235363","full_name":"tatuylonen/wiktextract","owner":"tatuylonen","description":"Wiktionary dump file parser and multilingual data extractor","archived":false,"fork":false,"pushed_at":"2025-04-10T09:33:14.000Z","size":18671,"stargazers_count":882,"open_issues_count":22,"forks_count":92,"subscribers_count":16,"default_branch":"master","last_synced_at":"2025-04-10T09:37:57.860Z","etag":null,"topics":["dictionary","extractor","lua","multilingual","parser","scribunto","templates","wikitext","wiktionary","wiktionary-parser"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tatuylonen.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-10-29T15:27:35.000Z","updated_at":"2025-04-10T09:33:18.000Z","dependencies_parsed_at":"2023-02-15T08:15:49.399Z","dependency_job_id":"f7d580ce-8a0d-4028-8d3b-ccd6941db12c","html_url":"https://github.com/tatuylonen/wiktextract","commit_stats":null,"previous_names":[],"tags_count":7,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tatuylonen%2Fwiktextract","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tatuylonen%2Fwiktextract/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tatuylonen%2Fwiktextract/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tatuylonen%2Fwiktextract/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tatuylonen","download_url":"https://codeload.github.com/tatuylonen/wiktextract/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248695311,"owners_count":21146952,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dictionary","extractor","lua","multilingual","parser","scribunto","templates","wikitext","wiktionary","wiktionary-parser"],"created_at":"2024-10-09T00:03:00.649Z","updated_at":"2025-04-13T09:56:12.436Z","avatar_url":"https://github.com/tatuylonen.png","language":"Python","readme":"# Wiktextract\n\nThis is a utility and Python package for extracting data from Wiktionary.\n\nPlease report issues on github and we'll try to address them reasonably\nsoon.\n\nSome extracted Wiktionary editions data are available for browsing\nand downloading at https://kaikki.org, the website will be updated\nevery few days.\n\nNote: extracting all data for all languages from the English\nWiktionary may take from an hour to several days, depending\non your computer.  Expanding Lua modules is not cheap, but it enables\nsuperior extraction quality and maintainability! You may want to look\nat the data downloads instead of running it yourself.\n\n## Overview\n\nThis is a Python package and tool for extracting information from various\nWiktionary data dumps, most notably and completely the English edition\n(enwiktionary).  Note that an edition of Wiktionary contains extensive\ndictionaries and inflectional information for many languages, not just the\nlanguage it has been written in.\n\nOne thing that distinguishes this tool from any system we're aware of is\nthat this tool expands templates and Lua macros in Wiktionary.  That\nenables much more accurate rendering and extraction of glosses, word\nsenses, inflected forms, and pronunciations.  It also makes the system\nmuch easier to maintain.  All this results in much higher extraction\nquality and accuracy.\n\nThe English edition extraction 'module' extracts glosses, parts-of-speech,\ndeclension/conjugation information when available, translations for all\nlanguages when available, pronunciations (including audio file links),\nqualifiers including usage notes, word forms, links between words including\nhypernyms, hyponyms, holonyms, meronyms, related words, derived terms,\ncompounds, alternative forms, etc.  Links to Wikipedia pages, Wikidata\nidentifiers, and other such data are also extracted when available. For many\nclasses of words, a word sense is annotated with specific information such as\nwhat word it is a form of, what is the RGB value of the color it represents,\nwhat is the numeric value of a number, what SI unit it represents, etc.\n\nOther editions are less complete (or the Wiktionary edition itself doesn't\nnecessarily have the same width of data), but we try to cover the basics.\n\nThis tool extracts information for all languages that have data in the\nwiktionary edition.  It also extracts translingual data and\ninformation about characters (anything that has an entry in Wiktionary).\n\nThis tool reads a ``\u003clanguage-code\u003ewiktionary-\u003cdate\u003e-pages-articles.xml.bz2``\ndump file and outputs JSONL-format (json objects separated with newlines)\ndictionaries containing most of the information in Wiktionary.  The dump files\ncan be downloaded from https:// dumps.wikimedia.org.\n\nThis utility will be useful for many natural language processing,\nsemantic parsing, machine translation, and language generation\napplications both in research and industry.\n\nThe tool can be used to extract machine translation dictionaries,\nlanguage understanding dictionaries, semantically annotated\ndictionaries, and morphological dictionaries with\ndeclension/conjugation information (where this information is\navailable for the target language).  Dozens of languages have\nextensive vocabulary in ``enwiktionary``, and several thousand\nlanguages have partial coverage.\n\nThe ``wiktwords`` script makes extracting the information for use by other tools\ntrivial without writing a single line of code.  It extracts the information\nspecified by command options for languages specified on the command line, and\nwrites the extracted data to a file or standard output in JSONL format (json\nobjects separated with newlines) for processing by other tools.\n\nAs far as we know, this is the most comprehensive tool available for\nextracting information from Wiktionary as of December 2020.\n\nIf you find this tool and/or the pre-extracted data helpful, please\ngive this a star on github!\n\n## Pre-extracted data\n\nFor most people, it may be easiest to just download pre-expanded data.\nPlease see\n[https://kaikki.org/dictionary/rawdata.html](https://kaikki.org/dictionary/rawdata.html).\nThe raw wiktextract data, extracted category tree, extracted templates\nand modules, as well as a bulk download of audio files for\npronunciations in both \u003ccode\u003e.ogg\u003c/code\u003e and \u003ccode\u003e.mp3\u003c/code\u003e formats\nare available.\n\nThere is a also download link at the bottom of every page and a button\nto view the JSON produced for each page.  You can download all data,\ndata for a specific language, data for just a single word, or data for\na list of related words (e.g., a particular part-of-speech or words\nrelating to a particular topic or having a particular inflectional\nform).  All downloads are in [JSON Lines](https://jsonlines.org/) format (each line is a separate JSON\nobject).  The bigger downloads are also available in compressed form.\n\nSome people have asked for the full data as a single JSON object\n(instead of the current one JSON object per line format).  I've\ndecided to keep it as a JSON object per line, because loading all the\ndata into Python requires about 120 GB of memory.  It is much easier to\nprocess the data line-by-line, especially if you are only interested\nin a part of the information.  You can easily read the files using the\nfollowing code:\n\n```python\nimport json\n\nwith open(\"filename.json\", encoding=\"utf-8\") as f:\n    for line in f:\n        data = json.loads(line)\n        ... # parse the data in this record\n```\n\nIf you want to collect all the data into a list, you can read the file\ninto a list with:\n\n```python\nimport json\n\nlst = []\nwith open(\"filename.json\", encoding=\"utf-8\") as f:\n    for line in f:\n        data = json.loads(line)\n        lst.append(data)\n```\n\nYou can also easily pretty-print the data into a more human-readable form using:\n\n```python\nprint(json.dumps(data, indent=2, sort_keys=True, ensure_ascii=False))\n```\n\nHere is a pretty-printed example of an extracted word entry for the\nword `thrill` as an English verb (only one part-of-speech is shown here):\n\n```python\n{\n  \"categories\": [\n    \"Emotions\"\n  ],\n  \"derived\": [\n    {\n      \"word\": \"enthrill\"\n    }\n  ],\n  \"forms\": [\n    {\n      \"form\": \"thrills\",\n      \"tags\": [\n        \"present\",\n        \"simple\",\n        \"singular\",\n        \"third-person\"\n      ]\n    },\n    {\n      \"form\": \"thrilling\",\n      \"tags\": [\n        \"present\"\n      ]\n    },\n    {\n      \"form\": \"thrilled\",\n      \"tags\": [\n        \"participle\",\n        \"past\",\n        \"simple\"\n      ]\n    }\n  ],\n  \"head_templates\": [\n    {\n      \"args\": {},\n      \"expansion\": \"thrill (third-person singular simple present thrills, present participle thrilling, simple past and past participle thrilled)\",\n      \"name\": \"en-verb\"\n    }\n  ],\n  \"lang\": \"English\",\n  \"lang_code\": \"en\",\n  \"pos\": \"verb\",\n  \"senses\": [\n    {\n      \"glosses\": [\n        \"To suddenly excite someone, or to give someone great pleasure; to electrify; to experience such a sensation.\"\n      ],\n      \"tags\": [\n        \"ergative\",\n        \"figuratively\"\n      ]\n    },\n    {\n      \"glosses\": [\n        \"To (cause something to) tremble or quiver.\"\n      ],\n      \"tags\": [\n        \"ergative\"\n      ]\n    },\n    {\n      \"glosses\": [\n        \"To perforate by a pointed instrument; to bore; to transfix; to drill.\"\n      ],\n      \"tags\": [\n        \"obsolete\"\n      ]\n    },\n    {\n      \"glosses\": [\n        \"To hurl; to throw; to cast.\"\n      ],\n      \"tags\": [\n        \"obsolete\"\n      ]\n    }\n  ],\n  \"sounds\": [\n    {\n      \"ipa\": \"/\\u03b8\\u0279\\u026al/\"\n    },\n    {\n      \"ipa\": \"[\\u03b8\\u027e\\u032a\\u030a\\u026a\\u026b]\",\n      \"tags\": [\n        \"UK\",\n        \"US\"\n      ]\n    },\n    {\n      \"ipa\": \"[\\u03b8\\u027e\\u032a\\u030a\\u026al]\",\n      \"tags\": [\n        \"Ireland\"\n      ]\n    },\n    {\n      \"ipa\": \"[t\\u032a\\u027e\\u032a\\u030a\\u026al]\",\n      \"tags\": [\n        \"Ireland\"\n      ]\n    },\n    {\n      \"rhymes\": \"-\\u026al\"\n    },\n    {\n      \"audio\": \"en-us-thrill.ogg\",\n      \"mp3_url\": \"https://upload.wikimedia.org/wikipedia/commons/transcoded/d/db/En-us-thrill.ogg/En-us-thrill.ogg.mp3\",\n      \"ogg_url\": \"https://upload.wikimedia.org/wikipedia/commons/d/db/En-us-thrill.ogg\",\n      \"tags\": [\n        \"US\"\n      ],\n      \"text\": \"Audio (US)\"\n    }\n  ],\n  \"translations\": [\n    {\n      \"code\": \"nl\",\n      \"lang\": \"Dutch\",\n      \"sense\": \"suddenly excite someone, or to give someone great pleasure; to electrify\",\n      \"word\": \"opwinden\"\n    },\n    {\n      \"code\": \"fi\",\n      \"lang\": \"Finnish\",\n      \"sense\": \"suddenly excite someone, or to give someone great pleasure; to electrify\",\n      \"word\": \"syk\\u00e4hdytt\\u00e4\\u00e4\"\n    },\n    {\n      \"code\": \"fi\",\n      \"lang\": \"Finnish\",\n      \"sense\": \"suddenly excite someone, or to give someone great pleasure; to electrify\",\n      \"word\": \"riemastuttaa\"\n    },\n...\n    {\n      \"code\": \"tr\",\n      \"lang\": \"Turkish\",\n      \"sense\": \"slight quivering of the heart that accompanies a cardiac murmur\",\n      \"word\": \"\\u00e7arp\\u0131nt\\u0131\"\n    }\n  ],\n  \"wikipedia\": [\n    \"thrill\"\n  ],\n  \"word\": \"thrill\"\n}\n```\n\n## Getting started\n\n### Installing\n\n#### Use container:\n\n```\n$ podman run -v /data:/data -it --rm ghcr.io/tatuylonen/wiktextract --all --all-languages --out /data/fr-20250101.jsonl --edition fr /data/frwiktionary-20250101-pages-articles.xml.bz2\n```\n\n#### Install from source:\n\nOn Linux (example from Ubuntu 20.04), you may need to\nfirst install the `build-essential` and `python3-dev` packages\nwith `apt update \u0026\u0026 apt install build-essential python3-dev python3-pip lbzip2`.\n\n```\ngit clone https://github.com/tatuylonen/wiktextract.git\ncd wiktextract\npython -m venv .venv\nsource .venv/bin/activate\npython -m pip install -U pip\npython -m pip install -e .\n```\n\nUse `pip install` command's `--force-reinstall` and `-e` option to\nreinstall the wikitextprocessor package from source in editable\nmode if you want to update both packages' code with `git pull`.\n\n### Running tests\n\nThis package includes tests written using the `unittest` framework.\nThe test dependencies can be installed with the command\n`python -m pip install -e .[dev]`.\n\nTo run the tests, use the following command in the top-level directory:\n\n```\nmake test\n```\n\n### Expected performance\n\nExtracting all data for all languages from English Wiktionary takes\nabout 1.25 hours on a 128-core dual AMD EPYC 7702 system.  The\nperformance is expected to be approximately linear with the number of\nprocessor cores, provided you have enough memory (about 10GB/core or\n5GB/hyperthread recommended).\n\nAs the extractor expands, these times will change.\n\nYou can control the number of parallel processes to use with the\n`--num-processes` option; the default is to use the number of\navailable cores/hyperthreads.\n\nYou can download the full pre-extracted data from\n[kaikki.org](https://kaikki.org/dictionary/). The pre-extraction is\nupdated regularly with the latest Wiktionary dump.  Using the\npre-extracted data may be the easiest option unless you have special\nneeds or want to modify the code.\n\n## Using the command-line tool\n\nThe ``wiktwords`` script is the easiest way to extract data from\nWiktionary.  Just download the data dump file from\n[dumps.wikimedia.org](https://dumps.wikimedia.org/enwiktionary/) and\nrun the script.  The correct dump file the name\n``enwiktionary-\u003cdate\u003e-pages-articles.xml.bz2``.\n\nAn example of a typical invocation for extracting all data would be:\n```\nwiktwords --all --all-languages --out data.json --edition en enwiktionary-20230801-pages-articles.xml.bz2\n```\n\nIf you wish to modify the code or test processing individual pages,\nthe following may also be useful:\n\n1. Pass a path to save database file that you can use for quickly\nprocessing individual pages:\n\n```\nwiktwords --db-path en_20230801.db --edition en enwiktionary-20230801-pages-articles.xml.bz2\n```\n\n2. To process a single page and produce a human-readable output file\nfor debugging:\n\n```\nwiktwords --db-path en_20230801.db --edition en --all --all-languages --out outfile --page page_title\n```\n\nThe following command-line options can be used to control its operation:\n\n* --out FILE: specifies the name of the file to write (specifying \"-\" as the file writes to stdout)\n* --all-languages: extract words for all available languages\n* --language-code LANGUAGE_CODE: extracts the given language (this option may be specified multiple times; defaults to dump file language code and `mul`(Translingual))\n* --language-name LANGUAGE_NAME: Similar to `--language-code` except this option accepts language name\n* --edition LANGUAGE_CODE: specifies the language code for the Wiktionary edition that the dump file is for (supported editions are listed in `-h` help descriptions)\n* --skip-extraction: Used to create a database file from the dump file without waiting for the extraction process to complete.\n* --all: causes all data to be captured for the selected languages\n* --translations: causes translations to be captured\n* --pronunciation: causes pronunciation information to be captured\n* --linkages: causes linkages (synonyms etc.) to be captured\n* --examples: causes usage examples to be captured\n* --etymologies: causes etymology information to be captured\n* --descendants: causes descendants information to be captured\n* --inflections: causes inflection tables to be captured\n* --redirects: causes redirects to be extracted\n* --pages-dir DIR: save all wiktionary pages under this directory (mostly for debugging)\n* --db-path PATH: save/use database from this path (for debugging)\n* --page FILE or TITLE: read page from file or database, can be specified multiple times(first line must be \"TITLE: pagetitle\"; file should use UTF-8 encoding)\n* --num-processes PROCESSES: use this many parallel processes (needs 4GB/process)\n* --human-readable: print human-readable JSON with indentation (no longer\nmachine-readable)\n* --override PATH: override pages with files in this directory (first line of the file must be TITLE: pagetitle)\n* --templates-file: extract Template namespace to this tar file\n* --modules-file: extract Module namespace to this tar file\n* --categories-file: extract Wiktionary category tree into this file as JSON (see description below)\n* --inflection_tables_file: extract and expand tables into this file as wikitext; use this to create tests\n* --help: displays help text (with some more options than listed here)\n\n## Calling the library\n\nWhile this package has been mostly intended to be used using the\n`wiktwords` command, it is also possible to call this as a library.\nUnderneath, this uses the `wikitextprocessor` module. For more usage\nexamples please read the [wiktwords.py](https://github.com/tatuylonen/wiktextract/blob/master/src/wiktextract/wiktwords.py) and [wiktionary.py](https://github.com/tatuylonen/wiktextract/blob/master/src/wiktextract/wiktionary.py) files.\n\nThis code can be called from an application as follows:\n\n```python\nfrom wiktextract import (\n    WiktextractContext,\n    WiktionaryConfig,\n    parse_wiktionary,\n)\nfrom wikitextprocessor import Wtp\n\nconfig = WiktionaryConfig(\n    dump_file_lang_code=\"en\",\n    capture_language_codes=[\"en\", \"mul\"],\n    capture_translations=True,\n    capture_pronunciation=True,\n    capture_linkages=True,\n    capture_compounds=True,\n    capture_redirects=True,\n    capture_examples=True,\n    capture_etymologies=True,\n    capture_descendants=True,\n    capture_inflections=True,\n)\nwxr = WiktextractContext(Wtp(), config)\n\nRECOGNIZED_NAMESPACE_NAMES = [\n    \"Main\",\n    \"Category\",\n    \"Appendix\",\n    \"Project\",\n    \"Thesaurus\",\n    \"Module\",\n    \"Template\",\n    \"Reconstruction\"\n]\n\nnamespace_ids = {\n    wxr.wtp.NAMESPACE_DATA.get(name, {}).get(\"id\")\n    for name in RECOGNIZED_NAMESPACE_NAMES\n}\nwith open(\"output.json\", \"w\", encoding=\"utf-8\") as f:\n    parse_wiktionary(wxr, dump_path, None, False, namespace_ids, f)\n```\n\nThe capture arguments default to ``True``, so they only need to be set if\nsome values are not to be captured (note that the ``wiktwords``\nprogram sets them to ``False`` unless the ``--all`` or specific capture\noptions are used).\n\n#### parse_wiktionary()\n\n```python\ndef parse_wiktionary(\n    wxr: WiktextractContext,\n    dump_path: str,\n    num_processes: Optional[int],\n    phase1_only: bool,\n    namespace_ids: Set[int],\n    out_f: TextIO,\n    human_readable: bool = False,\n    override_folders: Optional[List[str]] = None,\n    skip_extract_dump: bool = False,\n    save_pages_path: Optional[str] = None,\n) -\u003e None:\n```\n\nThe ``parse_wiktionary`` function will call ``word_cb(data)`` for\nwords and redirects found in the Wiktionary dump.  ``data`` is\ninformation about a single word and part-of-speech as a dictionary and\nmay include several word senses.  It may also be a redirect (indicated\nby the presence of a \"redirect\" key in the dictionary).  It is in the same\nformat as the JSONL-formatted dictionaries returned by the\n``wiktwords`` tool.\n\nIts arguments are as follows:\n* ``wxr`` (WiktextractContext) - a Wiktextract-level processing context\n  containing fields that point to a Wtp context and WiktionarConfig object\n  (below).\n** ``wxr.wtp`` (Wtp) - a\n  [wikitextprocessor](https://github.com/tatuylonen/wikitextprocessor/)\n  processing context.  The number of parallel processes to use can be\n  given as the ``num_threads`` argument to the constructor, and a database\n  file path can be provided as the ``db_path`` argument.\n** ``wxr.config`` (WiktionaryConfig) - a configuration object describing\n  what to exctract (see below)\n* `dump_path` (str) - path to a Wiktionary dump file (*-pages-articles.xml.bz2)\n* ``phase1_only`` - if this is set to ``True``, then only a cache file will\n  be created but no extraction will take place.  In this case the ``Wtp``\n  constructor should probably be given the ``db_path`` argument when\n  creating ``wxr.wtp``.\n* `namespace_ids` - a set of namespace ids, pages with namespace ids that\n  are not included in this set won't be processed. Avaliable id values can\n  be found in wikitextprocessor project's [data/en/namespaces.json](https://github.com/tatuylonen/wikitextprocessor/blob/main/wikitextprocessor/data/en/namespaces.json)\n  file and the Wiktionary *.xml.bz2 dump file.\n* `out_f` - output file object.\n* `human_readable` - if set to `True`, the output JSON will be formatted with indentation.\n* `override_folders` - override pages with files in these directories.\n* `skip_extract_dump` - skip extract dump file if database exists.\n* `save_pages_path` - path for storing extracted pages.\n\nThis call gathers statistics in ``wxr.config``.  This function will\nautomatically parallelize the extraction.  ``page_cb`` will be called in\nthe parent process, however.\n\n#### parse_page()\n\n```python\ndef parse_page(\n    wxr: WiktextractContext, page_title: str, page_text: str\n) -\u003e List[Dict[str, str]]:\n```\n\n* ``wxr`` (WiktextractContext) - a ``wiktextract`` context containing:\n** ``wxr.wtp`` (Wtp) - a ``wikitextprocessor`` context\n** ``wxr.config`` (WiktionaryConfig) - specifies what to capture and is also used\n* `page_title` (str) - the title to use for the page\n* `page_text` (str) - contents of the page (wikitext)\n  for collecting statistics\n\n#### PARTS_OF_SPEECH\n\nThis is a constant set of all part-of-speech values (``pos`` key) that\nmay occur in the extracted data.  Note that the list is somewhat larger than\nwhat a conventional part-of-speech list would be.\n\n### class WiktextractContext(object)\n\nThe ``WiktextractContext`` object is used to hold the ``wikitextprocessor``-\nspecific ``Wtp`` context object and the wiktextract's ``WiktionaryConfig``\nobjects, and XXX in the future it will hold actual context that doesn't\nbelong in Wtp and XXX WiktionaryConfig will be most probably integrated\ninto the WiktextractContext object proper.\n\nThe constructor is called simply by supplying a Wtp and WiktionaryConfig\nobject:\n\n```python\n# Blanks slate for testing, usually\nwxr = WiktextractContext(Wtp(), WiktionaryConfig())\n```\n\nor\n\n```python\n# separately initialized config with a bunch of arguments like in the\n# example in the -\u003e class WiktionaryConfig(object)-section below\nwxr = WiktextractContext(wtp, config)\n```\n\nif it is more convenient.\n\n### class WiktionaryConfig(object)\n\nThe ``WiktionaryConfig`` object is used for specifying what data to collect\nfrom Wiktionary and is also used for collecting statistics during\nextraction. Currently, it is a field of the WiktextractContext context object.\n\nThe constructor:\n\n```python\ndef __init__(\n    self,\n    dump_file_lang_code=\"en\",\n    capture_language_codes=[\"en\", \"mul\"],\n    capture_translations=True,\n    capture_pronunciation=True,\n    capture_linkages=True,\n    capture_compounds=True,\n    capture_redirects=True,\n    capture_examples=True,\n    capture_etymologies=True,\n    capture_inflections=True,\n    capture_descendants=True,\n    verbose=False,\n    expand_tables=False,\n):\n```\n\nThe arguments are as follows:\n* ``capture_language_codes`` (list/tuple/set of strings) - codes of\n  languages for which to capture data.  It defaults to ``[\"en\",\n  \"mul\"]``. To capture all languages, set it to `None`.\n* ``capture_translations`` (boolean) - set to ``False`` to disable capturing\n  translations.  Translation information seems to be most\n  widely available for the English language, which has translations into\n  other languages.\n* ``capture_pronunciation`` (boolean) - set to ``False`` to disable\n  capturing pronunciations.  Typically, pronunciations include\n  IPA transcriptions and any audio files included in the word entries, along\n  with other information (including dialectal tags).  The type and amount of\n  pronunciation information varies widely between languages.\n* ``capture_linkages`` (boolean) - set to ``False`` to disable capturing\n  linkages between word, such as hypernyms, antonyms, synonyms, etc.\n* ``capture_compounds`` (boolean) - set to ``False`` to disable capturing\n  compound words containing the word.  Compound word capturing is not currently\n  fully implemented.\n* ``capture_redirects`` (boolean) - set to ``False`` to disable capturing\n  redirects.  Redirects are not associated with any specific language\n  and thus requesting them returns them for all words in all languages.\n* ``capture_examples`` (boolean) - set to ``False`` to disable\n  capturing usage examples.\n* ``capture_etymologies`` (boolean) - set to ``False`` to\n  disable capturing etymologies.\n* ``capture_descendants`` (boolean) - set to ``False`` to\n  disable capturing descendants.\n* ``capture_inflections`` (boolean) - set to ``False`` to\n  disable capturing inflection tables.\n\n## Format of extracted redirects\n\nSome pages in Wiktionary are redirects.  For these, ``word_cb`` will\nbe called with data in a special format.  In this case, the dictionary\nwill have a ``redirect`` key, which will contain the page title that\nthe entry redirects to.  The ``title`` key contains the word/term that\ncontains the redirect.  Redirect entries do not have ``pos`` or any of\nthe other fields.  Redirects also are not associated with any\nlanguage, so all redirects are always returned regardless of the\ncaptured languages (if extracting redirects has been requested).\n\n## Format of the extracted word entries\n\nInformation returned for each word is a dictionary.  The dictionary has the\nfollowing keys (others may also be present or added later):\n\n* ``word`` - the word form\n* ``pos`` - part-of-speech, such as \"noun\", \"verb\", \"adj\", \"adv\", \"pron\", \"determiner\", \"prep\" (preposition), \"postp\" (postposition), and many others.  The complete list of possible values returned by the package can be found in ``wiktextract.PARTS_OF_SPEECH``.\n* ``lang`` - name of the language this word belongs to (e.g., ``English``)\n* ``lang_code`` - Wiktionary language code corresponding to ``lang`` key (e.g., ``en``)\n* ``senses`` - list of word senses (dictionaries) for this word/part-of-speech (see below)\n* ``forms`` - list of inflected or alternative forms specified for the word (e.g., plural, comparative, superlative, roman script version).  This is a list of dictionaries, where each dictionary has a ``form`` key and a ``tags`` key.  The ``tags`` identify what type of form it is.  It may also contain \"ipa\", \"roman\", and \"source\" fields.  The form can be \"-\" when the word is marked as not having that form (some of those will be word-specific, while others are language-specific; post-processing can drop such forms when no word has a value for that tag combination).\n* ``sounds`` - list of dictionaries containing pronunciation, hyphenation, rhyming, and related information.  Each dictionary may have a ``tags`` key containing tags that clarify what kind of form that entry is.  Different types of information are stored in different fields: ``ipa`` is [IPA](https://en.wikipedia.org/wiki/International_Phonetic_Alphabet) pronunciation, ``enPR`` is [enPR](https://en.wikipedia.org/wiki/Pronunciation_respelling_for_English) pronunciation, ``audio`` is name of sound file in Wikimedia commons.\n* ``categories`` - list of non-disambiguated categories for the word\n* ``topics`` - list of non-disambiguated topics for the word\n* ``translations`` - non-disambiguated translation entries (see below)\n* ``etymology_text`` - etymology section as cleaned text\n* ``etymology_templates`` - templates and their arguments and expansions from\n  the etymology section.  These can be used to easily parse etymological\n  relations.  Certain common templates that do not signify etymological\n  relations are not included.\n* ``etymology_number`` - for words with multiple numbered etymologies, this contains the number of the etymology under which this entry appeared\n* ``descendants`` - descendants of the word (see below)\n* ``synonyms`` - non-disambiguated synonym linkages for the word (see below)\n* ``antonyms`` - non-disambiguated antonym linkages for the word (see below)\n* ``hypernyms`` - non-disambiguated hypernym linkages for the word (see below)\n* ``holonyms`` - non-disambiguated linkages indicating being part of something (see below) (not systematically encoded)\n* ``meronyms`` - non-disambiguated linkages indicating having a part (see below) (fairly rare)\n* ``derived`` - non-disambiguated derived word linkages for the word (see below)\n* ``related`` - non-disambiguated related word linkages for the word (see below)\n* ``coordinate_terms`` - non-disambiguated coordinate term linkages for the word (see below)\n* ``wikidata`` - non-disambiguated Wikidata identifer\n* ``wiktionary`` - non-disambiguated page title in Wikipedia (possibly prefixed by language id)\n* ``head_templates``: part-of-speech specific head tags for the word.  This basically just captures the templates (their name and arguments) as a list of dictionaries.  Most applications may want to ignore this.\n* ``inflection_templates`` - conjugation and declension templates found for the word, as dictionaries.  These basically capture the language-specific inflection template for the word.  Note that for some languages inflection information is also contained in ``head_templates``.  XXX in the very near future, we will start parsing inflections from the inflection tables into ``forms``, so there is usually no need to use the ``inflection_templates`` data.\n\nThere may also be other fields.\n\n### Word senses\n\nEach word entry may have multiple glosses under the ``senses`` key.  Each\nsense is a dictionary that may contain the following keys (among others, and more may be added in the future):\n\n* ``glosses`` - list of gloss strings for the word sense (usually only one).  This has been cleaned, and should be straightforward text with no tagging.\n* ``raw_glosses`` - list of gloss strings for the word sense, with less cleaning than ``glosses``.  In particular, parenthesized parts that have been parsed from the gloss into ``tags`` and ``topics`` are still present here.  This version may be easier for humans to interpret.\n* ``tags`` - list of qualifiers and tags for the gloss.  This is a list of strings, and may include words such as \"archaic\", \"colloquial\", \"present\", \"participle\", \"plural\", \"feminine\", and many others (new words may appear arbitrarily).\n* ``categories`` - list of sense-disambiguated category names extracted from (a subset) of the Category links on the page\n* ``topics`` - list of sense-disambiguated topic names (kind of similar to categories but determined differently)\n* ``alt_of`` - list of words that his sense is an alternative form of; this is a list of dictionaries, with field ``word`` containing the linked word and optionally ``extra`` containing additional text\n* ``form_of`` - list of words that this sense is an inflected form of; this is a list of dictionaries, with field ``word`` containing the linked word and optionally ``extra`` containing additional text\n* ``translations`` - sense-disambiguated translation entries (see below)\n* ``synonyms`` - sense-disambiguated synonym linkages for the word (see below)\n* ``antonyms`` - sense-disambiguated antonym linkages for the word (see below)\n* ``hypernyms`` - sense-disambiguated hypernym linkages for the word (see below)\n* ``holonyms`` - sense-disambiguated linkages indicating being part of something (see below) (not systematically encoded)\n* ``meronyms`` - sense-disambiguated linkages indicating having a part (see below) (fairly rare)\n* ``coordinate_terms`` - sense-disambiguated coordinate_terms linkages (see below)\n* ``derived`` - sense-disambiguated derived word linkages for the word (see below)\n* ``related`` - sense-disambiguated related word linkages for the word (see below)\n* ``senseid`` - list of textual identifiers collected for the sense.  If there is a QID for the entry (e.g., Q123), those are stored in the ``wikidata`` field.\n* ``wikidata`` - list of QIDs (e.g., Q123) for the sense\n* ``wikipedia`` - list of Wikipedia page titles (with optional language code prefix)\n* ``examples`` - list of usage examples, each example being a dictionary with ``text`` field containing the example text, optional ``ref`` field containing a source reference, optional ``english`` field containing English translation, optional ``type`` field containing example type (currently ``example`` or ``quotation`` if present), optional ``roman`` field containing romanization (for some languages written in non-Latin scripts), and optional (rare) ``note`` field contains English-language parenthesized note from the beginning of a non-english example.\n* ``english`` - if the word sense has a qualifier that could not be parsed, that qualifier is put in this field (rare).  Most qualifiers are parsed into ``tags`` and/or ``topics``.  The gloss with the qualifier still present can be found in ``raw_glosses``.\n\n### Pronunciation\n\nPronunciation information is stored under the ``sounds`` key.  It is a\nlist of dictionaries, each of which may contain the following keys,\namong others:\n\n* ``ipa`` - pronunciation specifications as an IPA string /.../ or [...]\n* ``enpr`` - pronunciation in English pronunciation respelling\n* ``audio`` - name of a sound file in WikiMedia Commons\n* ``ogg_url`` - URL for an OGG Vorbis format sound file\n* ``mp3_url`` - URL for an MP3 format sound file\n* ``audio-ipa`` - IPA string associated with the audio file, generally giving IPA transcription of what is in the sound file\n* ``homophones`` - list of homophones for the word\n* ``hyphenation`` - list of hyphenations\n* ``tags`` - other labels or context information attached to the pronunciation entry (e.g., might indicate regional variant or dialect)\n* ``text`` - text associated with an audio file (often not very useful)\n\nNote that Wiktionary audio files are available for bulk download at\n[https://kaikki.org/dictionary/rawdata.html](https://kaikki.org/dictionary/rawdata.html).\nFiles in the download are named with the last component of the URL in\n``ogg_url`` and/or ``mp3_url``.  Downloading them individually takes\nserveral days and puts unnecessary load on Wikimedia servers.\n\n### Translations\n\nTranslations are stored under the ``translations`` key in the word's\ndata (if not sense-disambiguated) or in the word sense (if\nsense-disambiguated).  They are stored in a list of dictionaries,\nwhere each dictionary has the following keys (and possibly others):\n\n* ``alt`` - optional alternative form of the translation (e.g., in a different script)\n* ``code`` - Wiktionary's 2 or 3-letter language code for the language the translation is for\n* ``english`` - English text, generally clarifying the target sense of the translation\n* ``lang``  the language name that the translation is for\n* ``note`` - optional text describing or commenting on the translation\n* ``roman`` - optional romanization of the translation (when in non-Latin characters)\n* ``sense`` - optional sense indicating the meaning for which this is a translation (this is a free-text string, and may not match any gloss exactly)\n* ``tags`` - optional list of qualifiers for the translations, e.g., gender\n* ``taxonomic`` - optional taxonomic name of an organism mentioned in the translation\n* ``word`` - the translation in the specified language (may be missing when ``note`` is present)\n\n### Etymologies\n\nEtymological information is stored under the ``etymology_text`` and\n``etymology_templates`` keys in the word's data.  When multiple parts-of-speech\nare listed under the same etymology, the same data is copied to each\npart-of-speech entry under that etymology.\n\nThe ``etymology_text`` field contains the contents of the whole etymology\nsection cleaned into human-readable text (i.e., templates have been expanded\nand HTML tags removed, among other things).\n\nThe ``etymology_templates`` field contains a list of templates from\nthe etymology section.  Some common templates considered not relevant\nfor etymological information have been removed (e.g., ``redlink\ncategory`` and ``isValidPageName``).  The list also includes nested\ntemplates referenced from templates directly used in the etymology\ndescription.  Each template in the list is a dictionary with the following\nkeys:\n* ``name`` - name of the template\n* ``args`` - dictionary mapping argument names to their cleaned values.  Positional arguments have keys that are numeric strings, starting with \"1\".\n* ``expansion`` - the (cleaned) text the template expands to.\n\n### Descendants\n\nIf a word has a \"Descendants\" section, the `descendants` key will appear in the word's data. It contains a list of objects corresponding to each line in the section, where each object has the following keys:\n\n* `depth`: The level of indentation of the current line. This can be used to track the hierarchical structure of the list.\n* `templates`: An array of objects corresponding to templates that appear on the line. The structure of each of these objects is the same as the structure of each object in `etymology_templates`.\n* `text`: The expanded and cleaned line text, akin to `etymology_text`.\n\n`descendants` data will also appear for the special case of \"Derived terms\" and \"Extensions\" sections for words that are roots in reconstructed languages, as these sections have the same format.\n\n### Linkages to other words\n\nLinkages (``synonyms``, ``antonyms``, ``hypernyms``, ``derived\nwords``, ``holonyms``, ``meronyms``, ``derived``, ``related``,\n``coordinate_terms``) are stored in the word's data if not\nsense-disambiguated, and in the word sense if sense-disambiguated.\nThey are lists of dictionaries, where each dictionary can contain the\nfollowing keys, among others:\n\n* ``alt`` - optional alternative form of the target (e.g., in a different script)\n* ``english`` - optional English text associated with the sense, usually identifying the linked target sense\n* ``roman`` - optional romanization of a linked word in a non-Latin script\n* ``sense`` - text identifying the word sense or context (e.g., ``\"to rain very heavily\"``)\n* ``tags``: qualifiers specified for the sense (e.g., field of study, region, dialect, style)\n* ``taxonomic``: optional taxonomic name associated with the linkage\n* ``topics``: list of topic descriptors for the linkage (e.g., ``military``)\n* ``word`` - the word this links to (string)\n\n## Category tree data format\n\nThe ``--categories-file`` option extracts the Wiktionary category tree\nas JSON into the specified file.  The data is extracted from the Wiktionary\nLua modules by evaluating them.\n\nThe data written to the JSON file is a dictionary, with the top-level\nkeys ``roots`` and ``nodes``.\n\nRoots is a list of top-level nodes that are not children of other\nnodes.  ``Fundamental`` is the normal top-level node; other roots may\nreflect errors in the hierarchy in Wiktionary.  While not a root, the\ncategory ``all topics`` contains the subhierarchy of topical\ncategories (e.g., ``food and drink``, ``nature``, ``sciences``, etc.).\n\nNodes is a dictionary mapping lowercased category name to a dictionary\ncontaining data about the category.  For each category, the following\nfields may be present:\n\n* ``name`` (always present): non-lowercased name of the category (note, however,\n  that many categories are originally lowercase in the Wiktionary\n  hierarchy)\n* ``desc``: optional description of the category\n* ``clean_desc``: optional cleaned description of the category, with wikitext formatting cleaned to human-readable text, except {{{langname}}} (and possibly other similar tags) are left intact.\n* ``children``: optional list of child categories of the category\n* ``sort``: optional list of sorts (types of subcategories?).\n\nThe categories are returned as they are in the original Wiktionary\ncategory data.  Language-specific categories are generally not\nincluded.  However, there is a category ``{{{langcat}}}`` that appears\nto contain a lot of the categories that have language-specific\nvariants.  Also, the category tree data does not contain language\nprefixes (the tree is defined in Wiktionary without prefixes and then\nreplicated for each language).\n\n## Related packages\n\nThe\n[wikitextprocessor](https://github.com/tatuylonen/wikitextprocessor)\nis a generic module for extracting data from Wiktionary, Wikipedia, and\nother WikiMedia dump files.  ``wiktextract`` is built using this module.\n\n*When using a version of wiktextract from github, please also setup\nwikitextprocessor so that they have rough parity. The pypi versions of these\npackages are usually out-of-date, and mixing a newer version with an older\none will lead to bugs. These packages are being developed in parallel.*\n\nThe [wiktfinnish](https://github.com/tatuylonen/wiktfinnish) package\ncan be used to interpret Finnish noun declinations and verb\nconjugations and for generating Finnish inflected word forms.\n\n## Publications\n\nIf you use Wiktextract or the extracted data in academic work, please\ncite the following article:\n\nTatu Ylonen: [Wiktextract: Wiktionary as Machine-Readable Structured\ndata](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.140.pdf),\nProceedings of the 13th Conference on Language Resources and\nEvaluation (LREC), pp. 1317-1325, Marseille, 20-25 June 2022.\n\nLinking to [https://kaikki.org](https://kaikki.org) or the relevant\nsub-pages would also be greatly appreciated.\n\n## Related tools\n\nA few other tools also exist for parsing Wiktionaries.  These include\n[Dbnary](http://kaiko.getalp.org/about-dbnary/),\n[Wikiparse](https://github.com/frankier/wikiparse), and [DKPro\nJWKTL](https://dkpro.github.io/dkpro-jwktl/).\n\n## Contributing and reporting bugs\n\nPlease report bugs and other issues on github.  I also welcome\nsuggestions for improvement.\n\nPlease email to ``ylo`` at ``clausal.com`` if you wish to contribute\nor have patches or suggestions.\n\n## License\n\nCopyright (c) 2018-2020 [Tatu Ylonen](https://ylonen.org).  This\npackage is free for both commercial and non-commercial use.  It is\nlicensed under the MIT license.  See the file\n[LICENSE](https://github.com/tatuylonen/wiktextract/blob/master/LICENSE)\nfor details.  (Certain files have different open source licenses)\n","funding_links":[],"categories":["Developer Resources"],"sub_categories":["Dictionary Data"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftatuylonen%2Fwiktextract","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftatuylonen%2Fwiktextract","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftatuylonen%2Fwiktextract/lists"}