{"id":15134673,"url":"https://github.com/tatuylonen/wikitextprocessor","last_synced_at":"2025-04-13T00:42:13.339Z","repository":{"id":53075447,"uuid":"302691576","full_name":"tatuylonen/wikitextprocessor","owner":"tatuylonen","description":"Python package for WikiMedia dump processing (Wiktionary, Wikipedia etc). Wikitext parsing, template expansion, Lua module execution.  For data extraction, bulk syntax checking, error detection, and offline formatting.","archived":false,"fork":false,"pushed_at":"2025-03-31T04:18:51.000Z","size":5577,"stargazers_count":97,"open_issues_count":4,"forks_count":23,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-04-13T00:42:00.728Z","etag":null,"topics":["mediawiki","scribuntu","wikipedia","wikitext","wiktionary"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tatuylonen.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-10-09T16:09:03.000Z","updated_at":"2025-03-31T04:18:56.000Z","dependencies_parsed_at":"2024-02-07T13:25:26.828Z","dependency_job_id":"5c8b6540-b6c7-4abb-8bcc-9d5e2d9ec078","html_url":"https://github.com/tatuylonen/wikitextprocessor","commit_stats":{"total_commits":945,"total_committers":15,"mean_commits":63.0,"dds":0.6412698412698412,"last_synced_commit":"66545a619dac146e47ea81e9cc13872b890ad97e"},"previous_names":[],"tags_count":8,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tatuylonen%2Fwikitextprocessor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tatuylonen%2Fwikitextprocessor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tatuylonen%2Fwikitextprocessor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tatuylonen%2Fwikitextprocessor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tatuylonen","download_url":"https://codeload.github.com/tatuylonen/wikitextprocessor/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248650417,"owners_count":21139672,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["mediawiki","scribuntu","wikipedia","wikitext","wiktionary"],"created_at":"2024-09-26T05:23:42.078Z","updated_at":"2025-04-13T00:42:13.314Z","avatar_url":"https://github.com/tatuylonen.png","language":"Python","readme":"# wikitextprocessor\n\nThis is a Python package for processing [WikiMedia dump\nfiles](https://dumps.wikimedia.org) for\n[Wiktionary](https://www.wiktionary.org),\n[Wikipedia](https://www.wikipedia.org), etc., for data extraction,\nerror checking, offline conversion into HTML or other formats, and\nother uses.  Key features include:\n\n* Parsing dump files, including built-in support for processing pages\n  in parallel\n* [Wikitext](https://en.wikipedia.org/wiki/Help:Wikitext) syntax\n  parser that converts the whole page into a parse tree\n* Extracting template definitions and\n  [Scribunto](https://www.mediawiki.org/wiki/Extension:Scribunto/Lua_reference_manual)\n  Lua module definitions from dump files\n* Expanding selected templates or all templates, and\n  heuristically identifying templates that need to be expanded before\n  parsing is reasonably possible (e.g., templates that emit table\n  start and end tags)\n* Processing and expanding wikitext parser functions\n* Processing, executing, and expanding Scribunto Lua modules (they are\n  very widely used in, e.g., Wiktionary, for example for generating\n  [IPA](https://en.wikipedia.org/wiki/International_Phonetic_Alphabet)\n  strings for many languages)\n* Controlled expansion of parts of pages for applications that parse\n  overall page structure before parsing but then expand templates on\n  certain sections of the page\n* Capturing information from template arguments while expanding them,\n  as template arguments often contain useful information not available\n  in the expanded content.\n\nThis module is primarily intended as a building block for other\npackages that process Wikitionary or Wikipedia data, particularly for\ndata extraction.  You will need to write code to use this.\n\nFor pre-existing extraction modules that use this package, please see:\n\n* [Wiktextract](https://github.com/tatuylonen/wiktextract/) for\n  extracting rich machine-readable dictionaries from Wiktionary.  You can also\n  find pre-extracted machine-readable Wiktionary data in JSON format at\n  [kaikki.org](https://kaikki.org/dictionary).\n\n## Getting started\n\n### Installing\n\nInstall from source:\n\n```\ngit clone --recurse-submodules --shallow-submodules https://github.com/tatuylonen/wikitextprocessor.git\ncd wikitextprocessor\npython -m venv .venv\nsource .venv/bin/activate\npython -m pip install -U pip\npython -m pip install -e .\n```\n\n### Running tests\n\nThis package includes tests written using the `unittest` framework.\nThe test dependencies can be installed with command\n`python -m pip install -e .[dev]`.\n\nTo run the tests, use the following command in the top-level directory:\n\n```\nmake test\n```\n\nTo run a specific test, use the following syntax:\n\n```\npython -m unittest tests.test_[module].[Module]Tests.test_[name]\n```\n\nPython's unittest framework help and options can be accessed through:\n\n```\npython -m unittest -h\n```\n\n### Obtaining WikiMedia dump files\n\nThis package is primarily intended for processing Wiktionary and\nWikipedia dump files (though you can also use it for processing\nindividual pages or other files that are in wikitext format).  To\ndownload WikiMedia dump files, go to the [dump download\npage](https://dumps.wikimedia.org/backup-index.html).  We recommend\nusing the \u0026lt;name\u0026gt;-\u0026lt;date\u0026gt;-pages-articles.xml.bz2 files.\n\n## API documentation\n\nUsage example:\n\n```python\nfrom functools import partial\nfrom typing import Any\n\nfrom wikitextprocessor import Wtp, WikiNode, NodeKind, Page\nfrom wikitextprocessor.dumpparser import process_dump\n\ndef page_handler(wtp: Wtp, page: Page) -\u003e Any:\n    wtp.start_page(page.title)\n    # process parse tree\n    tree = wtp.parse(page.body)\n    # or get expanded plain text\n    text = wtp.expand(page.body)\n\nwtp = Wtp(\n    db_path=\"en_20230801.db\", lang_code=\"en\", project=\"wiktionary\"\n)\n\n# extract dump file then save pages to SQLite file\nprocess_dump(\n    wtp,\n    \"enwiktionary-20230801-pages-articles.xml.bz2\",\n    {0, 10, 110, 828},  # namespace id, can be found at the start of dump file\n)\n\nfor _ in map(\n    partial(page_handler, wtp), wtp.get_all_pages([0])\n):\n    pass\n```\n\nThe basic operation is as follows:\n* Extract templates, modules, and other pages from the dump file and save\n  them in a SQLite file\n* Heuristically analyze which templates need to be pre-expanded before\n  parsing to make sense of the page structure (this cannot detect templates\n  that call Lua code that outputs wikitext that affects parsed structure).\n  These first steps together are called the \"first phase\".\n* Process the pages again, calling a page handler function for each page.\n  The page handler can extract, parse, and otherwise process the page, and\n  has full access to templates and Lua macros defined in the dump.  This may\n  call the page handler in multiple processes in parallel.  Return values\n  from the page handler calls are returned to the caller. This is\n  called the second phase.\n\nMost of the functionality is hidden behind the ``Wtp`` object.\n``WikiNode`` objects are used for representing the parse\ntree that is returned by the ``Wtp.parse()`` function.  ``NodeKind``\nis an enumeration type used to encode the type of a ``WikiNode``.\n\n### class Wtp\n\n```python\ndef __init__(\n    self,\n    db_path: Optional[Union[str, Path]] = None,\n    lang_code=\"en\",\n    template_override_funcs: Dict[str, Callable[[Sequence[str]], str]] = {},\n    project: str = \"wiktionary\",\n):\n```\n\nThe initializer can usually be called without arguments, but recognizes\nthe following arguments:\n* `db_path` can be `None`, in which case a temporary database file\n  will be created under `/tmp`, or a path for the database file which contains\n  page texts and other data of the dump file.\n  There are two reasons why you might want to set this:\n  1) you don't have enough space on `/tmp` (3.4G for English dump file),\n  or 2) for testing.\n  If you specify the path and an existing database file exists, that file will\n  be used, eliminating the time needed for Phase 1 (this is very\n  important for testing, allowing processing single pages reasonably fast).\n  In this case, you should not call ``Wtp.process()`` but instead use\n  ``Wtp.reprocess()`` or just call ``Wtp.expand()`` or ``Wtp.parse()`` on\n  wikitext that you have obtained otherwise (e.g., from some file).\n  If the file doesn't exist, you will need to call `Wtp.process()`\n  to parse a dump file, which will initialize the database file during the\n  first phase. If you wish to re-create the database, you should remove\n  the old file first.\n* `lang_code` - the language code of the dump file.\n* `template_override_funcs` - Python functions for overriding expanded template text.\n* `project` - \"wiktionary\" or \"wikipedia\".\n\n```python\ndef read_by_title(\n    self, title: str, namespace_id: Optional[int] = None\n) -\u003e Optional[str]:\n```\n\nReads the contents of the page with the specified title from the cache\nfile.  There is usually no need to call this function explicitly, as\n``Wtp.process()`` and ``Wtp.reprocess()`` normally load the page\nautomatically. This function does not automatically call `Wtp.start_page()`.\n\nArguments are:\n* `title` - the title of the page to read\n* `namespace_id` - namespace id number, this argument is required if\n  `title` donesn't have namespace prefix like `Template:`.\n\nThis returns the page contents as a string, or ``None`` if the page\ndoes not exist.\n\n```python\ndef parse(\n    self,\n    text: str,\n    pre_expand=False,\n    expand_all=False,\n    additional_expand=None,\n    do_not_pre_expand=None,\n    template_fn=None,\n    post_template_fn=None,\n) -\u003e WikiNode:\n```\n\nParses wikitext into a parse tree (``WikiNode``), optionally expanding\nsome or all the templates and Lua macros in the wikitext (using the definitions\nfor the templates and macros in the cache files, as added by ``Wtp.process()``\nor calls to ``Wtp.add_page()``.\n\nThe ``Wtp.start_page()`` function must be called before this function\nto set the page title (which may be used by templates, Lua macros, and\nerror messages).  The ``Wtp.process()`` and ``Wtp.reprocess()``\nfunctions will call it automatically.\n\nThis accepts the following arguments:\n* ``text`` (str) - the wikitext to be parsed\n* ``pre_expand`` (boolean) - if set to ``True``, the templates that were\n  heuristically detected as affecting parsing (e.g., expanding to table start\n  or end tags or list items) will be automatically expanded before parsing.\n  Any Lua macros those templates use may also be called.\n* ``expand_all`` - if set to ``True``, expands all templates and Lua\n  macros in the wikitext before parsing.\n* ``additional_expand`` (set or ``None``) - if this argument is provided, it\n  should be a set of template names that should be expanded in addition to\n  those specified by the other options (i.e., in addition to to the\n  heuristically detected templates if ``pre_expand`` is ``True`` or just these\n  if it is false; this option is meaningless if ``expand_all`` is set to\n  ``True``).\n\nThis returns the parse tree.  See below for a documentation of the ``WikiNode``\nclass used for representing the parse tree.\n\n```python\ndef node_to_wikitext(self, node)\n```\n\nConverts a part of a parse tree back to wikitext.\n* ``node`` (``WikiNode``, str, list/tuple of these) - This is the part of the\n  parse tree that is to be converted back to wikitext.  We also allow\n  strings and lists, so that ``node.children`` can be used directly as\n  the argument.\n\n\n```python\ndef expand(self, text, template_fn=None, post_template_fn=None,\n           pre_expand=False, templates_to_expand=None,\n           expand_parserfns=True, expand_invoke=True)\n```\n\nExpands the selected templates, parser functions and Lua macros in the\ngiven Wikitext.  This can selectively expand some or all templates.  This can\nalso capture the arguments and/or the expansion of any template as well as\nsubstitute custom expansions instead of the default expansions.\n\nThe ``Wtp.start_page()`` function must be called before this function to\nset the page title (which may be used by templates and Lua macros).  The\n``Wtp.process()`` and ``Wtp.reprocess()`` will call it automatically.  The\npage title is also used in error messages.\n\nThe arguments are as follows:\n* ``text`` (str) - the wikitext to be expanded\n* ``template_fn`` (function) - if set, this will be called as\n  ``template_fn(name, args)``, where ``name`` (str) is the name of the\n  template and ``args`` is a dictionary containing arguments to the\n  template.  Positional arguments (and named arguments with numeric\n  names) will have integer keys in the dictionary, whereas other named\n  arguments will have their names as keys.  All values corresponding\n  to arguments are strings (after they have been expanded).  This\n  function can return ``None`` to cause the template to be expanded in\n  the normal way, or a string that will be used instead of the\n  expansion of the template.  This can return ``\"\"`` (empty string) to\n  expand the template to nothing.  This can also capture the template name\n  and its arguments.\n* ``post_template_fn`` (function) - if set, this will be called\n  as ``post_template_fn(name, ht, expansion)`` after the template has\n  been expanded in the normal way.  This can return ``None`` to use the\n  default expansion, or a string to use a that string as the expansion.\n  This can also be used to capture the template, its arguments, and/or its\n  expansion.\n* ``pre_expand`` (boolean) - if set to ``True``, all templates that were\n  heuristically determined as needing to be expanded before parsing will be\n  expanded.\n* ``templates_to_expand`` (``None`` or set or dictionary) - if this is set,\n  these templates will be expanded in addition to any other templates that\n  have been specified to be expanded.  If a dictionary is provided, its keys\n  will be taken as the names of the templates to be expanded.  If this has not\n  been set or is ``None``, all templates will be expanded.\n* ``expand_parserfns`` (boolean) - Normally, wikitext parser functions will\n  be expanded.  This can be set to ``False`` to prevent parser function\n  expansion.\n* ``expand_invoke`` (boolean) - Normally, the ``#invoke`` parser function\n  (which calls a Lua module) will be expanded along with other parser\n  functions.  This can be set to ``False`` to prevent expansion of the\n  ``#invoke`` parser function.\n\n```python\ndef start_page(self, title)\n```\n\nThis function should be called before starting the processing of a new page\nor file.  This saves the page title (which is frequently accessed by\ntemplates, parser functions, and Lua macros).  The page title is also\nused in error messages.\n\nThe ``Wtp.process()`` and ``Wtp.reprocess()`` functions will automatically\ncall this before calling the page handler for each page.  This needs to be\ncalled manually when processing wikitext obtained from other sources.\n\nThe arguments are as follows:\n* ``title`` (str) - The page title.  For normal pages, there is usually no\n  prefix.  Templates typically have ``Template:`` prefix and Lua modules\n  ``Module:`` prefix, and other prefixes are also used (e.g., ``Thesaurus:``).\n  This does not care about the form of the name, but some parser functions do.\n\n```python\ndef start_section(self, title)\n```\n\nSets the title of the current section on the page.  This is\nautomatically reset to ``None`` by ``Wtp.start_page()``.  The section\ntitle is only used in error, warning, and debug messages.\n\nThe arguments are:\n* ``title`` (str) - the title of the section, or ``None`` to clear it.\n\n\n```python\ndef start_subsection(self, title)\n```\n\nSets the title of the current subsection of the current section on the\npage.  This is automatically reset to ``None`` by ``Wtp.start_page()``\nand ``Wtp.start_section()``.  The subsection title is only used in error,\nwarning, and debug messages.\n\nThe arguments are:\n* ``title`` (str) - the title of the subsection, or ``None`` to clear it.\n\n```python\ndef add_page(self, title: str, namespace_id: int, body: Optional[str] = None,\n             redirect_to: Optional[str] = None, need_pre_expand: bool = False,\n             model: str = \"wikitext\") -\u003e None:\n```\n\nThis function is used to add pages, templates, and modules for\nprocessing.  There is usually no need to use this if ``Wtp.process()``\nis used; however, this can be used to add templates and pages for\ntesting or other special processing needs.\n\nThe arguments are:\n* `title` - the title of the page to be added (normal pages typically\n  have no prefix in the title, templates begin with `Template:`, and Lua\n  modules begin with `Module:`)\n* `namespace_id` - namespace id\n* `body` - the content of the page, template, or module\n* `redirect_to` - title of redirect page\n* `need_pre_expand` - set to `True` if the page is a template that need to\n  be expanded before parsing.\n* `model` - the model value for the page (usually `wikitext`\n  for normal pages and templates and `Scribunto` for Lua modules)\n\nThe ``Wtp.analyze_templates()`` function needs to be called after\ncalling ``Wtp.add_page()`` before pages can be expanded or parsed (it should\npreferably only be called once after adding all pages and templates).\n\n```python\ndef analyze_templates(self)\n```\n\nAnalyzes the template definitions in the cache file and determines which\nof them should be pre-expanded before parsing because they affect the\ndocument structure significantly.  Some templates in, e.g., Wiktionary\nexpand to table start tags, table end tags, or list items, and parsing\nresults are generally much better if they are expanded before parsing.\nThe actual expansion only happens if ``pre_expand`` or some other argument\nto ``Wtp.expand()`` or ``Wtp.parse()`` tells them to do so.\n\nThe analysis is heuristic and is not guaranteed to find every such template.\nIn particular, it cannot detect templates that call Lua modules that output\nWikitext control structures (there are several templates in Wiktionary that\ncall Lua code that outputs list items, for example).  Such templates may need\nto be identified manually and specified as additional templates to expand.\nLuckily, there seem to be relatively few such templates, at least in\nWiktionary.\n\nThis function is automatically called by ``Wtp.process()`` at the end of\nphase 1.  An explicit call is only necessary if ``Wtp.add_page()`` has been\nused by the application.\n\n### Error handling\n\nVarious functions in this module, including ``Wtp.parse()`` and\n``Wtp.expand()`` may generate errors and warnings.  Those will be displayed\non ``stdout`` as well as collected in ``Wtp.errors``, ``Wtp.warnings``, and\n``Wtp.debugs``.  These fields will contain lists of dictionaries, where\neach dictionary describes an error/warning/debug message.  The dictionary can\nhave the following keys (not all of them are always present):\n* ``msg`` (str) - the error message\n* ``trace`` (str or ``None``) - optional stacktrace where the error occurred\n* ``title`` (str) - the page title on which the error occurred\n* ``section`` (str or ``None``) - the section where the error occurred\n* ``subsection`` (str or ``None``) - the subsection where the error occurred\n* ``path`` (tuple of str) - a path of title, template names, parser function\n  names, or Lua module/function names, giving information about where the\n  error occurred during expansion or parsing.\n\nThe fields containing the error messages will be cleared by every call\nto ``Wtp.start_page()`` (including the implicit calls during\n``Wtp.process()`` and ``Wtp.reprocess()``).  Thus, the\n``page_handler`` function often returns these lists together with any\ninformation extracted from the page, and they can be collected\ntogether from the values returned by the iterators returned by these\nfunctions.  The ``Wtp.to_return()`` function maybe useful for this.\n\nThe following functions can be used for reporting errors.  These can\nalso be called by application code from within the ``page_handler``\nfunction as well as ``template_fn`` and ``post_template_fn`` functions\nto report errors, warnings, and debug messages in a uniform way.\n\n```python\ndef error(self, msg, trace=None)\n```\n\nReports an error message.  The error will be added to ``Wtp.errors`` list and\nprinted to stdout.  The arguments are:\n* msg (str) - the error message (need not include page title or section)\n* trace (str or ``None``) - an optional stack trace giving more information\n  about where the error occurred\n\n```python\ndef warning(self, msg, trace=None)\n```\n\nReports a warning message.  The warning will be added to ``Wtp.warnings`` list\nand printed to stdout.  The arguments are the same as for ``Wtp.error()``.\n\n```python\ndef debug(self, msg, trace=None)\n```\n\nReports a debug message.  The message will be added to ``Wtp.debugs`` list\nand printed to stdout.  The arguments are the same as for ``Wtp.error()``.\n\n```python\ndef to_return(self)\n```\n\nProduces a dictionary containing the error, warning, and debug\nmessages from ``Wtp``.  This would typically be called at the end of a\n``page_handler`` function and the value returned along with whatever\ndata was extracted from that page.  The error lists are reset by\n``Wtp.start_page()`` (including the implicit calls from\n``Wtp.process()`` and ``Wtp.reprocess()``), so they should be saved\n(e.g., by this call) for each page.  (Given the parallelism in\nthe processing of the pages, they cannot just be accumulated in the\nsubprocesses.)\n\nThe returned dictionary contains the following keys:\n* ``errors`` - a list of dictionaries describing any error messages\n* ``warnings`` - a list of dictionaries describing any warning messages\n* ``debugs`` - a list of dictionaries describing any debug messages.\n\n### class WikiNode\n\nThe ``WikiNode`` class represents a parse tree node and is returned by\n``Wtp.parse()``.  This object can be printed or converted to a string\nand will display a human-readable format that is suitable for\ndebugging purposes (at least for small parse trees).\n\nThe ``WikiNode`` objects have the following fields:\n* ``kind`` (NodeKind, see below) - The type of the node.  This determines\n  how to interpret the other fields.\n* ``children`` (list) - Contents of the node.  This is generally used when\n  the node has arbitrary size content, such as subsections, list items/sublists,\n  other HTML tags, etc.\n* ``args`` (list or str, depending on ``kind``) - Direct arguments to the\n  node.  This is used, for example, for templates, template arguments, parser\n  function arguments, and link arguments, in which case this is a list.\n  For some node types (e.g., list, list item, and HTML tag), this is\n  directly a string.\n* ``attrs`` - A dictionary containing HTML attributes or a definition list\n  definition (under the ``def`` key).\n\n### class NodeKind(enum.Enum)\n\nThe ``NodeKind`` type is an enumerated value for parse tree (``WikiNode``)\nnode types.  Currently the following values are used (typically these\nneed to be prefixed by ``Nodekind.``, e.g., ``NodeKind.LEVEL2``):\n* ``ROOT`` - The root node of the parse tree.\n* ``LEVEL2`` - Level 2 subtitle (==).  The ``args`` field contains the title\n  and ``children`` field contains any contents that are within this section\n* ``LEVEL3`` - Level 3 subtitle (===)\n* ``LEVEL4`` - Level 4 subtitle (====)\n* ``LEVEL5`` - Level 5 subtitle (=====)\n* ``LEVEL6`` - Level 6 subtitle (======)\n* ``ITALIC`` - Italic, content is in ``children``\n* ``BOLD`` - Bold, content is in ``children``\n* ``HLINE`` - A horizontal line (no arguments or children)\n* ``LIST`` - Indicates a list.  Each list and sublist will start with\n  this kind of node.  ``args`` will contain the prefix used to open the\n  list (e.g., ``\"##\"`` - note this is stored directly as a string\n  in ``args``).  List items will be stored in ``children``.\n* ``LIST_ITEM`` - A list item in the children of a ``LIST`` node.  ``args``\n  is the prefix used to open the list item (same as for the ``LIST`` node).\n  The contents of the list item (including any possible sublists) are in\n  ``children``.  If the list is a definition list (i.e., the prefix ends\n  in ``\";\"``), then ``children`` contains the item label to be defined\n  and ``definition`` contains the definition.\n* ``PREFORMATTED`` - Preformatted text where markup is interpreted.  Content\n  is in ``children``.  This is used for lines starting with a space in\n  wikitext.\n* ``PRE`` - Preformatted text where markup is not interpreted.  Content\n  is in ``children``.  This is indicated in wikitext by\n  \u0026lt;pre\u0026gt;...\u0026lt;/pre\u0026gt;.\n* ``LINK`` - An internal wikimedia link ([[...]] in wikitext).  The link\n  arguments are in ``args``.  This tag is also used for media inclusion.\n  Links with a trailing word end immediately after the link have the trailing\n  part in ``children``.\n* ``TEMPLATE`` - A template call (transclusion).  Template name is in the\n  first argument and template arguments in subsequent arguments in ``args``.\n  The ``children`` field is not used.  In wikitext templates are marked up\n  as {{name|arg1|arg2|...}}.\n* ``TEMPLATE_ARG`` - A template argument.  The argument name is in the first\n  item in ``args`` followed by any subsequet arguments (normally at most two\n  items, but I've seen arguments with more - probably an error in those\n  template definitions).  The ``children`` field is not used.  In wikitext\n  template arguments are marked up as {{{name|defval}}}.\n* ``PARSER_FN`` - A parser function invocation.  This is also used for built-in\n  variables such as {{PAGENAME}}.  The parser function name is in the\n  first element of ``args`` and parser function arguments in subsequent\n  elements.\n* ``URL`` - An external URL. The first argument is the URL.  The second\n  optional argument (in ``args``) is the display text.  The ``children``\n  field is not used.\n* ``TABLE`` - A table.  Content is in ``children``.  In wikitext, a table\n  is encoded as {| ... |}.\n* ``TABLE_CAPTION`` - A table caption.  This can only occur under\n  ``TABLE``.  The content is in ``children``.  The ``attrs`` field contains\n  a dictionary of any HTML attributes given to the table.\n* ``TABLE_ROW`` - A table row.  This can only occur under ``TABLE``.  The\n  content is in ``children`` (normally the content would be ``TABLE_CELL``\n  or ``TABLE_HEADER_CELL`` nodes).  The ``attrs`` field contains a dictionary\n  of any HTML attributes given to the table row.\n* ``TABLE_HEADER_CELL`` - A table header cell.  This can only occur under\n  ``TABLE_ROW``.  Content is in children.  The ``attrs`` field contains\n  a dictionary of any HTML attributes given to the table row.\n* ``TABLE_CELL`` - A table cell.  This can only occur under ``TABLE_ROW``.\n  Content is in ``children``.  The ``attrs`` field contains a dictionary\n  of any HTML attributes given to the table row.\n* ``MAGIC_WORD`` - A MediaWiki magic word.  The magic word is assigned\n  directly to ``args`` as a string (i.e., not in a list).  ``children`` is\n  not used.  An example of a magic word would be ``__NOTOC__``.\n* ``HTML`` - A HTML tag (or a matched pair of HTML tags).  ``args`` is the\n  name of the HTML tag directly (not in a list and always without a slash).\n  ``attrs`` is set to a dictionary of any HTML attributes from the tag.\n  The contents of the HTML tag is in ``children``.\n\n## Expected performance\n\nThis can generally process a few Wiktionary pages per second per processor\ncore, including expansion of all templates, Lua macros, parsing the\nfull page, and analyzing the parse.  On a multi-core machine, this can\ngenerally process a few dozen to a few hundred pages per second,\ndepending on the speed and the number of the cores.\n\nMost of the processing effort goes to expanding Lua macros.  You can\nelect not to expand Lua macros, but they are used extensively in\nWiktionary and for important information.  Expanding templates and Lua\nmacros allows much more robust and complete data extraction, but does\nnot come cheap.\n\n## Contributing and bug reports\n\nPlease create an issue on github to report bugs or to contribute!\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftatuylonen%2Fwikitextprocessor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftatuylonen%2Fwikitextprocessor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftatuylonen%2Fwikitextprocessor/lists"}