{"id":13817584,"url":"https://github.com/5j9/wikitextparser","last_synced_at":"2025-05-15T03:07:18.159Z","repository":{"id":28635963,"uuid":"32154877","full_name":"5j9/wikitextparser","owner":"5j9","description":"A Python library to parse MediaWiki WikiText","archived":false,"fork":false,"pushed_at":"2025-05-14T01:33:47.000Z","size":1857,"stargazers_count":309,"open_issues_count":1,"forks_count":23,"subscribers_count":8,"default_branch":"main","last_synced_at":"2025-05-14T03:14:01.862Z","etag":null,"topics":["mediawiki","parsing","python","text-analysis"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/5j9.png","metadata":{"files":{"readme":"README.rst","changelog":"CHANGELOG.rst","contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2015-03-13T12:34:51.000Z","updated_at":"2025-05-14T01:33:51.000Z","dependencies_parsed_at":"2024-10-19T15:21:06.279Z","dependency_job_id":null,"html_url":"https://github.com/5j9/wikitextparser","commit_stats":{"total_commits":1555,"total_committers":11,"mean_commits":"141.36363636363637","dds":0.05144694533762062,"last_synced_commit":"40f7f884dbe568dabe7c102afbbc9bb046272391"},"previous_names":[],"tags_count":98,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/5j9%2Fwikitextparser","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/5j9%2Fwikitextparser/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/5j9%2Fwikitextparser/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/5j9%2Fwikitextparser/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/5j9","download_url":"https://codeload.github.com/5j9/wikitextparser/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254103510,"owners_count":22015280,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["mediawiki","parsing","python","text-analysis"],"created_at":"2024-08-04T06:00:51.389Z","updated_at":"2025-05-15T03:07:13.145Z","avatar_url":"https://github.com/5j9.png","language":"Python","funding_links":[],"categories":["پایتون Python"],"sub_categories":[],"readme":".. image:: https://github.com/5j9/wikitextparser/actions/workflows/tests.yml/badge.svg\n    :target: https://github.com/5j9/wikitextparser/actions/workflows/tests.yml\n.. image:: https://codecov.io/github/5j9/wikitextparser/coverage.svg?branch=master\n    :target: https://codecov.io/github/5j9/wikitextparser\n.. image:: https://readthedocs.org/projects/wikitextparser/badge/?version=latest\n    :target: http://wikitextparser.readthedocs.io/en/latest/?badge=latest\n\n==============\nWikiTextParser\n==============\n.. Quick Start Guid\n\nA simple to use WikiText parsing library for `MediaWiki \u003chttps://www.mediawiki.org/wiki/MediaWiki\u003e`_.\n\nThe purpose is to allow users easily extract and/or manipulate templates, template parameters, parser functions, tables, external links, wikilinks, lists, etc. found in wikitexts.\n\n.. contents:: Table of Contents\n\nInstallation\n============\n\n- Python 3.8+ is required\n- ``pip install wikitextparser``\n\nUsage\n=====\n\n.. code:: python\n\n    \u003e\u003e\u003e import wikitextparser as wtp\n\nWikiTextParser can detect sections, parser functions, templates, wiki links, external links, arguments, tables, wiki lists, and comments in your wikitext. The following sections are a quick overview of some of these functionalities.\n\nYou may also want to have a look at the test modules for more examples and probable pitfalls (expected failures).\n\nTemplates\n---------\n\n.. code:: python\n\n    \u003e\u003e\u003e parsed = wtp.parse(\"{{text|value1{{text|value2}}}}\")\n    \u003e\u003e\u003e parsed.templates\n    [Template('{{text|value1{{text|value2}}}}'), Template('{{text|value2}}')]\n    \u003e\u003e\u003e parsed.templates[0].arguments\n    [Argument(\"|value1{{text|value2}}\")]\n    \u003e\u003e\u003e parsed.templates[0].arguments[0].value = 'value3'\n    \u003e\u003e\u003e print(parsed)\n    {{text|value3}}\n\nThe ``pformat`` method returns a pretty-print formatted string for templates:\n\n.. code:: python\n\n    \u003e\u003e\u003e parsed = wtp.parse('{{t1 |b=b|c=c| d={{t2|e=e|f=f}} }}')\n    \u003e\u003e\u003e t1, t2 = parsed.templates\n    \u003e\u003e\u003e print(t2.pformat())\n    {{t2\n        | e = e\n        | f = f\n    }}\n    \u003e\u003e\u003e print(t1.pformat())\n    {{t1\n        | b = b\n        | c = c\n        | d = {{t2\n            | e = e\n            | f = f\n        }}\n    }}\n\n``Template.rm_dup_args_safe`` and ``Template.rm_first_of_dup_args`` methods can be used to clean-up `pages using duplicate arguments in template calls \u003chttps://en.wikipedia.org/wiki/Category:Pages_using_duplicate_arguments_in_template_calls\u003e`_:\n\n.. code:: python\n\n    \u003e\u003e\u003e t = wtp.Template('{{t|a=a|a=b|a=a}}')\n    \u003e\u003e\u003e t.rm_dup_args_safe()\n    \u003e\u003e\u003e t\n    Template('{{t|a=b|a=a}}')\n    \u003e\u003e\u003e t = wtp.Template('{{t|a=a|a=b|a=a}}')\n    \u003e\u003e\u003e t.rm_first_of_dup_args()\n    \u003e\u003e\u003e t\n    Template('{{t|a=a}}')\n\nTemplate parameters:\n\n.. code:: python\n\n    \u003e\u003e\u003e param = wtp.parse('{{{a|b}}}').parameters[0]\n    \u003e\u003e\u003e param.name\n    'a'\n    \u003e\u003e\u003e param.default\n    'b'\n    \u003e\u003e\u003e param.default = 'c'\n    \u003e\u003e\u003e param\n    Parameter('{{{a|c}}}')\n    \u003e\u003e\u003e param.append_default('d')\n    \u003e\u003e\u003e param\n    Parameter('{{{a|{{{d|c}}}}}}')\n\n\nWikiLinks\n---------\n\n.. code:: python\n\n    \u003e\u003e\u003e wl = wtp.parse('... [[title#fragmet|text]] ...').wikilinks[0]\n    \u003e\u003e\u003e wl.title = 'new_title'\n    \u003e\u003e\u003e wl.fragment = 'new_fragmet'\n    \u003e\u003e\u003e wl.text = 'X'\n    \u003e\u003e\u003e wl\n    WikiLink('[[new_title#new_fragmet|X]]')\n    \u003e\u003e\u003e del wl.text\n    \u003e\u003e\u003e wl\n    WikiLink('[[new_title#new_fragmet]]')\n\nAll WikiLink properties support get, set, and delete operations.\n\nSections\n--------\n\n.. code:: python\n\n    \u003e\u003e\u003e parsed = wtp.parse(\"\"\"\n    ... == h2 ==\n    ... t2\n    ... === h3 ===\n    ... t3\n    ... === h3 ===\n    ... t3\n    ... == h22 ==\n    ... t22\n    ... {{text|value3}}\n    ... [[Z|X]]\n    ... \"\"\")\n    \u003e\u003e\u003e parsed.sections\n    [Section('\\n'),\n     Section('== h2 ==\\nt2\\n=== h3 ===\\nt3\\n=== h3 ===\\nt3\\n'),\n     Section('=== h3 ===\\nt3\\n'),\n     Section('=== h3 ===\\nt3\\n'),\n     Section('== h22 ==\\nt22\\n{{text|value3}}\\n[[Z|X]]\\n')]\n    \u003e\u003e\u003e parsed.sections[1].title = 'newtitle'\n    \u003e\u003e\u003e print(parsed)\n\n    ==newtitle==\n    t2\n    === h3 ===\n    t3\n    === h3 ===\n    t3\n    == h22 ==\n    t22\n    {{text|value3}}\n    [[Z|X]]\n    \u003e\u003e\u003e del parsed.sections[1].title\n    \u003e\u003e\u003e\u003e print(parsed)\n\n    t2\n    === h3 ===\n    t3\n    === h3 ===\n    t3\n    == h22 ==\n    t22\n    {{text|value3}}\n    [[Z|X]]\n\nTables\n------\n\nExtracting cell values of a table:\n\n.. code:: python\n\n    \u003e\u003e\u003e p = wtp.parse(\"\"\"{|\n    ... |  Orange    ||   Apple   ||   more\n    ... |-\n    ... |   Bread    ||   Pie     ||   more\n    ... |-\n    ... |   Butter   || Ice cream ||  and more\n    ... |}\"\"\")\n    \u003e\u003e\u003e p.tables[0].data()\n    [['Orange', 'Apple', 'more'],\n     ['Bread', 'Pie', 'more'],\n     ['Butter', 'Ice cream', 'and more']]\n\nBy default, values are arranged according to ``colspan`` and ``rowspan`` attributes:\n\n.. code:: python\n\n    \u003e\u003e\u003e t = wtp.Table(\"\"\"{| class=\"wikitable sortable\"\n    ... |-\n    ... ! a !! b !! c\n    ... |-\n    ... !colspan = \"2\" | d || e\n    ... |-\n    ... |}\"\"\")\n    \u003e\u003e\u003e t.data()\n    [['a', 'b', 'c'], ['d', 'd', 'e']]\n    \u003e\u003e\u003e t.data(span=False)\n    [['a', 'b', 'c'], ['d', 'e']]\n\nCalling the ``cells`` method of a ``Table`` returns table cells as ``Cell`` objects. Cell objects provide methods for getting or setting each cell's attributes or values individually:\n\n.. code:: python\n\n    \u003e\u003e\u003e cell = t.cells(row=1, column=1)\n    \u003e\u003e\u003e cell.attrs\n    {'colspan': '2'}\n    \u003e\u003e\u003e cell.set('colspan', '3')\n    \u003e\u003e\u003e print(t)\n    {| class=\"wikitable sortable\"\n    |-\n    ! a !! b !! c\n    |-\n    !colspan = \"3\" | d || e\n    |-\n    |}\n\nHTML attributes of Table, Cell, and Tag objects are accessible via\n``get_attr``, ``set_attr``, ``has_attr``, and  ``del_attr`` methods.\n\nLists\n-----\n\nThe ``get_lists`` method provides access to lists within the wikitext.\n\n.. code:: python\n\n    \u003e\u003e\u003e parsed = wtp.parse(\n    ...     'text\\n'\n    ...     '* list item a\\n'\n    ...     '* list item b\\n'\n    ...     '** sub-list of b\\n'\n    ...     '* list item c\\n'\n    ...     '** sub-list of b\\n'\n    ...     'text'\n    ... )\n    \u003e\u003e\u003e wikilist = parsed.get_lists()[0]\n    \u003e\u003e\u003e wikilist.items\n    [' list item a', ' list item b', ' list item c']\n\nThe ``sublists`` method can be used to get all sub-lists of the current list or just sub-lists of specific items:\n\n.. code:: python\n\n    \u003e\u003e\u003e wikilist.sublists()\n    [WikiList('** sub-list of b\\n'), WikiList('** sub-list of b\\n')]\n    \u003e\u003e\u003e wikilist.sublists(1)[0].items\n    [' sub-list of b']\n\nIt also has an optional ``pattern`` argument that works similar to ``lists``, except that the current list pattern will be automatically added to it as a prefix:\n\n.. code:: python\n\n    \u003e\u003e\u003e wikilist = wtp.WikiList('#a\\n#b\\n##ba\\n#*bb\\n#:bc\\n#c', '\\#')\n    \u003e\u003e\u003e wikilist.sublists()\n    [WikiList('##ba\\n'), WikiList('#*bb\\n'), WikiList('#:bc\\n')]\n    \u003e\u003e\u003e wikilist.sublists(pattern='\\*')\n    [WikiList('#*bb\\n')]\n\n\nConvert one type of list to another using the convert method. Specifying the starting pattern of the desired lists can facilitate finding them and improves the performance:\n\n.. code:: python\n\n        \u003e\u003e\u003e wl = wtp.WikiList(\n        ...     ':*A1\\n:*#B1\\n:*#B2\\n:*:continuing A1\\n:*A2',\n        ...     pattern=':\\*'\n        ... )\n        \u003e\u003e\u003e print(wl)\n        :*A1\n        :*#B1\n        :*#B2\n        :*:continuing A1\n        :*A2\n        \u003e\u003e\u003e wl.convert('#')\n        \u003e\u003e\u003e print(wl)\n        #A1\n        ##B1\n        ##B2\n        #:continuing A1\n        #A2\n\nTags\n----\n\nAccessing HTML tags:\n\n.. code:: python\n\n        \u003e\u003e\u003e p = wtp.parse('text\u003cref name=\"c\"\u003ecitation\u003c/ref\u003e\\n\u003creferences/\u003e')\n        \u003e\u003e\u003e ref, references = p.get_tags()\n        \u003e\u003e\u003e ref.name = 'X'\n        \u003e\u003e\u003e ref\n        Tag('\u003cX name=\"c\"\u003ecitation\u003c/X\u003e')\n        \u003e\u003e\u003e references\n        Tag('\u003creferences/\u003e')\n\nWikiTextParser is able to handle common usages of HTML and extension tags. However it is not a fully-fledged HTML parser and may fail on edge cases or malformed HTML input. Please open an issue on github if you encounter bugs.\n\nMiscellaneous\n-------------\n``parent`` and ``ancestors`` methods can be used to access a node's parent or ancestors respectively:\n\n.. code:: python\n\n    \u003e\u003e\u003e template_d = parse(\"{{a|{{b|{{c|{{d}}}}}}}}\").templates[3]\n    \u003e\u003e\u003e template_d.ancestors()\n    [Template('{{c|{{d}}}}'),\n     Template('{{b|{{c|{{d}}}}}}'),\n     Template('{{a|{{b|{{c|{{d}}}}}}}}')]\n    \u003e\u003e\u003e template_d.parent()\n    Template('{{c|{{d}}}}')\n    \u003e\u003e\u003e _.parent()\n    Template('{{b|{{c|{{d}}}}}}')\n    \u003e\u003e\u003e _.parent()\n    Template('{{a|{{b|{{c|{{d}}}}}}}}')\n    \u003e\u003e\u003e _.parent()  # Returns None\n\nUse the optional ``type_`` argument if looking for ancestors of a specific type:\n\n.. code:: python\n\n    \u003e\u003e\u003e parsed = parse('{{a|{{#if:{{b{{c\u003c!----\u003e}}}}}}}}')\n    \u003e\u003e\u003e comment = parsed.comments[0]\n    \u003e\u003e\u003e comment.ancestors(type_='ParserFunction')\n    [ParserFunction('{{#if:{{b{{c\u003c!----\u003e}}}}}}')]\n\n\nTo delete/remove any object from its parents use ``del object[:]`` or ``del object.string``.\n\nThe ``remove_markup`` function or ``plain_text`` method can be used to remove wiki markup:\n\n.. code:: python\n\n    \u003e\u003e\u003e from wikitextparser import remove_markup, parse\n    \u003e\u003e\u003e s = \"'''a'''\u003c!--comment--\u003e [[b|c]] [[d]]\"\n    \u003e\u003e\u003e remove_markup(s)\n    'a c d'\n    \u003e\u003e\u003e parse(s).plain_text()\n    'a c d'\n\nCompared with mwparserfromhell\n==============================\n\n`mwparserfromhell \u003chttps://github.com/earwig/mwparserfromhell\u003e`_ is a mature and widely used library with nearly the same purposes as ``wikitextparser``. The main reason leading me to create ``wikitextparser`` was that ``mwparserfromhell`` could not parse wikitext in certain situations that I needed it for. See mwparserfromhell's issues `40 \u003chttps://github.com/earwig/mwparserfromhell/issues/40\u003e`_, `42 \u003chttps://github.com/earwig/mwparserfromhell/issues/42\u003e`_, `88 \u003chttps://github.com/earwig/mwparserfromhell/issues/88\u003e`_, and other related issues. In many of those situation ``wikitextparser`` may be able to give you more acceptable results.\n\nAlso note that ``wikitextparser`` is still using 0.x.y version `meaning \u003chttps://semver.org/\u003e`_ that the API is not stable and may change in the future versions.\n\nThe tokenizer in ``mwparserfromhell`` is written in C. Tokenization in ``wikitextparser`` is mostly done using the ``regex`` library which is also in C.\nI have not rigorously compared the two libraries in terms of performance, i.e. execution time and memory usage. In my limited experience, ``wikitextparser`` has a decent performance in realistic cases and should be able to compete and may even have little performance benefits in some situations.\n\nIf you have had a chance to compare these libraries in terms of performance or capabilities please share your experience by opening an issue on github.\n\nSome of the unique features of ``wikitextparser`` are: Providing access to individual cells of each table, pretty-printing templates, a WikiList class with rudimentary methods to work with `lists \u003chttps://www.mediawiki.org/wiki/Help:Lists\u003e`_, and a few other functions.\n\nKnown issues and limitations\n============================\n\n* The contents of templates/parameters are not known to offline parsers. For example an offline parser cannot know if the markup ``[[{{z|a}}]]`` should be treated as wikilink or not, it depends on the inner-workings of the ``{{z}}`` template. In these situations ``wikitextparser`` tries to use a best guess. ``[[{{z|a}}]]`` is treated as a wikilink (why else would anyone call a template inside wikilink markup, and even if it is not a wikilink, usually no harm is done).\n* Localized namespace names are unknown, so for example ``[[File:...]]`` links are treated as normal wikilinks. ``mwparserfromhell`` has similar issue, see `#87 \u003chttps://github.com/earwig/mwparserfromhell/issues/87\u003e`_ and `#136 \u003chttps://github.com/earwig/mwparserfromhell/issues/136\u003e`_. As a workaround, `Pywikibot \u003chttps://www.mediawiki.org/wiki/Manual:Pywikibot\u003e`_ can be used for determining the namespace.\n* `Linktrails \u003chttps://www.mediawiki.org/wiki/Help:Links\u003e`_ are language dependant and are not supported. `Also not supported by mwparserfromhell \u003chttps://github.com/earwig/mwparserfromhell/issues/82\u003e`_. However given the trail pattern and knowing that ``wikilink.span[1]`` is the ending position of a wikilink, it is possible to compute a WikiLink's linktrail.\n* Templates adjacent to external links are never considered part of the link. In reality, this depends on the contents of the template. Example: ``parse('http://example.com{{dead link}}').external_links[0].url == 'http://example.com'``\n* List of valid `extension tags \u003chttps://www.mediawiki.org/wiki/Parser_extension_tags\u003e`_ depends on the extensions intalled on the wiki. The ``tags`` method currently only supports the ones on English Wikipedia. A configuration option might be added in the future to address this issue.\n* ``wikitextparser`` currently does not provide an `ast.walk \u003chttps://docs.python.org/3/library/ast.html#ast.walk\u003e`_-like method yielding all descendant nodes.\n* `Parser functions \u003chttps://www.mediawiki.org/wiki/Help:Extension:ParserFunctions\u003e`_ and `magic words \u003chttps://www.mediawiki.org/wiki/Help:Magic_words\u003e`_ are not evaluated.\n\n\nCredits\n=======\n* `python \u003chttps://www.python.org/\u003e`_\n* `regex \u003chttps://github.com/mrabarnett/mrab-regex\u003e`_\n* `wcwidth \u003chttps://github.com/jquast/wcwidth\u003e`_\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2F5j9%2Fwikitextparser","html_url":"https://awesome.ecosyste.ms/projects/github.com%2F5j9%2Fwikitextparser","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2F5j9%2Fwikitextparser/lists"}