{"id":16168564,"url":"https://github.com/thatxliner/unmarkd","last_synced_at":"2025-03-18T23:30:59.760Z","repository":{"id":37031757,"uuid":"340810179","full_name":"ThatXliner/unmarkd","owner":"ThatXliner","description":"An extremely configurable markdown reverser for Python3.","archived":false,"fork":false,"pushed_at":"2024-02-15T18:08:51.000Z","size":2278,"stargazers_count":15,"open_issues_count":4,"forks_count":5,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-03-14T00:04:28.632Z","etag":null,"topics":["beautifulsoup","flexible","html","html2text","markdown","markdown-reverser","parser","python","python3","reverse-engineering","reverse-markdown","reverser"],"latest_commit_sha":null,"homepage":"https://pypi.org/project/unmarkd/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ThatXliner.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-02-21T03:39:30.000Z","updated_at":"2024-11-09T01:22:32.000Z","dependencies_parsed_at":"2024-02-12T16:07:10.862Z","dependency_job_id":"4b47e027-ab33-44c3-9139-4a80c9105e4a","html_url":"https://github.com/ThatXliner/unmarkd","commit_stats":{"total_commits":93,"total_committers":3,"mean_commits":31.0,"dds":"0.032258064516129004","last_synced_commit":"53809de73146825681e5235eda9aa7b1429addb0"},"previous_names":[],"tags_count":10,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ThatXliner%2Funmarkd","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ThatXliner%2Funmarkd/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ThatXliner%2Funmarkd/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ThatXliner%2Funmarkd/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ThatXliner","download_url":"https://codeload.github.com/ThatXliner/unmarkd/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243955722,"owners_count":20374373,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["beautifulsoup","flexible","html","html2text","markdown","markdown-reverser","parser","python","python3","reverse-engineering","reverse-markdown","reverser"],"created_at":"2024-10-10T03:12:14.207Z","updated_at":"2025-03-18T23:30:59.314Z","avatar_url":"https://github.com/ThatXliner.png","language":"Python","readme":"**NOTE: This project is _maintained._** While it may seem inactive, it is because there is nothing to add. If you want an enhancement or want to file a bug report, please go to the [issues](https://github.com/ThatXliner/unmarkd/issues).\n\n# 🔄 Unmarkd\n\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/charliermarsh/ruff/main/assets/badge/v1.json)](https://github.com/charliermarsh/ruff)\n[![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat\u0026labelColor=ef8336)](https://pycqa.github.io/isort/)\n[![Checked with mypy](http://www.mypy-lang.org/static/mypy_badge.svg)](http://mypy-lang.org/)[![codecov](https://codecov.io/gh/ThatXliner/unmarkd/branch/master/graph/badge.svg?token=PWVIERHTG3)](https://codecov.io/gh/ThatXliner/unmarkd) [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black) [![CI](https://github.com/ThatXliner/unmarkd/actions/workflows/ci.yml/badge.svg)](https://github.com/ThatXliner/unmarkd/actions/workflows/ci.yml) [![PyPI - Downloads](https://img.shields.io/pypi/dm/unmarkd)](https://pypi.org/project/unmarkd/)\n\n\u003e A markdown reverser.\n\n---\n\nUnmarkd is a [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)-powered [Markdown](https://en.wikipedia.org/wiki/Markdown) reverser written in Python and for Python.\n\n## Why\n\nThis is created as a [StackSearch](http://github.com/ThatXliner/stacksearch) (one of my other projects) dependency. In order to create a better API, I needed a way to reverse HTML. So I created this.\n\nThere are [similar projects](https://github.com/xijo/reverse_markdown) (written in Ruby) ~~but I have not found any written in Python (or for Python)~~ later I found a popular library, [html2text](https://github.com/Alir3z4/html2text).\n\n## Installation\n\nYou know the drill\n\n```bash\npip install unmarkd\n```\n\n## Comparison\n\n**TL;DR: Html2Text is fast. If you don't need much configuration, you could use Html2Text for the little speed increase.**\n\n\u003cdetails\u003e\n\n\u003csummary\u003eClick to expand\u003c/summary\u003e\n\n### Speed\n\n**TL;DR: Unmarkd \u003c Html2Text**\n\nHtml2Text is basically faster:\n\n![Benchmark](./assets/benchmark.png)\n\n(The `DOC` variable used can be found [here](./assets/benchmark.html))\n\nUnmarkd sacrifices speed for [power](#configurability).\n\nHtml2Text directly uses Python's [`html.parser`](https://docs.python.org/3/library/html.parser.html) module (in the standard library). On the other hand, Unmarkd uses the powerful HTML parsing library, `beautifulsoup4`. BeautifulSoup can be configured to use different HTML parsers. In Unmarkd, we configure it to use Python's `html.parser`, too.\n\nBut another layer of code means more code is ran.\n\nI hope that's a good explanation of the speed difference.\n\n### Correctness\n\n**TL;DR: Unmarkd == Html2Text**\n\nI actually found _two_ html-to-markdown libraries. One of them was [Tomd](https://github.com/gaojiuli/tomd) which had an _incorrect implementation_:\n\n![Actual results](./assets/tomd_cant_handle.png)\n\nIt seems to be abandoned, anyway.\n\nNow with Html2Text and Unmarkd:\n\n![Epic showdown](./assets/correct.png)\n\nIn other words, they _work_\n\n### Configurability\n\n**TL;DR: Unmarkd \u003e Html2Text**\n\nThis is Unmarkd's strong point.\n\nIn Html2Text, you only have a limited [set of options](https://github.com/Alir3z4/html2text/blob/master/docs/usage.md#available-options).\n\nIn Unmarkd, you can subclass the `BaseUnmarker` and implement conversions for new tags (e.g. `\u003cq\u003e`), etc. In my opinion, it's much easier to extend and configure Unmarkd.\n\nUnmarkd was originally written as a StackSearch dependancy.\n\nHtml2Text has no options for configuring parsing of code blocks. Unmarkd does\n\n\u003c/details\u003e\n\n## Documentation\n\nHere's an example of basic usage\n\n```python\nimport unmarkd\nprint(unmarkd.unmark(\"\u003cb\u003eI \u003ci\u003elove\u003c/i\u003e markdown!\u003c/b\u003e\"))\n# Output: **I *love* markdown!**\n```\n\nor something more complex (shamelessly taken from [here](https://markdowntohtml.com)):\n\n```python\nimport unmarkd\nhtml_doc = R\"\"\"\u003ch1 id=\"sample-markdown\"\u003eSample Markdown\u003c/h1\u003e\n\u003cp\u003eThis is some basic, sample markdown.\u003c/p\u003e\n\u003ch2 id=\"second-heading\"\u003eSecond Heading\u003c/h2\u003e\n\u003cul\u003e\n\u003cli\u003eUnordered lists, and:\u003col\u003e\n\u003cli\u003eOne\u003c/li\u003e\n\u003cli\u003eTwo\u003c/li\u003e\n\u003cli\u003eThree\u003c/li\u003e\n\u003c/ol\u003e\n\u003c/li\u003e\n\u003cli\u003eMore\u003c/li\u003e\n\u003c/ul\u003e\n\u003cblockquote\u003e\n\u003cp\u003eBlockquote\u003c/p\u003e\n\u003c/blockquote\u003e\n\u003cp\u003eAnd \u003cstrong\u003ebold\u003c/strong\u003e, \u003cem\u003eitalics\u003c/em\u003e, and even \u003cem\u003eitalics and later \u003cstrong\u003ebold\u003c/strong\u003e\u003c/em\u003e. Even \u003cdel\u003estrikethrough\u003c/del\u003e. \u003ca href=\"https://markdowntohtml.com\"\u003eA link\u003c/a\u003e to somewhere.\u003c/p\u003e\n\u003cp\u003eAnd code highlighting:\u003c/p\u003e\n\u003cpre\u003e\u003ccode class=\"lang-js\"\u003e\u003cspan class=\"hljs-keyword\"\u003evar\u003c/span\u003e foo = \u003cspan class=\"hljs-string\"\u003e'bar'\u003c/span\u003e;\n\n\u003cspan class=\"hljs-function\"\u003e\u003cspan class=\"hljs-keyword\"\u003efunction\u003c/span\u003e \u003cspan class=\"hljs-title\"\u003ebaz\u003c/span\u003e\u003cspan class=\"hljs-params\"\u003e(s)\u003c/span\u003e \u003c/span\u003e{\n   \u003cspan class=\"hljs-keyword\"\u003ereturn\u003c/span\u003e foo + \u003cspan class=\"hljs-string\"\u003e':'\u003c/span\u003e + s;\n}\n\u003c/code\u003e\u003c/pre\u003e\n\u003cp\u003eOr inline code like \u003ccode\u003evar foo = \u0026#39;bar\u0026#39;;\u003c/code\u003e.\u003c/p\u003e\n\u003cp\u003eOr an image of bears\u003c/p\u003e\n\u003cp\u003e\u003cimg src=\"http://placebear.com/200/200\" alt=\"bears\"\u003e\u003c/p\u003e\n\u003cp\u003eThe end ...\u003c/p\u003e\n\"\"\"\nprint(unmarkd.unmark(html_doc))\n```\n\nand the output:\n\n````markdown\n    # Sample Markdown\n\n\n    This is some basic, sample markdown.\n\n    ## Second Heading\n\n\n\n    - Unordered lists, and:\n     1. One\n     2. Two\n     3. Three\n    - More\n\n    \u003eBlockquote\n\n\n    And **bold**, *italics*, and even *italics and later **bold***. Even ~~strikethrough~~. [A link](https://markdowntohtml.com) to somewhere.\n\n    And code highlighting:\n\n\n    ```js\n    var foo = 'bar';\n\n    function baz(s) {\n       return foo + ':' + s;\n    }\n    ```\n\n\n    Or inline code like `var foo = 'bar';`.\n\n    Or an image of bears\n\n    ![bears](http://placebear.com/200/200)\n\n    The end ...\n````\n\n### Extending\n\n#### Brief Overview\n\nMost functionality should be covered by the `BasicUnmarker` class defined in `unmarkd.unmarkers`.\n\nIf you need to reverse markdown from StackExchange (as in the case for my other project), you may use the `StackOverflowUnmarker` (or it's alias, `StackExchangeUnmarker`), which is also defined in `unmarkd.unmarkers`.\n\n#### Customizing\n\nIf the above two classes do not suit your needs, you can subclass the `unmarkd.unmarkers.BaseUnmarker` abstract class.\n\nCurrently, you can _optionally_ override the following methods:\n\n- `detect_language` (parameters: **1**)\n  - **Parameters**:\n    - html: `bs4.BeautifulSoup`\n  - When a fenced code block is approached, this function is called with a parameter of type `bs4.BeautifulSoup` passed to it; this is the element the code block was detected from (i.e. `pre`).\n  - This function is responsible for detecting the programming language (or returning `''` if none was detected) of the code block.\n  - Note: This method is different from `unmarkd.unmarkers.BasicUnmarker`. It is simpler and does less checking/filtering\n\nBut Unmarkd is more flexible than that.\n\n##### Customizable constants\n\nThere are currently 3 constants you may override:\n\n- Formats:\n  NOTE: Use the [**Format String Syntax**](https://docs.python.org/3/library/string.html#formatstrings)\n  - `UNORDERED_FORMAT`\n    - The string format of unordered (bulleted) lists.\n  - `ORDERED_FORMAT`\n    - The string format of ordered (numbered) lists.\n- Miscellaneous:\n  - `ESCAPABLES`\n    - A container (preferably a `set`) of length-1 `str` that should be escaped\n\n##### Customize converting HTML tags\n\nFor an HTML tag `some_tag`, you can customize how it's converted to markdown by overriding a method like so:\n\n```python\nfrom unmarkd.unmarkers import BaseUnmarker\nclass MyCustomUnmarker(BaseUnmarker):\n    def tag_some_tag(self, element) -\u003e str:\n        ...  # parse code here\n```\n\nTo reduce code duplication, if your tag also has aliases (e.g. `strong` is an alias for `b` in HTML) then you may modify the `TAG_ALIASES`.\n\nIf you really need to, you may also modify `DEFAULT_TAG_ALIASES`. Be warned: if you do so, **you will also need to implement the aliases** (currently `em` and `strong`).\n\n###### Common Patterns\n\nI find myself iterating through the children of the tag a lot. But that would lead to us needing to handle new tags, which could be anything. So here's the template/pattern I recommend:\n\n```python\nfrom unmarkd.unmarkers import BaseUnmarker\nclass MyCustomUnmarker(BaseUnmarker):\n    def tag_some_tag(self, element) -\u003e str:\n        for child in element.children:\n            if non_tag_output := self.parse_non_tags(child):\n                output += non_tag_output\n                continue\n            assert isinstance(element, bs4.Tag), type(element)\n            ...   # Do whatever you want with the child\n```\n\n##### Utility functions when overriding\n\nYou may use (when extending) the following functions:\n\n- `__parse`, 2 parameters:\n  - `html`: _bs4.BeautifulSoup_\n    - The html to unmark. This is used internally by the `unmark` method and is slightly faster.\n  - `escape`: _bool_\n    - Whether to escape the characters inside the string or not. Defaults to `False`.\n- `escape`: 1 parameter:\n  - `string`: _str_\n    - The string to escape and make markdown-safe\n- `wrap`: 2 parameters:\n  - `element`: _bs4.BeautifulSoup_\n    - The element to wrap.\n  - `around_with`: _str_\n    - The character to wrap the element around with. **WILL NOT BE ESCPAED**\n- And, of course, `tag_*` and `detect_language`.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthatxliner%2Funmarkd","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fthatxliner%2Funmarkd","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthatxliner%2Funmarkd/lists"}