Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/thatxliner/unmarkd

An extremely configurable markdown reverser for Python3.
https://github.com/thatxliner/unmarkd

beautifulsoup flexible html html2text markdown markdown-reverser parser python python3 reverse-engineering reverse-markdown reverser

Last synced: 3 months ago
JSON representation

An extremely configurable markdown reverser for Python3.

Host: GitHub
URL: https://github.com/thatxliner/unmarkd
Owner: ThatXliner
License: gpl-3.0
Created: 2021-02-21T03:39:30.000Z (almost 4 years ago)
Default Branch: master
Last Pushed: 2024-02-15T18:08:51.000Z (12 months ago)
Last Synced: 2024-10-11T03:12:14.916Z (4 months ago)
Topics: beautifulsoup, flexible, html, html2text, markdown, markdown-reverser, parser, python, python3, reverse-engineering, reverse-markdown, reverser
Language: Python
Homepage: https://pypi.org/project/unmarkd/
Size: 2.17 MB
Stars: 14
Watchers: 2
Forks: 5
Open Issues: 4
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE.md

Awesome Lists containing this project

README

        **NOTE: This project is _maintained._** While it may seem inactive, it is because there is nothing to add. If you want an enhancement or want to file a bug report, please go to the [issues](https://github.com/ThatXliner/unmarkd/issues).

# 🔄 Unmarkd

[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)

[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/charliermarsh/ruff/main/assets/badge/v1.json)](https://github.com/charliermarsh/ruff)

[![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://pycqa.github.io/isort/)

[![Checked with mypy](http://www.mypy-lang.org/static/mypy_badge.svg)](http://mypy-lang.org/)[![codecov](https://codecov.io/gh/ThatXliner/unmarkd/branch/master/graph/badge.svg?token=PWVIERHTG3)](https://codecov.io/gh/ThatXliner/unmarkd) [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black) [![CI](https://github.com/ThatXliner/unmarkd/actions/workflows/ci.yml/badge.svg)](https://github.com/ThatXliner/unmarkd/actions/workflows/ci.yml) [![PyPI - Downloads](https://img.shields.io/pypi/dm/unmarkd)](https://pypi.org/project/unmarkd/)

> A markdown reverser.

---

Unmarkd is a [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)-powered [Markdown](https://en.wikipedia.org/wiki/Markdown) reverser written in Python and for Python.

## Why

This is created as a [StackSearch](http://github.com/ThatXliner/stacksearch) (one of my other projects) dependency. In order to create a better API, I needed a way to reverse HTML. So I created this.

There are [similar projects](https://github.com/xijo/reverse_markdown) (written in Ruby) ~~but I have not found any written in Python (or for Python)~~ later I found a popular library, [html2text](https://github.com/Alir3z4/html2text).

## Installation

You know the drill

```bash

pip install unmarkd

```

## Comparison

**TL;DR: Html2Text is fast. If you don't need much configuration, you could use Html2Text for the little speed increase.**

Click to expand

### Speed

**TL;DR: Unmarkd < Html2Text**

Html2Text is basically faster:

![Benchmark](./assets/benchmark.png)

(The `DOC` variable used can be found [here](./assets/benchmark.html))

Unmarkd sacrifices speed for [power](#configurability).

Html2Text directly uses Python's [`html.parser`](https://docs.python.org/3/library/html.parser.html) module (in the standard library). On the other hand, Unmarkd uses the powerful HTML parsing library, `beautifulsoup4`. BeautifulSoup can be configured to use different HTML parsers. In Unmarkd, we configure it to use Python's `html.parser`, too.

But another layer of code means more code is ran.

I hope that's a good explanation of the speed difference.

### Correctness

**TL;DR: Unmarkd == Html2Text**

I actually found _two_ html-to-markdown libraries. One of them was [Tomd](https://github.com/gaojiuli/tomd) which had an _incorrect implementation_:

![Actual results](./assets/tomd_cant_handle.png)

It seems to be abandoned, anyway.

Now with Html2Text and Unmarkd:

![Epic showdown](./assets/correct.png)

In other words, they _work_

### Configurability

**TL;DR: Unmarkd > Html2Text**

This is Unmarkd's strong point.

In Html2Text, you only have a limited [set of options](https://github.com/Alir3z4/html2text/blob/master/docs/usage.md#available-options).

In Unmarkd, you can subclass the `BaseUnmarker` and implement conversions for new tags (e.g. ``), etc. In my opinion, it's much easier to extend and configure Unmarkd.

Unmarkd was originally written as a StackSearch dependancy.

Html2Text has no options for configuring parsing of code blocks. Unmarkd does

## Documentation

Here's an example of basic usage

```python

import unmarkd

print(unmarkd.unmark("I love markdown!"))

# Output: **I *love* markdown!**

```

or something more complex (shamelessly taken from [here](https://markdowntohtml.com)):

```python

import unmarkd

html_doc = R"""
Sample Markdown

This is some basic, sample markdown.

Second Heading



Unordered lists, and:

One

Two

Three





More





Blockquote



And bold, italics, and even italics and later bold. Even strikethrough. A link to somewhere.

And code highlighting:

var foo = 'bar';

function baz(s) {

   return foo + ':' + s;

}



Or inline code like var foo = 'bar';.

Or an image of bears



The end ...

"""

print(unmarkd.unmark(html_doc))

```

and the output:

````markdown

    # Sample Markdown

    This is some basic, sample markdown.

    ## Second Heading

    - Unordered lists, and:

     1. One

     2. Two

     3. Three

    - More

    >Blockquote

    And **bold**, *italics*, and even *italics and later **bold***. Even ~~strikethrough~~. [A link](https://markdowntohtml.com) to somewhere.

    And code highlighting:

    ```js

    var foo = 'bar';

    function baz(s) {

       return foo + ':' + s;

    }

    ```

    Or inline code like `var foo = 'bar';`.

    Or an image of bears

    ![bears](http://placebear.com/200/200)

    The end ...

````

### Extending

#### Brief Overview

Most functionality should be covered by the `BasicUnmarker` class defined in `unmarkd.unmarkers`.

If you need to reverse markdown from StackExchange (as in the case for my other project), you may use the `StackOverflowUnmarker` (or it's alias, `StackExchangeUnmarker`), which is also defined in `unmarkd.unmarkers`.

#### Customizing

If the above two classes do not suit your needs, you can subclass the `unmarkd.unmarkers.BaseUnmarker` abstract class.

Currently, you can _optionally_ override the following methods:

- `detect_language` (parameters: **1**)

  - **Parameters**:

    - html: `bs4.BeautifulSoup`

  - When a fenced code block is approached, this function is called with a parameter of type `bs4.BeautifulSoup` passed to it; this is the element the code block was detected from (i.e. `pre`).

  - This function is responsible for detecting the programming language (or returning `''` if none was detected) of the code block.

  - Note: This method is different from `unmarkd.unmarkers.BasicUnmarker`. It is simpler and does less checking/filtering

But Unmarkd is more flexible than that.

##### Customizable constants

There are currently 3 constants you may override:

- Formats:

  NOTE: Use the [**Format String Syntax**](https://docs.python.org/3/library/string.html#formatstrings)

  - `UNORDERED_FORMAT`

    - The string format of unordered (bulleted) lists.

  - `ORDERED_FORMAT`

    - The string format of ordered (numbered) lists.

- Miscellaneous:

  - `ESCAPABLES`

    - A container (preferably a `set`) of length-1 `str` that should be escaped

##### Customize converting HTML tags

For an HTML tag `some_tag`, you can customize how it's converted to markdown by overriding a method like so:

```python

from unmarkd.unmarkers import BaseUnmarker

class MyCustomUnmarker(BaseUnmarker):

    def tag_some_tag(self, element) -> str:

        ...  # parse code here

```

To reduce code duplication, if your tag also has aliases (e.g. `strong` is an alias for `b` in HTML) then you may modify the `TAG_ALIASES`.

If you really need to, you may also modify `DEFAULT_TAG_ALIASES`. Be warned: if you do so, **you will also need to implement the aliases** (currently `em` and `strong`).

###### Common Patterns

I find myself iterating through the children of the tag a lot. But that would lead to us needing to handle new tags, which could be anything. So here's the template/pattern I recommend:

```python

from unmarkd.unmarkers import BaseUnmarker

class MyCustomUnmarker(BaseUnmarker):

    def tag_some_tag(self, element) -> str:

        for child in element.children:

            if non_tag_output := self.parse_non_tags(child):

                output += non_tag_output

                continue

            assert isinstance(element, bs4.Tag), type(element)

            ...   # Do whatever you want with the child

```

##### Utility functions when overriding

You may use (when extending) the following functions:

- `__parse`, 2 parameters:

  - `html`: _bs4.BeautifulSoup_

    - The html to unmark. This is used internally by the `unmark` method and is slightly faster.

  - `escape`: _bool_

    - Whether to escape the characters inside the string or not. Defaults to `False`.

- `escape`: 1 parameter:

  - `string`: _str_

    - The string to escape and make markdown-safe

- `wrap`: 2 parameters:

  - `element`: _bs4.BeautifulSoup_

    - The element to wrap.

  - `around_with`: _str_

    - The character to wrap the element around with. **WILL NOT BE ESCPAED**

- And, of course, `tag_*` and `detect_language`.