https://github.com/matthewwithanm/python-markdownify

Convert HTML to Markdown
https://github.com/matthewwithanm/python-markdownify

Last synced: 7 months ago
JSON representation

Convert HTML to Markdown

Host: GitHub
URL: https://github.com/matthewwithanm/python-markdownify
Owner: matthewwithanm
License: mit
Created: 2012-06-29T16:30:23.000Z (over 13 years ago)
Default Branch: develop
Last Pushed: 2025-04-28T10:37:33.000Z (7 months ago)
Last Synced: 2025-05-01T16:17:41.921Z (7 months ago)
Language: Python
Homepage:
Size: 224 KB
Stars: 1,577
Watchers: 13
Forks: 158
Open Issues: 20
Metadata Files:
- Readme: README.rst
- License: LICENSE

Awesome Lists containing this project

awesome-data-analysis - Python-markdownify - Convert HTML to Markdown. (📦 Additional Python Libraries / Documentation & File Processing)

README

          |build| |version| |license| |downloads|

.. |build| image:: https://img.shields.io/github/actions/workflow/status/matthewwithanm/python-markdownify/python-app.yml?branch=develop

    :alt: GitHub Workflow Status

    :target: https://github.com/matthewwithanm/python-markdownify/actions/workflows/python-app.yml?query=workflow%3A%22Python+application%22

.. |version| image:: https://img.shields.io/pypi/v/markdownify

    :alt: Pypi version

    :target: https://pypi.org/project/markdownify/

.. |license| image:: https://img.shields.io/pypi/l/markdownify

    :alt: License

    :target: https://github.com/matthewwithanm/python-markdownify/blob/develop/LICENSE

.. |downloads| image:: https://pepy.tech/badge/markdownify

    :alt: Pypi Downloads

    :target: https://pepy.tech/project/markdownify

Installation

============

``pip install markdownify``

Usage

=====

Convert some HTML to Markdown:

.. code:: python

    from markdownify import markdownify as md

    md('Yay GitHub')  # > '**Yay** [GitHub](http://github.com)'

Specify tags to exclude:

.. code:: python

    from markdownify import markdownify as md

    md('Yay GitHub', strip=['a'])  # > '**Yay** GitHub'

\...or specify the tags you want to include:

.. code:: python

    from markdownify import markdownify as md

    md('Yay GitHub', convert=['b'])  # > '**Yay** GitHub'

Options

=======

Markdownify supports the following options:

strip

  A list of tags to strip. This option can't be used with the

  ``convert`` option.

convert

  A list of tags to convert. This option can't be used with the

  ``strip`` option.

autolinks

  A boolean indicating whether the "automatic link" style should be used when

  a ``a`` tag's contents match its href. Defaults to ``True``.

default_title

  A boolean to enable setting the title of a link to its href, if no title is

  given. Defaults to ``False``.

heading_style

  Defines how headings should be converted. Accepted values are ``ATX``,

  ``ATX_CLOSED``, ``SETEXT``, and ``UNDERLINED`` (which is an alias for

  ``SETEXT``). Defaults to ``UNDERLINED``.

bullets

  An iterable (string, list, or tuple) of bullet styles to be used. If the

  iterable only contains one item, it will be used regardless of how deeply

  lists are nested. Otherwise, the bullet will alternate based on nesting

  level. Defaults to ``'*+-'``.

strong_em_symbol

  In markdown, both ``*`` and ``_`` are used to encode **strong** or

  *emphasized* texts. Either of these symbols can be chosen by the options

  ``ASTERISK`` (default) or ``UNDERSCORE`` respectively.

sub_symbol, sup_symbol

  Define the chars that surround ``_{`` and ``^{`` text. Defaults to an

  empty string, because this is non-standard behavior. Could be something like

  ``~`` and ``^`` to result in ``~sub~`` and ``^sup^``.  If the value starts

  with ``<`` and ends with ``>``, it is treated as an HTML tag and a ``/`` is

  inserted after the ``<`` in the string used after the text; this allows

  specifying ``_{`` to use raw HTML in the output for subscripts, for

  example.}}}

newline_style

  Defines the style of marking linebreaks (``
``) in markdown. The default

  value ``SPACES`` of this option will adopt the usual two spaces and a newline,

  while ``BACKSLASH`` will convert a linebreak to ``\\n`` (a backslash and a

  newline). While the latter convention is non-standard, it is commonly

  preferred and supported by a lot of interpreters.

code_language

  Defines the language that should be assumed for all ``
`` sections.

  Useful, if all code on a page is in the same programming language and

  should be annotated with `````python`` or similar.

  Defaults to ``''`` (empty string) and can be any string.

code_language_callback

  When the HTML code contains ``pre`` tags that in some way provide the code

  language, for example as class, this callback can be used to extract the

  language from the tag and prefix it to the converted ``pre`` tag.

  The callback gets one single argument, an BeautifylSoup object, and returns

  a string containing the code language, or ``None``.

  An example to use the class name as code language could be::

    def callback(el):

        return el['class'][0] if el.has_attr('class') else None

  Defaults to ``None``.

escape_asterisks

  If set to ``False``, do not escape ``*`` to ``\*`` in text.

  Defaults to ``True``.

escape_underscores

  If set to ``False``, do not escape ``_`` to ``\_`` in text.

  Defaults to ``True``.

escape_misc

  If set to ``True``, escape miscellaneous punctuation characters

  that sometimes have Markdown significance in text.

  Defaults to ``False``.

keep_inline_images_in

  Images are converted to their alt-text when the images are located inside

  headlines or table cells. If some inline images should be converted to

  markdown images instead, this option can be set to a list of parent tags

  that should be allowed to contain inline images, for example ``['td']``.

  Defaults to an empty list.

table_infer_header

  Controls handling of tables with no header row (as indicated by ````

  or ````). When set to ``True``, the first body row is used as the header row.

  Defaults to ``False``, which leaves the header row empty.

wrap, wrap_width

  If ``wrap`` is set to ``True``, all text paragraphs are wrapped at

  ``wrap_width`` characters. Defaults to ``False`` and ``80``.

  Use with ``newline_style=BACKSLASH`` to keep line breaks in paragraphs.

  A `wrap_width` value of `None` reflows lines to unlimited line length.

strip_document

  Controls whether leading and/or trailing separation newlines are removed from

  the final converted document. Supported values are ``LSTRIP`` (leading),

  ``RSTRIP`` (trailing), ``STRIP`` (both), and ``None`` (neither). Newlines

  within the document are unaffected.

  Defaults to ``STRIP``.

beautiful_soup_parser

  Specify the Beautiful Soup parser to be used for interpreting HTML markup. Parsers such

  as `html5lib`, `lxml` or even a custom parser as long as it is installed on the execution

  environment. Defaults to ``html.parser``.

.. _BeautifulSoup: https://www.crummy.com/software/BeautifulSoup/

Options may be specified as kwargs to the ``markdownify`` function, or as a

nested ``Options`` class in ``MarkdownConverter`` subclasses.

Converting BeautifulSoup objects

================================

.. code:: python

    from markdownify import MarkdownConverter

    # Create shorthand method for conversion

    def md(soup, **options):

        return MarkdownConverter(**options).convert_soup(soup)

Creating Custom Converters

==========================

If you have a special usecase that calls for a special conversion, you can

always inherit from ``MarkdownConverter`` and override the method you want to

change.

The function that handles a HTML tag named ``abc`` is called

``convert_abc(self, el, text, parent_tags)`` and returns a string

containing the converted HTML tag.

The ``MarkdownConverter`` object will handle the conversion based on the

function names:

.. code:: python

    from markdownify import MarkdownConverter

    class ImageBlockConverter(MarkdownConverter):

        """

        Create a custom MarkdownConverter that adds two newlines after an image

        """

        def convert_img(self, el, text, parent_tags):

            return super().convert_img(el, text, parent_tags) + '\n\n'

    # Create shorthand method for conversion

    def md(html, **options):

        return ImageBlockConverter(**options).convert(html)

.. code:: python

    from markdownify import MarkdownConverter

    class IgnoreParagraphsConverter(MarkdownConverter):

        """

        Create a custom MarkdownConverter that ignores paragraphs

        """

        def convert_p(self, el, text, parent_tags):

            return ''

    # Create shorthand method for conversion

    def md(html, **options):

        return IgnoreParagraphsConverter(**options).convert(html)

Command Line Interface

======================

Use ``markdownify example.html > example.md`` or pipe input from stdin

(``cat example.html | markdownify > example.md``).

Call ``markdownify -h`` to see all available options.

They are the same as listed above and take the same arguments.

Development

===========

To run tests and the linter run ``pip install tox`` once, then ``tox``.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/matthewwithanm/python-markdownify

Awesome Lists containing this project

README