{"id":19066690,"url":"https://github.com/washingtonpost/html2ans","last_synced_at":"2025-04-28T12:48:18.846Z","repository":{"id":57437637,"uuid":"170710318","full_name":"washingtonpost/html2ans","owner":"washingtonpost","description":"Converts HTML into the Washington Post's ANS format","archived":false,"fork":false,"pushed_at":"2023-01-12T14:29:19.000Z","size":511,"stargazers_count":9,"open_issues_count":0,"forks_count":7,"subscribers_count":11,"default_branch":"dev","last_synced_at":"2024-04-25T16:02:56.791Z","etag":null,"topics":["arc","arc-publishing","washington-post"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/washingtonpost.png","metadata":{"files":{"readme":"README.rst","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-02-14T15:08:30.000Z","updated_at":"2023-08-07T17:45:03.000Z","dependencies_parsed_at":"2023-02-09T12:15:41.497Z","dependency_job_id":null,"html_url":"https://github.com/washingtonpost/html2ans","commit_stats":null,"previous_names":[],"tags_count":11,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/washingtonpost%2Fhtml2ans","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/washingtonpost%2Fhtml2ans/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/washingtonpost%2Fhtml2ans/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/washingtonpost%2Fhtml2ans/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/washingtonpost","download_url":"https://codeload.github.com/washingtonpost/html2ans/tar.gz/refs/heads/dev","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223772159,"owners_count":17199977,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["arc","arc-publishing","washington-post"],"created_at":"2024-11-09T00:57:43.263Z","updated_at":"2024-11-09T00:57:43.904Z","avatar_url":"https://github.com/washingtonpost.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"html2ans\n========\n\n.. image:: https://img.shields.io/pypi/v/html2ans.svg\n    :target: https://pypi.org/project/html2ans/\n\n.. image:: https://img.shields.io/pypi/pyversions/html2ans.svg\n    :target: https://pypi.org/project/html2ans/\n\n.. image:: https://circleci.com/gh/washingtonpost/html2ans.svg?style=shield\n    :target: https://circleci.com/gh/washingtonpost/html2ans\n\n.. image:: https://img.shields.io/pypi/l/html2ans.svg\n    :target: https://pypi.python.org/pypi/html2ans/\n\n\nThis project provides a standardized method of parsing HTML elements into `ANS elements\n\u003chttps://github.com/washingtonpost/ans-schema\u003e`_. It is mainly used by Arc Publishing's\nprofessional services team to migrate client data into the Arc platform, but can also be\nused for arbitrary conversion of HTML to JSON.\n\nhtml2ans is hosted on `pypi \u003chttps://pypi.org/project/html2ans/\u003e`_.\n\nPlease use the `GitHub issue tracker \u003chttps://github.com/washingtonpost/html2ans/issues\u003e`_ to submit bugs or request features.\n\nFull documentation can be found `here \u003chttps://washingtonpost.github.io/html2ans/\u003e`_.\n\n\nQuickstart\n----------\n\nGenerating ANS from HTML\n~~~~~~~~~~~~~~~~~~~~~~~~\n\n.. code-block:: python\n\n    from html2ans.default import Html2Ans\n\n    parser = Html2Ans()\n    content_elements = parser.generate_ans(your_html_here)\n\n\nAdding Parsers\n~~~~~~~~~~~~~~\n\nBasic Addition\n^^^^^^^^^^^^^^\n\nIf you need to parse a certain tag in a customized way, you can write your own parser class and add it to the\nparsers ``Html2Ans`` will use like so:\n\n\n.. code-block:: python\n\n    from html2ans.default import Html2Ans\n\n    parser = Html2Ans()\n    parser.add_parser(YourCustomImageParser())\n    parser.generate_ans(your_html_here)\n\n\nThe types of items your parser can parse should be listed in its ``applicable_elements`` attribute.\n\nThe default parser class (``DefaultHtmlAnsParser`` or ``Html2Ans``) has parsers for text, links, images, various social media embeds, etc.\n\n\nPrioritized Addition\n^^^^^^^^^^^^^^^^^^^^\n\nThe parsers that can be used for each element type (e.g. ``img``, ``p``) are held in a list. If you want your parser to have a higher priority than the default parsers, add it like so:\n\n.. code-block:: python\n\n    from html2ans.default import Html2Ans\n\n    parser = Html2Ans()\n    parser.insert_parser('img', YourCustomImageParser(), 0)\n    parser.generate_ans(your_html_here)\n\n\nCreating Custom Parsers\n~~~~~~~~~~~~~~~~~~~~~~~\n\nMissing from the snippet above is a definition of ``YourCustomImageParser``. Before talking about how to create such a parser,\nlet's examine why you might need to do so.\n\nThe default image parser ``html2ans.parsers.image.ImageParser`` applies to html ``img`` tags only. Imagine you need to parse html whose images come in ``div`` tags (labelled with the class ``fancy-figure``) that also hold a caption (labelled with the class ``fancy-caption``). Here is a possible implementation of a parser for such images (note: this returns basic image ANS, not a reference): \n\n.. code-block:: python\n\n    from html2ans.parsers.image import ImageParser\n    from html2ans.parsers.base import ParseResult\n\n    class YourCustomImageParser(ImageParser):\n        applicable_elements = ['div', 'figure']\n        applicable_classes = ['fancy-figure']\n\n        def parse(self, element, *args, **kwargs):\n            image_tag = element.find('img')\n            caption_tag = element.find('p', {\"class\": \"fancy-caption\"})\n            if image_tag:\n                image = self.construct_output(image_tag)\n                if caption_tag:\n                  image[\"caption\"] = caption_tag.text\n                return ParseResult(image, True)\n            return ParseResult(None, True)\n\n\nCustom Parsing Tips\n~~~~~~~~~~~~~~~~~~~\n\nANS Versions\n^^^^^^^^^^^^\n\nSome ANS types require a version. You can set a version in your main parser (``Html2Ans``) and then automatically include that version in any element parser's output by setting the parser's ``version_required`` attribute to ``True``.\n\n*Note: this doesn't mean valid, version-compatible ANS is automatically produced!*\n\n\nKeeping HTML in ``text`` Output\n^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n\nTo adjust what HTML is/isn't left inline when parsing text, adjust the ``INLINE_TAGS`` attribute on the text parser. Every parser inherits from ``html2ans.parsers.utils.AbstractParserUtilities`` which provides a list of default ``INLINE_TAGS`` which can be used to make sure text formatters (e.g. ``strong``, ``em``, etc.) are left in place when text is parsed.\n\n\nLink Parsing\n^^^^^^^^^^^^\n\nBy default, ``a`` tags are left inline in text, assuming there is text outside of the link. A link by itself (e.g. ``\u003cp\u003e\u003ca href=\"google.com\"\u003eSearch\u003c/a\u003e\u003c/p\u003e``) will be turned into an ``interstitial_link``. If ``interstitial_link`` elements are unwanted, simply add ``a`` to the list of ``applicable_elements`` for the ``ParagraphParser``.\n\n\nRemoving Unnecessary Tags\n^^^^^^^^^^^^^^^^^^^^^^^^^\n\nSometimes it is helpful to remove unnecessary tags (e.g. ``\u003cp\u003e\u003c/p\u003e``, ``\u003cdiv\u003e\u003cimg src=\"...\" /\u003e\u003c/div\u003e``). By default, ``Html2Ans`` considers ``p`` and ``div`` tags with no attributes other than ``id``, ``class``, or ``style`` to be unnecessary \"wrappers\". When these are encountered, they are ignored and their children are parsed.\n\nThe benefit of this is that ``\u003cp\u003e\u003c/p\u003e`` is ignored and ``\u003cdiv\u003e\u003cimg src=\"...\" /\u003e\u003c/div\u003e`` is parsed as an image.\n\nThe downside is that sometimes you don't want your HTML removed! There are a few options in this case. You can configure what tags can be considered wrappers via the ``WRAPPER_TAGS`` attribute on ``Html2Ans``. So if ``div`` tags should never be removed, simply remove ``div`` from this list. If a more complicated set of rules are necessary, override the ``is_wrapper`` method on ``Html2Ans``.\n\nIf it's easier to modify the HTML than to modify this library, you can also add an arbitrary attribute like so: ``\u003cdiv no_parse_flag=\"true\"\u003e...\u003c/div\u003e``. This ``div`` will not be considered a wrapper when it is encountered.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwashingtonpost%2Fhtml2ans","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fwashingtonpost%2Fhtml2ans","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwashingtonpost%2Fhtml2ans/lists"}