{"id":22283727,"url":"https://github.com/veaba/pyhtmd","last_synced_at":"2025-07-28T21:33:03.565Z","repository":{"id":62580899,"uuid":"218495800","full_name":"veaba/pyhtmd","owner":"veaba","description":"A Python HTML to Markdown parser, without using any third-party dependency.（一款Python版本的HTML转markdown解析器，不使用任何第三方工具）","archived":false,"fork":false,"pushed_at":"2023-08-30T08:10:19.000Z","size":96,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2024-11-18T23:56:27.618Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/veaba.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-10-30T09:57:35.000Z","updated_at":"2024-09-30T01:49:12.000Z","dependencies_parsed_at":"2022-11-03T21:30:56.668Z","dependency_job_id":null,"html_url":"https://github.com/veaba/pyhtmd","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/veaba%2Fpyhtmd","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/veaba%2Fpyhtmd/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/veaba%2Fpyhtmd/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/veaba%2Fpyhtmd/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/veaba","download_url":"https://codeload.github.com/veaba/pyhtmd/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":227960431,"owners_count":17847788,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-12-03T16:41:54.869Z","updated_at":"2024-12-03T16:41:55.559Z","avatar_url":"https://github.com/veaba.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 一款Python版本的HTML转markdown解析器，不使用任何第三方工具，实验demo\n\n请勿使用于生产环境，这个只是一次尝试demo项目\n\n\n## install \n\n\u003e pip install pyhtmd\n\n## usage\n\n```python\nfrom pyhtmd import Pyhtmd\nhtml=\"\u003ccode\u003e Hello, world ! by Pyhtmd. \u003c/code\u003e\"\nmd=Pyhtmd(html)\ncontent=md.markdown()\nprint(content) # `Hello, world ! by Pyhtmd.`\n```\n\n## API\nPyhtmd(html,\n language=\"\",\n img=True\n)\n\n- language：类型 string （js、python、java等）\n- img:{Boolean}，默认 `True`，可以不需要 `img`渲染\n```python\nfrom pyhtmd import Pyhtmd\nhtml=\"\u003cpre\u003e\u003ccode\u003eimport time\\n print(time.time()) \u003c/code\u003e\u003cpre\u003e\"\nmd=Pyhtmd(html,language=\"python\")\ncontent=md.markdown()\nprint(content) # `Hello, world ! by Pyhtmd.`\n```\n\n\n## 注意：\n\n- 无法解析多层级HTML\n- 只能是单Node\n- 这项目目前只针对 [tensorflow-docs](https://github.com/veaba/tensorflow-docs) 项目\n- 存在个别自定义标签无法识别，基本可以适用平常场景\n- python官方html_parser https://docs.python.org/zh-cn/3.7/library/html.parser.html\n- html.parser 源码 https://github.com/python/cpython/blob/3.7/Lib/html/parser.py\n- 不支持美化后的HTML内容，需要内容紧凑\n\n## parser\n- [x] single html node element\n- [x] infinite html list node element，无限级ul/ol 标签解析\n- [x] img tag\n- [x] head html node element\n- [x] 内置支持svg 图标转img图片，不用担心svg格式的数学符号不会被解析\n\n\n## todo \n- 核心的问题是，粘在一起的代码如何拆分？\n- 本质还是要分割的，但具体怎么分割呢？\n- table\n\n## fix\n- 已解决list算法问题：\n    - 当前的list 标签算法无法解析这种结构：\n    - 因为算法中，假定是依次序性ul组成结束的标签\n    - 核心算法一：算出开始标签的level\n    - 核心算法二：根据左边的开始标签索引值算出其所对应的右边索引值序列，我自己给他起了一个炫酷拽炸天的名字：标记逆序奇偶互斥算法\n    - 上面两个算法我自己算出来的，第一个花了两天，第二个花了1-2周\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fveaba%2Fpyhtmd","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fveaba%2Fpyhtmd","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fveaba%2Fpyhtmd/lists"}